NLP Programming Tutorial 13 - Beam and A* Search

Similar documents
NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

Tagging with Hidden Markov Models

Chapter 6. Decoding. Statistical Machine Translation

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Language Modeling. Chapter Introduction

Inferring Probabilistic Models of cis-regulatory Modules. BMI/CS Spring 2015 Colin Dewey

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Statistical Machine Translation

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

ECE 250 Data Structures and Algorithms MIDTERM EXAMINATION /5:15-6:45 REC-200, EVI-350, RCH-106, HH-139

Hardware Implementation of Probabilistic State Machine for Word Recognition

Hidden Markov Models Chapter 15

Introduction to Algorithmic Trading Strategies Lecture 2

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

Search Algorithms for Software-Only Real-Time Recognition with Very Large Vocabularies

Statistical Machine Translation: IBM Models 1 and 2

Conditional Random Fields: An Introduction

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.

On Orchestrating Virtual Network Functions

Course: Model, Learning, and Inference: Lecture 5

Lecture 12: An Overview of Speech Recognition

Aim To help students prepare for the Academic Reading component of the IELTS exam.

Machine Translation. Agenda

Introduction. BM1 Advanced Natural Language Processing. Alexander Koller. 17 October 2014

Log-Linear Models. Michael Collins

Qlik REST Connector Installation and User Guide

AI: A Modern Approach, Chpts. 3-4 Russell and Norvig

Hidden Markov Models

Arithmetic Coding: Introduction

WRITING EFFECTIVE ESSAY EXAMS

ICFHR 2010 Tutorial: Multimodal Computer Assisted Transcription of Handwriting Images

Reliable and Cost-Effective PoS-Tagging

NLP Programming Tutorial 0 - Programming Basics

Customizing an English-Korean Machine Translation System for Patent Translation *

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

Structured Models for Fine-to-Coarse Sentiment Analysis

Questions 1 through 25 are worth 2 points each. Choose one best answer for each.

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Big Data and Scripting

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Basic Parsing Algorithms Chart Parsing

Key Stage 1 Assessment Information Meeting

CAD Algorithms. P and NP

HMM : Viterbi algorithm - a toy example

Home Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit

Analyzing the Facebook graph?

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

A. Aiken & K. Olukotun PA3

Web Data Extraction: 1 o Semestre 2007/2008

Automated Content Analysis of Discussion Transcripts

Voice Driven Animation System

CS MidTerm Exam 4/1/2004 Name: KEY. Page Max Score Total 139

Data Structures Fibonacci Heaps, Amortized Analysis

From Last Time: Remove (Delete) Operation

Binary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * *

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

CMSC5733 Social Computing

Algorithms and Data Structures

Text Analytics Illustrated with a Simple Data Set

Machine Learning for natural language processing

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

SSOScan: Automated Testing of Web Applications for Single Sign-On Vulnerabilities. Yuchen Zhou and David Evans Presented by Yishan

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

A Tutorial on Dual Decomposition and Lagrangian Relaxation for Inference in Natural Language Processing

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Building a Question Classifier for a TREC-Style Question Answering System

Turkish Radiology Dictation System

Measuring the Performance of an Agent

Node-Based Structures Linked Lists: Implementation

Measuring the effectiveness of testing using DDP

Using Segmentation Constraints in an Implicit Segmentation Scheme for Online Word Recognition

The aim of this investigation is to verify the accuracy of Pythagoras Theorem for problems in three dimensions.

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Q&As: Microsoft Excel 2013: Chapter 2

An Introduction to The A* Algorithm

Learning Outcomes. COMP202 Complexity of Algorithms. Binary Search Trees and Other Search Trees

Binary Heap Algorithms

Reinforcement Learning

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Speech Recognition on Cell Broadband Engine UCRL-PRES

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Hidden Markov models in gene finding. Bioinformatics research group David R. Cheriton School of Computer Science University of Waterloo

Data Structures and Algorithm Analysis (CSC317) Intro/Review of Data Structures Focus on dynamic sets

Log-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Fast nondeterministic recognition of context-free languages using two queues

Video Affective Content Recognition Based on Genetic Algorithm Combined HMM

Seminar. Path planning using Voronoi diagrams and B-Splines. Stefano Martina

Top Ten Mistakes in the FCE Writing Paper (And How to Avoid Them) By Neil Harris

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Solutions of Linear Equations in One Variable

Load Balancing. Load Balancing 1 / 24

Transcription:

NLP Programming Tutorial 13 - Beam and A* Search Graham Neubig Nara Institute of Science and Technology (NAIST) 1

Prediction Problems Given observable information X, find hidden Y argmax Y P (Y X ) Used in POS tagging, word segmentation, parsing Solving this argmax is search Until now, we mainly used the Viterbi algorithm 2

Hidden Markov Models (HMMs) for POS Tagging POS POS transition probabilities Like a bigram model! POS Word emission probabilities P (Y ) i=1 I +1 PT (y i y i 1 ) P (X Y ) 1 I P E ( x i y i ) P T (JJ <s>) * P T (NN JJ) * P T (NN NN) <s> JJ NN NN LRB NN RRB... </s> natural language processing ( nlp )... P E (natural JJ) * P E (language NN) * P E (processing NN) 3

Finding POS Tags with Markov Models The best path is our POS sequence natural language processing ( nlp ) 0:<S> 1:NN 2:NN 3:NN 4:NN 5:NN 6:NN 1:JJ 2:JJ 3:JJ 4:JJ 5:JJ 6:JJ 1:VB 2:VB 3:VB 4:VB 5:VB 6:VB 1:LRB 2:LRB 3:LRB 4:LRB 5:LRB 6:LRB 1:RRB 2:RRB 3:RRB 4:RRB 5:RRB 6:RRB <s> JJ NN NN LRB NN RRB 4

Remember: Viterbi Algorithm Steps Forward step, calculate the best path to a node Find the path to each node with the lowest negative log probability Backward step, reproduce the path This is easy, almost the same as word segmentation 5

Forward Step: Part 1 First, calculate transition from <S> and emission of the first word for every POS 0:<S> natural 1:NN best_score[ 1 NN ] = -log P T (NN <S>) + -log P E (natural NN) 1:JJ 1:VB 1:LRB best_score[ 1 JJ ] = -log P T (JJ <S>) + -log P E (natural JJ) best_score[ 1 VB ] = -log P T (VB <S>) + -log P E (natural VB) best_score[ 1 LRB ] = -log P T (LRB <S>) + -log P E (natural LRB) 1:RRB best_score[ 1 RRB ] = -log P T (RRB <S>) + -log P E (natural RRB) 6

Forward Step: Middle Parts For middle words, calculate the minimum score for all possible previous POS tags natural 1:NN 1:JJ 1:VB 1:LRB 1:RRB language 2:NN 2:JJ 2:VB 2:LRB 2:RRB best_score[ 2 NN ] = min( best_score[ 1 NN ] + -log P T (NN NN) + -log P E (language NN), best_score[ 1 JJ ] + -log P T (NN JJ) + -log P E (language NN), best_score[ 1 VB ] + -log P T (NN VB) + -log P E (language NN), best_score[ 1 LRB ] + -log P T (NN LRB) + -log P E (language NN), best_score[ 1 RRB ] + -log P T (NN RRB) + -log P E (language NN),... ) best_score[ 2 JJ ] = min( best_score[ 1 NN ] + -log P T (JJ NN) + -log P E (language JJ), best_score[ 1 JJ ] + -log P T (JJ JJ) + -log P E (language JJ), best_score[ 1 VB ] + -log P T (JJ VB) + -log P E (language JJ),... 7

Forward Step: Final Part Finish up the sentence with the sentence final symbol science I:NN I:JJ I:VB I:LRB I+1:</S> best_score[ I+1 </S> ] = min( best_score[ I NN ] + -log P T (</S> NN), ) best_score[ I JJ ] + -log P T (</S> JJ), best_score[ I VB ] + -log P T (</S> VB), best_score[ I LRB ] + -log P T (</S> LRB), best_score[ I NN ] + -log P T (</S> RRB),... I:RRB 8

Viterbi Algorithm and Time The time of the Viterbi algorithm depends on: type of problem: POS? Word Segmentation? Parsing? length of sentence: Longer Sentence=More Time number of tags: More Tags=More Time What is time complexity of HMM POS tagging? T = Number of tags N = length of sentence 9

Simple Viterbi Doesn't Scale Tagging: Named Entity Recognition: T = types of named entities (100s to 1000s) Supertagging: T = grammar rules (100s) Other difficult search problems: Parsing: T * N 3 Speech Recognition: (frames)*(wfst states, millions) Machine Translation: NP complete 10

Two Popular Solutions Beam Search: Remove low probability partial hypotheses + Simple, search time is stable - Might not find the best answer A* Search: Depth-first search, create a heuristic function of cost to process the remaining hypotheses + Faster than Viterbi, exact - Must be able to create heuristic, search time is not stable 11

Beam Search 12

Beam Search Choose beam of B hypotheses Do Viterbi algorithm, but keep only best B hypotheses at each step Definition of step depends on task: Tagging: Same number of words tagged Machine Translation: Same number of words translated Speech Recognition: Same number of frames processed 13

Calculate Best Scores (First Word) Calculate best scores for first word 0:<S> natural 1:NN best_score[ 1 NN ] = -3.1 1:JJ 1:VB 1:LRB best_score[ 1 JJ ] = -4.2 best_score[ 1 VB ] = -5.4 best_score[ 1 LRB ] = -8.2 1:RRB best_score[ 1 RRB ] = -8.1 14

Keep Best B Hypotheses (w 1 ) Remove hypotheses with low scores For example, B=3 0:<S> natural 1:NN best_score[ 1 NN ] = -3.1 1:JJ 1:VB 1:LRB best_score[ 1 JJ ] = -4.2 best_score[ 1 VB ] = -5.4 best_score[ 1 LRB ] = -8.2 1:RRB best_score[ 1 RRB ] = -8.1 15

Calculate Probabilities (w 2 ) Calculate score, but ignore removed hypotheses natural 1:NN 1:JJ 1:VB 1:LRB 1:RRB language 2:NN 2:JJ 2:VB 2:LRB 2:RRB best_score[ 2 NN ] = min( best_score[ 1 NN ] + -log P T (NN NN) + -log P E (language NN), best_score[ 1 JJ ] + -log P T (NN JJ) + -log P E (language NN), best_score[ 1 VB ] + -log P T (NN VB) + -log P E (language NN), best_score[ 1 LRB ] + -log P T (NN LRB) + -log P E (language NN), best_score[ 1 RRB ] + -log P T (NN RRB) + -log P E (language NN),... ) best_score[ 2 JJ ] = min( best_score[ 1 NN ] + -log P T (JJ NN) + -log P E (language JJ), best_score[ 1 JJ ] + -log P T (JJ JJ) + -log P E (language JJ), best_score[ 1 VB ] + -log P T (JJ VB) + -log P E (language JJ),... 16

Beam Search is Faster Remove some candidates from consideration faster speed! What is the time complexity? T = Number of tags N = length of sentence B = beam width 17

Implementation: Forward Step best_score[ 0 <s> ] = 0 # Start with <s> best_edge[ 0 <s> ] = NULL active_tags[0] = [ <s> ] for i in 0 I-1: make map my_best for each prev in keys of active_tags[i] for each next in keys of possible_tags if best_score[ i prev ] and transition[ prev next ] exist score = best_score[ i prev ] + -log P T (next prev) + -log P E (word[i] next) if best_score[ i+1 next ] is new or > score best_score[ i+1 next ] = score best_edge[ i+1 next ] = i prev my_best[next] = score active_tags[i+1] = best B elements of my_best # Finally, do the same for </s> 18

A* Search 19

Depth-First Search Always expand the state with the highest score Use a heap (priority queue) to keep track of states heap: a data structure that can add elements in O(1) and find the highest scoring element in time O(log n) Start with only the initial state on the heap Expand the best state on the heap until search finishes Compare with breadth-first search, which expands states at the same step (Viterbi, beam search) 20

Depth-First Search Initial state: 0:<S> natural language processing 1:NN 2:NN 3:NN Heap 0:<S> 0 1:JJ 2:JJ 3:JJ 1:VB 2:VB 3:VB 1:LRB 2:LRB 3:LRB 1:RRB 2:RRB 3:RRB 21

Process 0:<S> Depth-First Search 0:<S> natural language processing 1:NN -3.1 2:NN 3:NN 1:JJ -4.2 2:JJ 3:JJ 1:VB 2:VB 3:VB -5.4 1:LRB -8.2 2:LRB 3:LRB 1:RRB 2:RRB 3:RRB -8.1 Heap 1:NN -3.1 1:JJ -4.2 1:VB -5.4 1:RRB -8.1 1:LRB -8.2 22

Process 1:NN Depth-First Search 0:<S> natural language processing 1:NN -3.1 2:NN -5.5 3:NN 1:JJ -4.2 2:JJ -6.7 3:JJ 1:VB 2:VB 3:VB -5.4-5.7 1:LRB -8.2 2:LRB -11.2 3:LRB 1:RRB 2:RRB 3:RRB -8.1-11.4 Heap 1:JJ -4.2 1:VB -5.4 2:NN -5.5 2:VB -5.7 2:JJ -6.7 1:RRB -8.1 1:LRB -8.2 2:LRB -11.2 2:RRB -11.4 23

Depth-First Search 0:<S> Process 1:JJ From 1:NN 1:JJ natural language processing 1:NN 2:NN 3:NN -3.1-5.5-5.3 1:JJ 2:JJ 3:JJ -4.2-6.7-5.9 1:VB 2:VB 3:VB -5.4-5.7-7.2 1:LRB 2:LRB 3:LRB -8.2-11.2-11.9 1:RRB 2:RRB 3:RRB -8.1-11.4-11.7 Heap 2:NN -5.3 1:VB -5.4 2:NN -5.5 2:VB -5.7 2:JJ -5.9 2:JJ -6.7 1:RRB -8.1 1:LRB -8.2 2:LRB -11.2 2:RRB -11.4 24

Process 1:JJ Depth-First Search 0:<S> natural language processing 1:NN -3.1 2:NN -5.3 3:NN 1:JJ -4.2 2:JJ -5.9 3:JJ 1:VB 2:VB 3:VB -5.4-5.7 1:LRB -8.2 2:LRB -11.2 3:LRB 1:RRB 2:RRB 3:RRB -8.1-11.4 Heap 2:NN -5.3 1:VB -5.4 2:NN -5.5 2:VB -5.7 2:JJ -5.9 2:JJ -6.7 1:RRB -8.1 1:LRB -8.2 2:LRB -11.2 2:RRB -11.4 25

Process 2:NN Depth-First Search 0:<S> natural language processing 1:NN -3.1 2:NN -5.3 3:NN -7.2 1:JJ -4.2 2:JJ -5.9 3:JJ -9.8 1:VB 2:VB 3:VB -5.4-5.7-7.3 1:LRB -8.2 2:LRB -11.2 3:LRB -16.3 1:RRB 2:RRB 3:RRB -8.1-11.4-17.0 Heap 1:VB -5.4 2:NN -5.5 2:VB -5.7 2:JJ -5.9 2:JJ -6.7 3:NN -7.2 3:VB -7.3 1:RRB -8.1 1:LRB -8.2 3:JJ -9.8 2:LRB -11.2 26...

Process 1:VB Depth-First Search 0:<S> natural language processing 1:NN 2:NN 3:NN -3.1-5.3-7.3-7.2 1:JJ -4.2 2:JJ -5.9-8.9 3:JJ -9.8 1:VB -5.4 2:VB -5.7-12.7 3:VB -7.3 1:LRB 2:LRB 3:LRB -8.2-11.2-14.5-16.3 1:RRB 2:RRB 3:RRB -8.1-11.4-14.7-17.0 Heap 2:NN -5.5 2:VB -5.7 2:JJ -5.9 2:JJ -6.7 3:NN -7.2 3:VB -7.3 1:RRB -8.1 1:LRB -8.2 3:JJ -9.8 2:LRB -11.2... 27

Depth-First Search Do not process 2:NN (has already been processed) 0:<S> natural language processing 1:NN -3.1 2:NN -5.3 3:NN -7.2 1:JJ -4.2 2:JJ -5.9 3:JJ -9.8 1:VB 2:VB 3:VB -5.4-5.7-7.3 1:LRB -8.2 2:LRB -11.2 3:LRB -16.3 1:RRB 2:RRB 3:RRB -8.1-11.4-17.0 Heap 2:VB -5.7 2:JJ -5.9 2:JJ -6.7 3:NN -7.2 3:VB -7.3 1:RRB -8.1 1:LRB -8.2 3:JJ -9.8 2:LRB -11.2... 28

Problem: Still Inefficient Depth-first search does not work well for long sentences Why? Hint: Think of 1:VB in previous example 29

A* Search: Add Optimistic Heuristic Consider the words remaining Use Optimistic Heuristic: BEST score possible Optimistic heuristic for tagging: Best Emission Prob log(p(natural NN)) = -2.4 log(p(natural JJ)) = -2.0 log(p(natural VB)) = -3.1 log(p(natural LRB)) = -7.0 log(p(natural RRB)) = -7.0 natural language processing log(p(lang. NN)) = -2.4 log(p(lang. JJ)) = -3.0 log(p(lang. VB)) = -3.2 log(p(lang. LRB)) = -7.9 log(p(lang. RRB)) = -7.9 log(p(proc. NN)) = -2.5 log(p(proc. JJ)) = -3.4 log(p(proc. VB)) = -1.5 log(p(proc. LRB)) = -6.9 log(p(proc. RRB)) = -6.9 H(1+) = -5.9 H(2+) = -3.9 H(3+) = -1.5 H(4+) = 0.0 30

A* Search: Add Optimistic Heuristic Use Forward Score + Optimistic Heuristic Regular Heap 2:VB F(2:VB)=-5.7 H(3+)=-1.5 2:JJ F(2:JJ)=-5.9 H(3+)=-1.5 2:JJ F(2:JJ)=-6.7 H(3+)=-1.5 3:NN F(3:NN)=-7.2 H(4+)=-0.0 3:VB F(3:VB)=-7.3 H(4+)=-0.0 1:RRB F(1:RRB)=-8.1 H(2+)=-3.9 1:LRB F(1:LRB)=-8.2 H(2+)=-3.9 3:JJ F(3:JJ)=-9.8 H(4+)=-0.0 2:LRB F(2:LRB)=-11.2 H(3+)=-1.5 A* Heap 3:NN -7.2 2:VB -7.2 3:VB -7.3 2:JJ -7.4 2:JJ -8.2 3:JJ -9.8 1:RRB -12.0 1:LRB -12.1 2:LRB -12.7 31

Exercise 32

Write test-hmm-beam Test the program Exercise Input: test/05 {train,test} input.txt Answer: test/05 {train,test} answer.txt Train an HMM model on data/wiki en train.norm_pos and run the program on data/wiki en test.norm Measure the accuracy of your tagging with script/gradepos.pl data/wiki en test.pos my_answer.pos Report the accuracy for different beam sizes Challenge: implement A* search 33

Thank You! 34