Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University



Similar documents
Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Word Completion and Prediction in Hebrew

Tagging with Hidden Markov Models

Log-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models

CS 533: Natural Language. Word Prediction

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

Projektgruppe. Categorization of text documents via classification

Automated Content Analysis of Discussion Transcripts

Understanding Video Lectures in a Flipped Classroom Setting. A Major Qualifying Project Report. Submitted to the Faculty

The Basics of Graphical Models

Language Modeling. Chapter Introduction

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Customizing an English-Korean Machine Translation System for Patent Translation *

Information Leakage in Encrypted Network Traffic

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Math 141. Lecture 2: More Probability! Albyn Jones 1. jones/courses/ Library 304. Albyn Jones Math 141

A chart generator for the Dutch Alpino grammar

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

Outline of today s lecture

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Artificial Intelligence Exam DT2001 / DT2006 Ordinarie tentamen

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Course: Model, Learning, and Inference: Lecture 5

Natural Language to Relational Query by Using Parsing Compiler

Turkish Radiology Dictation System

Chapter 7. Language models. Statistical Machine Translation

Supervised Learning (Big Data Analytics)

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Statistical Machine Translation

Automatic Text Analysis Using Drupal

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Special Topics in Computer Science

Building a Question Classifier for a TREC-Style Question Answering System

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Introduction. BM1 Advanced Natural Language Processing. Alexander Koller. 17 October 2014

National Sun Yat-Sen University CSE Course: Information Theory. Gambling And Entropy

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Hardware Implementation of Probabilistic State Machine for Word Recognition

Ling 201 Syntax 1. Jirka Hana April 10, 2006

How to write a technique essay? A student version. Presenter: Wei-Lun Chao Date: May 17, 2012

Introduction to Algorithmic Trading Strategies Lecture 2

Learning is a very general term denoting the way in which agents:

Statistical Machine Translation: IBM Models 1 and 2

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Hidden Markov Models Chapter 15

Estudios de Asia y Africa Idiomas Modernas I What you should have learnt from Face2Face

Syntactic Theory on Swedish

A Chart Parsing implementation in Answer Set Programming

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

MAN VS. MACHINE. How IBM Built a Jeopardy! Champion x The Analytics Edge

Cell Phone based Activity Detection using Markov Logic Network

DEPENDENCY PARSING JOAKIM NIVRE

Hidden Markov Models 1

Log-Linear Models. Michael Collins

Structural and Semantic Indexing for Supporting Creation of Multilingual Web Pages

A Knowledge-based System for Translating FOL Formulas into NL Sentences

Introducing Formal Methods. Software Engineering and Formal Methods

Year 1 reading expectations (New Curriculum) Year 1 writing expectations (New Curriculum)

31 Case Studies: Java Natural Language Tools Available on the Web

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

Deposit Identification Utility and Visualization Tool

Reliable and Cost-Effective PoS-Tagging

8. Machine Learning Applied Artificial Intelligence

Lecture 12: An Overview of Speech Recognition

Machine Learning and Statistics: What s the Connection?

Question 1 Formatted: Formatted: Formatted: Formatted:

Bayes Theorem. Bayes Theorem- Example. Evaluation of Medical Screening Procedure. Evaluation of Medical Screening Procedure

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

SAMPLE. Grammar, punctuation and spelling. Paper 2: short answer questions. English tests KEY STAGE LEVEL. Downloaded from satspapers.org.

A Case Study of Question Answering in Automatic Tourism Service Packaging

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

Hidden Markov Models

Micro blogs Oriented Word Segmentation System

Segmentation and Classification of Online Chats

How To Write A Summary Of A Review

Student Guide for Usage of Criterion

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN

WHITEPAPER. Text Analytics Beginner s Guide

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

NATURAL LANGUAGE TO SQL CONVERSION SYSTEM

Machine Translation. Agenda

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.

Texas Success Initiative (TSI) Assessment. Interpreting Your Score

Chapter 5. Discrete Probability Distributions

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane

Natural Language Database Interface for the Community Based Monitoring System *

A Learning Based Method for Super-Resolution of Low Resolution Images

Automatic Identification of Arabic Language Varieties and Dialects in Social Media

Transcription:

Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University

Last class: grammars and parsing in Prolog Noun -> roller Verb thrills VP Verb NP S NP VP NP S VP Verb NP A roller coaster thrills every teenager 2

Today: NLP ambiguity Example: books: NOUN OR VERB You do not need books for this class. She books her trips early. Another example: Thank you for not smoking or playing ipods without earphones. Thank you for not smoking () without earphones These cases can be detected as special uses of the same word Caveout: If we write too many rules, we may write unnatural grammars special rules became general rules it puts a burden too large on the person writing the grammar 3

Ambiguity in Parsing S NP VP S NP Det N NP NP PP VP V NP NP VP VP VP PP PP P NP NP Papa Papa VP PP N caviar N spoon V NP P NP V spoon V ate P with ate Det N with Det N Det the Det a the caviar a spoon 4

Ambiguity in Parsing S NP VP S NP Det N NP NP PP VP V NP NP VP VP VP PP PP P NP NP Papa Papa V NP N caviar N spoon ate NP PP V spoon V ate Det N P NP P with Det the Det a the caviar with Det N 5 a spoon

Scores P A R S E R correct test trees s c o r e r accuracy test sentences Grammar Recent parsers quite accurate good enough to help NLP tasks! 6

Speech processing ambiguity Speech processing is a very hard problem (gender, accent, background noise) Solution: n-grams Letter or word frequencies: 1-grams: THE, COURSE useful in solving cryptograms If you know the previous letter/word: 2-grams h is rare in English (4%; 4 points in Scrabble) but h is common after t (20%)!!! If you know the previous 2 letters/words: 3-grams h is really common after (space) t 7

N-grams An n-gram is a contiguous sequence of n items from a given sequence the items can be: letters, words, phonemes, syllables, etc. Examples: Consider the paragraph: Suppose you should be walking down Broadway after dinner, with ten minutes allotted to the consummation of your cigar while you are choosing between a diverting tragedy and something serious in the way of vaudeville. Suddenly a hand is laid upon your arm. (from the The green door, The Four Million novel, by O. Henry) unigrams (n-grams of size 1): Suppose, you, should, be, walking,... bigrams (n-grams of size 2): Suppose you, you should, should be, be walking,... trigrams (n-grams of size 3): Suppose you should, you should be, should be walking,... four-grams: Suppose you should be, you should be walking,... 8

N-grams 9 An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence Example: from a novel, we can extract all bigrams (sequences of 2- words) with their probabilities of showing up: probability( you can )= number of you can s in text/total number of bigrams = 0.002 probability( you should )= number of you should s in text/total number of bigrams = 0.001 etc. Consider that the current word is you and we want to predict what is the next word: with probability 0.002%, the next word is can with probability 0.001%, the next word is should Advantages: easy training, simple/intuitive probabilistic model.

N-grams in NLP N-gram models are widely used in statistical natural language processing (NLP): in speech recognition, phonemes and sequences of phonemes are modeled using a n-gram distribution: due to multitude of accents, various voice pitches, gender, etc., speech recognition is very error prone, so information about the predicted next phonemes is very useful to detect the exact transcription in parsing text, words are modeled such that each n-gram is composed of n words there is an infinite number of ways to express ideas in natural language. However, there are common and standardized ways used by humans to convert information. A probabilistic model is useful to the parser to construct a data structure that best fits the grammatical rules. in language identification, sequences of characters are modeled for different languages (e.g., in English, t and w are followed by an h 90% of the time). 10

N-grams in NLP Other applications: plagiarism detection, find candidates for the correct spelling of a misspelled word, optical character recognition (OCR), machine translation, correction codes (correct words that were garbled during transmission) Modern statistical models are typically made up of two parts: a prior distribution describing the inherent likelihood of a possible result and a likelihood function used to assess the compatibility of a possible result with observed data. 11

Probabilities and statistics descriptive: mean scores confirmatory: statistically significant? predictive: what will follow? Probability notation p(x Y): p(paul Revere wins weather s clear) = 0.9 Revere s won 90% of races with clear weather p(win clear) = p(win, clear) / p(clear) syntactic sugar logical conjunction of predicates predicate selecting races where weather s clear 12

Properties of p (axioms) p( ) = 0 p(all outcomes) = 1 p(x) p(y) for any X Y p(x Y) = p(x) + p(y) provided X Y= example: p(win & clear) + p(win & clear) = p(win) 13

Properties and Conjunction what happens as we add conjuncts to left of bar? p(paul Revere wins, Valentine places, Epitaph shows weather s clear) probability can only decrease what happens as we add conjuncts to right of bar? p(paul Revere wins weather s clear, ground is dry, jockey getting over sprain) probability can increase or decrease 14 Simplifying Right Side (Backing Off) - reasonable estimate p(paul Revere wins weather s clear, ground is dry, jockey getting over sprain)

Factoring Left Side: The Chain Rule p(revere, Valentine, Epitaph weather s clear) = p(revere Valentine, Epitaph, weather s clear) * p(valentine Epitaph, weather s clear) * p(epitaph weather s clear) 15 True because numerators cancel against denominators Makes perfect sense when read from bottom to top Revere? Valentine? Revere? Epitaph? Valentine? Revere? Revere? Epitaph, Valentine, Revere? 1/3 * 1/5 * 1/4

Factoring Left Side: The Chain Rule p(revere Valentine, Epitaph, weather s clear) conditional independence lets us use backed-off data from all four of these cases to estimate their shared probabilities If this prob is unchanged by backoff, we say Revere was CONDITIONALLY INDEPENDENT of Valentine and Epitaph (conditioned on the weather s being clear). no 3/4 Revere? yes 1/4 Valentine? no 3/4 Revere? yes 1/4 Epitaph? no 3/4 yes yes 1/4 Valentine? Revere? clear? irrelevant no 3/4 Revere? yes 1/4 16 16

Bayes Theorem p(a B) = p(b A) * p(a) / p(b) Easy to check by removing syntactic sugar Use 1: Converts p(b A) to p(a B) Use 2: Updates p(a) to p(a B) 17 17

Probabilistic Algorithms Example: The Viterbi algorithm computes the probability of a sequence of observed events and the most likely sequence of hidden states (the Viterbi path) that result in the sequence of observed events. http://www.cs.stonybrook.edu/~pfodor/old_page/viterbi/viterbi.p forward_viterbi(+observations, +States, +Start_probabilities, +Transition_probabilities, +Emission_probabilities, -Prob, -Viterbi_path, - Viterbi_prob) forward_viterbi(['walk', 'shop', 'clean'], ['Ranny', 'Sunny'], [0.6, 0.4], [[0.7, 0.3],[0.4,0.6]], [[0.1, 0.4, 0.5], [0.6, 0.3, 0.1]], Prob, Viterbi_path, Viterbi_prob) will return: Prob = 0.03361, Viterbi_path = [Sunny, Rainy, Rainy, Rainy], Viterbi_prob=0.0094 18

Viterbi Algorithm Alice and Bob live far apart from each other. Bob does three activities: walks in the park, shops, and cleans his apartment. Alice has no definite information about the weather where Bob lives. Alice tries to guess what the weather is based on what Bob does: The weather operates as a discrete Markov chain There are two (hidden to Alice) states "Rainy" and "Sunny start_probability = {'Rainy': 0.6, 'Sunny': 0.4} 19 Wikipedia

Viterbi Algorithm A dynamic programming algorithm Input: a first-order hidden Markov model (HMM) states Y initial probabilities π i of being in state i transition probabilities a i,j of transitioning from state i to statej observations x 0,..., x T Output: The state sequence y 0,..., y T most likely to have produced the observations V t,k is the probability of the most probable state sequence responsible for the first t + 1 observations V 0,k = P(x 0 k) π i V T,k = P(x t k) max y Y (a y,k V t-1,y ) 20 Wikipedia

Viterbi Algorithm The transition_probability represents the change of the weather transition_probability = { 'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3}, 'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}} The emission_probability represents how likely Bob is to perform a certain activity on each day: emission_probability = { 'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5}, 'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, } 21 Wikipedia

Viterbi Algorithm Alice talks to Bob and discovers the history of his activities: on the first day he went for a walk on the second day he went shopping on the third day he cleaned his apartment ['walk', 'shop', 'clean'] What is the most likely sequence of rainy/sunny days that would explain these observations? 22 Wikipedia

23 Wikipedia