CS 533: Natural Language. Word Prediction

Similar documents

Chapter 7. Language models. Statistical Machine Translation

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c All rights reserved. Draft of September 1, 2014.

DRAFT N-GRAMS. ... all of a sudden I notice three guys standing on the sidewalk...

Language Modeling. Chapter Introduction

Estimating Probability Distributions

Word Completion and Prediction in Hebrew

Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables

Natural Language Processing: A Model to Predict a Sequence of Words

Log-Linear Models. Michael Collins

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Perplexity Method on the N-gram Language Model Based on Hadoop Framework

Simple Language Models for Spam Detection

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Factored Language Models for Statistical Machine Translation

Basic probability theory and n-gram models

Tagging with Hidden Markov Models

Noam Chomsky: Aspects of the Theory of Syntax notes

Reliable and Cost-Effective PoS-Tagging

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

Using classes has the potential of reducing the problem of sparseness of data by allowing generalizations

A STATE-SPACE METHOD FOR LANGUAGE MODELING. Vesa Siivola and Antti Honkela. Helsinki University of Technology Neural Networks Research Centre

Micro blogs Oriented Word Segmentation System

Automatic slide assignation for language model adaptation

OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

An Overview of Applied Linguistics

Full and Complete Binary Trees

Turkish Radiology Dictation System

Statistical Machine Translation: IBM Models 1 and 2

Speech and Language Processing

Terminology Extraction from Log Files

LECTURE 16. Readings: Section 5.1. Lecture outline. Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

CS 3719 (Theory of Computation and Algorithms) Lecture 4

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Outline of today s lecture

Collecting Polish German Parallel Corpora in the Internet

Introduction to Pattern Recognition

Terminology Extraction from Log Files

I m sorry, Dave, I m afraid I can t do that :

Math 55: Discrete Mathematics

Scaling Shrinkage-Based Language Models

MCS 563 Spring 2014 Analytic Symbolic Computation Wednesday 9 April. Hilbert Polynomials

Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model

Two-State Option Pricing

Ling 201 Syntax 1. Jirka Hana April 10, 2006

DEPENDENCY PARSING JOAKIM NIVRE

A System for Labeling Self-Repairs in Speech 1

Doctoral School of Historical Sciences Dr. Székely Gábor professor Program of Assyiriology Dr. Dezső Tamás habilitate docent

3. Mathematical Induction

Building A Vocabulary Self-Learning Speech Recognition System

Processing: current projects and research at the IXA Group

Factored bilingual n-gram language models for statistical machine translation

Experimental Analysis

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Understanding Video Lectures in a Flipped Classroom Setting. A Major Qualifying Project Report. Submitted to the Faculty

Three New Graphical Models for Statistical Language Modelling

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015

88 CHAPTER 2. VECTOR FUNCTIONS. . First, we need to compute T (s). a By definition, r (s) T (s) = 1 a sin s a. sin s a, cos s a

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

WHERE DOES THE 10% CONDITION COME FROM?

v w is orthogonal to both v and w. the three vectors v, w and v w form a right-handed set of vectors.

Computer Aided Document Indexing System

CHAPTER 3. Methods of Proofs. 1. Logical Arguments and Formal Proofs

Lecture 9 Maher on Inductive Probability

Regular Languages and Finite Automata

Slovak Automatic Transcription and Dictation System for the Judicial Domain

15 Prime and Composite Numbers

Large Language Models in Machine Translation

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Least Squares Estimation

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Sentiment analysis using emoticons

Chapter 11 Number Theory

Homework until Test #2

IBM Research Report. Scaling Shrinkage-Based Language Models

Mathematical Induction

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Semantic analysis of text and speech

Language Model Adaptation for Video Lecture Transcription

Historical Linguistics. Diachronic Analysis. Two Approaches to the Study of Language. Kinds of Language Change. What is Historical Linguistics?

Identifying Focus, Techniques and Domain of Scientific Papers

Estonian Large Vocabulary Speech Recognition System for Radiology

0.1 Phase Estimation Technique

Whitepaper. Leveraging Social Media Analytics for Competitive Advantage

CS 103X: Discrete Structures Homework Assignment 3 Solutions

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning

Touchstone Level 2. Common European Framework of Reference for Languages (CEFR)

Pre-processing of Bilingual Corpora for Mandarin-English EBMT

Probability Topics: Contingency Tables

8 Divisibility and prime numbers

Search Engines. Stephen Shaw 18th of February, Netsoc

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

An Adaptive Approach to Named Entity Extraction for Meeting Applications

Transcription:

CS 533: Natural Language Processing Lecture 03 N-Gram Models and Algorithms CS 533: Natural Language Processing Lecture 01 1 Word Prediction Suppose you read the following sequence of words: Sue swallowed the large green... What word do you expect to see next? pill? frog? tree? car? mountain? the? that? We might want to use syntactic or semantic knowledge to answer this question... CS 533: Natural Language Processing Professor McCarty 2

Word Prediction But suppose we just use probabilities: P(pill sue, swallowed, the, large, green) P(tree sue, swallowed, the, large, green) P(that sue, swallowed, the, large, green)... Generally: P w i w 1, w 2,, w i 1 = P w 1, w 2,, w i 1, w i P w 1, w 2,,w i 1 CS 533: Natural Language Processing Professor McCarty 3 Chain Rule P w 1, w 2,, w n = P w 1 P w 2 w 1 P w 3 w 1, w 2 P w n w 1,, w n 1 n = i=1 P w i w 1,,w i 1 Note: Also add markers <s> and </s> at the beginning and end of the sentence. CS 533: Natural Language Processing Professor McCarty 4

Markov Assumption P w i w 1,, w i 1 unigram: = P w i bigram: = P w i w i 1 trigram: = P w i w i 2, w i 1 CS 533: Natural Language Processing Professor McCarty 5 Example P(sue, swallowed, the, large, green, pill) = P(sue <s> ) * P(swallowed sue) * P(the swallowed) * P(large the) * P(green large) * P(pill green) * P(</s> pill) using a bigram language model. CS 533: Natural Language Processing Professor McCarty 6

Estimating Probabilities The Maximum Likelihood Estimate (MLE) is: P w i w i 1 = count w i 1 w i count w i 1 using a bigram language model. P w i w i 2,w i 1 = count w i 2 w i 1 w i count w i 2 w i 1 count large green pill P pill large, green = count large green using a trigram language model. CS 533: Natural Language Processing Professor McCarty 7 Suppose the text is: Example <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> CS 533: Natural Language Processing Professor McCarty 8

What Should We Count? From the Brown corpus: He stepped out into the hall, was delighted to encounter a water brother. This sentence has 15 words counting, and. as words, 13 otherwise. From the Switchboard corpus: I do uh main- mainly business data processing Should we count the filler uh? Should we count the fragment main-? CS 533: Natural Language Processing Professor McCarty 9 What Should We Count? Tokens and Types: They picnicked by the pool, then lay back on the grass and looked at the stars. This sentence has 16 tokens (not including punctuation), and 14 types. Common notation for a corpus: V = the size of the vocabulary, or the number of types. N = the total number of running words, or the number of tokens. CS 533: Natural Language Processing Professor McCarty 10

What Should We Count? Should we distinguish: cats and cat? geese and goose? Terminology: Lemma: a set of lexical forms having the same stem, major part of speech, and rough word sense. Wordform: a fully inflected surface form. For N-grams, we usually use wordforms. CS 533: Natural Language Processing Professor McCarty 11 Brown et al. (1992): Large Corpora 583,000,000 wordform tokens. 293,181 wordform types. Google N-gram Release (2006): 1,024,908,267,229 tokens. 13,588,391 wordform types. Why so many types? misspellings, numbers, names, acronyms,... CS 533: Natural Language Processing Professor McCarty 12

Training and Testing Training Set (~80%): often includes separate held out data to estimate parameters of the model. Test Set, often split into: development set (~10%), to be used in the iterative process of developing an algorithm. final test set (~10%). Cardinal Rule: Don't train your model on the test set! CS 533: Natural Language Processing Professor McCarty 13 Chomsky's Argument Syntactic Structures (1957): Second, the notion grammatical cannot be identified with meaningful or significant in any semantic sense. Sentences (1) and (2) are equally nonsensical, but any speaker of English will recognize that only the former is grammatical. (1) Colorless green ideas sleep furiously. (2) Furiously sleep ideas green colorless.... CS 533: Natural Language Processing Professor McCarty 14

Chomsky's Argument Syntactic Structures (1957):... Third, the notion grammatical in English cannot be identified in any way with the notion high order of statistical approximation to English. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not.... CS 533: Natural Language Processing Professor McCarty 15 Problem: Sparse Data Bigram counts for the Berkeley Restaurant Project corpus of 9332 sentences (J&M website): What to do about the zero counts? CS 533: Natural Language Processing Professor McCarty 16

Problem: Sparse Data Divide bigram counts by unigram counts: to compute bigram probabilities: CS 533: Natural Language Processing Professor McCarty 17 Zipf's Law CS 533: Natural Language Processing Professor McCarty 18

Zipf's Law Brown Corpus: f r = k CS 533: Natural Language Processing Professor McCarty 19 Mandelbrot's Formula Brown Corpus: f =P r B CS 533: Natural Language Processing Professor McCarty 20

Basic Idea: Smoothing Decrease the probability of the previously seen events, so that there is some probability left over for the unseen events. Several Methods: Laplace (or add one ) smoothing. Good-Turing discounting. Interpolation. Katz backoff. CS 533: Natural Language Processing Professor McCarty 21 Laplace Smoothing Laplace's Law (1814): Add one to all the counts! Thus, for bigrams: * P LAP wi w i 1 = count w w 1 i 1 i count w i 1 V Alternatively, compute a reconstructed count : c* w i 1 w i = [count w i 1 w i 1] count w i 1 count w i 1 V CS 533: Natural Language Processing Professor McCarty 22

Laplace Smoothing Smoothed bigram counts: CS 533: Natural Language Processing Professor McCarty 23 Laplace Smoothing Smoothed bigram probabilities: P LAP * wi w i 1 = count w i 1w i 1 count w i 1 V CS 533: Natural Language Processing Professor McCarty 24

Laplace Smoothing Reconstructed counts: c* w i 1 w i = [count w i 1 w i 1] count w i 1 count w i 1 V CS 533: Natural Language Processing Professor McCarty 25 Laplace Smoothing For N-gram models, this technique transfers too much probability mass to unseen events: count want to is reduced from 608 to 238. P to want is reduced from 0.66 to 0.26. We can measure this reduction by the discount: d c = c*/ c the discount for want to is.39. the discount for Chinese food is.10. Result: Laplace smoothing does not work very well in empirical tests (Church and Gale, 1991a). CS 533: Natural Language Processing Professor McCarty 26

Good-Turing Discounting Good (1953) attributes the idea to Alan Turing: Let N c be the number of N-grams that have occurred c times (the frequency of frequency c). Re-estimate the smoothed counts as follows: c* = c 1 N c 1 N c Based on Good-Turing Theorem, assuming binomial distributions. We want to know the probability attributed to the N- grams that have not previously been seen, which we can think of as the missing mass. CS 533: Natural Language Processing Professor McCarty 27 Good-Turing Discounting missing mass = P x x : count x =0 c * x = x : count x =0 N 0 1 N = 0 1 / N 0 x : count x =0 N = N 1 / N 0 N 1 = N 1 N x : count x =0 Thus we can estimate the count of unseen N-grams using the count of N-grams we have seen once. CS 533: Natural Language Processing Professor McCarty 28

Fishing Example Assume there are 8 species of fish in a lake, and we have caught 6 species with the following counts: carp perch whitefish trout salmon eel 10 3 2 1 1 1 The other two species (catfish and bass) have not yet been seen. What is the probability that the next fish we catch will be one of the previously unseen species? CS 533: Natural Language Processing Professor McCarty 29 Fishing Example The probability that the next fish is a catfish is: * catfish =.085 P GT Note: The count for trout was reduced from 1 to: c * trout =.67 CS 533: Natural Language Processing Professor McCarty 30

Good-Turing Discounting Bigram frequencies of frequencies and Good-Turing reestimates for a 22 million word AP Newswire corpus (Church and Gale, 1991) and for the 9332 sentences in the Berkeley Restaurant corpus: CS 533: Natural Language Processing Professor McCarty 31 Good-Turing Discounting In practice: Since some values of N could be zero, smooth these values before discounting, e.g., use a log-linear regression: log N c = a b log c Since large counts are reliable, don't apply discounting when c k for some threshold k, e.g., when c 5. Treat counts of 1 as if they were 0. Combine discounting with interpolation and backoff. CS 533: Natural Language Processing Professor McCarty 32

Interpolation Combine trigrams, bigrams and unigrams: P w i w i 2, w i 1 = 1 P w i 2 P w i w i 1 3 P w i w i 2, w i 1 i 0, i = 1 i Verify: w P w w i 2, w i 1 = 1 Set λs using held out corpus. CS 533: Natural Language Processing Professor McCarty 33 For trigrams: For bigrams: Katz Backoff P KB w i w i 2, w i 1 = {P * wi wi 2, wi 1, if count wi 2 wi 1 wi 0 } w i 2,w i 1 P KB w i w i 1, if count w i 2 w i 1 0 P * w i, otherwise P KB w i w i 1 = { P * w i w i 1, if count w i 1 w i 0 w i 1 P * w i, otherwise } CS 533: Natural Language Processing Professor McCarty 34

Katz Backoff with Smoothing Katz backoff bigram probabilities for eight words (out of V = 1446) using Good-Turing discounting with k = 5 and counts of 1 replaced with 0: CS 533: Natural Language Processing Professor McCarty 35 Evaluating Language Models Subjective Evaluation Use a technique attributed to Claude Shannon: Generate texts randomly by sampling from the N-gram models, e.g., for Shakespeare and for the Wall Street Journal. Do they look the same? CS 533: Natural Language Processing Professor McCarty 36

Shakespeare The quadrigram really looks like Shakespeare. Why? CS 533: Natural Language Processing Professor McCarty 37 Wall Street Journal The Wall Street Journal is not Shakespeare. Can we measure these differences objectively? CS 533: Natural Language Processing Professor McCarty 38

Perplexity Given a test set W =w 1 w 2 w N, the perplexity is the probability of the test set, normalized by N : PP W = P w 1, w 2,, w N 1 N For a bigram model, the perplexity is: CS 533: Natural Language Processing Professor McCarty 39 Perplexity Models: Unigram, Bigram, Trigram. Training Set: 38 million word Wall Street Journal corpus with V = 19,979. Method: Katz backoff with Good-Turing discounting. Test Set: 1.5 million words. Results: CS 533: Natural Language Processing Professor McCarty 40