Chapter 7. Language models. Statistical Machine Translation

Save this PDF as:

Size: px
Start display at page:

Transcription

1 Chapter 7 Language models Statistical Machine Translation

2 Language models Language models answer the question: How likely is a string of English words good English? Help with reordering p lm (the house is small) > p lm (small the is house) Help with word choice p lm (I am going home) > p lm (I am going house) Chapter 7: Language Models 1

3 N-Gram Language Models Given: a string of English words W = w 1, w 2, w 3,..., w n Question: what is p(w )? Sparse data: Many good English sentences will not have been seen before Decomposing p(w ) using the chain rule: p(w 1, w 2, w 3,..., w n ) = p(w 1 ) p(w 2 w 1 ) p(w 3 w 1, w 2 )...p(w n w 1, w 2,...w n 1 ) (not much gained yet, p(w n w 1, w 2,...w n 1 ) is equally sparse) Chapter 7: Language Models 2

4 Markov Chain Markov assumption: only previous history matters limited memory: only last k words are included in history (older words less relevant) kth order Markov model For instance 2-gram language model: p(w 1, w 2, w 3,..., w n ) p(w 1 ) p(w 2 w 1 ) p(w 3 w 2 )...p(w n w n 1 ) What is conditioned on, here w i 1 is called the history Chapter 7: Language Models 3

5 Estimating N-Gram Probabilities Maximum likelihood estimation Collect counts over a large text corpus p(w 2 w 1 ) = count(w 1, w 2 ) count(w 1 ) Millions to billions of words are easy to get (trillions of English words available on the web) Chapter 7: Language Models 4

6 Example: 3-Gram Counts for trigrams and estimated word probabilities the green (total: 1748) word c. prob. paper group light party ecu the red (total: 225) word c. prob. cross tape army card , the blue (total: 54) word c. prob. box flag , angel trigrams in the Europarl corpus start with the red 123 of them end with cross maximum likelihood probability is = Chapter 7: Language Models 5

7 How good is the LM? A good model assigns a text of real English W a high probability This can be also measured with cross entropy: H(W ) = 1 n log p(w n 1 ) Or, perplexity perplexity(w ) = 2 H(W ) Chapter 7: Language Models 6

8 Example: 4-Gram prediction p lm -log 2 p lm p lm (i </s><s>) p lm (would <s>i) p lm (like i would) p lm (to would like) p lm (commend like to) p lm (the to commend) p lm (rapporteur commend the) p lm (on the rapporteur) p lm (his rapporteur on) p lm (work on his) p lm (. his work) p lm (</s> work.) average Chapter 7: Language Models 7

9 Comparison 1 4-Gram word unigram bigram trigram 4-gram i would like to commend the rapporteur on his work </s> average perplexity Chapter 7: Language Models 8

10 Unseen N-Grams We have seen i like to in our corpus We have never seen i like to smooth in our corpus p(smooth i like to) = 0 Any sentence that includes i like to smooth will be assigned probability 0 Chapter 7: Language Models 9

11 Add-One Smoothing For all possible n-grams, add the count of one. c = count of n-gram in corpus n = count of history v = vocabulary size p = c + 1 n + v But there are many more unseen n-grams than seen n-grams Example: Europarl 2-bigrams: 86, 700 distinct words 86, = 7, 516, 890, 000 possible bigrams but only about 30, 000, 000 words (and bigrams) in corpus Chapter 7: Language Models 10

12 Add-α Smoothing Add α < 1 to each count p = c + α n + αv What is a good value for α? Could be optimized on held-out set Chapter 7: Language Models 11

13 Example: 2-Grams in Europarl Count Adjusted count Test count c (c + 1) n n (c + α) t n+v 2 n+αv 2 c Add-α smoothing with α = t c are average counts of n-grams in test set that occurred c times in corpus Chapter 7: Language Models 12

14 Deleted Estimation Estimate true counts in held-out data split corpus in two halves: training and held-out counts in training C t (w 1,..., w n ) number of ngrams with training count r: N r total times ngrams of training count r seen in held-out data: T r Held-out estimator: p h (w 1,..., w n ) = T r N r N where count(w 1,..., w n ) = r Both halves can be switched and results combined p h (w 1,..., w n ) = T 1 r + T 2 r N(N 1 r + N 2 r ) where count(w 1,..., w n ) = r Chapter 7: Language Models 13

15 Good-Turing Smoothing Adjust actual counts r to expected counts r with formula r = (r + 1) N r+1 N r N r number of n-grams that occur exactly r times in corpus N 0 total number of n-grams Chapter 7: Language Models 14

16 Good-Turing for 2-Grams in Europarl Count Count of counts Adjusted count Test count r N r r t 0 7,514,941, ,132, , , , , , , , , adjusted count fairly accurate when compared against the test count Chapter 7: Language Models 15

17 Derivation of Good-Turing A specific n-gram α occurs with (unknown) probability p in the corpus Assumption: all occurrences of an n-gram α are independent of each other Number of times α occurs in corpus follows binomial distribution p(c(α) = r) = b(r; N, p i ) = ( ) N p r (1 p) N r r Chapter 7: Language Models 16

18 Derivation of Good-Turing (2) Goal of Good-Turing smoothing: compute expected count c Expected count can be computed with help from binomial distribution: E(c (α)) = = N r p(c(α) = r) r=0 N r r=0 ( ) N p r (1 p) N r r Note again: p is unknown, we cannot actually compute this Chapter 7: Language Models 17

19 Derivation of Good-Turing (3) Definition: expected number of n-grams that occur r times: E N (N r ) We have s different n-grams in corpus let us call them α 1,..., α s each occurs with probability p 1,..., p s, respectively Given the previous formulae, we can compute E N (N r ) = = s p(c(α i ) = r) i=1 s i=1 ( ) N p r i (1 p i ) N r r Note again: p i is unknown, we cannot actually compute this Chapter 7: Language Models 18

20 Derivation of Good-Turing (4) Reflection we derived a formula to compute E N (N r ) we have N r for small r: E N (N r ) N r Ultimate goal compute expected counts c, given actual counts c E(c (α) c(α) = r) Chapter 7: Language Models 19

21 Derivation of Good-Turing (5) For a particular n-gram α, we know its actual count r Any of the n-grams α i may occur r times Probability that α is one specific α i p(α = α i c(α) = r) = p(c(α i ) = r) s j=1 p(c(α j) = r) Expected count of this n-gram α E(c (α) c(α) = r) = s N p i p(α = α i c(α) = r) i=1 Chapter 7: Language Models 20

22 Derivation of Good-Turing (6) Combining the last two equations: E(c (α) c(α) = r) = = s i=1 N p i p(c(α i ) = r) s j=1 p(c(α j) = r) s i=1 N p i p(c(α i ) = r) s j=1 p(c(α j) = r) We will now transform this equation to derive Good-Turing smoothing Chapter 7: Language Models 21

23 Derivation of Good-Turing (7) Repeat: E(c (α) c(α) = r) = s i=1 N p i p(c(α i ) = r) s j=1 p(c(α j) = r) Denominator is our definition of expected counts E N (N r ) Chapter 7: Language Models 22

24 Numerator: Derivation of Good-Turing (8) s N p i p(c(α i ) = r) = i=1 s ( N N p i r i=1 ) p r i (1 p i ) N r N! = N N r!r! pr+1 i (1 p i ) N r (r + 1) N + 1! = N N + 1 N r!r + 1! pr+1 i (1 p i ) N r N = (r + 1) N + 1 E N+1(N r+1 ) (r + 1) E N+1 (N r+1 ) Chapter 7: Language Models 23

25 Derivation of Good-Turing (9) Using the simplifications of numerator and denominator: r = E(c (α) c(α) = r) = (r + 1) E N+1(N r+1 ) E N (N r ) (r + 1) N r+1 N r QED Chapter 7: Language Models 24

26 Back-Off In given corpus, we may never observe Scottish beer drinkers Scottish beer eaters Both have count 0 our smoothing methods will assign them same probability Better: backoff to bigrams: beer drinkers beer eaters Chapter 7: Language Models 25

27 Interpolation Higher and lower order n-gram models have different strengths and weaknesses high-order n-grams are sensitive to more context, but have sparse counts low-order n-grams consider only very limited context, but have robust counts Combine them p I (w 3 w 1, w 2 ) = λ 1 p 1 (w 3 ) λ 2 p 2 (w 3 w 2 ) λ 3 p 3 (w 3 w 1, w 2 ) Chapter 7: Language Models 26

28 Recursive Interpolation We can trust some histories w i n+1,..., w i 1 more than others Condition interpolation weights on history: λ wi n+1,...,w i 1 Recursive definition of interpolation p I n(w i w i n+1,..., w i 1 ) = λ wi n+1,...,w i 1 p n (w i w i n+1,..., w i 1 ) + + (1 λ wi n+1,...,w i 1 ) p I n 1(w i w i n+2,..., w i 1 ) Chapter 7: Language Models 27

29 Back-Off Trust the highest order language model that contains n-gram Requires p BO n (w i w i n+1,..., w i 1 ) = α n (w i w i n+1,..., w i 1 ) if count n (w i n+1,..., w i ) > 0 = d n (w i n+1,..., w i 1 ) p BO n 1(w i w i n+2,..., w i 1 ) else adjusted prediction model α n (w i w i n+1,..., w i 1 ) discounting function d n (w 1,..., w n 1 ) Chapter 7: Language Models 28

30 Back-Off with Good-Turing Smoothing Previously, we computed n-gram probabilities based on relative frequency p(w 2 w 1 ) = count(w 1, w 2 ) count(w 1 ) Good Turing smoothing adjusts counts c to expected counts c count (w 1, w 2 ) count(w 1, w 2 ) We use these expected counts for the prediction model (but 0 remains 0) α(w 2 w 1 ) = count (w 1, w 2 ) count(w 1 ) This leaves probability mass for the discounting function d 2 (w 1 ) = 1 w 2 α(w 2 w 1 ) Chapter 7: Language Models 29

31 Diversity of Predicted Words Consider the bigram histories spite and constant both occur 993 times in Europarl corpus only 9 different words follow spite almost always followed by of (979 times), due to expression in spite of 415 different words follow constant most frequent: and (42 times), concern (27 times), pressure (26 times), but huge tail of singletons: 268 different words More likely to see new bigram that starts with constant than spite Witten-Bell smoothing considers diversity of predicted words Chapter 7: Language Models 30

32 Witten-Bell Smoothing Recursive interpolation method Number of possible extensions of a history w 1,..., w n 1 in training data N 1+ (w 1,..., w n 1, ) = {w n : c(w 1,..., w n 1, w n ) > 0} Lambda parameters 1 λ w1,...,w n 1 = N 1+ (w 1,..., w n 1, ) N 1+ (w 1,..., w n 1, ) + w n c(w 1,..., w n 1, w n ) Chapter 7: Language Models 31

33 Witten-Bell Smoothing: Examples Let us apply this to our two examples: 1 λ spite = = 1 λ constant = = N 1+ (spite, ) N 1+ (spite, ) + w n c(spite, w n ) = N 1+ (constant, ) N 1+ (constant, ) + w n c(constant, w n ) = Chapter 7: Language Models 32

34 Diversity of Histories Consider the word York fairly frequent word in Europarl corpus, occurs 477 times as frequent as foods, indicates and providers in unigram language model: a respectable probability However, it almost always directly follows New (473 times) Recall: unigram model only used, if the bigram model inconclusive York unlikely second word in unseen bigram in back-off unigram model, York should have low probability Chapter 7: Language Models 33

35 Kneser-Ney Smoothing Kneser-Ney smoothing takes diversity of histories into account Count of histories for a word N 1+ ( w) = {w i : c(w i, w) > 0} Recall: maximum likelihood estimation of unigram language model p ML (w) = c(w) i c(w i) In Kneser-Ney smoothing, replace raw counts with count of histories p KN (w) = N 1+ ( w) w i N 1+ (w i w) Chapter 7: Language Models 34

36 Based on interpolation Modified Kneser-Ney Smoothing Requires p BO n (w i w i n+1,..., w i 1 ) = α n (w i w i n+1,..., w i 1 ) if count n (w i n+1,..., w i ) > 0 = d n (w i n+1,..., w i 1 ) p BO n 1(w i w i n+2,..., w i 1 ) else adjusted prediction model α n (w i w i n+1,..., w i 1 ) discounting function d n (w 1,..., w n 1 ) Chapter 7: Language Models 35

37 Formula for α for Highest Order N-Gram Model Absolute discounting: subtract a fixed D from all non-zero counts α(w n w 1,..., w n 1 ) = c(w 1,..., w n ) D w c(w 1,..., w n 1, w) Refinement: three different discount values D 1 if c = 1 D(c) = D 2 if c = 2 D 3+ if c 3 Chapter 7: Language Models 36

38 Discount Parameters Optimal discounting parameters D 1, D 2, D 3+ can be computed quite easily Y = N 1 N 1 + 2N 2 D 1 = 1 2Y N 2 N 1 D 2 = 2 3Y N 3 N 2 D 3+ = 3 4Y N 4 N 3 Values N c are the counts of n-grams with exactly count c Chapter 7: Language Models 37

39 Formula for d for Highest Order N-Gram Model Probability mass set aside from seen events d(w 1,..., w n 1 ) = i {1,2,3+} D in i (w 1,..., w n 1 ) w n c(w 1,..., w n ) N i for i {1, 2, 3+} are computed based on the count of extensions of a history w 1,..., w n 1 with count 1, 2, and 3 or more, respectively. Similar to Witten-Bell smoothing Chapter 7: Language Models 38

40 Formula for α for Lower Order N-Gram Models Recall: base on count of histories N 1+ ( w) in which word may appear, not raw counts. α(w n w 1,..., w n 1 ) = N 1+( w 1,..., w n ) D w N 1+( w 1,..., w n 1, w) Again, three different values for D (D 1, D 2, D 3+ ), based on the count of the history w 1,..., w n 1 Chapter 7: Language Models 39

41 Formula for d for Lower Order N-Gram Models Probability mass set aside available for the d function d(w 1,..., w n 1 ) = i {1,2,3+} D in i (w 1,..., w n 1 ) w n c(w 1,..., w n ) Chapter 7: Language Models 40

42 Interpolated Back-Off Back-off models use only highest order n-gram if sparse, not very reliable. two different n-grams with same history occur once same probability one may be an outlier, the other under-represented in training To remedy this, always consider the lower-order back-off models Adapting the α function into interpolated α I function by adding back-off α I (w n w 1,..., w n 1 ) = α(w n w 1,..., w n 1 ) Note that d function needs to be adapted as well + d(w 1,..., w n 1 ) p I (w n w 2,..., w n 1 ) Chapter 7: Language Models 41

43 Evaluation Evaluation of smoothing methods: Perplexity for language models trained on the Europarl corpus Smoothing method bigram trigram 4-gram Good-Turing Witten-Bell Modified Kneser-Ney Interpolated Modified Kneser-Ney Chapter 7: Language Models 42

44 Managing the Size of the Model Millions to billions of words are easy to get (trillions of English words available on the web) But: huge language models do not fit into RAM Chapter 7: Language Models 43

45 Number of Unique N-Grams Number of unique n-grams in Europarl corpus 29,501,088 tokens (words and punctuation) Order Unique n-grams Singletons unigram 86,700 33,447 (38.6%) bigram 1,948,935 1,132,844 (58.1%) trigram 8,092,798 6,022,286 (74.4%) 4-gram 15,303,847 13,081,621 (85.5%) 5-gram 19,882,175 18,324,577 (92.2%) remove singletons of higher order n-grams Chapter 7: Language Models 44

46 Language models too large to build What needs to be stored in RAM? Estimation on Disk maximum likelihood estimation p(w n w 1,..., w n 1 ) = count(w 1,..., w n ) count(w 1,..., w n 1 ) can be done separately for each history w 1,..., w n 1 Keep data on disk extract all n-grams into files on-disk sort by history on disk only keep n-grams with shared history in RAM Smoothing techniques may require additional statistics Chapter 7: Language Models 45

47 Efficient Data Structures 4-gram the very 3-gram backoff very 2-gram backoff large boff: very large boff: important boff: best boff: serious boff: accept p: acceptable p: accession p: accidents p: accountancy p: accumulated p: accumulation p: action p: additional p: administration p: large boff: important boff: best boff: serious boff: amount p: amounts p: and p: area p: companies p: cuts p: degree p: extent p: financial p: foreign p: majority p: number p: and p: areas p: challenge p: debate p: discussion p: fact p: international p: issue p: gram backoff aa-afns p: aachen p: aaiun p: aalborg p: aarhus p: aaron p: aartsen p: ab p: abacha p: aback p: Need to store probabilities for the very large majority the very language number Both share history the very large no need to store history twice Trie Chapter 7: Language Models 46

48 Fewer Bits to Store Probabilities Index for words two bytes allow a vocabulary of 2 16 = 65, 536 words, typically more needed Huffman coding to use fewer bits for frequent words. Probabilities typically stored in log format as floats (4 or 8 bytes) quantization of probabilities to use even less memory, maybe just 4-8 bits Chapter 7: Language Models 47

49 Reducing Vocabulary Size For instance: each number is treated as a separate token Replace them with a number token num but: we want our language model to prefer p lm (I pay in May 2007) > p lm (I pay 2007 in May ) not possible with number token p lm (I pay num in May num) = p lm (I pay num in May num) Replace each digit (with unique symbol, or 5), retain some distinctions p lm (I pay in May 5555) > p lm (I pay 5555 in May ) Chapter 7: Language Models 48

50 Filtering Irrelevant N-Grams We use language model in decoding we only produce English words in translation options filter language model down to n-grams containing only those words Ratio of 5-grams needed to all 5-grams (by sentence length): 0.16 ratio of 5-grams required (bag-of-words) sentence length Chapter 7: Language Models 49

51 Summary Language models: How likely is a string of English words good English? N-gram models (Markov assumption) Perplexity Count smoothing add-one, add-α deleted estimation Good Turing Interpolation and backoff Good Turing Witten-Bell Kneser-Ney Managing the size of the model Chapter 7: Language Models 50

Probability Estimation

Probability Estimation D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 April 23, 2009 Outline Laplace Estimator Good-Turing Backoff The Sparse Data Problem There is a major problem with

Language Modeling. Chapter 1. 1.1 Introduction

Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

CS 533: Natural Language. Word Prediction

CS 533: Natural Language Processing Lecture 03 N-Gram Models and Algorithms CS 533: Natural Language Processing Lecture 01 1 Word Prediction Suppose you read the following sequence of words: Sue swallowed

Perplexity Method on the N-gram Language Model Based on Hadoop Framework

94 International Arab Journal of e-technology, Vol. 4, No. 2, June 2015 Perplexity Method on the N-gram Language Model Based on Hadoop Framework Tahani Mahmoud Allam 1, Hatem Abdelkader 2 and Elsayed Sallam

Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables

Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables Tahani Mahmoud Allam Assistance lecture in Computer and Automatic Control Dept - Faculty of Engineering-Tanta University, Tanta,

Natural Language Processing. Today. Word Prediction. Lecture 6 9/10/2015. Jim Martin. Language modeling with N-grams. Guess the next word...

Natural Language Processing Lecture 6 9/10/2015 Jim Martin Today Language modeling with N-grams Basic counting Probabilistic model Independence assumptions 9/8/15 Speech and Language Processing - Jurafsky

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c 204. All rights reserved. Draft of September, 204. CHAPTER 4 N-Grams You are uniformly charming! cried he, with a smile of

Language Modeling. Introduc*on to N- grams

Language Modeling Introduc*on to N- grams Probabilis1c Language Models Today s goal: assign a probability to a sentence Machine Transla*on: Why? P(high winds tonite) > P(large winds tonite) Spell Correc*on

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

1 Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models Alessandro Vinciarelli, Samy Bengio and Horst Bunke Abstract This paper presents a system for the offline

Large Language Models in Machine Translation

Large Language Models in Machine Translation Thorsten Brants Ashok C. Popat Peng Xu Franz J. Och Jeffrey Dean Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94303, USA {brants,popat,xp,och,jeff}@google.com

HMM Speech Recognition. Words: Pronunciations and Language Models. Pronunciation dictionary. Out-of-vocabulary (OOV) rate.

HMM Speech Recognition ords: Pronunciations and Language Models Recorded Speech Decoded Text (Transcription) Steve Renals Acoustic Features Acoustic Model Automatic Speech Recognition ASR Lecture 9 January-March

Basic probability theory and n-gram models

theory and n-gram models INF4820 H2010 Institutt for Informatikk Universitetet i Oslo 28. september Outline 1 2 3 Outline 1 2 3 Gambling the beginning of probabilities Example Throwing a fair dice Example

Compressing Trigram Language Models With Golomb Coding

Compressing Trigram Language Models With Golomb Coding Ken Church Microsoft One Microsoft Way Redmond, WA, USA Ted Hart Microsoft One Microsoft Way Redmond, WA, USA {church,tedhar,jfgao}@microsoft.com

Natural Language Processing: A Model to Predict a Sequence of Words

Natural Language Processing: A Model to Predict a Sequence of Words Gerald R. Gendron, Jr. Confido Consulting Spot Analytic Chesapeake, VA Gerald.gendron@gmail.com; LinkedIn: jaygendron ABSTRACT With the

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

Strategies for Training Large Scale Neural Network Language Models

Strategies for Training Large Scale Neural Network Language Models Tomáš Mikolov #1, Anoop Deoras 2, Daniel Povey 3, Lukáš Burget #4, Jan Honza Černocký #5 # Brno University of Technology, Speech@FIT,

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

Log-Linear Models. Michael Collins

Log-Linear Models Michael Collins 1 Introduction This note describes log-linear models, which are very widely used in natural language processing. A key advantage of log-linear models is their flexibility:

Word Completion and Prediction in Hebrew

Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

Information Retrieval Statistics of Text

Information Retrieval Statistics of Text James Allan University of Massachusetts, Amherst (NTU ST770-A) Fall 2002 All slides copyright Bruce Croft and/or James Allan Outline Zipf distribution Vocabulary

The 1997 CMU Sphinx-3 English Broadcast News Transcription System

The 1997 CMU Sphinx-3 English Broadcast News Transcription System K. Seymore, S. Chen, S. Doh, M. Eskenazi, E. Gouvêa,B.Raj, M. Ravishankar, R. Rosenfeld, M. Siegler, R. Stern, and E. Thayer Carnegie Mellon

NLP Programming Tutorial 1 - Unigram Language Models

NLP Programming Tutorial 1 - Unigram Language Models Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Language Model Basics 2 Why Language Models? We have an English speech recognition

Estimating Probability Distributions

Estimating Probability Distributions Readings: Manning and Schutze, Section 6.2 Jurafsky & Martin, Section 6.3 One of the central problems we face in using probability models for NLP is obtaining the actual

LIUM s Statistical Machine Translation System for IWSLT 2010

LIUM s Statistical Machine Translation System for IWSLT 2010 Anthony Rousseau, Loïc Barrault, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans,

Scaling Shrinkage-Based Language Models

Scaling Shrinkage-Based Language Models Stanley F. Chen, Lidia Mangu, Bhuvana Ramabhadran, Ruhi Sarikaya, Abhinav Sethy IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights, NY 10598 USA {stanchen,mangu,bhuvana,sarikaya,asethy}@us.ibm.com

Factored Language Models for Statistical Machine Translation

Factored Language Models for Statistical Machine Translation Amittai E. Axelrod E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science by Research Institute for Communicating and Collaborative

Chapter 5. Phrase-based models. Statistical Machine Translation

Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many

Language Model Adaptation for Video Lecture Transcription

UNIVERSITAT POLITÈCNICA DE VALÈNCIA DEPARTAMENT DE SISTEMES INFORMÀTICS I COMPUTACIÓ Language Model Adaptation for Video Lecture Transcription Master Thesis - MIARFID Adrià A. Martínez Villaronga Directors:

IBM Research Report. Scaling Shrinkage-Based Language Models

RC24970 (W1004-019) April 6, 2010 Computer Science IBM Research Report Scaling Shrinkage-Based Language Models Stanley F. Chen, Lidia Mangu, Bhuvana Ramabhadran, Ruhi Sarikaya, Abhinav Sethy IBM Research

A Mixed Trigrams Approach for Context Sensitive Spell Checking

A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA dfossa1@uic.edu, bdieugen@cs.uic.edu

Recurrent (Neural) Networks: Part 1. Tomas Mikolov, Facebook AI Research Deep Learning NYU, 2015

Recurrent (Neural) Networks: Part 1 Tomas Mikolov, Facebook AI Research Deep Learning course @ NYU, 2015 Structure of the talk Introduction Architecture of simple recurrent nets Training: Backpropagation

Data and Application

Carnegie Mellon University Data and Application Tutorial of Parameter Server Mu Li! CSD@CMU & IDL@Baidu! muli@cs.cmu.edu About me Ph.D student working with Alex Smola and Dave Andersen! large scale machine

Random Forests in Language Modeling

Random Forests in Language Modeling Peng Xu and Frederick Jelinek Center for Language and Speech Processing the Johns Hopkins University Baltimore, MD 21218, USA xp,jelinek @jhu.edu Abstract In this paper,

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Part of Speech (POS) Tagging Given a sentence X, predict its

The OpenGrm open-source finite-state grammar software libraries

The OpenGrm open-source finite-state grammar software libraries Brian Roark Richard Sproat Cyril Allauzen Michael Riley Jeffrey Sorensen & Terry Tai Oregon Health & Science University, Portland, Oregon

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015 Starting a Python and an NLTK Session Open a Python 2.7 IDLE (Python GUI) window or a Python interpreter

5.1 Radical Notation and Rational Exponents

Section 5.1 Radical Notation and Rational Exponents 1 5.1 Radical Notation and Rational Exponents We now review how exponents can be used to describe not only powers (such as 5 2 and 2 3 ), but also roots

List of Standard Counting Problems

List of Standard Counting Problems Advanced Problem Solving This list is stolen from Daniel Marcus Combinatorics a Problem Oriented Approach (MAA, 998). This is not a mathematical standard list, just a

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

POS Def. Part of Speech POS POS L645 POS = Assigning word class information to words Dept. of Linguistics, Indiana University Fall 2009 ex: the man bought a book determiner noun verb determiner noun 1

Three New Graphical Models for Statistical Language Modelling

Andriy Mnih Geoffrey Hinton Department of Computer Science, University of Toronto, Canada amnih@cs.toronto.edu hinton@cs.toronto.edu Abstract The supremacy of n-gram models in statistical language modelling

Pre-Algebra Lecture 6

Pre-Algebra Lecture 6 Today we will discuss Decimals and Percentages. Outline: 1. Decimals 2. Ordering Decimals 3. Rounding Decimals 4. Adding and subtracting Decimals 5. Multiplying and Dividing Decimals

MAT 0950 Course Objectives

MAT 0950 Course Objectives 5/15/20134/27/2009 A student should be able to R1. Do long division. R2. Divide by multiples of 10. R3. Use multiplication to check quotients. 1. Identify whole numbers. 2. Identify

Exercises with solutions (1)

Exercises with solutions (). Investigate the relationship between independence and correlation. (a) Two random variables X and Y are said to be correlated if and only if their covariance C XY is not equal

A PRELIMINARY REPORT ON A GENERAL THEORY OF INDUCTIVE INFERENCE

A PRELIMINARY REPORT ON A GENERAL THEORY OF INDUCTIVE INFERENCE R. J. Solomonoff Abstract Some preliminary work is presented on a very general new theory of inductive inference. The extrapolation of an

Compression techniques

Compression techniques David Bařina February 22, 2013 David Bařina Compression techniques February 22, 2013 1 / 37 Contents 1 Terminology 2 Simple techniques 3 Entropy coding 4 Dictionary methods 5 Conclusion

Automatic slide assignation for language model adaptation

Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly

Tightly Packed Tries: How to Fit Large Models into Memory, and Make them Load Fast, Too

Tightly Packed Tries: How to Fit Large Models into Memory, and Make them Load Fast, Too Eric Joanis National Research Council Canada Eric.Joanis@cnrc-nrc.gc.ca Ulrich Germann University of Toronto and

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University Last class: grammars and parsing in Prolog Noun -> roller Verb thrills VP Verb NP S NP VP NP S VP

Verification of Bangla Sentence Structure using N-Gram

Global Journal of Computer Science and Technology: A Hardware & Computation Volume 14 Issue 1 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc.

Phrases. Topics for Today. Phrases. POS Tagging. ! Text transformation. ! Text processing issues

Topics for Today! Text transformation Word occurrence statistics Tokenizing Stopping and stemming Phrases Document structure Link analysis Information extraction Internationalization Phrases! Many queries

GRE MATH REVIEW #5. 1. Variable: A letter that represents an unknown number.

GRE MATH REVIEW #5 Eponents and Radicals Many numbers can be epressed as the product of a number multiplied by itself a number of times. For eample, 16 can be epressed as. Another way to write this is

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Name: Class: Date: Exam #1 - Prep True/False Indicate whether the statement is true or false. 1. Programming is the process of writing a computer program in a language that the computer can respond to

Information Retrieval CS Lecture 02. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS 6900 Lecture 02 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Typical IR task Input: A large collection of unstructured text documents.

Tagging with Hidden Markov Models

Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,

2 Programs: Instructions in the Computer

2 2 Programs: Instructions in the Computer Figure 2. illustrates the first few processing steps taken as a simple CPU executes a program. The CPU for this example is assumed to have a program counter (PC),

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

NIRMAL: Automatic Identification of Software Relevant Tweets Leveraging Language Model

NIRMAL: Automatic Identification of Software Relevant Tweets Leveraging Language Model Abhishek Sharma, Yuan Tian and David Lo School of Information Systems Singapore Management University Email: {abhisheksh.2014,yuan.tian.2012,davidlo}@smu.edu.sg

DRAFT N-GRAMS. ... all of a sudden I notice three guys standing on the sidewalk...

Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing. Daniel Jurafsky & James H. Martin. Copyright c 2006, All rights reserved.

EE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100)

EE282 Computer Architecture and Organization Midterm Exam February 13, 2001 (Total Time = 120 minutes, Total Points = 100) Name: (please print) Wolfe - Solution In recognition of and in the spirit of the

Learning to Complete Sentences

Learning to Complete Sentences Steffen Bickel, Peter Haider, and Tobias Scheffer Humboldt-Universität zu Berlin, Department of Computer Science Unter den Linden 6, 99 Berlin, Germany {bickel,haider,scheffer}@informatik.hu-berlin.de

Estimating Language Models Using Hadoop and Hbase

Estimating Language Models Using Hadoop and Hbase Xiaoyang Yu E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2008

Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

Can you design an algorithm that searches for the maximum value in the list?

The Science of Computing I Lesson 2: Searching and Sorting Living with Cyber Pillar: Algorithms Searching One of the most common tasks that is undertaken in Computer Science is the task of searching. We

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

Reliable and Cost-Effective PoS-Tagging

Reliable and Cost-Effective PoS-Tagging Yu-Fang Tsai Keh-Jiann Chen Institute of Information Science, Academia Sinica Nanang, Taipei, Taiwan 5 eddie,chen@iis.sinica.edu.tw Abstract In order to achieve

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples

Lab 1: Text Processing in Linux; N-grams

Lab 1: Text Processing in Linux; N-grams Massimo Poesio 30 Settembre 2003 Linux machines come with a lot of software that can be used to manipulate text. The first goal of this lab is to get yourselves

Bayes Rule. How is this rule derived? Using Bayes rule for probabilistic inference: Evidence Cause) Evidence)

Bayes Rule P ( A B = B A A B ( Rev. Thomas Bayes (1702-1761 How is this rule derived? Using Bayes rule for probabilistic inference: P ( Cause Evidence = Evidence Cause Cause Evidence Cause Evidence: diagnostic

Math 0980 Chapter Objectives. Chapter 1: Introduction to Algebra: The Integers.

Math 0980 Chapter Objectives Chapter 1: Introduction to Algebra: The Integers. 1. Identify the place value of a digit. 2. Write a number in words or digits. 3. Write positive and negative numbers used

SMALL INDEX LARGE INDEX (SILT)

Wayne State University ECE 7650: Scalable and Secure Internet Services and Architecture SMALL INDEX LARGE INDEX (SILT) A Memory Efficient High Performance Key Value Store QA REPORT Instructor: Dr. Song

Introduction to Hypothesis Testing

I. Terms, Concepts. Introduction to Hypothesis Testing A. In general, we do not know the true value of population parameters - they must be estimated. However, we do have hypotheses about what the true

Efficient visual search of local features. Cordelia Schmid

Efficient visual search of local features Cordelia Schmid Visual search change in viewing angle Matches 22 correct matches Image search system for large datasets Large image dataset (one million images

Matrix Algebra and Applications

Matrix Algebra and Applications Dudley Cooke Trinity College Dublin Dudley Cooke (Trinity College Dublin) Matrix Algebra and Applications 1 / 49 EC2040 Topic 2 - Matrices and Matrix Algebra Reading 1 Chapters

Sample Solutions for Assignment 2.

AMath 383, Autumn 01 Sample Solutions for Assignment. Reading: Chs. -3. 1. Exercise 4 of Chapter. Please note that there is a typo in the formula in part c: An exponent of 1 is missing. It should say 4

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and

Time series Forecasting using Holt-Winters Exponential Smoothing

Time series Forecasting using Holt-Winters Exponential Smoothing Prajakta S. Kalekar(04329008) Kanwal Rekhi School of Information Technology Under the guidance of Prof. Bernard December 6, 2004 Abstract

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

Discrete Mathematics: Homework 7 solution. Due: 2011.6.03

EE 2060 Discrete Mathematics spring 2011 Discrete Mathematics: Homework 7 solution Due: 2011.6.03 1. Let a n = 2 n + 5 3 n for n = 0, 1, 2,... (a) (2%) Find a 0, a 1, a 2, a 3 and a 4. (b) (2%) Show that

Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

Basic Use of the TI-84 Plus

Basic Use of the TI-84 Plus Topics: Key Board Sections Key Functions Screen Contrast Numerical Calculations Order of Operations Built-In Templates MATH menu Scientific Notation The key VS the (-) Key Navigation

MATH 90 CHAPTER 1 Name:.

MATH 90 CHAPTER 1 Name:. 1.1 Introduction to Algebra Need To Know What are Algebraic Expressions? Translating Expressions Equations What is Algebra? They say the only thing that stays the same is change.

A Joint Sequence Translation Model with Integrated Reordering

A Joint Sequence Translation Model with Integrated Reordering Nadir Durrani, Helmut Schmid and Alexander Fraser Institute for Natural Language Processing University of Stuttgart Introduction Generation

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that

Slovak Automatic Transcription and Dictation System for the Judicial Domain

Slovak Automatic Transcription and Dictation System for the Judicial Domain Milan Rusko 1, Jozef Juhár 2, Marian Trnka 1, Ján Staš 2, Sakhia Darjaa 1, Daniel Hládek 2, Miloš Cerňak 1, Marek Papco 2, Róbert

Large-Scale Test Mining

Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

Using classes has the potential of reducing the problem of sparseness of data by allowing generalizations

POS Tags and Decision Trees for Language Modeling Peter A. Heeman Department of Computer Science and Engineering Oregon Graduate Institute PO Box 91000, Portland OR 97291 heeman@cse.ogi.edu Abstract Language

Lecture 10: Sequential Data Models

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 10: Sequential Data Models 1 Example: sequential data Until now, considered data to be i.i.d. Turn attention to sequential data Time-series: stock

A STATE-SPACE METHOD FOR LANGUAGE MODELING. Vesa Siivola and Antti Honkela. Helsinki University of Technology Neural Networks Research Centre

A STATE-SPACE METHOD FOR LANGUAGE MODELING Vesa Siivola and Antti Honkela Helsinki University of Technology Neural Networks Research Centre Vesa.Siivola@hut.fi, Antti.Honkela@hut.fi ABSTRACT In this paper,

Hindi-to-Urdu Machine Translation Through Transliteration

Hindi-to-Urdu Machine Translation Through Transliteration Nadir Durrani, Hassan Sajjad, Alexander Fraser and Helmut Schmid Institute of Natural Language Processing University of Stuttgart Transliteration

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Lesson : Comparison of Population Means Part c: Comparison of Two- Means Welcome to lesson c. This third lesson of lesson will discuss hypothesis testing for two independent means. Steps in Hypothesis

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

7 th Grade Pre Algebra A Vocabulary Chronological List

SUM sum the answer to an addition problem Ex. 4 + 5 = 9 The sum is 9. DIFFERENCE difference the answer to a subtraction problem Ex. 8 2 = 6 The difference is 6. PRODUCT product the answer to a multiplication

Data Deduplication in Slovak Corpora

Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Errata for ASM Exam C/4 Study Manual (Sixteenth Edition) Sorted by Page 1 Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page Practice exam 1:9, 1:22, 1:29, 9:5, and 10:8

Building A Vocabulary Self-Learning Speech Recognition System

INTERSPEECH 2014 Building A Vocabulary Self-Learning Speech Recognition System Long Qin 1, Alexander Rudnicky 2 1 M*Modal, 1710 Murray Ave, Pittsburgh, PA, USA 2 Carnegie Mellon University, 5000 Forbes

Information Retrieval

Information Retrieval Indexes All slides unless specifically mentioned are Addison Wesley, 2008; Anton Leuski & Donald Metzler 2012 1 Search Process Information need query text objects representation representation