Chapter 7. Language models. Statistical Machine Translation


 Anna Gregory
 2 years ago
 Views:
Transcription
1 Chapter 7 Language models Statistical Machine Translation
2 Language models Language models answer the question: How likely is a string of English words good English? Help with reordering p lm (the house is small) > p lm (small the is house) Help with word choice p lm (I am going home) > p lm (I am going house) Chapter 7: Language Models 1
3 NGram Language Models Given: a string of English words W = w 1, w 2, w 3,..., w n Question: what is p(w )? Sparse data: Many good English sentences will not have been seen before Decomposing p(w ) using the chain rule: p(w 1, w 2, w 3,..., w n ) = p(w 1 ) p(w 2 w 1 ) p(w 3 w 1, w 2 )...p(w n w 1, w 2,...w n 1 ) (not much gained yet, p(w n w 1, w 2,...w n 1 ) is equally sparse) Chapter 7: Language Models 2
4 Markov Chain Markov assumption: only previous history matters limited memory: only last k words are included in history (older words less relevant) kth order Markov model For instance 2gram language model: p(w 1, w 2, w 3,..., w n ) p(w 1 ) p(w 2 w 1 ) p(w 3 w 2 )...p(w n w n 1 ) What is conditioned on, here w i 1 is called the history Chapter 7: Language Models 3
5 Estimating NGram Probabilities Maximum likelihood estimation Collect counts over a large text corpus p(w 2 w 1 ) = count(w 1, w 2 ) count(w 1 ) Millions to billions of words are easy to get (trillions of English words available on the web) Chapter 7: Language Models 4
6 Example: 3Gram Counts for trigrams and estimated word probabilities the green (total: 1748) word c. prob. paper group light party ecu the red (total: 225) word c. prob. cross tape army card , the blue (total: 54) word c. prob. box flag , angel trigrams in the Europarl corpus start with the red 123 of them end with cross maximum likelihood probability is = Chapter 7: Language Models 5
7 How good is the LM? A good model assigns a text of real English W a high probability This can be also measured with cross entropy: H(W ) = 1 n log p(w n 1 ) Or, perplexity perplexity(w ) = 2 H(W ) Chapter 7: Language Models 6
8 Example: 4Gram prediction p lm log 2 p lm p lm (i </s><s>) p lm (would <s>i) p lm (like i would) p lm (to would like) p lm (commend like to) p lm (the to commend) p lm (rapporteur commend the) p lm (on the rapporteur) p lm (his rapporteur on) p lm (work on his) p lm (. his work) p lm (</s> work.) average Chapter 7: Language Models 7
9 Comparison 1 4Gram word unigram bigram trigram 4gram i would like to commend the rapporteur on his work </s> average perplexity Chapter 7: Language Models 8
10 Unseen NGrams We have seen i like to in our corpus We have never seen i like to smooth in our corpus p(smooth i like to) = 0 Any sentence that includes i like to smooth will be assigned probability 0 Chapter 7: Language Models 9
11 AddOne Smoothing For all possible ngrams, add the count of one. c = count of ngram in corpus n = count of history v = vocabulary size p = c + 1 n + v But there are many more unseen ngrams than seen ngrams Example: Europarl 2bigrams: 86, 700 distinct words 86, = 7, 516, 890, 000 possible bigrams but only about 30, 000, 000 words (and bigrams) in corpus Chapter 7: Language Models 10
12 Addα Smoothing Add α < 1 to each count p = c + α n + αv What is a good value for α? Could be optimized on heldout set Chapter 7: Language Models 11
13 Example: 2Grams in Europarl Count Adjusted count Test count c (c + 1) n n (c + α) t n+v 2 n+αv 2 c Addα smoothing with α = t c are average counts of ngrams in test set that occurred c times in corpus Chapter 7: Language Models 12
14 Deleted Estimation Estimate true counts in heldout data split corpus in two halves: training and heldout counts in training C t (w 1,..., w n ) number of ngrams with training count r: N r total times ngrams of training count r seen in heldout data: T r Heldout estimator: p h (w 1,..., w n ) = T r N r N where count(w 1,..., w n ) = r Both halves can be switched and results combined p h (w 1,..., w n ) = T 1 r + T 2 r N(N 1 r + N 2 r ) where count(w 1,..., w n ) = r Chapter 7: Language Models 13
15 GoodTuring Smoothing Adjust actual counts r to expected counts r with formula r = (r + 1) N r+1 N r N r number of ngrams that occur exactly r times in corpus N 0 total number of ngrams Chapter 7: Language Models 14
16 GoodTuring for 2Grams in Europarl Count Count of counts Adjusted count Test count r N r r t 0 7,514,941, ,132, , , , , , , , , adjusted count fairly accurate when compared against the test count Chapter 7: Language Models 15
17 Derivation of GoodTuring A specific ngram α occurs with (unknown) probability p in the corpus Assumption: all occurrences of an ngram α are independent of each other Number of times α occurs in corpus follows binomial distribution p(c(α) = r) = b(r; N, p i ) = ( ) N p r (1 p) N r r Chapter 7: Language Models 16
18 Derivation of GoodTuring (2) Goal of GoodTuring smoothing: compute expected count c Expected count can be computed with help from binomial distribution: E(c (α)) = = N r p(c(α) = r) r=0 N r r=0 ( ) N p r (1 p) N r r Note again: p is unknown, we cannot actually compute this Chapter 7: Language Models 17
19 Derivation of GoodTuring (3) Definition: expected number of ngrams that occur r times: E N (N r ) We have s different ngrams in corpus let us call them α 1,..., α s each occurs with probability p 1,..., p s, respectively Given the previous formulae, we can compute E N (N r ) = = s p(c(α i ) = r) i=1 s i=1 ( ) N p r i (1 p i ) N r r Note again: p i is unknown, we cannot actually compute this Chapter 7: Language Models 18
20 Derivation of GoodTuring (4) Reflection we derived a formula to compute E N (N r ) we have N r for small r: E N (N r ) N r Ultimate goal compute expected counts c, given actual counts c E(c (α) c(α) = r) Chapter 7: Language Models 19
21 Derivation of GoodTuring (5) For a particular ngram α, we know its actual count r Any of the ngrams α i may occur r times Probability that α is one specific α i p(α = α i c(α) = r) = p(c(α i ) = r) s j=1 p(c(α j) = r) Expected count of this ngram α E(c (α) c(α) = r) = s N p i p(α = α i c(α) = r) i=1 Chapter 7: Language Models 20
22 Derivation of GoodTuring (6) Combining the last two equations: E(c (α) c(α) = r) = = s i=1 N p i p(c(α i ) = r) s j=1 p(c(α j) = r) s i=1 N p i p(c(α i ) = r) s j=1 p(c(α j) = r) We will now transform this equation to derive GoodTuring smoothing Chapter 7: Language Models 21
23 Derivation of GoodTuring (7) Repeat: E(c (α) c(α) = r) = s i=1 N p i p(c(α i ) = r) s j=1 p(c(α j) = r) Denominator is our definition of expected counts E N (N r ) Chapter 7: Language Models 22
24 Numerator: Derivation of GoodTuring (8) s N p i p(c(α i ) = r) = i=1 s ( N N p i r i=1 ) p r i (1 p i ) N r N! = N N r!r! pr+1 i (1 p i ) N r (r + 1) N + 1! = N N + 1 N r!r + 1! pr+1 i (1 p i ) N r N = (r + 1) N + 1 E N+1(N r+1 ) (r + 1) E N+1 (N r+1 ) Chapter 7: Language Models 23
25 Derivation of GoodTuring (9) Using the simplifications of numerator and denominator: r = E(c (α) c(α) = r) = (r + 1) E N+1(N r+1 ) E N (N r ) (r + 1) N r+1 N r QED Chapter 7: Language Models 24
26 BackOff In given corpus, we may never observe Scottish beer drinkers Scottish beer eaters Both have count 0 our smoothing methods will assign them same probability Better: backoff to bigrams: beer drinkers beer eaters Chapter 7: Language Models 25
27 Interpolation Higher and lower order ngram models have different strengths and weaknesses highorder ngrams are sensitive to more context, but have sparse counts loworder ngrams consider only very limited context, but have robust counts Combine them p I (w 3 w 1, w 2 ) = λ 1 p 1 (w 3 ) λ 2 p 2 (w 3 w 2 ) λ 3 p 3 (w 3 w 1, w 2 ) Chapter 7: Language Models 26
28 Recursive Interpolation We can trust some histories w i n+1,..., w i 1 more than others Condition interpolation weights on history: λ wi n+1,...,w i 1 Recursive definition of interpolation p I n(w i w i n+1,..., w i 1 ) = λ wi n+1,...,w i 1 p n (w i w i n+1,..., w i 1 ) + + (1 λ wi n+1,...,w i 1 ) p I n 1(w i w i n+2,..., w i 1 ) Chapter 7: Language Models 27
29 BackOff Trust the highest order language model that contains ngram Requires p BO n (w i w i n+1,..., w i 1 ) = α n (w i w i n+1,..., w i 1 ) if count n (w i n+1,..., w i ) > 0 = d n (w i n+1,..., w i 1 ) p BO n 1(w i w i n+2,..., w i 1 ) else adjusted prediction model α n (w i w i n+1,..., w i 1 ) discounting function d n (w 1,..., w n 1 ) Chapter 7: Language Models 28
30 BackOff with GoodTuring Smoothing Previously, we computed ngram probabilities based on relative frequency p(w 2 w 1 ) = count(w 1, w 2 ) count(w 1 ) Good Turing smoothing adjusts counts c to expected counts c count (w 1, w 2 ) count(w 1, w 2 ) We use these expected counts for the prediction model (but 0 remains 0) α(w 2 w 1 ) = count (w 1, w 2 ) count(w 1 ) This leaves probability mass for the discounting function d 2 (w 1 ) = 1 w 2 α(w 2 w 1 ) Chapter 7: Language Models 29
31 Diversity of Predicted Words Consider the bigram histories spite and constant both occur 993 times in Europarl corpus only 9 different words follow spite almost always followed by of (979 times), due to expression in spite of 415 different words follow constant most frequent: and (42 times), concern (27 times), pressure (26 times), but huge tail of singletons: 268 different words More likely to see new bigram that starts with constant than spite WittenBell smoothing considers diversity of predicted words Chapter 7: Language Models 30
32 WittenBell Smoothing Recursive interpolation method Number of possible extensions of a history w 1,..., w n 1 in training data N 1+ (w 1,..., w n 1, ) = {w n : c(w 1,..., w n 1, w n ) > 0} Lambda parameters 1 λ w1,...,w n 1 = N 1+ (w 1,..., w n 1, ) N 1+ (w 1,..., w n 1, ) + w n c(w 1,..., w n 1, w n ) Chapter 7: Language Models 31
33 WittenBell Smoothing: Examples Let us apply this to our two examples: 1 λ spite = = 1 λ constant = = N 1+ (spite, ) N 1+ (spite, ) + w n c(spite, w n ) = N 1+ (constant, ) N 1+ (constant, ) + w n c(constant, w n ) = Chapter 7: Language Models 32
34 Diversity of Histories Consider the word York fairly frequent word in Europarl corpus, occurs 477 times as frequent as foods, indicates and providers in unigram language model: a respectable probability However, it almost always directly follows New (473 times) Recall: unigram model only used, if the bigram model inconclusive York unlikely second word in unseen bigram in backoff unigram model, York should have low probability Chapter 7: Language Models 33
35 KneserNey Smoothing KneserNey smoothing takes diversity of histories into account Count of histories for a word N 1+ ( w) = {w i : c(w i, w) > 0} Recall: maximum likelihood estimation of unigram language model p ML (w) = c(w) i c(w i) In KneserNey smoothing, replace raw counts with count of histories p KN (w) = N 1+ ( w) w i N 1+ (w i w) Chapter 7: Language Models 34
36 Based on interpolation Modified KneserNey Smoothing Requires p BO n (w i w i n+1,..., w i 1 ) = α n (w i w i n+1,..., w i 1 ) if count n (w i n+1,..., w i ) > 0 = d n (w i n+1,..., w i 1 ) p BO n 1(w i w i n+2,..., w i 1 ) else adjusted prediction model α n (w i w i n+1,..., w i 1 ) discounting function d n (w 1,..., w n 1 ) Chapter 7: Language Models 35
37 Formula for α for Highest Order NGram Model Absolute discounting: subtract a fixed D from all nonzero counts α(w n w 1,..., w n 1 ) = c(w 1,..., w n ) D w c(w 1,..., w n 1, w) Refinement: three different discount values D 1 if c = 1 D(c) = D 2 if c = 2 D 3+ if c 3 Chapter 7: Language Models 36
38 Discount Parameters Optimal discounting parameters D 1, D 2, D 3+ can be computed quite easily Y = N 1 N 1 + 2N 2 D 1 = 1 2Y N 2 N 1 D 2 = 2 3Y N 3 N 2 D 3+ = 3 4Y N 4 N 3 Values N c are the counts of ngrams with exactly count c Chapter 7: Language Models 37
39 Formula for d for Highest Order NGram Model Probability mass set aside from seen events d(w 1,..., w n 1 ) = i {1,2,3+} D in i (w 1,..., w n 1 ) w n c(w 1,..., w n ) N i for i {1, 2, 3+} are computed based on the count of extensions of a history w 1,..., w n 1 with count 1, 2, and 3 or more, respectively. Similar to WittenBell smoothing Chapter 7: Language Models 38
40 Formula for α for Lower Order NGram Models Recall: base on count of histories N 1+ ( w) in which word may appear, not raw counts. α(w n w 1,..., w n 1 ) = N 1+( w 1,..., w n ) D w N 1+( w 1,..., w n 1, w) Again, three different values for D (D 1, D 2, D 3+ ), based on the count of the history w 1,..., w n 1 Chapter 7: Language Models 39
41 Formula for d for Lower Order NGram Models Probability mass set aside available for the d function d(w 1,..., w n 1 ) = i {1,2,3+} D in i (w 1,..., w n 1 ) w n c(w 1,..., w n ) Chapter 7: Language Models 40
42 Interpolated BackOff Backoff models use only highest order ngram if sparse, not very reliable. two different ngrams with same history occur once same probability one may be an outlier, the other underrepresented in training To remedy this, always consider the lowerorder backoff models Adapting the α function into interpolated α I function by adding backoff α I (w n w 1,..., w n 1 ) = α(w n w 1,..., w n 1 ) Note that d function needs to be adapted as well + d(w 1,..., w n 1 ) p I (w n w 2,..., w n 1 ) Chapter 7: Language Models 41
43 Evaluation Evaluation of smoothing methods: Perplexity for language models trained on the Europarl corpus Smoothing method bigram trigram 4gram GoodTuring WittenBell Modified KneserNey Interpolated Modified KneserNey Chapter 7: Language Models 42
44 Managing the Size of the Model Millions to billions of words are easy to get (trillions of English words available on the web) But: huge language models do not fit into RAM Chapter 7: Language Models 43
45 Number of Unique NGrams Number of unique ngrams in Europarl corpus 29,501,088 tokens (words and punctuation) Order Unique ngrams Singletons unigram 86,700 33,447 (38.6%) bigram 1,948,935 1,132,844 (58.1%) trigram 8,092,798 6,022,286 (74.4%) 4gram 15,303,847 13,081,621 (85.5%) 5gram 19,882,175 18,324,577 (92.2%) remove singletons of higher order ngrams Chapter 7: Language Models 44
46 Language models too large to build What needs to be stored in RAM? Estimation on Disk maximum likelihood estimation p(w n w 1,..., w n 1 ) = count(w 1,..., w n ) count(w 1,..., w n 1 ) can be done separately for each history w 1,..., w n 1 Keep data on disk extract all ngrams into files ondisk sort by history on disk only keep ngrams with shared history in RAM Smoothing techniques may require additional statistics Chapter 7: Language Models 45
47 Efficient Data Structures 4gram the very 3gram backoff very 2gram backoff large boff: very large boff: important boff: best boff: serious boff: accept p: acceptable p: accession p: accidents p: accountancy p: accumulated p: accumulation p: action p: additional p: administration p: large boff: important boff: best boff: serious boff: amount p: amounts p: and p: area p: companies p: cuts p: degree p: extent p: financial p: foreign p: majority p: number p: and p: areas p: challenge p: debate p: discussion p: fact p: international p: issue p: gram backoff aaafns p: aachen p: aaiun p: aalborg p: aarhus p: aaron p: aartsen p: ab p: abacha p: aback p: Need to store probabilities for the very large majority the very language number Both share history the very large no need to store history twice Trie Chapter 7: Language Models 46
48 Fewer Bits to Store Probabilities Index for words two bytes allow a vocabulary of 2 16 = 65, 536 words, typically more needed Huffman coding to use fewer bits for frequent words. Probabilities typically stored in log format as floats (4 or 8 bytes) quantization of probabilities to use even less memory, maybe just 48 bits Chapter 7: Language Models 47
49 Reducing Vocabulary Size For instance: each number is treated as a separate token Replace them with a number token num but: we want our language model to prefer p lm (I pay in May 2007) > p lm (I pay 2007 in May ) not possible with number token p lm (I pay num in May num) = p lm (I pay num in May num) Replace each digit (with unique symbol, or 5), retain some distinctions p lm (I pay in May 5555) > p lm (I pay 5555 in May ) Chapter 7: Language Models 48
50 Filtering Irrelevant NGrams We use language model in decoding we only produce English words in translation options filter language model down to ngrams containing only those words Ratio of 5grams needed to all 5grams (by sentence length): 0.16 ratio of 5grams required (bagofwords) sentence length Chapter 7: Language Models 49
51 Summary Language models: How likely is a string of English words good English? Ngram models (Markov assumption) Perplexity Count smoothing addone, addα deleted estimation Good Turing Interpolation and backoff Good Turing WittenBell KneserNey Managing the size of the model Chapter 7: Language Models 50
Probability Estimation
Probability Estimation D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 20089 April 23, 2009 Outline Laplace Estimator GoodTuring Backoff The Sparse Data Problem There is a major problem with
More informationLanguage Modeling. Chapter 1. 1.1 Introduction
Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set
More informationCS 533: Natural Language. Word Prediction
CS 533: Natural Language Processing Lecture 03 NGram Models and Algorithms CS 533: Natural Language Processing Lecture 01 1 Word Prediction Suppose you read the following sequence of words: Sue swallowed
More informationPerplexity Method on the Ngram Language Model Based on Hadoop Framework
94 International Arab Journal of etechnology, Vol. 4, No. 2, June 2015 Perplexity Method on the Ngram Language Model Based on Hadoop Framework Tahani Mahmoud Allam 1, Hatem Abdelkader 2 and Elsayed Sallam
More informationManaged Ngram Language Model Based on Hadoop Framework and a Hbase Tables
Managed Ngram Language Model Based on Hadoop Framework and a Hbase Tables Tahani Mahmoud Allam Assistance lecture in Computer and Automatic Control Dept  Faculty of EngineeringTanta University, Tanta,
More informationNatural Language Processing. Today. Word Prediction. Lecture 6 9/10/2015. Jim Martin. Language modeling with Ngrams. Guess the next word...
Natural Language Processing Lecture 6 9/10/2015 Jim Martin Today Language modeling with Ngrams Basic counting Probabilistic model Independence assumptions 9/8/15 Speech and Language Processing  Jurafsky
More informationSpeech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c 2014. All rights reserved. Draft of September 1, 2014.
Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c 204. All rights reserved. Draft of September, 204. CHAPTER 4 NGrams You are uniformly charming! cried he, with a smile of
More informationLanguage Modeling. Introduc*on to N grams
Language Modeling Introduc*on to N grams Probabilis1c Language Models Today s goal: assign a probability to a sentence Machine Transla*on: Why? P(high winds tonite) > P(large winds tonite) Spell Correc*on
More informationOffline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke
1 Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models Alessandro Vinciarelli, Samy Bengio and Horst Bunke Abstract This paper presents a system for the offline
More informationLarge Language Models in Machine Translation
Large Language Models in Machine Translation Thorsten Brants Ashok C. Popat Peng Xu Franz J. Och Jeffrey Dean Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94303, USA {brants,popat,xp,och,jeff}@google.com
More informationHMM Speech Recognition. Words: Pronunciations and Language Models. Pronunciation dictionary. Outofvocabulary (OOV) rate.
HMM Speech Recognition ords: Pronunciations and Language Models Recorded Speech Decoded Text (Transcription) Steve Renals Acoustic Features Acoustic Model Automatic Speech Recognition ASR Lecture 9 JanuaryMarch
More informationBasic probability theory and ngram models
theory and ngram models INF4820 H2010 Institutt for Informatikk Universitetet i Oslo 28. september Outline 1 2 3 Outline 1 2 3 Gambling the beginning of probabilities Example Throwing a fair dice Example
More informationCompressing Trigram Language Models With Golomb Coding
Compressing Trigram Language Models With Golomb Coding Ken Church Microsoft One Microsoft Way Redmond, WA, USA Ted Hart Microsoft One Microsoft Way Redmond, WA, USA {church,tedhar,jfgao}@microsoft.com
More informationNatural Language Processing: A Model to Predict a Sequence of Words
Natural Language Processing: A Model to Predict a Sequence of Words Gerald R. Gendron, Jr. Confido Consulting Spot Analytic Chesapeake, VA Gerald.gendron@gmail.com; LinkedIn: jaygendron ABSTRACT With the
More informationSimple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS  Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
More informationStrategies for Training Large Scale Neural Network Language Models
Strategies for Training Large Scale Neural Network Language Models Tomáš Mikolov #1, Anoop Deoras 2, Daniel Povey 3, Lukáš Burget #4, Jan Honza Černocký #5 # Brno University of Technology, Speech@FIT,
More informationSearch and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
More informationLogLinear Models. Michael Collins
LogLinear Models Michael Collins 1 Introduction This note describes loglinear models, which are very widely used in natural language processing. A key advantage of loglinear models is their flexibility:
More informationWord Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohenKerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
More informationInformation Retrieval Statistics of Text
Information Retrieval Statistics of Text James Allan University of Massachusetts, Amherst (NTU ST770A) Fall 2002 All slides copyright Bruce Croft and/or James Allan Outline Zipf distribution Vocabulary
More informationThe 1997 CMU Sphinx3 English Broadcast News Transcription System
The 1997 CMU Sphinx3 English Broadcast News Transcription System K. Seymore, S. Chen, S. Doh, M. Eskenazi, E. Gouvêa,B.Raj, M. Ravishankar, R. Rosenfeld, M. Siegler, R. Stern, and E. Thayer Carnegie Mellon
More informationNLP Programming Tutorial 1  Unigram Language Models
NLP Programming Tutorial 1  Unigram Language Models Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Language Model Basics 2 Why Language Models? We have an English speech recognition
More informationEstimating Probability Distributions
Estimating Probability Distributions Readings: Manning and Schutze, Section 6.2 Jurafsky & Martin, Section 6.3 One of the central problems we face in using probability models for NLP is obtaining the actual
More informationLIUM s Statistical Machine Translation System for IWSLT 2010
LIUM s Statistical Machine Translation System for IWSLT 2010 Anthony Rousseau, Loïc Barrault, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans,
More informationScaling ShrinkageBased Language Models
Scaling ShrinkageBased Language Models Stanley F. Chen, Lidia Mangu, Bhuvana Ramabhadran, Ruhi Sarikaya, Abhinav Sethy IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights, NY 10598 USA {stanchen,mangu,bhuvana,sarikaya,asethy}@us.ibm.com
More informationFactored Language Models for Statistical Machine Translation
Factored Language Models for Statistical Machine Translation Amittai E. Axelrod E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science by Research Institute for Communicating and Collaborative
More informationChapter 5. Phrasebased models. Statistical Machine Translation
Chapter 5 Phrasebased models Statistical Machine Translation Motivation WordBased Models translate words as atomic units PhraseBased Models translate phrases as atomic units Advantages: manytomany
More informationLanguage Model Adaptation for Video Lecture Transcription
UNIVERSITAT POLITÈCNICA DE VALÈNCIA DEPARTAMENT DE SISTEMES INFORMÀTICS I COMPUTACIÓ Language Model Adaptation for Video Lecture Transcription Master Thesis  MIARFID Adrià A. Martínez Villaronga Directors:
More informationIBM Research Report. Scaling ShrinkageBased Language Models
RC24970 (W1004019) April 6, 2010 Computer Science IBM Research Report Scaling ShrinkageBased Language Models Stanley F. Chen, Lidia Mangu, Bhuvana Ramabhadran, Ruhi Sarikaya, Abhinav Sethy IBM Research
More informationA Mixed Trigrams Approach for Context Sensitive Spell Checking
A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA dfossa1@uic.edu, bdieugen@cs.uic.edu
More informationRecurrent (Neural) Networks: Part 1. Tomas Mikolov, Facebook AI Research Deep Learning NYU, 2015
Recurrent (Neural) Networks: Part 1 Tomas Mikolov, Facebook AI Research Deep Learning course @ NYU, 2015 Structure of the talk Introduction Architecture of simple recurrent nets Training: Backpropagation
More informationData and Application
Carnegie Mellon University Data and Application Tutorial of Parameter Server Mu Li! CSD@CMU & IDL@Baidu! muli@cs.cmu.edu About me Ph.D student working with Alex Smola and Dave Andersen! large scale machine
More informationRandom Forests in Language Modeling
Random Forests in Language Modeling Peng Xu and Frederick Jelinek Center for Language and Speech Processing the Johns Hopkins University Baltimore, MD 21218, USA xp,jelinek @jhu.edu Abstract In this paper,
More informationNLP Programming Tutorial 5  Part of Speech Tagging with Hidden Markov Models
NLP Programming Tutorial 5  Part of Speech Tagging with Hidden Markov Models Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Part of Speech (POS) Tagging Given a sentence X, predict its
More informationThe OpenGrm opensource finitestate grammar software libraries
The OpenGrm opensource finitestate grammar software libraries Brian Roark Richard Sproat Cyril Allauzen Michael Riley Jeffrey Sorensen & Terry Tai Oregon Health & Science University, Portland, Oregon
More informationNLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015
NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015 Starting a Python and an NLTK Session Open a Python 2.7 IDLE (Python GUI) window or a Python interpreter
More information5.1 Radical Notation and Rational Exponents
Section 5.1 Radical Notation and Rational Exponents 1 5.1 Radical Notation and Rational Exponents We now review how exponents can be used to describe not only powers (such as 5 2 and 2 3 ), but also roots
More informationList of Standard Counting Problems
List of Standard Counting Problems Advanced Problem Solving This list is stolen from Daniel Marcus Combinatorics a Problem Oriented Approach (MAA, 998). This is not a mathematical standard list, just a
More informationPOS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23
POS Def. Part of Speech POS POS L645 POS = Assigning word class information to words Dept. of Linguistics, Indiana University Fall 2009 ex: the man bought a book determiner noun verb determiner noun 1
More informationThree New Graphical Models for Statistical Language Modelling
Andriy Mnih Geoffrey Hinton Department of Computer Science, University of Toronto, Canada amnih@cs.toronto.edu hinton@cs.toronto.edu Abstract The supremacy of ngram models in statistical language modelling
More informationPreAlgebra Lecture 6
PreAlgebra Lecture 6 Today we will discuss Decimals and Percentages. Outline: 1. Decimals 2. Ordering Decimals 3. Rounding Decimals 4. Adding and subtracting Decimals 5. Multiplying and Dividing Decimals
More informationMAT 0950 Course Objectives
MAT 0950 Course Objectives 5/15/20134/27/2009 A student should be able to R1. Do long division. R2. Divide by multiples of 10. R3. Use multiplication to check quotients. 1. Identify whole numbers. 2. Identify
More informationExercises with solutions (1)
Exercises with solutions (). Investigate the relationship between independence and correlation. (a) Two random variables X and Y are said to be correlated if and only if their covariance C XY is not equal
More informationA PRELIMINARY REPORT ON A GENERAL THEORY OF INDUCTIVE INFERENCE
A PRELIMINARY REPORT ON A GENERAL THEORY OF INDUCTIVE INFERENCE R. J. Solomonoff Abstract Some preliminary work is presented on a very general new theory of inductive inference. The extrapolation of an
More informationCompression techniques
Compression techniques David Bařina February 22, 2013 David Bařina Compression techniques February 22, 2013 1 / 37 Contents 1 Terminology 2 Simple techniques 3 Entropy coding 4 Dictionary methods 5 Conclusion
More informationAutomatic slide assignation for language model adaptation
Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly
More informationTightly Packed Tries: How to Fit Large Models into Memory, and Make them Load Fast, Too
Tightly Packed Tries: How to Fit Large Models into Memory, and Make them Load Fast, Too Eric Joanis National Research Council Canada Eric.Joanis@cnrcnrc.gc.ca Ulrich Germann University of Toronto and
More informationGrammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University
Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University Last class: grammars and parsing in Prolog Noun > roller Verb thrills VP Verb NP S NP VP NP S VP
More informationVerification of Bangla Sentence Structure using NGram
Global Journal of Computer Science and Technology: A Hardware & Computation Volume 14 Issue 1 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc.
More informationPhrases. Topics for Today. Phrases. POS Tagging. ! Text transformation. ! Text processing issues
Topics for Today! Text transformation Word occurrence statistics Tokenizing Stopping and stemming Phrases Document structure Link analysis Information extraction Internationalization Phrases! Many queries
More informationGRE MATH REVIEW #5. 1. Variable: A letter that represents an unknown number.
GRE MATH REVIEW #5 Eponents and Radicals Many numbers can be epressed as the product of a number multiplied by itself a number of times. For eample, 16 can be epressed as. Another way to write this is
More informationName: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.
Name: Class: Date: Exam #1  Prep True/False Indicate whether the statement is true or false. 1. Programming is the process of writing a computer program in a language that the computer can respond to
More informationInformation Retrieval CS Lecture 02. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Lecture 02 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Typical IR task Input: A large collection of unstructured text documents.
More informationTagging with Hidden Markov Models
Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Partofspeech (POS) tagging is perhaps the earliest, and most famous,
More information2 Programs: Instructions in the Computer
2 2 Programs: Instructions in the Computer Figure 2. illustrates the first few processing steps taken as a simple CPU executes a program. The CPU for this example is assumed to have a program counter (PC),
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad
More informationData Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through
More informationNIRMAL: Automatic Identification of Software Relevant Tweets Leveraging Language Model
NIRMAL: Automatic Identification of Software Relevant Tweets Leveraging Language Model Abhishek Sharma, Yuan Tian and David Lo School of Information Systems Singapore Management University Email: {abhisheksh.2014,yuan.tian.2012,davidlo}@smu.edu.sg
More informationDRAFT NGRAMS. ... all of a sudden I notice three guys standing on the sidewalk...
Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing. Daniel Jurafsky & James H. Martin. Copyright c 2006, All rights reserved.
More informationEE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100)
EE282 Computer Architecture and Organization Midterm Exam February 13, 2001 (Total Time = 120 minutes, Total Points = 100) Name: (please print) Wolfe  Solution In recognition of and in the spirit of the
More informationLearning to Complete Sentences
Learning to Complete Sentences Steffen Bickel, Peter Haider, and Tobias Scheffer HumboldtUniversität zu Berlin, Department of Computer Science Unter den Linden 6, 99 Berlin, Germany {bickel,haider,scheffer}@informatik.huberlin.de
More informationEstimating Language Models Using Hadoop and Hbase
Estimating Language Models Using Hadoop and Hbase Xiaoyang Yu E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2008
More informationModeling Lifetime Value in the Insurance Industry
Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting
More informationCan you design an algorithm that searches for the maximum value in the list?
The Science of Computing I Lesson 2: Searching and Sorting Living with Cyber Pillar: Algorithms Searching One of the most common tasks that is undertaken in Computer Science is the task of searching. We
More informationNgram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department
More informationReliable and CostEffective PoSTagging
Reliable and CostEffective PoSTagging YuFang Tsai KehJiann Chen Institute of Information Science, Academia Sinica Nanang, Taipei, Taiwan 5 eddie,chen@iis.sinica.edu.tw Abstract In order to achieve
More informationIntroduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007
Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples
More informationLab 1: Text Processing in Linux; Ngrams
Lab 1: Text Processing in Linux; Ngrams Massimo Poesio 30 Settembre 2003 Linux machines come with a lot of software that can be used to manipulate text. The first goal of this lab is to get yourselves
More informationBayes Rule. How is this rule derived? Using Bayes rule for probabilistic inference: Evidence Cause) Evidence)
Bayes Rule P ( A B = B A A B ( Rev. Thomas Bayes (17021761 How is this rule derived? Using Bayes rule for probabilistic inference: P ( Cause Evidence = Evidence Cause Cause Evidence Cause Evidence: diagnostic
More informationMath 0980 Chapter Objectives. Chapter 1: Introduction to Algebra: The Integers.
Math 0980 Chapter Objectives Chapter 1: Introduction to Algebra: The Integers. 1. Identify the place value of a digit. 2. Write a number in words or digits. 3. Write positive and negative numbers used
More informationSMALL INDEX LARGE INDEX (SILT)
Wayne State University ECE 7650: Scalable and Secure Internet Services and Architecture SMALL INDEX LARGE INDEX (SILT) A Memory Efficient High Performance Key Value Store QA REPORT Instructor: Dr. Song
More informationIntroduction to Hypothesis Testing
I. Terms, Concepts. Introduction to Hypothesis Testing A. In general, we do not know the true value of population parameters  they must be estimated. However, we do have hypotheses about what the true
More informationEfficient visual search of local features. Cordelia Schmid
Efficient visual search of local features Cordelia Schmid Visual search change in viewing angle Matches 22 correct matches Image search system for large datasets Large image dataset (one million images
More informationMatrix Algebra and Applications
Matrix Algebra and Applications Dudley Cooke Trinity College Dublin Dudley Cooke (Trinity College Dublin) Matrix Algebra and Applications 1 / 49 EC2040 Topic 2  Matrices and Matrix Algebra Reading 1 Chapters
More informationSample Solutions for Assignment 2.
AMath 383, Autumn 01 Sample Solutions for Assignment. Reading: Chs. 3. 1. Exercise 4 of Chapter. Please note that there is a typo in the formula in part c: An exponent of 1 is missing. It should say 4
More informationPart III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing
CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and
More informationTime series Forecasting using HoltWinters Exponential Smoothing
Time series Forecasting using HoltWinters Exponential Smoothing Prajakta S. Kalekar(04329008) Kanwal Rekhi School of Information Technology Under the guidance of Prof. Bernard December 6, 2004 Abstract
More informationMicro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
More informationDiscrete Mathematics: Homework 7 solution. Due: 2011.6.03
EE 2060 Discrete Mathematics spring 2011 Discrete Mathematics: Homework 7 solution Due: 2011.6.03 1. Let a n = 2 n + 5 3 n for n = 0, 1, 2,... (a) (2%) Find a 0, a 1, a 2, a 3 and a 4. (b) (2%) Show that
More informationMining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
More informationBasic Use of the TI84 Plus
Basic Use of the TI84 Plus Topics: Key Board Sections Key Functions Screen Contrast Numerical Calculations Order of Operations BuiltIn Templates MATH menu Scientific Notation The key VS the () Key Navigation
More informationMATH 90 CHAPTER 1 Name:.
MATH 90 CHAPTER 1 Name:. 1.1 Introduction to Algebra Need To Know What are Algebraic Expressions? Translating Expressions Equations What is Algebra? They say the only thing that stays the same is change.
More informationA Joint Sequence Translation Model with Integrated Reordering
A Joint Sequence Translation Model with Integrated Reordering Nadir Durrani, Helmut Schmid and Alexander Fraser Institute for Natural Language Processing University of Stuttgart Introduction Generation
More informationText Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and CoFounder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and CoFounder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
More informationSlovak Automatic Transcription and Dictation System for the Judicial Domain
Slovak Automatic Transcription and Dictation System for the Judicial Domain Milan Rusko 1, Jozef Juhár 2, Marian Trnka 1, Ján Staš 2, Sakhia Darjaa 1, Daniel Hládek 2, Miloš Cerňak 1, Marek Papco 2, Róbert
More informationLargeScale Test Mining
LargeScale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding
More informationUsing classes has the potential of reducing the problem of sparseness of data by allowing generalizations
POS Tags and Decision Trees for Language Modeling Peter A. Heeman Department of Computer Science and Engineering Oregon Graduate Institute PO Box 91000, Portland OR 97291 heeman@cse.ogi.edu Abstract Language
More informationLecture 10: Sequential Data Models
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 10: Sequential Data Models 1 Example: sequential data Until now, considered data to be i.i.d. Turn attention to sequential data Timeseries: stock
More informationA STATESPACE METHOD FOR LANGUAGE MODELING. Vesa Siivola and Antti Honkela. Helsinki University of Technology Neural Networks Research Centre
A STATESPACE METHOD FOR LANGUAGE MODELING Vesa Siivola and Antti Honkela Helsinki University of Technology Neural Networks Research Centre Vesa.Siivola@hut.fi, Antti.Honkela@hut.fi ABSTRACT In this paper,
More informationHinditoUrdu Machine Translation Through Transliteration
HinditoUrdu Machine Translation Through Transliteration Nadir Durrani, Hassan Sajjad, Alexander Fraser and Helmut Schmid Institute of Natural Language Processing University of Stuttgart Transliteration
More informationBig Data Technology MapReduce Motivation: Indexing in Search Engines
Big Data Technology MapReduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationLesson 1: Comparison of Population Means Part c: Comparison of Two Means
Lesson : Comparison of Population Means Part c: Comparison of Two Means Welcome to lesson c. This third lesson of lesson will discuss hypothesis testing for two independent means. Steps in Hypothesis
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
More information7 th Grade Pre Algebra A Vocabulary Chronological List
SUM sum the answer to an addition problem Ex. 4 + 5 = 9 The sum is 9. DIFFERENCE difference the answer to a subtraction problem Ex. 8 2 = 6 The difference is 6. PRODUCT product the answer to a multiplication
More informationData Deduplication in Slovak Corpora
Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain
More informationErrata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page
Errata for ASM Exam C/4 Study Manual (Sixteenth Edition) Sorted by Page 1 Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page Practice exam 1:9, 1:22, 1:29, 9:5, and 10:8
More informationBuilding A Vocabulary SelfLearning Speech Recognition System
INTERSPEECH 2014 Building A Vocabulary SelfLearning Speech Recognition System Long Qin 1, Alexander Rudnicky 2 1 M*Modal, 1710 Murray Ave, Pittsburgh, PA, USA 2 Carnegie Mellon University, 5000 Forbes
More informationInformation Retrieval
Information Retrieval Indexes All slides unless specifically mentioned are Addison Wesley, 2008; Anton Leuski & Donald Metzler 2012 1 Search Process Information need query text objects representation representation
More informationUPM system for the translation task
UPM system for the translation task Verónica LópezLudeña Grupo de Tecnología del Habla Universidad Politécnica de Madrid veronicalopez@die.upm.es Rubén SanSegundo Grupo de Tecnología del Habla Universidad
More informationOPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu
More information