Chapter 7. Language models. Statistical Machine Translation

Transcription

1 Chapter 7 Language models Statistical Machine Translation

2 Language models Language models answer the question: How likely is a string of English words good English? Help with reordering p lm (the house is small) > p lm (small the is house) Help with word choice p lm (I am going home) > p lm (I am going house) Chapter 7: Language Models 1

3 N-Gram Language Models Given: a string of English words W = w 1, w 2, w 3,..., w n Question: what is p(w )? Sparse data: Many good English sentences will not have been seen before Decomposing p(w ) using the chain rule: p(w 1, w 2, w 3,..., w n ) = p(w 1 ) p(w 2 w 1 ) p(w 3 w 1, w 2 )...p(w n w 1, w 2,...w n 1 ) (not much gained yet, p(w n w 1, w 2,...w n 1 ) is equally sparse) Chapter 7: Language Models 2

4 Markov Chain Markov assumption: only previous history matters limited memory: only last k words are included in history (older words less relevant) kth order Markov model For instance 2-gram language model: p(w 1, w 2, w 3,..., w n ) p(w 1 ) p(w 2 w 1 ) p(w 3 w 2 )...p(w n w n 1 ) What is conditioned on, here w i 1 is called the history Chapter 7: Language Models 3

5 Estimating N-Gram Probabilities Maximum likelihood estimation Collect counts over a large text corpus p(w 2 w 1 ) = count(w 1, w 2 ) count(w 1 ) Millions to billions of words are easy to get (trillions of English words available on the web) Chapter 7: Language Models 4

6 Example: 3-Gram Counts for trigrams and estimated word probabilities the green (total: 1748) word c. prob. paper group light party ecu the red (total: 225) word c. prob. cross tape army card , the blue (total: 54) word c. prob. box flag , angel trigrams in the Europarl corpus start with the red 123 of them end with cross maximum likelihood probability is = Chapter 7: Language Models 5

7 How good is the LM? A good model assigns a text of real English W a high probability This can be also measured with cross entropy: H(W ) = 1 n log p(w n 1 ) Or, perplexity perplexity(w ) = 2 H(W ) Chapter 7: Language Models 6

8 Example: 4-Gram prediction p lm -log 2 p lm p lm (i </s><s>) p lm (would <s>i) p lm (like i would) p lm (to would like) p lm (commend like to) p lm (the to commend) p lm (rapporteur commend the) p lm (on the rapporteur) p lm (his rapporteur on) p lm (work on his) p lm (. his work) p lm (</s> work.) average Chapter 7: Language Models 7

9 Comparison 1 4-Gram word unigram bigram trigram 4-gram i would like to commend the rapporteur on his work </s> average perplexity Chapter 7: Language Models 8

10 Unseen N-Grams We have seen i like to in our corpus We have never seen i like to smooth in our corpus p(smooth i like to) = 0 Any sentence that includes i like to smooth will be assigned probability 0 Chapter 7: Language Models 9

11 Add-One Smoothing For all possible n-grams, add the count of one. c = count of n-gram in corpus n = count of history v = vocabulary size p = c + 1 n + v But there are many more unseen n-grams than seen n-grams Example: Europarl 2-bigrams: 86, 700 distinct words 86, = 7, 516, 890, 000 possible bigrams but only about 30, 000, 000 words (and bigrams) in corpus Chapter 7: Language Models 10

12 Add-α Smoothing Add α < 1 to each count p = c + α n + αv What is a good value for α? Could be optimized on held-out set Chapter 7: Language Models 11

13 Example: 2-Grams in Europarl Count Adjusted count Test count c (c + 1) n n (c + α) t n+v 2 n+αv 2 c Add-α smoothing with α = t c are average counts of n-grams in test set that occurred c times in corpus Chapter 7: Language Models 12

14 Deleted Estimation Estimate true counts in held-out data split corpus in two halves: training and held-out counts in training C t (w 1,..., w n ) number of ngrams with training count r: N r total times ngrams of training count r seen in held-out data: T r Held-out estimator: p h (w 1,..., w n ) = T r N r N where count(w 1,..., w n ) = r Both halves can be switched and results combined p h (w 1,..., w n ) = T 1 r + T 2 r N(N 1 r + N 2 r ) where count(w 1,..., w n ) = r Chapter 7: Language Models 13

15 Good-Turing Smoothing Adjust actual counts r to expected counts r with formula r = (r + 1) N r+1 N r N r number of n-grams that occur exactly r times in corpus N 0 total number of n-grams Chapter 7: Language Models 14

16 Good-Turing for 2-Grams in Europarl Count Count of counts Adjusted count Test count r N r r t 0 7,514,941, ,132, , , , , , , , , adjusted count fairly accurate when compared against the test count Chapter 7: Language Models 15

17 Derivation of Good-Turing A specific n-gram α occurs with (unknown) probability p in the corpus Assumption: all occurrences of an n-gram α are independent of each other Number of times α occurs in corpus follows binomial distribution p(c(α) = r) = b(r; N, p i ) = ( ) N p r (1 p) N r r Chapter 7: Language Models 16

18 Derivation of Good-Turing (2) Goal of Good-Turing smoothing: compute expected count c Expected count can be computed with help from binomial distribution: E(c (α)) = = N r p(c(α) = r) r=0 N r r=0 ( ) N p r (1 p) N r r Note again: p is unknown, we cannot actually compute this Chapter 7: Language Models 17

19 Derivation of Good-Turing (3) Definition: expected number of n-grams that occur r times: E N (N r ) We have s different n-grams in corpus let us call them α 1,..., α s each occurs with probability p 1,..., p s, respectively Given the previous formulae, we can compute E N (N r ) = = s p(c(α i ) = r) i=1 s i=1 ( ) N p r i (1 p i ) N r r Note again: p i is unknown, we cannot actually compute this Chapter 7: Language Models 18

20 Derivation of Good-Turing (4) Reflection we derived a formula to compute E N (N r ) we have N r for small r: E N (N r ) N r Ultimate goal compute expected counts c, given actual counts c E(c (α) c(α) = r) Chapter 7: Language Models 19

21 Derivation of Good-Turing (5) For a particular n-gram α, we know its actual count r Any of the n-grams α i may occur r times Probability that α is one specific α i p(α = α i c(α) = r) = p(c(α i ) = r) s j=1 p(c(α j) = r) Expected count of this n-gram α E(c (α) c(α) = r) = s N p i p(α = α i c(α) = r) i=1 Chapter 7: Language Models 20

22 Derivation of Good-Turing (6) Combining the last two equations: E(c (α) c(α) = r) = = s i=1 N p i p(c(α i ) = r) s j=1 p(c(α j) = r) s i=1 N p i p(c(α i ) = r) s j=1 p(c(α j) = r) We will now transform this equation to derive Good-Turing smoothing Chapter 7: Language Models 21

23 Derivation of Good-Turing (7) Repeat: E(c (α) c(α) = r) = s i=1 N p i p(c(α i ) = r) s j=1 p(c(α j) = r) Denominator is our definition of expected counts E N (N r ) Chapter 7: Language Models 22

24 Numerator: Derivation of Good-Turing (8) s N p i p(c(α i ) = r) = i=1 s ( N N p i r i=1 ) p r i (1 p i ) N r N! = N N r!r! pr+1 i (1 p i ) N r (r + 1) N + 1! = N N + 1 N r!r + 1! pr+1 i (1 p i ) N r N = (r + 1) N + 1 E N+1(N r+1 ) (r + 1) E N+1 (N r+1 ) Chapter 7: Language Models 23

25 Derivation of Good-Turing (9) Using the simplifications of numerator and denominator: r = E(c (α) c(α) = r) = (r + 1) E N+1(N r+1 ) E N (N r ) (r + 1) N r+1 N r QED Chapter 7: Language Models 24

26 Back-Off In given corpus, we may never observe Scottish beer drinkers Scottish beer eaters Both have count 0 our smoothing methods will assign them same probability Better: backoff to bigrams: beer drinkers beer eaters Chapter 7: Language Models 25

27 Interpolation Higher and lower order n-gram models have different strengths and weaknesses high-order n-grams are sensitive to more context, but have sparse counts low-order n-grams consider only very limited context, but have robust counts Combine them p I (w 3 w 1, w 2 ) = λ 1 p 1 (w 3 ) λ 2 p 2 (w 3 w 2 ) λ 3 p 3 (w 3 w 1, w 2 ) Chapter 7: Language Models 26

28 Recursive Interpolation We can trust some histories w i n+1,..., w i 1 more than others Condition interpolation weights on history: λ wi n+1,...,w i 1 Recursive definition of interpolation p I n(w i w i n+1,..., w i 1 ) = λ wi n+1,...,w i 1 p n (w i w i n+1,..., w i 1 ) + + (1 λ wi n+1,...,w i 1 ) p I n 1(w i w i n+2,..., w i 1 ) Chapter 7: Language Models 27

29 Back-Off Trust the highest order language model that contains n-gram Requires p BO n (w i w i n+1,..., w i 1 ) = α n (w i w i n+1,..., w i 1 ) if count n (w i n+1,..., w i ) > 0 = d n (w i n+1,..., w i 1 ) p BO n 1(w i w i n+2,..., w i 1 ) else adjusted prediction model α n (w i w i n+1,..., w i 1 ) discounting function d n (w 1,..., w n 1 ) Chapter 7: Language Models 28

30 Back-Off with Good-Turing Smoothing Previously, we computed n-gram probabilities based on relative frequency p(w 2 w 1 ) = count(w 1, w 2 ) count(w 1 ) Good Turing smoothing adjusts counts c to expected counts c count (w 1, w 2 ) count(w 1, w 2 ) We use these expected counts for the prediction model (but 0 remains 0) α(w 2 w 1 ) = count (w 1, w 2 ) count(w 1 ) This leaves probability mass for the discounting function d 2 (w 1 ) = 1 w 2 α(w 2 w 1 ) Chapter 7: Language Models 29

31 Diversity of Predicted Words Consider the bigram histories spite and constant both occur 993 times in Europarl corpus only 9 different words follow spite almost always followed by of (979 times), due to expression in spite of 415 different words follow constant most frequent: and (42 times), concern (27 times), pressure (26 times), but huge tail of singletons: 268 different words More likely to see new bigram that starts with constant than spite Witten-Bell smoothing considers diversity of predicted words Chapter 7: Language Models 30

32 Witten-Bell Smoothing Recursive interpolation method Number of possible extensions of a history w 1,..., w n 1 in training data N 1+ (w 1,..., w n 1, ) = {w n : c(w 1,..., w n 1, w n ) > 0} Lambda parameters 1 λ w1,...,w n 1 = N 1+ (w 1,..., w n 1, ) N 1+ (w 1,..., w n 1, ) + w n c(w 1,..., w n 1, w n ) Chapter 7: Language Models 31

33 Witten-Bell Smoothing: Examples Let us apply this to our two examples: 1 λ spite = = 1 λ constant = = N 1+ (spite, ) N 1+ (spite, ) + w n c(spite, w n ) = N 1+ (constant, ) N 1+ (constant, ) + w n c(constant, w n ) = Chapter 7: Language Models 32

34 Diversity of Histories Consider the word York fairly frequent word in Europarl corpus, occurs 477 times as frequent as foods, indicates and providers in unigram language model: a respectable probability However, it almost always directly follows New (473 times) Recall: unigram model only used, if the bigram model inconclusive York unlikely second word in unseen bigram in back-off unigram model, York should have low probability Chapter 7: Language Models 33

35 Kneser-Ney Smoothing Kneser-Ney smoothing takes diversity of histories into account Count of histories for a word N 1+ ( w) = {w i : c(w i, w) > 0} Recall: maximum likelihood estimation of unigram language model p ML (w) = c(w) i c(w i) In Kneser-Ney smoothing, replace raw counts with count of histories p KN (w) = N 1+ ( w) w i N 1+ (w i w) Chapter 7: Language Models 34

36 Based on interpolation Modified Kneser-Ney Smoothing Requires p BO n (w i w i n+1,..., w i 1 ) = α n (w i w i n+1,..., w i 1 ) if count n (w i n+1,..., w i ) > 0 = d n (w i n+1,..., w i 1 ) p BO n 1(w i w i n+2,..., w i 1 ) else adjusted prediction model α n (w i w i n+1,..., w i 1 ) discounting function d n (w 1,..., w n 1 ) Chapter 7: Language Models 35

37 Formula for α for Highest Order N-Gram Model Absolute discounting: subtract a fixed D from all non-zero counts α(w n w 1,..., w n 1 ) = c(w 1,..., w n ) D w c(w 1,..., w n 1, w) Refinement: three different discount values D 1 if c = 1 D(c) = D 2 if c = 2 D 3+ if c 3 Chapter 7: Language Models 36

38 Discount Parameters Optimal discounting parameters D 1, D 2, D 3+ can be computed quite easily Y = N 1 N 1 + 2N 2 D 1 = 1 2Y N 2 N 1 D 2 = 2 3Y N 3 N 2 D 3+ = 3 4Y N 4 N 3 Values N c are the counts of n-grams with exactly count c Chapter 7: Language Models 37

39 Formula for d for Highest Order N-Gram Model Probability mass set aside from seen events d(w 1,..., w n 1 ) = i {1,2,3+} D in i (w 1,..., w n 1 ) w n c(w 1,..., w n ) N i for i {1, 2, 3+} are computed based on the count of extensions of a history w 1,..., w n 1 with count 1, 2, and 3 or more, respectively. Similar to Witten-Bell smoothing Chapter 7: Language Models 38

40 Formula for α for Lower Order N-Gram Models Recall: base on count of histories N 1+ ( w) in which word may appear, not raw counts. α(w n w 1,..., w n 1 ) = N 1+( w 1,..., w n ) D w N 1+( w 1,..., w n 1, w) Again, three different values for D (D 1, D 2, D 3+ ), based on the count of the history w 1,..., w n 1 Chapter 7: Language Models 39

41 Formula for d for Lower Order N-Gram Models Probability mass set aside available for the d function d(w 1,..., w n 1 ) = i {1,2,3+} D in i (w 1,..., w n 1 ) w n c(w 1,..., w n ) Chapter 7: Language Models 40

42 Interpolated Back-Off Back-off models use only highest order n-gram if sparse, not very reliable. two different n-grams with same history occur once same probability one may be an outlier, the other under-represented in training To remedy this, always consider the lower-order back-off models Adapting the α function into interpolated α I function by adding back-off α I (w n w 1,..., w n 1 ) = α(w n w 1,..., w n 1 ) Note that d function needs to be adapted as well + d(w 1,..., w n 1 ) p I (w n w 2,..., w n 1 ) Chapter 7: Language Models 41

43 Evaluation Evaluation of smoothing methods: Perplexity for language models trained on the Europarl corpus Smoothing method bigram trigram 4-gram Good-Turing Witten-Bell Modified Kneser-Ney Interpolated Modified Kneser-Ney Chapter 7: Language Models 42

44 Managing the Size of the Model Millions to billions of words are easy to get (trillions of English words available on the web) But: huge language models do not fit into RAM Chapter 7: Language Models 43

45 Number of Unique N-Grams Number of unique n-grams in Europarl corpus 29,501,088 tokens (words and punctuation) Order Unique n-grams Singletons unigram 86,700 33,447 (38.6%) bigram 1,948,935 1,132,844 (58.1%) trigram 8,092,798 6,022,286 (74.4%) 4-gram 15,303,847 13,081,621 (85.5%) 5-gram 19,882,175 18,324,577 (92.2%) remove singletons of higher order n-grams Chapter 7: Language Models 44

46 Language models too large to build What needs to be stored in RAM? Estimation on Disk maximum likelihood estimation p(w n w 1,..., w n 1 ) = count(w 1,..., w n ) count(w 1,..., w n 1 ) can be done separately for each history w 1,..., w n 1 Keep data on disk extract all n-grams into files on-disk sort by history on disk only keep n-grams with shared history in RAM Smoothing techniques may require additional statistics Chapter 7: Language Models 45

47 Efficient Data Structures 4-gram the very 3-gram backoff very 2-gram backoff large boff: very large boff: important boff: best boff: serious boff: accept p: acceptable p: accession p: accidents p: accountancy p: accumulated p: accumulation p: action p: additional p: administration p: large boff: important boff: best boff: serious boff: amount p: amounts p: and p: area p: companies p: cuts p: degree p: extent p: financial p: foreign p: majority p: number p: and p: areas p: challenge p: debate p: discussion p: fact p: international p: issue p: gram backoff aa-afns p: aachen p: aaiun p: aalborg p: aarhus p: aaron p: aartsen p: ab p: abacha p: aback p: Need to store probabilities for the very large majority the very language number Both share history the very large no need to store history twice Trie Chapter 7: Language Models 46

48 Fewer Bits to Store Probabilities Index for words two bytes allow a vocabulary of 2 16 = 65, 536 words, typically more needed Huffman coding to use fewer bits for frequent words. Probabilities typically stored in log format as floats (4 or 8 bytes) quantization of probabilities to use even less memory, maybe just 4-8 bits Chapter 7: Language Models 47

49 Reducing Vocabulary Size For instance: each number is treated as a separate token Replace them with a number token num but: we want our language model to prefer p lm (I pay in May 2007) > p lm (I pay 2007 in May ) not possible with number token p lm (I pay num in May num) = p lm (I pay num in May num) Replace each digit (with unique symbol, or 5), retain some distinctions p lm (I pay in May 5555) > p lm (I pay 5555 in May ) Chapter 7: Language Models 48

50 Filtering Irrelevant N-Grams We use language model in decoding we only produce English words in translation options filter language model down to n-grams containing only those words Ratio of 5-grams needed to all 5-grams (by sentence length): 0.16 ratio of 5-grams required (bag-of-words) sentence length Chapter 7: Language Models 49

51 Summary Language models: How likely is a string of English words good English? N-gram models (Markov assumption) Perplexity Count smoothing add-one, add-α deleted estimation Good Turing Interpolation and backoff Good Turing Witten-Bell Kneser-Ney Managing the size of the model Chapter 7: Language Models 50