Language Modeling. Chapter Introduction
|
|
|
- Vivian Hodge
- 9 years ago
- Views:
Transcription
1 Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set of example sentences in a language. Language models were originally developed for the problem of speech recognition; they still play a central role in modern speech recognition systems. They are also widely used in other NLP applications. The parameter estimation techniques that were originally developed for language modeling, as described in this chapter, are useful in many other contexts, such as the tagging and parsing problems considered in later chapters of this book. Our task is as follows. Assume that we have a corpus, which is a set of sentences in some language. For example, we might have several years of text from the New York Times, or we might have a very large amount of text from the web. Given this corpus, we d like to estimate the parameters of a language model. A language model is defined as follows. First, we will define V to be the set of all words in the language. For example, when building a language model for English we might have V = {the, dog, laughs, saw, barks, cat,...} In practice V can be quite large: it might contain several thousands, or tens of thousands, of words. We assume that V is a finite set. A sentence in the language is a sequence of words x 1 x 2... x n where the integer n is such that n 1, we have x i V for i {1... (n 1)}, and we assume that x n is a special symbol, STOP (we assume that STOP is not a 1
2 2CHAPTER 1. LANGUAGE MODELING(COURSE NOTES FOR NLP BY MICHAEL COLLINS, COLUMB member of V). We ll soon see why it is convenient to assume that each sentence ends in the STOP symbol. Example sentences could be the dog barks STOP the cat laughs STOP the cat saw the dog STOP the STOP cat the dog the STOP cat cat cat STOP STOP... We will define V to be the set of all sentences with the vocabulary V: this is an infinite set, because sentences can be of any length. We then give the following definition: Definition 1 (Language Model) A language model consists of a finite set V, and a function p(x 1, x 2,... x n ) such that: 1. For any x 1... x n V, p(x 1, x 2,... x n ) 0 2. In addition, x 1...x n V p(x 1, x 2,... x n ) = 1 Hence p(x 1, x 2,... x n ) is a probability distribution over the sentences in V. As one example of a (very bad) method for learning a language model from a training corpus, consider the following. Define c(x 1... x n ) to be the number of times that the sentence x 1... x n is seen in our training corpus, and N to be the total number of sentences in the training corpus. We could then define p(x 1... x n ) = c(x 1... x n ) N This is, however, a very poor model: in particular it will assign probability 0 to any sentence not seen in the training corpus. Thus it fails to generalize to sentences that have not been seen in the training data. The key technical contribution of this chapter will be to introduce methods that do generalize to sentences that are not seen in our training data. At first glance the language modeling problem seems like a rather strange task, so why are we considering it? There are a couple of reasons:
3 1.2. MARKOV MODELS 3 1. Language models are very useful in a broad range of applications, the most obvious perhaps being speech recognition and machine translation. In many applications it is very useful to have a good prior distribution p(x 1... x n ) over which sentences are or aren t probable in a language. For example, in speech recognition the language model is combined with an acoustic model that models the pronunciation of different words: one way to think about it is that the acoustic model generates a large number of candidate sentences, together with probabilities; the language model is then used to reorder these possibilities based on how likely they are to be a sentence in the language. 2. The techniques we describe for defining the function p, and for estimating the parameters of the resulting model from training examples, will be useful in several other contexts during the course: for example in hidden Markov models, which we will see next, and in models for natural language parsing. 1.2 Markov Models We now turn to a critical question: given a training corpus, how do we learn the function p? In this section we describe Markov models, a central idea from probability theory; in the next section we describe trigram language models, an important class of language models that build directly on ideas from Markov models Markov Models for Fixed-length Sequences Consider a sequence of random variables, X 1, X 2,..., X n. Each random variable can take any value in a finite set V. For now we will assume that the length of the sequence, n, is some fixed number (e.g., n = 100). In the next section we ll describe how to generalize the approach to cases where n is also a random variable, allowing different sequences to have different lengths. Our goal is as follows: we would like to model the probability of any sequence x 1... x n, where n 1 and x i V for i = 1... n, that is, to model the joint probability P (X 1 = x 1, X 2 = x 2,..., X n = x n ) There are V n possible sequences of the form x 1... x n : so clearly, it is not feasible for reasonable values of V and n to simply list all V n probabilities. We would like to build a much more compact model. In a first-order Markov process, we make the following assumption, which considerably simplifies the model: P (X 1 = x 1, X 2 = x 2,... X n = x n )
4 4CHAPTER 1. LANGUAGE MODELING(COURSE NOTES FOR NLP BY MICHAEL COLLINS, COLUMB = n P (X 1 = x 1 ) P (X i = x i X 1 = x 1,..., X i 1 = x i 1 ) (1.1) i=2 = n P (X 1 = x 1 ) P (X i = x i X i 1 = x i 1 ) (1.2) i=2 The first step, in Eq. 1.1, is exact: by the chain rule of probabilities, any distribution P (X 1 = x 1... X n = x n ) can be written in this form. So we have made no assumptions in this step of the derivation. However, the second step, in Eq. 1.2, is not necessarily exact: we have made the assumption that for any i {2... n}, for any x 1... x i, P (X i = x i X 1 = x 1... X i 1 = x i 1 ) = P (X i = x i X i 1 = x i 1 ) This is a (first-order) Markov assumption. We have assumed that the identity of the i th word in the sequence depends only on the identity of the previous word, x i 1. More formally, we have assumed that the value of X i is conditionally independent of X 1... X i 2, given the value for X i 1. In a second-order Markov process, which will form the basis of trigram language models, we make a slightly weaker assumption, namely that each word depends on the previous two words in the sequence: P (X i = x i X 1 = x 1,..., X i 1 = x i 1 ) = P (X i = x i X i 2 = x i 2, X i 1 = x i 1 ) It follows that the probability of an entire sequence is written as = P (X 1 = x 1, X 2 = x 2,... X n = x n ) n P (X i = x i X i 2 = x i 2, X i 1 = x i 1 ) (1.3) For convenience, we will assume that x 0 = x 1 = * in this definition, where * is a special start symbol in the sentence Markov Sequences for Variable-length Sentences In the previous section, we assumed that the length of the sequence, n, was fixed. In many applications, however, the length n can itself vary. Thus n is itself a random variable. There are various ways of modeling this variability in length: in this section we describe the most common approach for language modeling. The approach is simple: we will assume that the n th word in the sequence, X n, is always equal to a special symbol, the STOP symbol. This symbol can only
5 1.3. TRIGRAM LANGUAGE MODELS 5 appear at the end of a sequence. We use exactly the same assumptions as before: for example under a second-order Markov assumption, we have P (X 1 = x 1, X 2 = x 2,... X n = x n ) = n P (X i = x i X i 2 = x i 2, X i 1 = x i 1 ) (1.4) for any n 1, and for any x 1... x n such that x n = STOP, and x i V for i = 1... (n 1). We have assumed a second-order Markov process where at each step we generate a symbol x i from the distribution P (X i = x i X i 2 = x i 2, X i 1 = x i 1 ) where x i can be a member of V, or alternatively can be the STOP symbol. If we generate the STOP symbol, we finish the sequence. Otherwise, we generate the next symbol in the sequence. A little more formally, the process that generates sentences would be as follows: 1. Initialize i = 1, and x 0 = x 1 = * 2. Generate x i from the distribution P (X i = x i X i 2 = x i 2, X i 1 = x i 1 ) 3. If x i = STOP then return the sequence x 1... x i. Otherwise, set i = i + 1 and return to step 2. Thus we now have a model that generates sequences that vary in length. 1.3 Trigram Language Models There are various ways of defining language models, but we ll focus on a particularly important example, the trigram language model, in this chapter. This will be a direct application of Markov models, as described in the previous section, to the language modeling problem. In this section we give the basic definition of a trigram model, discuss maximum-likelihood parameter estimates for trigram models, and finally discuss strengths of weaknesses of trigram models.
6 6CHAPTER 1. LANGUAGE MODELING(COURSE NOTES FOR NLP BY MICHAEL COLLINS, COLUMB Basic Definitions As in Markov models, we model each sentence as a sequence of n random variables, X 1, X 2,... X n. The length, n, is itself a random variable (it can vary across different sentences). We always have X n = STOP. Under a second-order Markov model, the probability of any sentence x 1... x n is then n P (X 1 = x 1, X 2 = x 2,... X n = x n ) = P (X i = x i X i 2 = x i 2, X i 1 = x i 1 ) where we assume as before that x 0 = x 1 = *. We will assume that for any i, for any x i 2, x i 1, x i, P (X i = x i X i 2 = x i 2, X i 1 = x i 1 ) = q(x i x i 2, x i 1 ) where q(w u, v) for any (u, v, w) is a parameter of the model. We will soon see how to derive estimates of the q(w u, v) parameters from our training corpus. Our model then takes the form n p(x 1... x n ) = q(x i x i 2, x i 1 ) for any sequence x 1... x n. This leads us to the following definition: Definition 2 (Trigram Language Model) A trigram language model consists of a finite set V, and a parameter q(w u, v) for each trigram u, v, w such that w V {STOP}, and u, v V {*}. The value for q(w u, v) can be interpreted as the probability of seeing the word w immediately after the bigram (u, v). For any sentence x 1... x n where x i V for i = 1... (n 1), and x n = STOP, the probability of the sentence under the trigram language model is where we define x 0 = x 1 = *. For example, for the sentence n p(x 1... x n ) = q(x i x i 2, x i 1 ) the dog barks STOP
7 1.3. TRIGRAM LANGUAGE MODELS 7 we would have p(the dog barks STOP) = q(the *, *) q(dog *, the) q(barks the, dog) q(stop dog, barks) Note that in this expression we have one term for each word in the sentence (the, dog, barks, and STOP). Each word depends only on the previous two words: this is the trigram assumption. The parameters satisfy the constraints that for any trigram u, v, w, q(w u, v) 0 and for any bigram u, v, w V {STOP} q(w u, v) = 1 Thus q(w u, v) defines a distribution over possible words w, conditioned on the bigram context u, v. The key problem we are left with is to estimate the parameters of the model, namely q(w u, v) where w can be any member of V {STOP}, and u, v V {*}. There are around V 3 parameters in the model. This is likely to be a very large number. For example with V = 10, 000 (this is a realistic number, most likely quite small by modern standards), we have V Maximum-Likelihood Estimates We first start with the most generic solution to the estimation problem, the maximumlikelihood estimates. We will see that these estimates are flawed in a critical way, but we will then show how related estimates can be derived that work very well in practice. First, some notation. Define c(u, v, w) to be the number of times that the trigram (u, v, w) is seen in the training corpus: for example, c(the, dog, barks) is the number of times that the sequence of three words the dog barks is seen in the training corpus. Similarly, define c(u, v) to be the number of times that the bigram (u, v) is seen in the corpus. For any w, u, v, we then define q(w u, v) = c(u, v, w) c(u, v)
8 8CHAPTER 1. LANGUAGE MODELING(COURSE NOTES FOR NLP BY MICHAEL COLLINS, COLUMB As an example, our estimate for q(barks the, dog) would be q(barks the, dog) = c(the, dog, barks) c(the, dog) This estimate is very natural: the numerator is the number of times the entire trigram the dog barks is seen, and the denominator is the number of times the bigram the dog is seen. We simply take the ratio of these two terms. Unfortunately, this way of estimating parameters runs into a very serious issue. Recall that we have a very large number of parameters in our model (e.g., with a vocabulary size of 10, 000, we have around parameters). Because of this, many of our counts will be zero. This leads to two problems: Many of the above estimates will be q(w u, v) = 0, due to the count in the numerator being 0. This will lead to many trigram probabilities being systematically underestimated: it seems unreasonable to assign probability 0 to any trigram not seen in training data, given that the number of parameters of the model is typically very large in comparison to the number of words in the training corpus. In cases where the denominator c(u, v) is equal to zero, the estimate is not well defined. We will shortly see how to come up with modified estimates that fix these problems. First, however, we discuss how language models are evaluated, and then discuss strengths and weaknesses of trigram language models Evaluating Language Models: Perplexity So how do we measure the quality of a language model? A very common method is to evaluate the perplexity of the model on some held-out data. The method is as follows. Assume that we have some test data sentences x (1), x (2),..., x (m). Each test sentence x (i) for i {1... m} is a sequence of words x (i) 1,..., x(i) n i, where n i is the length of the i th sentence. As before we assume that every sentence ends in the STOP symbol. It is critical that the test sentences are held out, in the sense that they are not part of the corpus used to estimate the language model. In this sense, they are examples of new, unseen sentences. For any test sentence x (i), we can measure its probability p(x (i) ) under the language model. A natural measure of the quality of the language model would be
9 1.3. TRIGRAM LANGUAGE MODELS 9 the probability it assigns to the entire set of test sentences, that is m p(x (i) ) The intuition is as follows: the higher this quantity is, the better the language model is at modeling unseen sentences. The perplexity on the test corpus is derived as a direct transformation of this quantity. Define M to be the total number of words in the test corpus. More precisely, under the definition that n i is the length of the i th test sentence, m M = n i Then the average log probability under the model is defined as 1 m M log 2 p(x (i) ) = 1 M m log 2 p(x (i) ) This is just the log probability of the entire test corpus, divided by the total number of words in the test corpus. Here we use log 2 (z) for any z > 0 to refer to the log with respect to base 2 of z. Again, the higher this quantity is, the better the language model. The perplexity is then defined as where l = 1 M 2 l m log 2 p(x (i) ) Thus we take the negative of the average log probability, and raise two to that power. (Again, we re assuming in this section that log 2 is log base two). The perplexity is a positive number. The smaller the value of perplexity, the better the language model is at modeling unseen data. Some intuition behind perplexity is as follows. Say we have a vocabulary V, where V {STOP} = N, and the model predicts q(w u, v) = 1 N for all u, v, w. Thus this is the dumb model that simply predicts the uniform distribution over the vocabulary together with the STOP symbol. In this case, it can
10 10CHAPTER 1. LANGUAGE MODELING(COURSE NOTES FOR NLP BY MICHAEL COLLINS, COLUM be shown that the perplexity is equal to N. So under a uniform probability model, the perplexity is equal to the vocabulary size. Perplexity can be thought of as the effective vocabulary size under the model: if, for example, the perplexity of the model is 120 (even though the vocabulary size is say 10, 000), then this is roughly equivalent to having an effective vocabulary size of 120. To give some more motivation, it is relatively easy to show that the perplexity is equal to 1 t where t = M m p(x (i) ) Here we use M z to refer to the M th root of z: so t is the value such that t M = m p(x (i) ). Given that m m p(x (i) n i ) = q(x (i) j x(i) j 2, x(i) j 1 ) j=1 and M = i n i, the value for t is the geometric mean of the terms q(x (i) j x(i) j 2, x(i) j 1 ) appearing in m p(x (i) ). For example if the perplexity is equal to 100, then t = 0.01, indicating that the geometric mean is One additional useful fact about perplexity is the following. If for any trigram u, v, w seen in test data, we have the estimate q(w u, v) = 0 then the perplexity will be. To see this, note that in this case the probability of the test corpus under the model will be 0, and the average log probability will be. Thus if we take perplexity seriously as our measure of a language model, then we should avoid giving 0 estimates at all costs. Finally, some intuition about typical values for perplexity. Goodman ( A bit of progress in language modeling, figure 2) evaluates unigram, bigram and trigram language models on English data, with a vocabulary size of 50, 000. In a bigram model we have parameters of the form q(w v), and n p(x 1... x n ) = q(x i x i 1 )
11 1.4. SMOOTHED ESTIMATION OF TRIGRAM MODELS 11 Thus each word depends only on the previous word in the sentence. In a unigram model we have parameters q(w), and n p(x 1... x n ) = q(x i ) Thus each word is chosen completely independently of other words in the sentence. Goodman reports perplexity figures of around 74 for a trigram model, 137 for a bigram model, and 955 for a unigram model. The perplexity for a model that simply assigns probability 1/50, 000 to each word in the vocabulary would be 50, 000. So the trigram model clearly gives a big improvement over bigram and unigram models, and a huge improvement over assigning a probability of 1/50, 000 to each word in the vocabulary Strengths and Weaknesses of Trigram Language Models The trigram assumption is arguably quite strong, and linguistically naive (see the lecture slides for discussion). However, it leads to models that are very useful in practice. 1.4 Smoothed Estimation of Trigram Models As discussed previously, a trigram language model has a very large number of parameters. The maximum-likelihood parameter estimates, which take the form q(w u, v) = c(u, v, w) c(u, v) will run into serious issues with sparse data. Even with a large set of training sentences, many of the counts c(u, v, w) or c(u, v) will be low, or will be equal to zero. In this section we describe smoothed estimation methods, which alleviate many of the problems found with sparse data. The key idea will be to rely on lower-order statistical estimates in particular, estimates based on bigram or unigram counts to smooth the estimates based on trigrams. We discuss two smoothing methods that are very commonly used in practice: first, linear interpolation; second, discounting methods.
12 12CHAPTER 1. LANGUAGE MODELING(COURSE NOTES FOR NLP BY MICHAEL COLLINS, COLUM Linear Interpolation A linearly interpolated trigram model is derived as follows. We define the trigram, bigram, and unigram maximum-likelihood estimates as q ML (w u, v) = q ML (w v) = c(u, v, w) c(u, v) c(v, w) c(v) q ML (w) = c(w) c() where we have extended our notation: c(w) is the number of times word w is seen in the training corpus, and c() is the total number of words seen in the training corpus. The trigram, bigram, and unigram estimates have different strengths and weaknesses. The unigram estimate will never have the problem of its numerator or denominator being equal to 0: thus the estimate will always be well-defined, and will always be greater than 0 (providing that each word is seen at least once in the training corpus, which is a reasonable assumption). However, the unigram estimate completely ignores the context (previous two words), and hence discards very valuable information. In contrast, the trigram estimate does make use of context, but has the problem of many of its counts being 0. The bigram estimate falls between these two extremes. The idea in linear interpolation is to use all three estimates, by defining the trigram estimate as follows: q(w u, v) = λ 1 q ML (w u, v) + λ 2 q ML (w v) + λ 3 q ML (w) Here λ 1, λ 2 and λ 3 are three additional parameters of the model, which satisfy λ 1 0, λ 2 0, λ 3 0 and λ 1 + λ 2 + λ 3 = 1 Thus we take a weighted average of the three estimates. There are various ways of estimating the λ values. A common one is as follows. Say we have some additional held-out data, which is separate from both our training and test corpora. We will call this data the development data. Define
13 1.4. SMOOTHED ESTIMATION OF TRIGRAM MODELS 13 c (u, v, w) to be the number of times that the trigram (u, v, w) is seen in the development data. It is easy to show that the log-likelihood of the development data, as a function of the parameters λ 1, λ 2, λ 3, is L(λ 1, λ 2, λ 3 ) = u,v,w = u,v,w c (u, v, w) log q(w u, v) c (u, v, w) log (λ 1 q ML (w u, v) + λ 2 q ML (w v) + λ 3 q ML (w)) We would like to choose our λ values to make L(λ 1, λ 2, λ 3 ) as high as possible. Thus the λ values are taken to be arg max λ 1,λ 2,λ 3 L(λ 1, λ 2, λ 3 ) such that and λ 1 0, λ 2 0, λ 3 0 λ 1 + λ 2 + λ 3 = 1 Finding the optimal values for λ 1, λ 2, λ 3 is fairly straightforward (see section?? for one algorithm that is often used for this purpose). As described, our method has three smoothing parameters, λ 1, λ 2, and λ 3. The three parameters can be interpreted as an indication of the confidence, or weight, placed on each of the trigram, bigram, and unigram estimates. For example, if λ 1 is close to 1, this implies that we put a significant weight on the trigram estimate q ML (w u, v); conversely, if λ 1 is close to zero we have placed a low weighting on the trigram estimate. In practice, it is important to add an additional degree of freedom, by allowing the values for λ 1, λ 2 and λ 3 to vary depending on the bigram (u, v) that is being conditioned on. In particular, the method can be extended to allow λ 1 to be larger when c(u, v) is larger the intuition being that a larger value of c(u, v) should translate to us having more confidence in the trigram estimate. At the very least, the method is used to ensure that λ 1 = 0 when c(u, v) = 0, because in this case the trigram estimate q ML (w u, v) = c(u, v, w) c(u, v) is undefined. Similarly, if both c(u, v) and c(v) are equal to zero, we need λ 1 = λ 2 = 0, as both the trigram and bigram ML estimates are undefined.
14 14CHAPTER 1. LANGUAGE MODELING(COURSE NOTES FOR NLP BY MICHAEL COLLINS, COLUM One extension to the method, often referred to as bucketing, is described in section Another, much simpler method, is to define λ 1 = c(u, v) c(u, v) + γ λ 2 = (1 λ 1 ) c(v) c(v) + γ λ 3 = 1 λ 1 λ 2 where γ > 0 is the only parameter of the method. It can be verified that λ 1 0, λ 2 0, and λ 3 0, and also that λ 1 + λ 2 + λ 3 = 1. Under this definition, it can be seen that λ 1 increases as c(u, v) increases, and similarly that λ 2 increases as c(v) increases. In addition we have λ 1 = 0 if c(u, v) = 0, and λ 2 = 0 if c(v) = 0. The value for γ can again be chosen by maximizing log-likelihood of a set of development data. This method is relatively crude, and is not likely to be optimal. It is, however, very simple, and in practice it can work well in some applications Discounting Methods We now describe an alternative estimation method, which is again commonly used in practice. Consider first a method for estimating a bigram language model, that is, our goal is to define q(w v) for any w V {STOP}, v V { }. The first step will be to define discounted counts. For any bigram c(v, w) such that c(v, w) > 0, we define the discounted count as c (v, w) = c(v, w) β where β is a value between 0 and 1 (a typical value might be β = 0.5). Thus we simply subtract a constant value, β, from the count. This reflects the intuition that if we take counts from the training corpus, we will systematically over-estimate the probability of bigrams seen in the corpus (and under-estimate bigrams not seen in the corpus). For any bigram (v, w) such that c(v, w) > 0, we can then define q(w v) = c (v, w) c(v) Thus we use the discounted count on the numerator, and the regular count on the denominator of this expression.
15 1.4. SMOOTHED ESTIMATION OF TRIGRAM MODELS 15 x c(x) c (x) the 48 c (x) c(the) the, dog /48 the, woman /48 the, man /48 the, park /48 the, job /48 the, telescope /48 the, manual /48 the, afternoon /48 the, country /48 the, street /48 Figure 1.1: An example illustrating discounting methods. We show a made-up example, where the unigram the is seen 48 times, and we list all bigrams (u, v) such that u = the, and c(u, v) > 0 (the counts for these bigrams sum to 48). In addition we show the discounted count c (x) = c(x) β, where β = 0.5, and finally we show the estimate c (x)/c(the) based on the discounted count.
16 16CHAPTER 1. LANGUAGE MODELING(COURSE NOTES FOR NLP BY MICHAEL COLLINS, COLUM For any context v, this definition leads to some missing probability mass, defined as α(v) = 1 c (v, w) c(v) w:c(v,w)>0 As an example, consider the counts shown in the example in figure 1.1. In this case we show all bigrams (u, v) where u = the, and c(u, v) > 0. We use a discounting value of β = 0.5. In this case we have w:c(v,w)>0 c (v, w) c(v) = = and the missing mass is α(the) = = 5 48 The intuition behind discounted methods is to divide this missing mass between the words w such that c(v, w) = 0. More specifically, the complete definition of the estimate is as follows. For any v, define the sets A(v) = {w : c(v, w) > 0} and B(v) = {w : c(v, w) = 0} For the data in figure 1.1, for example, we would have A(the) = {dog, woman, man, park, job, telescope, manual, afternoon, country, street} and B(the) would be the set of remaining words in the vocabulary. Then the estimate is defined as c (v,w) c(v) If w A(v) q D (w v) = q α(v) ML (w) w B(v) q If w B(v) ML(w) Thus if c(v, w) > 0 we return the estimate c (v, w)/c(v); otherwise we divide the remaining probability mass α(v) in proportion to the unigram estimates q ML (w). The method can be generalized to trigram language models in a natural, recursive way: for any bigram (u, v) define A(u, v) = {w : c(u, v, w) > 0} and B(u, v) = {w : c(u, v, w) = 0}
17 1.5. ADVANCED TOPICS 17 Define c (u, v, w) to be the discounted count for the trigram (u, v, w): that is, c (u, v, w) = c(u, v, w) β where β is again the discounting value. Then the trigram model is where c (u,v,w) c(u,v) If w A(u, v) q D (w u, v) = q α(u, v) D (w v) If w B(u, v) α(u, v) = 1 w B(u,v) q D(w v) w A(u,v) c (u, v, w) c(u, v) is again the missing probability mass. Note that we have divided the missing probability mass in proportion to the bigram estimates q D (w v), which were defined previously. The only parameter of this approach is the discounting value, β. As in linearly interpolated models, the usual way to choose the value for this parameter is by optimization of likelihood of a development corpus, which is again a separate set of data from the training and test corpora. Define c (u, v, w) to be the number of times that the trigram u, v, w is seen in this development corpus. The log-likelihood of the development data is u,v,w c (u, v, w) log q D (w u, v) where q D (w u, v) is defined as above. The parameter estimates q D (w u, v) will vary as the value for β varies. Typically we will test a set of possible values for β for example, we might test all values in the set {0.1, 0.2, 0.3,..., 0.9} where for each value of β we calculate the log-likelihood of the development data. We then choose the value for β that maximizes this log-likelihood. 1.5 Advanced Topics Linear Interpolation with Bucketing In linearly interpolated models the parameter estimates are defined as q(w u, v) = q(w u, v) = λ 1 q ML (w u, v) + λ 2 q ML (w v) + λ 3 q ML (w) where λ 1, λ 2 and λ 3 are smoothing parameters in the approach.
18 18CHAPTER 1. LANGUAGE MODELING(COURSE NOTES FOR NLP BY MICHAEL COLLINS, COLUM In practice, it is important to allow the smoothing parameters to vary depending on the bigram (u, v) that is being conditioned on in particular, the higher the count c(u, v), the higher the weight should be for λ 1 (and similarly, the higher the count c(v), the higher the weight should be for λ 2 ). The classical method for achieving this is through an extension that is often referred to as bucketing. The first step in this method is to define a function Π that maps bigrams (u, v) to values Π(u, v) {1, 2,..., K} where K is some integer specifying the number of buckets. Thus the function Π defines a partition of bigrams into K different subsets. The function is defined by hand, and typically depends on the counts seen in the training data. One such definition, with K = 3, would be Π(u, v) = 1 if c(u, v) > 0 Π(u, v) = 2 if c(u, v) = 0 and c(v) > 0 Π(u, v) = 3 otherwise This is a very simple definition that simply tests whether the counts c(u, v) and c(v) are equal to 0. Another slightly more complicated definition, which is more sensitive to frequency of the bigram (u, v), would be Π(u, v) = 1 if 100 c(u, v) Π(u, v) = 2 if 50 c(u, v) < 100 Π(u, v) = 3 if 20 c(u, v) < 50 Π(u, v) = 4 if 10 c(u, v) < 20 Π(u, v) = 5 if 5 c(u, v) < 10 Π(u, v) = 6 if 2 c(u, v) < 5 Π(u, v) = 7 if c(u, v) = 1 Π(u, v) = 8 if c(u, v) = 0 and c(v) > 0 Π(u, v) = 9 otherwise Given a definition of the function Π(u, v), we then introduce smoothing parameters λ (k) 1, λ(k) 2, λ(k) 3 for all k {1... K}. Thus each bucket has its own set of smoothing parameters. We have the constraints that for all k {1... K} λ (k) 1 0, λ (k) 2 0, λ (k) 3 0 and λ (k) 1 + λ (k) 2 + λ (k) 3 = 1
19 1.5. ADVANCED TOPICS 19 The linearly interpolated estimate will be q(w u, v) = q(w u, v) = λ (k) 1 q ML (w u, v)+λ (k) 2 q ML (w v)+λ (k) 3 q ML (w) where k = Π(u, v) Thus we have crucially introduced a dependence of the smoothing parameters on the value for Π(u, v). Thus each bucket of bigrams gets its own set of smoothing parameters; the values for the smoothing parameters can vary depending on the value of Π(u, v) (which is usually directly related to the counts c(u, v) and c(v)). The smoothing parameters are again estimated using a development data set. If we again define c (u, v, w) to be the number of times the trigram u, v, w appears in the development data, the log-likelihood of the development data is u,v,w = = u,v,w K c (u, v, w) log q(w u, v) ( c (u, v, w) log λ (Π(u,v)) 1 q ML (w u, v) + λ (Π(u,v)) 2 q ML (w v) + λ (Π(u,v)) ) 3 q ML (w) k=1 u,v,w: Π(u,v)=k ( c (u, v, w) log λ (k) 1 q ML (w u, v) + λ (k) 2 q ML (w v) + λ (k) ) 3 q ML (w) The λ (k) 1, λ(k) 2, λ(k) 3 values are chosen to maximize this function.
Tagging with Hidden Markov Models
Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,
Log-Linear Models. Michael Collins
Log-Linear Models Michael Collins 1 Introduction This note describes log-linear models, which are very widely used in natural language processing. A key advantage of log-linear models is their flexibility:
Statistical Machine Translation: IBM Models 1 and 2
Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation
Chapter 7. Language models. Statistical Machine Translation
Chapter 7 Language models Statistical Machine Translation Language models Language models answer the question: How likely is a string of English words good English? Help with reordering p lm (the house
Math 4310 Handout - Quotient Vector Spaces
Math 4310 Handout - Quotient Vector Spaces Dan Collins The textbook defines a subspace of a vector space in Chapter 4, but it avoids ever discussing the notion of a quotient space. This is understandable
CHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
CS 533: Natural Language. Word Prediction
CS 533: Natural Language Processing Lecture 03 N-Gram Models and Algorithms CS 533: Natural Language Processing Lecture 01 1 Word Prediction Suppose you read the following sequence of words: Sue swallowed
COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012
Binary numbers The reason humans represent numbers using decimal (the ten digits from 0,1,... 9) is that we have ten fingers. There is no other reason than that. There is nothing special otherwise about
1 Review of Least Squares Solutions to Overdetermined Systems
cs4: introduction to numerical analysis /9/0 Lecture 7: Rectangular Systems and Numerical Integration Instructor: Professor Amos Ron Scribes: Mark Cowlishaw, Nathanael Fillmore Review of Least Squares
Estimating Probability Distributions
Estimating Probability Distributions Readings: Manning and Schutze, Section 6.2 Jurafsky & Martin, Section 6.3 One of the central problems we face in using probability models for NLP is obtaining the actual
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete
Math 115 Spring 2011 Written Homework 5 Solutions
. Evaluate each series. a) 4 7 0... 55 Math 5 Spring 0 Written Homework 5 Solutions Solution: We note that the associated sequence, 4, 7, 0,..., 55 appears to be an arithmetic sequence. If the sequence
What is Linear Programming?
Chapter 1 What is Linear Programming? An optimization problem usually has three essential ingredients: a variable vector x consisting of a set of unknowns to be determined, an objective function of x to
NUMBER SYSTEMS. William Stallings
NUMBER SYSTEMS William Stallings The Decimal System... The Binary System...3 Converting between Binary and Decimal...3 Integers...4 Fractions...5 Hexadecimal Notation...6 This document available at WilliamStallings.com/StudentSupport.html
2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system
1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables
1.4 Compound Inequalities
Section 1.4 Compound Inequalities 53 1.4 Compound Inequalities This section discusses a technique that is used to solve compound inequalities, which is a phrase that usually refers to a pair of inequalities
8 Divisibility and prime numbers
8 Divisibility and prime numbers 8.1 Divisibility In this short section we extend the concept of a multiple from the natural numbers to the integers. We also summarize several other terms that express
Review of Fundamental Mathematics
Review of Fundamental Mathematics As explained in the Preface and in Chapter 1 of your textbook, managerial economics applies microeconomic theory to business decision making. The decision-making tools
Basics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu [email protected] Modern machine learning is rooted in statistics. You will find many familiar
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2
CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2 Proofs Intuitively, the concept of proof should already be familiar We all like to assert things, and few of us
Unified Lecture # 4 Vectors
Fall 2005 Unified Lecture # 4 Vectors These notes were written by J. Peraire as a review of vectors for Dynamics 16.07. They have been adapted for Unified Engineering by R. Radovitzky. References [1] Feynmann,
Section 4.4 Inner Product Spaces
Section 4.4 Inner Product Spaces In our discussion of vector spaces the specific nature of F as a field, other than the fact that it is a field, has played virtually no role. In this section we no longer
CALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
Maximum Likelihood Estimation
Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for
Lecture L3 - Vectors, Matrices and Coordinate Transformations
S. Widnall 16.07 Dynamics Fall 2009 Lecture notes based on J. Peraire Version 2.0 Lecture L3 - Vectors, Matrices and Coordinate Transformations By using vectors and defining appropriate operations between
Inner Product Spaces
Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and
5.1 Radical Notation and Rational Exponents
Section 5.1 Radical Notation and Rational Exponents 1 5.1 Radical Notation and Rational Exponents We now review how exponents can be used to describe not only powers (such as 5 2 and 2 3 ), but also roots
1.6 The Order of Operations
1.6 The Order of Operations Contents: Operations Grouping Symbols The Order of Operations Exponents and Negative Numbers Negative Square Roots Square Root of a Negative Number Order of Operations and Negative
Normal distribution. ) 2 /2σ. 2π σ
Normal distribution The normal distribution is the most widely known and used of all distributions. Because the normal distribution approximates many natural phenomena so well, it has developed into a
1 The Brownian bridge construction
The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof
Probability: Terminology and Examples Class 2, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom
Probability: Terminology and Examples Class 2, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom 1 Learning Goals 1. Know the definitions of sample space, event and probability function. 2. Be able to
Binary Number System. 16. Binary Numbers. Base 10 digits: 0 1 2 3 4 5 6 7 8 9. Base 2 digits: 0 1
Binary Number System 1 Base 10 digits: 0 1 2 3 4 5 6 7 8 9 Base 2 digits: 0 1 Recall that in base 10, the digits of a number are just coefficients of powers of the base (10): 417 = 4 * 10 2 + 1 * 10 1
MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.
MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column
5 Numerical Differentiation
D. Levy 5 Numerical Differentiation 5. Basic Concepts This chapter deals with numerical approximations of derivatives. The first questions that comes up to mind is: why do we need to approximate derivatives
Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products
Chapter 3 Cartesian Products and Relations The material in this chapter is the first real encounter with abstraction. Relations are very general thing they are a special type of subset. After introducing
6 3 4 9 = 6 10 + 3 10 + 4 10 + 9 10
Lesson The Binary Number System. Why Binary? The number system that you are familiar with, that you use every day, is the decimal number system, also commonly referred to as the base- system. When you
ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015
ECON 459 Game Theory Lecture Notes Auctions Luca Anderlini Spring 2015 These notes have been used before. If you can still spot any errors or have any suggestions for improvement, please let me know. 1
S-Parameters and Related Quantities Sam Wetterlin 10/20/09
S-Parameters and Related Quantities Sam Wetterlin 10/20/09 Basic Concept of S-Parameters S-Parameters are a type of network parameter, based on the concept of scattering. The more familiar network parameters
Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.
8 Inequalities Concepts: Equivalent Inequalities Linear and Nonlinear Inequalities Absolute Value Inequalities (Sections 4.6 and 1.1) 8.1 Equivalent Inequalities Definition 8.1 Two inequalities are equivalent
Linear Programming Notes VII Sensitivity Analysis
Linear Programming Notes VII Sensitivity Analysis 1 Introduction When you use a mathematical model to describe reality you must make approximations. The world is more complicated than the kinds of optimization
Systems of Linear Equations
Systems of Linear Equations Beifang Chen Systems of linear equations Linear systems A linear equation in variables x, x,, x n is an equation of the form a x + a x + + a n x n = b, where a, a,, a n and
Continued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
CHAPTER 5 Round-off errors
CHAPTER 5 Round-off errors In the two previous chapters we have seen how numbers can be represented in the binary numeral system and how this is the basis for representing numbers in computers. Since any
CHI-SQUARE: TESTING FOR GOODNESS OF FIT
CHI-SQUARE: TESTING FOR GOODNESS OF FIT In the previous chapter we discussed procedures for fitting a hypothesized function to a set of experimental data points. Such procedures involve minimizing a quantity
Binomial lattice model for stock prices
Copyright c 2007 by Karl Sigman Binomial lattice model for stock prices Here we model the price of a stock in discrete time by a Markov chain of the recursive form S n+ S n Y n+, n 0, where the {Y i }
Sensitivity Analysis 3.1 AN EXAMPLE FOR ANALYSIS
Sensitivity Analysis 3 We have already been introduced to sensitivity analysis in Chapter via the geometry of a simple example. We saw that the values of the decision variables and those of the slack and
Association Between Variables
Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi
6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation
6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation Daron Acemoglu and Asu Ozdaglar MIT November 2, 2009 1 Introduction Outline The problem of cooperation Finitely-repeated prisoner s dilemma
WRITING PROOFS. Christopher Heil Georgia Institute of Technology
WRITING PROOFS Christopher Heil Georgia Institute of Technology A theorem is just a statement of fact A proof of the theorem is a logical explanation of why the theorem is true Many theorems have this
Chapter 11 Number Theory
Chapter 11 Number Theory Number theory is one of the oldest branches of mathematics. For many years people who studied number theory delighted in its pure nature because there were few practical applications
Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013
Notes on Orthogonal and Symmetric Matrices MENU, Winter 201 These notes summarize the main properties and uses of orthogonal and symmetric matrices. We covered quite a bit of material regarding these topics,
CHAPTER FIVE. Solutions for Section 5.1. Skill Refresher. Exercises
CHAPTER FIVE 5.1 SOLUTIONS 265 Solutions for Section 5.1 Skill Refresher S1. Since 1,000,000 = 10 6, we have x = 6. S2. Since 0.01 = 10 2, we have t = 2. S3. Since e 3 = ( e 3) 1/2 = e 3/2, we have z =
23. RATIONAL EXPONENTS
23. RATIONAL EXPONENTS renaming radicals rational numbers writing radicals with rational exponents When serious work needs to be done with radicals, they are usually changed to a name that uses exponents,
Student Outcomes. Lesson Notes. Classwork. Discussion (10 minutes)
NYS COMMON CORE MATHEMATICS CURRICULUM Lesson 5 8 Student Outcomes Students know the definition of a number raised to a negative exponent. Students simplify and write equivalent expressions that contain
Part 1 Expressions, Equations, and Inequalities: Simplifying and Solving
Section 7 Algebraic Manipulations and Solving Part 1 Expressions, Equations, and Inequalities: Simplifying and Solving Before launching into the mathematics, let s take a moment to talk about the words
1 Solving LPs: The Simplex Algorithm of George Dantzig
Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.
The Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
Algebraic and Transcendental Numbers
Pondicherry University July 2000 Algebraic and Transcendental Numbers Stéphane Fischler This text is meant to be an introduction to algebraic and transcendental numbers. For a detailed (though elementary)
Principle of Data Reduction
Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then
This asserts two sets are equal iff they have the same elements, that is, a set is determined by its elements.
3. Axioms of Set theory Before presenting the axioms of set theory, we first make a few basic comments about the relevant first order logic. We will give a somewhat more detailed discussion later, but
Vector and Matrix Norms
Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty
Unit 26 Estimation with Confidence Intervals
Unit 26 Estimation with Confidence Intervals Objectives: To see how confidence intervals are used to estimate a population proportion, a population mean, a difference in population proportions, or a difference
CHAPTER 3. Methods of Proofs. 1. Logical Arguments and Formal Proofs
CHAPTER 3 Methods of Proofs 1. Logical Arguments and Formal Proofs 1.1. Basic Terminology. An axiom is a statement that is given to be true. A rule of inference is a logical rule that is used to deduce
Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions.
Unit 1 Number Sense In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions. BLM Three Types of Percent Problems (p L-34) is a summary BLM for the material
1 Error in Euler s Method
1 Error in Euler s Method Experience with Euler s 1 method raises some interesting questions about numerical approximations for the solutions of differential equations. 1. What determines the amount of
Section 1.1 Linear Equations: Slope and Equations of Lines
Section. Linear Equations: Slope and Equations of Lines Slope The measure of the steepness of a line is called the slope of the line. It is the amount of change in y, the rise, divided by the amount of
IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem
IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem Time on my hands: Coin tosses. Problem Formulation: Suppose that I have
Math 120 Final Exam Practice Problems, Form: A
Math 120 Final Exam Practice Problems, Form: A Name: While every attempt was made to be complete in the types of problems given below, we make no guarantees about the completeness of the problems. Specifically,
Notes on Determinant
ENGG2012B Advanced Engineering Mathematics Notes on Determinant Lecturer: Kenneth Shum Lecture 9-18/02/2013 The determinant of a system of linear equations determines whether the solution is unique, without
Covariance and Correlation
Covariance and Correlation ( c Robert J. Serfling Not for reproduction or distribution) We have seen how to summarize a data-based relative frequency distribution by measures of location and spread, such
Session 7 Fractions and Decimals
Key Terms in This Session Session 7 Fractions and Decimals Previously Introduced prime number rational numbers New in This Session period repeating decimal terminating decimal Introduction In this session,
Christmas Gift Exchange Games
Christmas Gift Exchange Games Arpita Ghosh 1 and Mohammad Mahdian 1 Yahoo! Research Santa Clara, CA, USA. Email: [email protected], [email protected] Abstract. The Christmas gift exchange is a popular
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
The Goldberg Rao Algorithm for the Maximum Flow Problem
The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }
Lecture Notes Module 1
Lecture Notes Module 1 Study Populations A study population is a clearly defined collection of people, animals, plants, or objects. In psychological research, a study population usually consists of a specific
Math 319 Problem Set #3 Solution 21 February 2002
Math 319 Problem Set #3 Solution 21 February 2002 1. ( 2.1, problem 15) Find integers a 1, a 2, a 3, a 4, a 5 such that every integer x satisfies at least one of the congruences x a 1 (mod 2), x a 2 (mod
The Graphical Method: An Example
The Graphical Method: An Example Consider the following linear program: Maximize 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2 0, where, for ease of reference,
Linear Programming Notes V Problem Transformations
Linear Programming Notes V Problem Transformations 1 Introduction Any linear programming problem can be rewritten in either of two standard forms. In the first form, the objective is to maximize, the material
Linear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
CS 103X: Discrete Structures Homework Assignment 3 Solutions
CS 103X: Discrete Structures Homework Assignment 3 s Exercise 1 (20 points). On well-ordering and induction: (a) Prove the induction principle from the well-ordering principle. (b) Prove the well-ordering
3.6. Partial Fractions. Introduction. Prerequisites. Learning Outcomes
Partial Fractions 3.6 Introduction It is often helpful to break down a complicated algebraic fraction into a sum of simpler fractions. For 4x + 7 example it can be shown that x 2 + 3x + 2 has the same
Notes on Continuous Random Variables
Notes on Continuous Random Variables Continuous random variables are random quantities that are measured on a continuous scale. They can usually take on any value over some interval, which distinguishes
c 2008 Je rey A. Miron We have described the constraints that a consumer faces, i.e., discussed the budget constraint.
Lecture 2b: Utility c 2008 Je rey A. Miron Outline: 1. Introduction 2. Utility: A De nition 3. Monotonic Transformations 4. Cardinal Utility 5. Constructing a Utility Function 6. Examples of Utility Functions
a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.
Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given
Application. Outline. 3-1 Polynomial Functions 3-2 Finding Rational Zeros of. Polynomial. 3-3 Approximating Real Zeros of.
Polynomial and Rational Functions Outline 3-1 Polynomial Functions 3-2 Finding Rational Zeros of Polynomials 3-3 Approximating Real Zeros of Polynomials 3-4 Rational Functions Chapter 3 Group Activity:
The equivalence of logistic regression and maximum entropy models
The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/
Base Conversion written by Cathy Saxton
Base Conversion written by Cathy Saxton 1. Base 10 In base 10, the digits, from right to left, specify the 1 s, 10 s, 100 s, 1000 s, etc. These are powers of 10 (10 x ): 10 0 = 1, 10 1 = 10, 10 2 = 100,
Section 1.3 P 1 = 1 2. = 1 4 2 8. P n = 1 P 3 = Continuing in this fashion, it should seem reasonable that, for any n = 1, 2, 3,..., = 1 2 4.
Difference Equations to Differential Equations Section. The Sum of a Sequence This section considers the problem of adding together the terms of a sequence. Of course, this is a problem only if more than
Basic numerical skills: FRACTIONS, DECIMALS, PROPORTIONS, RATIOS AND PERCENTAGES
Basic numerical skills: FRACTIONS, DECIMALS, PROPORTIONS, RATIOS AND PERCENTAGES. Introduction (simple) This helpsheet is concerned with the ways that we express quantities that are not whole numbers,
Creating, Solving, and Graphing Systems of Linear Equations and Linear Inequalities
Algebra 1, Quarter 2, Unit 2.1 Creating, Solving, and Graphing Systems of Linear Equations and Linear Inequalities Overview Number of instructional days: 15 (1 day = 45 60 minutes) Content to be learned
Solutions of Linear Equations in One Variable
2. Solutions of Linear Equations in One Variable 2. OBJECTIVES. Identify a linear equation 2. Combine like terms to solve an equation We begin this chapter by considering one of the most important tools
1. Give the 16 bit signed (twos complement) representation of the following decimal numbers, and convert to hexadecimal:
Exercises 1 - number representations Questions 1. Give the 16 bit signed (twos complement) representation of the following decimal numbers, and convert to hexadecimal: (a) 3012 (b) - 435 2. For each of
x 2 + y 2 = 1 y 1 = x 2 + 2x y = x 2 + 2x + 1
Implicit Functions Defining Implicit Functions Up until now in this course, we have only talked about functions, which assign to every real number x in their domain exactly one real number f(x). The graphs
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
Markov random fields and Gibbs measures
Chapter Markov random fields and Gibbs measures 1. Conditional independence Suppose X i is a random element of (X i, B i ), for i = 1, 2, 3, with all X i defined on the same probability space (.F, P).
Lecture 3: Finding integer solutions to systems of linear equations
Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture
LINEAR EQUATIONS IN TWO VARIABLES
66 MATHEMATICS CHAPTER 4 LINEAR EQUATIONS IN TWO VARIABLES The principal use of the Analytic Art is to bring Mathematical Problems to Equations and to exhibit those Equations in the most simple terms that
Irrational Numbers. A. Rational Numbers 1. Before we discuss irrational numbers, it would probably be a good idea to define rational numbers.
Irrational Numbers A. Rational Numbers 1. Before we discuss irrational numbers, it would probably be a good idea to define rational numbers. Definition: Rational Number A rational number is a number that
