Advanced statistical inference. Suhasini Subba Rao

Size: px
Start display at page:

Download "Advanced statistical inference. Suhasini Subba Rao Email: suhasini.subbarao@stat.tamu.edu"

Transcription

1 Advanced statistical inference Suhasini Subba Rao August 1, 2012

2 2

3 Chapter 1 Basic Inference 1.1 A review of results in statistical inference In this section, we review some results that you came across in STAT611 or equivalent. We will review the Cramer-Rao bound and some properties of the likelihood. In later sections, we will use the likelihood as a means of parameter estimation ie. the maximimum likelihood estimator which you would have done in previous courses and heuristically argue why the Fisher information which gives the Cramer-Rao bound is extremely important The likelihood function Let {X i } be iid random variables with probability function or probability density function fx;θ, where f is known but the parameter θ is unknown. The likelihood function is defined as LX;θ = and the log-likelihood is T fx i ;θ 1.1 loglx;θ = LX;θ = logfx i ;θ. 1.2 Example i Suppose that {X t } are iid normal random variables with mean µ and variance σ 2 the log likelihood is L T X;µ,σ 2 Tσ 2 X t µ σ 2

4 ii Suppose that {X t } are iid binomial random variables X t Binn,π. Then the log likelihood is L T X;π n log + X t X t log π +nlog1 π. 1 π iii Suppose that {X t } are independent binomial random variables such that X t Binn t,π t, where the regressors z t influence the mean of X t, such that π t = gβ x t. Then the log likelihood is L T X;π log nt X t + X t log gβ x t +nt 1 gβ log1 gβ x t. x t iv Suppose that {X t } are independent exponential random variables which have the density θ 1 exp x/θ. The log-likelihood is L T X;θ = αlogθ + Y t. θ v A generalisation of the exponential distribution which gives more freedom in terms of shape of the distribution is the Weibull. Suppose that {X t } are independent Weibull random variables which have the density αyα 1 θ exp y/θ α where θ,α > 0 in the case that α α = 0 we have the regular exponential and y is defined over the positive real line. The log-likelihood is L T X;α,θ = logα+α 1logY t αlogθ Y t α. θ In the case, that α is known, but θ is unknown the likelihood is proportional to L T X;θ = αlogθ Y t α.. θ Bounds for the variance of an unbiased estimator We require the following assumptions, often called the regularity assumptions. We state the assumptions and result scalar θ, but they can easily be extended to the case that θ is a vector. Assumption Regularity Conditions 1 Let us suppose that L T is the likelihood with true parameter θ, and i logl T x;θ L T x;θdx = 0 for iid this is equivalent to logfx;θ fx;θdx = 0. 4

5 ii LT x;θdx = L T x;θ dx = 0. iii gxlt x;θdx = gx L Tx;θ dx, where g is any function which is not a function of θ for example the estimator of θ. iv E logl T X;θ 2 > 0. Theorem The Cramer-Rao bound Let θx be an unbiased estimator of θ. Suppose the likelihood L T X;θ satisfies the regularity conditions given in Assumption and θx is an unbiased estimator of θ, then we have var θx E logl T X;θ 1 2 = PROOF. Recall that θx is an unbiased estimator of θ therefore θxl T x;θdx = θ. Differentiating both sides wrt to θ gives θx L Tx;θ dx = 1. Since L T x;θ dx = 0 we have { θ θx } LT x;θ dx = 1. Multiplying and dividing by L T x;θ gives { θx θ } 1 L T x;θ E 2 logl T X;θ 1 2, Hence since L T x;θ is the distribution of X we have } 1 logl T X;θ E { θx θ = 1. L T X;θ L T x;θ L T x;θdx = Recalling that the Cauchy-Schwartz inequality is EUV EU 2 1/2 EV 2 1/2 where equality only arises if U = av +b where a and b are constants and applying it to the above we have var θxe logl T X;θ Thus giving us the Cramer-Rao inequality. Finally we need to prove that E logl T X;θ 2 = E 2 logl T X;θ. To prove this result we use the fact that 2 LT is a density to obtain L T x;θdx = 1. 5

6 Now by differentiating the above with respect to θ gives L T x;θdx = 0. By using Assumption 1.1.1ii we have LT x;θ dx = 0 loglt x;θ L T x;θdx = 0 Differentiating again with respect to θ and taking the derivative inside gives 2 logl T x;θ loglt x;θ L T x;θ 2 L T x;θdx+ dx = 0 2 logl T x;θ loglt x;θ 1 L T x;θ 2 L T x;θdx+ L T x;θdx = 0 L T x;θ 2 logl T x;θ loglt x;θ 2 2 L T x;θdx+ L T x;θdx = 0 Thus 2 logl T X;θ loglt X;θ E 2 = E L T X;θ. Which gives us the required result. Corollary Estimators which attain the C-R bound Suppose Assumption is satisfied. Then the estimator θx attains the C-R bound only if it can be written as for some functions a and b. ˆθX = aθ+bθ logl TX;θ PROOF. The proof is clear and follows from when the Cauchy-Schwartz inequality is an actual equality in the derivation of the C-R bound. We mention that there exists some well known distributions which do not satisfy Assumption These are non-regular distributions. A classical example of a distribution which violates this assumption is the uniform distribution fx;θ = 1/θ, for x [0,θ] and zero elsewhere. Other examples, include distributions where the support of the distribution is a function of the parameter. The Cramer-Rao lower bound does hold or even exist for such distributions. Example The classical example of the uniform Let us consider the example if the uniform distribution, which has the density fx;θ = θ 1 exp x/θ. Given the iid uniform 6

7 random variables {X t } the likelihood it is easier to study the likelihood rather than the loglikelihood is L T X T ;θ = 1 θ T T I [0,θ] X t. Since the support of density involves the unknown parameter, then the derivative of logl T X T ;θ is not well defined what is the derivative of logi [0,θ] X t = logi [Xt, θ with respect to θ? - observe that at log0 is not well defined and the derivative at X t is not well defined and Assumption 1.1.1ii is not satisfied. This is a classical example of a density which does not satisfy the regularity conditions. This means that the inverse of the Fisher information does not give a lower bound for the variance estimator. And below we will show why. In fact, using L T X T ;θ, the maximum likelihood estimator of θ is ˆθ T = max 1 t T X t you can see this by making a plot of L T X T ;θ against θ. It it well known that the distribution of max 1 t T X t is P max 1 t T X t x = PX 1 x,...,x T x = and the density of max 1 t T X t is fˆθt x = Tx T 1 /θ T. Exercise: Find the variance of ˆθ T definedabove. T PX t x = x T, θ Often we want to estimate a function of θ, τθ. The following corollary is a small generalisation of the Cramer-Rao bound. Corollary Suppose the regularity conditions Assumption are satisfied and TX is an unbiased estimator of τθ. Then we have vartx τ θ 2 E logl T X;θ We now define the notion of sufficiency, which gives us the ingredients for constructing a good estimator see also Sections and Davison Definition Sufficiency and the factorisation theorem Suppose that X = X 1,...,X T is a random vector. The statistic sx is called a sufficient statistic of the parameter θ, if the conditional distribution of X given sx is not a function of θ. Normally it is extremely hard to obtain the sufficient statistic from its definition. However, the factorisation theorem gives us a way of obtaining the sufficient statistic. The Factorisation Theorem Suppose that the likelihood function can be partitioned as follows, L T X;θ = hxgsx;θ, where hx is not a function of θ, then sx is a sufficient statistic of θ.. 7

8 We see that a sufficient statistic contains all the ingredients about the parameter θ. Theorem Rao-Blackwell Theorem Suppose sx is a sufficient statistic and θx is an unbiased estimator of θ then if we define the new unbiased estimator EθX SX, then vare θx SX var θx. The Rao-Blackwell theorem tells us that estimators with the smallest variance must be a function of the sufficient statistic. Of course this begs the question is there a unique estimator with the minumum variance. For this we require completeness of the sufficient statistic. Uniqueness immediately follows from completeness. Definition Completeness Let sx be a sufficient statistic for θ. Suppose Z is a function of sx such that EZsX = 0. sx is a complete sufficient statistic if and only if EZsX = 0 implies Zt = 0 for all t. Theorem Lehmann-Scheffe Theorem Suppose that SX is a complete sufficient statistic and θsx is an unbiased estimator estimator of θ then θsx is the unique minimum variance unbiased estimator of θ. The theorems above are theoretical, in the sense that, under certain conditions they give a lower bound for the variance of a plausible estimator and practical in the sense that they tell us that the best estimator should be a function of sufficient statistic. The natural question to ask, is how to construct such estimators. One of the most popular estimators in statistics are maximum likelihood estimators mle. That is the mle of θ is ˆθ T = argmax θ Θ L T θ, where Θ is the parameter space which contains all values of θ with fx;θdx = 1. There are two reasons that they are so widely used i it can be shown for a wide range of probability distributions - including under certain conditions the exponential family of distributions, defined below, that the mle is a function of the sufficient statistic, hence the mle is often the minimum variance unbiased estimator ii asymptotically at least the mle under certain conditions attains the C-R bound. Of course one can construct examples, where the regularity conditions are not satisfied and the mle is not the optimal estimator examples include estimation of the range in the uniform distribution, where an estimator can be constructed which has a small variance than the mle. But for a mass majority of distributions the mle is optimal. It is also worth mentioning that there can exist biased estimators which have a smaller mean squared error than the MLE this intriguing notion it called super-efficiency, which is beyond this course - see Stoica and Ottesten 1996 for a review. 8

9 1.1.3 Additional Notes We will use various distributions in this course, it would be useful if you complied a list of these distributions and become familiar with them. Example Useful transformations Question: The distribution function of the random variable X t is F t x = 1 exp λ t x. i Give a transformation of X t, such that the transformed variable is uniformly distributed on the interval [0, 1]. ii Suppose that I observe the independent but not necessarily identically distributed Answer: random variables {X t }, and I want to to check whether they have the distribution function F t x = 1 exp λ t x. Using i, suggest a method for checking this? i It is well known that if the random variable X t has the distribution function F t x, then the transformed random variable Y t = F t X t is uniformly distributed on the interval [0,1]. To see this, note that the distribution of Y t can be evaluated as PY t y = PF t X t y = PX t F 1 t y = F t F 1 t y = y, y [0,1]. Thus to answer the question, we let Y t = 1 exp λ t X t, which as a uniform distribution. ii If we want to check whether X t follows the distribution F t x = 1 exp λ t x, we can make the transformation Y t = 1 exp λ t X t, and use, for example, the Kolmogorov- Smirnov test to check whether {Y t } follows a uniform distribution. Example Question Suppose that Z is a Weibull random variable with density fx;φ,α = α φ x φ α 1 exp x/φ α. Show that EZ r = φ r Γ 1+ r. α Hint: Use x a exp x b dx = 1 b Γa b + 1 a,b > 0. b Solution This result may be useful in some of the examples given in this course. 9

10 10

11 Chapter 2 The Bayesian Cramer-Rao 2.1 The Bayesian Cramer-Rao inequality The classical Cramér-Rao inequality is useful for assessing the quality of a given estimator. But from the derivation we can clearly see that it only holds if the estimator is unbiased. No such inequality can be derived to include estimators which are biased. For example, this can be a problem in nonparametric regression, where estimators in general will be biased. How does one access the estimator in such cases? To answer this question we consider the Bayesian Cramer-Rao inequality. This is similar to the Cramer-Rao inequality but does not require that the estimator is unbiased, so long as we place a prior on the parameter space. This inequality is known as the Bayesian Cramer-Rao or van-trees inequality. Suppose {X i } T are random variables with distribution function L TX;θ. Let θx be an estimator of θ. We now Bayesianise the set-up by placing a prior distribution on the parameter space Θ, the density of this prior we denote as λ. Let Egx θ = gxl T x θdx and E λ denote the expectation over the density of the parameter λ. For example b E λ E θx θ = θxlt x θdxdθ. a R T Assumption θ is defined over the compact interval [a,b] and λx 0 as x a and x b so λa = λb = 0. Theorem Suppose Assumptions and hold. Let θx be an estimator of θ. Then we have E λ Eθ θx θ 2 θ E λ Iθ+Iλ 1 11

12 where E λ Iθ = and Iλ = loglt x;θ 2 L T x;θλθdθ logλθ 2 λθdθ. PROOF. We first note that under Assumption we have b Therefore by using the above we have a ] L T x;θλθ b dθ = L T x;θλθ = 0. a R T θx b a L T x;θλθ dθdx = Now let us consider b R T a θ L Tx;θλθ dθdx. By integration by parts we have R T = b a θ L Tx;θλθ dθdx = θl T x;θλθ R T ] b a dx R n b a L T x;θλθdθdx L T x;θλθdθdx. 2.2 Subtracting 2.2 from 2.1 we have R T b a LT x;θλθ θx θ dθdx = Multiplying and dividing by L T x;θλθ gives R T b a b R T R T b a L T x;θλθdθdx = 1. 1 L T x;θλθ θx θ L T x;θλθdxdθ = 1. L T x;θλθ loglt x;θλθ θx θ L T x;θλθdxdθ = 1 a Now by using the Cauchy-Schwartz inequality we have b 2LT 1 θx θ x;θλθdxdθ a R } T {{ } E λ E θx θ 2 θ Rearranging the above gives b a R T loglt x;θλθ 2 L T x;θλθdxdθ. E λ Eθ θx θ 2 b a R T loglt x;θλθ 1 2LT x;θλθdxdθ 12

13 Finally we want to show that the denominator of the RHS of the above is b loglt x;θλθ 2 L T x;θλθdxdθ = E λ Iθ+Iλ. a R T Using basic algebra we have loglt x;θλθ 2 L T x;θλθdxdθ = = b loglt x;θ + logλθ 2 L T x;θλθdxdθ a R T loglt x;θ 2 b L T x;θλθdxdθ+2 a } {{ } E λ Iθ b logλθ 2 + L T x;θλθdxdθ. a R T } {{ } Iλ logl T x;θ R T logλθ L T x;θλθdxdθ We note that loglt x;θ and b a R T logλθ logλθ logλθ dxdθ = 2LT x;θλθdxdθ = b a loglt x;θλθ 2 L T x;θλθdxdθ LT x;θ dxdθ } {{ } =0 logλθ 2λθdθ. Therefore we have = b loglt x;θ 2 L T x;θλθdxdθ+ a R T } {{ } E λ Iθ b logλθ 2 L T x;θλθdxdθ, a R T } {{ } Iλ we required. We will consider applications of the Bayesian Cramer-Rao bound in Section for obtaining lower bounds of nonparametric density estimators. 13

14 14

15 Chapter 3 The Exponential Family 3.1 The exponential family of distributions See also Section 5.2, Davison It is possible to derive the properties eg. mean, variance and maximum likelihood estimators - to be defined properly later on for every distribution of interest. However, this can be cumbersome, the algebra can be tedious and we may not see the big picture. Instead, we now consider an umbrella family of distributions which include several well known distributions. We will derive a general expression for the mean and variance of such distributions which will be useful when we consider Generalised Linear models later in this course, and use these results to show that the maximum likelihood estimator is a function of the sufficient statistic :- thus is the best unbiased estimator under the assumption of completeness. In other words, we that for this family of distributions the maximum likelihood estimator which we have encountered many times previously is indeed the best parameter estimator in terms of minimum variance. Suppose that the distribution of the random variable X t can be written in the form fy;ω = exp syηω bω+cy. 3.1 If the distribution of X t both the probability distribution function for discrete random variables and probability density function for continuous random variables has the above representation, then X t is said to belong to the exponential family of distributions. A large number of well known distribution functions belong to this family. Hence by understanding the properties the exponential family, we can draw conclusions on a large number of distribution functions. Example a The exponential distribution X Expλ, hence the pdf is fy; λ = λexp λy, which can be written as logfy;λ = yλ+logλ. 15

16 Therefore sy = y and ηλ = λ. b The binomial distribution PX = y = n y π1 π n y can be rewritten as π n logpy;λ = ylog 1 π +nlog1 π+log. y Therefore sy = y, ηπ = log π 1 π, bπ = nlog1 π 1 and cy = log n y. It should be mentioned that it is straightforward to generalise the exponential family to the case that θ is a vector of dimension greater than one. Suppose that θ is a p-dimensional vector. The order p exponential family is defined as distributions which satisfy fy;ω = expsy θω bω+cy, where sy = s 1 y,...,s p y with {s i } linearly independent and θω = θ 1 ω,...,θ p ω The natural exponential family If we let θ = ηω and η is an invertible function hence there is a one-to-one correspondence between the space containing ω and the space containing θ, then we can rewrite 3.1 we fy;θ = expsyθ κθ+cy, where κθ = bη 1 θ. The natural exponential family is when sy = y. Now by transformation we give example of distributions which have natural form. i The exponential distribution is already in natural exponential form. ii For the binomial distribution we let θ = log π π 1 π, since log 1 π is invertible this gives the log distribution as π logfy;θ = logfy;log 1 π = yθ nlog 1 n. +log 1+expθ y Hence the parameter of interest, π, has been transformed, and often we fit a model later in the course to θ, and transform back to obtain an estimator of π. Some properties of the natural exponental Distributions which have a natural exponential have interesting properties which we now discuss. Lemma Suppose that X is a random variable which has the natural exponential representation. Then the moment generating function of X is EexpXt = expκt + θ κθ. Furthermore, EX = κ θ and varx = κ θ. 16

17 PROOF. Let us suppose that t is sufficiently small such that fy;θ+t is a distribution. The mfg is M X t = EexptY = expty expθy κθ + cydy = expκθ +t κθ expθ +ty κθ +t+cydy = expκθ +t κθ, since expθ + ty κθ + t + cydy = fy;θ + tdy = 1. To obtain the moments we recall that M X 0 = EX and varx = M X 0 M X 02. Therefore M Xt = κ θ +texpκθ +t κθ M Xt = κ θ +t+κ θ +t 2 expκθ +t κθ. Hence M X 0 = κ θ and M X 0 = κ θ+κ θ 2, which gives the result. Remark The mean and variance of the natural exponential family make obtaining the mle estimators quite simple. We derive this later but we first observe that since EX = κ θ, therefore the mean of X is a function of θ, hence we can write µθ = κ θ. Moreover, since varx = κ θ, then the derivative of µ, µ θ, is strictly positive. In other words, µθ = κ θ is an increasing function in θ. Thus µθ is an invertible function, therefore given µθ, we can uniquely determine θ. This observation will prove useful later when obtaining the mle estimators of θ Maximum likelihood estimation for the exponential family Suppose that {X t } are iid random variables which have a natural exponential distribution representation. Then the log likelihood function is L T X;θ = θ X t Tκθ+T cx t. Hence by using the factorisation theorem we see that the sufficient statistic for θ is sx = T X t. Hence, supposing that Assumption is satisfied, then the minimum variance unbiased estimator of θ should be a function of sx. We now obtain the maximum likelihood estimator of θ, and derive conditions under which the mle is a function of sx hence, by Rao-Blackwell theorem and the Lehmann-Scheffe lemma it is the best estimator. The mle of θ is ˆθ T where ˆθ T = argmax θ Θ { T θ X t Tκθ+ 17 cx t }.

18 Thenaturalwaytoobtain ˆθ T istofindthesolutionof L TX;θ = 0. Howeverfor L TX;θ θ=ˆθt = 0, depends on a few conditions. Before we derive these conditions we first consider the solution of the derivative of L T X;θ. Differentiating L T X;θ gives L T X;θ = X t Tκ θ. Therefore, since µθ = κ prime θ is an invertible function, then L TX;θ = 0 when ˆθ T = µ 1 1 T X t. Of course, we need to know under what conditions µ 1 1 T X t = argmax θ Θ { T θ X t Tκθ+T cx t }. The above really depends on the parameter space Θ. Definition Let Θ be the parameter space of θ and the space of outcomes of the random variable X, Y. Let M = {µ = µθ;θ Θ} denote the man space. Let Ȳ T = {y = 1 T T x t;x t Y} the sample mean space. Lemma Suppose that {X t } are iid random variables which have a natural exponential representation. If Y T M then µ 1 1 T X t = argmax θ Θ { T θ X t Tκθ+T cx t }. PROOF.Theproofisstraightforward,sincethefirstderivativeiszerowhenθ = µ 1 1 T T X t. Then this is the maximum of lx;θ in the sample mean space ȲT. Hence in order for it the minimum over the mean space M, then either M = ȲT or ȲT M. Remark Minimum variance unbiased estimators Suppose X t has a distribution in the natural exponential family, the conditions of the above lemma are satisfied and sx is the complete statistic of θ. Moreover if µ 1 1 T T X t is an unbiased estimator of θ, then µ 1 1 T T X t is the minumum variance unbiased estimator of θ. However, in general, this will not be case. But by using Slutsky s theorem it can be shown that µ 1 1 T T X t P θ. 18

19 Remark Estimating ω Often we are interested in estimating ω, where θ = θω. However, since lx; θ ω = ω lx;θ = θ ω T X t Tκ θ. Then if all conditions regarding parameter and sample mean spaces are satisfied then the mle of ω is ˆω T = η 1 µ 1 1 T X t. It should be noted that one great advantage of the exponential family of distributions is that the mle is easy to obtain with explicit expressions!. Many of the results above can be generalised to the setting that {X t } are independent but not necessarily identically distributed and there exists regressors z which are known to influence the mean of X t. We will revisit this problem when we consider generalised linear models. 19

20 20

21 Chapter 4 The Maximum Likelihood Estimator 4.1 The Maximum likelihood estimator As illustrated in the exponential family of distributions, discussed above, the maximum likelihood estimator of θ 0 the true parameter is defined as ˆθ T = argmax θ Θ L TX;θ = argmax θ Θ L Tθ. Often we find that L Tθ θ=ˆθt = 0, hence solution can be obtained by solving the derivative of the log likelihood often called the score function. However, if θ 0 lies on or close to the boundary of the parameter space this will not necessarily be true. Below we consider the sampling properties of ˆθ T when the true parameter θ 0 lies in the interior of the parameter space Θ. We note that the likelihood is invariant to transformations of the data. For example if X has the density f ;θ and we define the transformed random variable Z = gx, where the function g has an inverse its a 1-1 transformation, then it is easy to show that the density of Z is fg 1 z;θ g 1 z z. Therefore the likelihood of {Z t = gx t } is T fg 1 Z t ;θ g 1 z z=zt = z T fx t ;θ g 1 z z=zt. z Hence it is proportional to the likelihood of {X t } and the maximum of the likelihood of {Z t = gx t } is the same as the maximum of the likelihood of {X t }. 21

22 4.1.1 Evaluating the MLE Examples Example {X t } are iid random variables, which follow a Normal Gaussian distribution Nµ,σ 2. The likelihood is proportional to L T X;µ,σ 2 = T logσ 1 2σ 2 X t µ 2. Maximising the above with respect to µ and σ 2 gives ˆµ T = X and ˆσ 2 = 1 T T X t X 2. Example Question: Solution: {X t } are iid random variables, which follow a Weibull distribution, which has the density αy α 1 θ α exp y/θ α θ,α > 0. Suppose that α is known, but θ is unknown and we need to estimate it. What is the maximum likelihood estimator of θ? The log-likelihood of interest is proportional to L T X;θ = logα+α 1logY t αlogθ Y t α θ αlogθ Y t α.. θ The derivative of the log-likelihood wrt to θ is L T = Tα θ + α θ α+1 Solving the above gives ˆθ T = 1 T T Y α t 1/α. Y α t = 0. Example Notice that if α is given, an explicit solution for the maximum of the likelihood, in the above example, can be obtained. Consider instead the maximum of the likelihood with respect to α and θ, ie. arg max θ,α logα+α 1logY t αlogθ Y t α. θ 22

23 The derivative of the likelihood is L T L T α = Tα θ + α θ α+1 Y α t = 0 = T T α logy t T logθ Tα T θ + log Y t θ Y t θ α = 0. It is clear that an explicit expression to the solution of the above does not exist and we need to find alternative methods for finding a solution. Below we shall describe numerical routines which can be used in the maximisation. In special cases, one can use other methods, such as the Profile likelihood we cover this later on. Numerical Routines In an ideal world to maximise a likelihood, we would consider the derivative of the likelihood and solve it L Tθ θ=ˆθt = 0, and an explicit expression would exist for this solution. In reality this rarely happens as we illustrated in the section above. Usually, we will be unable to obtain an explicit expression for the MLE. In such cases, one has to do the maximisation using alternative, numerical methods. Typically it is relative straightforward to maximise the likelihood of random variables which belong to the exponential family numerical algorithms sometimes have to be used, but they tend to be fast and attain the maximum of the likelihood - not just the local maximum. However, the story becomes more complicated even if we consider mixtures of exponential family distributions - these do not belong to the exponential family, and can be difficult to maximise using conventional numerical routines. We give an example of such a distribution here. Let us suppose that {X t } are iid random variables which follow the classical normal mixture distribution fy;θ = pf 1 y;θ 1 +1 pf 2 y;θ 2, where f 1 is the density of the normal with mean µ 1 and variance σ 2 1 and f 2 is the density of the normal with mean µ 2 and variance σ2 2. The log likelihood is L T Y;θ = 1 log p exp 1 2πσ 2 1 2σ 2 1 X t µ p exp 1 2πσ 2 2 2σ 2 2 X t µ 2 2. Studying the above it is clear there does not explicit solution to the maximum. Hence one needs to use a numerical algorithm to maximise the above likelihood. We discuss a few such methods below. 23

24 The Newton Raphson Routine The Newton-Raphson routine is the standard method to numerically maximise the likelihood, this can often be done automatically in R by using the R functions optim or nlm. To apply Newton-Raphson, we have to assume that the derivative of the likelihood exists this is not always the case - think about the l 1 - norm based estimators! and the minimum lies inside the parameter space such that L T θ θ=ˆθt = 0. We choose an initial value θ 1 and apply the routine 2 L T θ 1 L T θ n 1 θ n = θ n θn 1 θn 1. Where this routine comes from will be clear by using the Taylor expansion of L Tθ n 1 about θ 0 see Section If the likelihood has just one global maximum and no local maximums hence it is convex, then it is quite easy to maximise. If on the other hand, the likelihood has a few local maximums and the initial value θ 1 is not chosen close enough to the true maximum, then the routine may converge to a local maximum not good!. In this case it may be a good idea to do the routine several times for several different initial values θ i i 1. For each convergence value ˆθ T evaluate the likelihood L Tˆθ i T and select the value which gives the largest likelihood. It is best to avoid these problems by starting with an informed choice of initial value. Implementing without any thought a Newton-Rapshon routine can lead to estimators which take an incredibly long time to converge. If one carefully considers the likelihood one can shorten the convergence time by rewriting the likelihood and using faster methods often based on the Newton-Raphson. Iterative least squares This is a method that we shall describe later when we consider Generalised linear models. As the name suggests the algorithm has to be interated, however at each step weighted least squares is implemented see later in the course. The EM-algorithm This is done by the introduction of dummy variables, which leads to a new unobserved likelihood which can easily be maximised. In fact one the simplest methods of maximising the likelihood of mixture distributions is to use the EM-algorithm. We cover this later in the course. See Example 4.23 on page 117 in Davison The likelihood for dependent data We mention that the likelihood for dependent data can also be constructed though often the estimation and the asymptotic properties can be a lot harder to derive. Using Bayes rule ie. 24

25 PA 1,A 2,...,A T = PA 1 T i=2 PA i A i 1,...,A 1 we have L T X;θ = fx 1 ;θ T fx t X t 1,...,X 1 ;θ. t=2 Under certain conditions on {X t } the structure above T t=2 fx t X t 1,...,X 1 ;θ can be simplified. For example if X t were Markovian then we have X t conditioned on the past on depends the most recent past observation, ie. fx t X t 1,...,X 1 ;θ = fx t X t 1 ;θ in this case the above likelihood reduces to L T X;θ = fx 1 ;θ n fx t X t 1 ;θ. 4.1 t=2 Example A lot of the material we will cover in this class will be for independent observations. However likelihood methods can also work for dependent observations too. Consider the AR1 time series X t = ax t 1 +ε t, where ε t are iid random variables with mean zero. We will assume that a < 1. We see from the above that the observation X t 1 as a linear influence on the next observation and it is Markovian, that it given X t 1, the random variable X t 2 has no influence on X t to see this consider the distribution function PX t x X t 1,X t 2. Therefore by using 4.1 the likelihood of {X t } t is L T X;a = fx 1 ;a T f ε X t ax t 1, 4.2 t=2 where f ε is the density of ε and fx 1 ;a is the marginal density of X 1. This means the likelihood of {X t } only depends on f ε and the marginal density of X t. We use â T = argmaxl T X;a as the mle estimator of a. Often we ignore the term fx 1 ;a because this is often hard to know - try and figure it out - its relatively easy in the Gaussian case and consider what is called the conditional likelihood Q T X;a = T f ε X t ax t t=2 ã T = argmaxl T X;a as the quasi-mle estimator of a. Exercise: What is the quasi-likelihood proportional to in the case that {ε t } are Gaussian random variables with mean zero. It should be mentioned that often the conditional likelihood is derived as if the errors {ε t } are Gaussian - even if they are not. This is often called the quasi or pseudo likelihood. 25

26 4.1.2 A quick review of the central limit theorem In this section we will not endeavour to proof the central limit theorem which is usually based on showing that the characteristic function - a close cousin of the moment generating function - of the average converges to the characteristic function of the normal distribution. However, we will recall the general statement of the CLT and generalisations of it. The purpose of this section is not to lumber you with unnecessary mathematics but to help you understand when an estimator is close to normal or not. Lemma The famous CLT Let us suppose that {X t } are iid random variables, let µ = EX t < and σ 2 = varx t <. Define X = 1 T T X t. Then we have D T X µ N0,σ 2, alternatively X µ D N0, σ2 T. What this means that if we have a large enough sample size and plotted the histogram of several replications of the average, this should be close to normal. Remark i The above lemma appears to be restricted to just averages. However, it can be used in several different contexts. Averages arise in several different situations. It is not just restricted to the average of the observations. By judicious algebraic manipulations, one can show that several estimators can be rewritten as an average or approximately as an average. At first appearance, the MLE of the Weibull parameters given in Example does not look like an average, however, in the section we will consider the general maximum likelihood estimators, and show that they can be rewritten as an average hence the CLT applies to them too. ii The CLT can be extended in several ways. a To random variables whose variance are not all the same ie. indepedent but identically distributed random variables. b Dependent random variables so long as the dependency decays in some way. c To not just averages but weighted averages too so long as the weight depends in certain way. However, the weights should be distributed well over all the random variables. Ie. suppose that {X t } are iid random variables. Then it is clear that X t will never be normal unless {X t } is normal - observe 10 is fixed!, but it seems plausible that 1 n n sin2πt/12x t is normal despite this not being the sum of iid random variables. 26

27 There exists several theorems which one can use to prove normality. But really the take home message is, look at your estimator and see whether asymptotic normality it looks plausible - you could even check it through simulations. Example Some problem cases One should think a little before blindly applying the CLT. Suppose that the iid random variables {X t } follow a t-distribution with 2 degrees of freedom, ie. the density function is fx = Γ3/2 2π 1+ x2 2 3/2. Let X = 1 n n X t denote the sample mean. It is well known that the mean of the t-distribution with two degrees of freedom exists, but the variance does not it is too thick tailed. Thus, the assumptions required for the CLT to hold are violated and X is not normally distributed in fact it follows a stable law distribution. Intuitively this is clear, recall that the chance of outliers for a t-distribution with a small number of degrees of freedom, if large. This makes it impossible that even averages should be well behaved there is a large chance that an average could also be too large or too small. To see why the variance is infinite, study the form of t-distribution with two degrees. For the variance to be finite, the tails of the distribution should converge to zero fast enough in other words the probability of outliers should not be too large. See that the tails of the t-distribution for large x behaves like fx Cx 3 make a plot in Maple to check, thus the second moment EX 2 M Cx 3 x 2 dx = M Cx 1 dx for some C and M, is clearly not finite! This argument can be made precise The Taylor series expansion - the statisticians tool The Taylor series is used all over the place in statistics and you should be completely fluent with using it. It can be used to prove consistency of an estimator, normality based on the assumption that averages converge to a normal distribution, obtaining the limiting variance of an estimator etc. We start by demonstrating its use for the log likelihood. We recall that the mean value in the univariate case states that fx = fx 0 +x x 0 f x 1 fx = fx 0 +x x 0 f x 0 + x x 0 2 f x 2, 2 where x 1 and x 2 both lie between x and x 0. In the case that f is a multivariate function, then we have fx = fx 0 +x x 0 fx x= x1 fx = fx 0 +x x 0 fx x=x x x 0 2 fx x= x2 x x 0, 27

28 where x 1 and x 2 both lie between x and x 0. In the case that fx is a vector, then the mean value theorem does not directly work. Strictly speaking we cannot say that fx = fx 0 +x x 0 fx x= x1, where x 1 lies between x and x 0. However, it is quite straightforward to overcome this inconvience. The mean value theorem does hold pointwise, for every element of the vector fx = f 1 x,...,f d x, ie. for every 1 i d we have f i x = f i x 0 +x x 0 f i x x= xi, where x i lies between x and x 0. Thus if f i x x= xi f i x x=x0, we do have that fx fx 0 +x x 0 fx x=x0. We use the above below. Application 1 An expression for L T ˆθ T L T θ 0 in terms of ˆθ T θ 0 : The expansion of L T ˆθ T about θ 0 the true parameter L T θ 0 L T ˆθ T = L Tθ θ ˆθT 0 ˆθ T θ 0 ˆθ T 2 L T θ 2 θt θ 0 ˆθ T where θ T lies between θ 0 and ˆθ T. If ˆθ T lies in the interior of the parameter space this is an extremely important assumption here then L Tθ = 0. Moreover, if it can ˆθT P be shown that ˆθ T θ 0 0 we show this in the section below, then under certain conditions on L Tθ that 2 L T θ 2 θt P E 2 L T θ 2 such as the existence of the third derivative etc. it can be shown θ0 = Iθ 0. Hence the above is roughly 2L T ˆθ T L T θ 0 ˆθ T θ 0 Iθ 0 ˆθ T θ 0 Note that in many of the derivations below we will use that 2 L T θ P 2 L T θ 2 θt E 2 θ0 = Iθ0. But it should be noted that this only true if i ˆθ T θ 0 P 0 and ii 2 L T θ 2 uniformly to E 2 L T θ 2 θ0. We consider below another closely related application. converges 28

29 Application 2 An expression for ˆθ T θ 0 in terms of L Tθ θ0 : Theexpansionofthep-dimensionvector L Tθ ˆθT pointwiseaboutθ 0 thetrueparameter gives for 1 i d L i,t θ ˆθT = L i,tθ θ0 + L i,tθ θt ˆθ T θ 0. Now by using the same argument as in Application 1 we have L T θ θ0 Iθ 0 ˆθ T θ 0. We mention that U T θ 0 = L Tθ θ0 is often called the score or U statistic. And we see that the asymptotic sampling properties of U T determine the sampling properties of ˆθ T θ 0. Example The Weibull Evaluate the second derivative of the likelihood given in Example 4.1.3, take the expection on this, Iθ,α = E 2 L T we use the to denote the second derivative with respect to the parameters α and θ. Exercise: Evaluate Iθ, α. Application 2 implies that the maximum likelihood estimators ˆθ T and ˆα T recalling that no explicit expression for them exists can be written as T ˆθT θ Iθ,α 1 ˆα T α T α θ + α Y α θ α+1 t 1 α logy t logθ α θ +logyt θ Yt θ α Sampling properties of the maximum likelihood estimator See also Section p118, Davison These proofs will not be examined, but you should have some idea why Theorem is true. We have shown that under certain conditions the maximum likelihood estimator can often be the minimum variance unbiased estimator for example, in the case of exponential family of distributions. However, for finite samples, the mle may not attain the C-R lower bound. Hence for finite sample varˆθ T > Iθ 1. However, it can be shown that asymptotically the variance of the mle attains the mle lower bound. In other words, for large samples, the variance of the mle is close to the Cramer-Rao bound. We will prove the result in the case that l T is the log likelihood of independent, identically distributed random variables. The proof can be generalised to the case of non-identically distributed random variables. We first state sufficient conditions for this to be true. Assumption [Regularity Conditions 2] Let {X t } be iid random variables with density fx;θ. 29

30 i Suppose the conditions in Assumption hold. ii Almost sure uniform convergence This is optional For every ε > 0 there exists a δ such that P lim sup T θ 1 θ 2 >δ 1 T L TX;θ EL T θ > ε 0. We mention that directly verifying uniform convergence can be difficult. However, it can be established by showing that the parameter space is compact, point wise convergence of the likelihood to its expectation and almost sure equicontinuity in probability. iii Model identifiability For every θ Θ, there does not exist another θ Θ such that fx;θ = fx; θ for all x. iv The parameter space Θ is finite and compact. v supe 1 T L T X;θ <. We require Assumption 4.1.1ii,iii to show consistency and Assumptions and 4.1.1iiiv to show asymptotic normality. Theorem Supppose Assumption 4.1.1ii,iii holds. Let θ 0 be the true parameter and ˆθ T be the mle. Then we have ˆθ a.s. T θ 0 consistency. PROOF. To prove the result we first need to show that the expectation of the maximum likelihood is maximum at the true parameter and that this is the unique maximum. In other words we need to show that E 1 T L TX;θ E 1 T L TX;θ 0 0 for all θ Θ. To do this, we have E 1 T L TX;θ E 1 T L TX;θ 0 = log fx;θ fx;θ 0 fx;θ 0dx = E log fx;θ. fx;θ 0 Now by using Jensen s inequality we have E log fx;θ fx;θ loge = log fx;θ 0 fx;θ 0 fx;θdx = 0. ThusgivingE 1 T L TX;θ E 1 T L TX;θ 0 0. ToprovethatE 1 T L TX;θ E 1 T L TX;θ 0 = 0 only when θ 0 we note that identifiability assumption in Assumption 4.1.1iii, which means that fx;θ = fx;θ 0 only when θ 0 and no other function of f gives equality. 30

31 Hence E 1 T L TX;θ is uniquely maximum at θ 0. Finally, we need to show that ˆθ P T θ0. By Assumption 4.1.1ii and also the LLN we have that for all θ Θ that 1 T L TX;θ a.s. lθ. Therefore, for every mle ˆθ T we have 1 T L TX;θ 0 1 T L TX;ˆθ T a.s. E 1 T L TX;ˆθ T E 1 T L TX;θ To bound E 1 T L TX;θ 0 1 T L TX;ˆθ T we note that Now by using 4.4 we have and E 1 T L TX;θ 0 1 T L TX;ˆθ T = { E 1 T L TX;θ 0 1 T L TX;θ 0 } + { E 1 T L TX;ˆθ T 1 T L TX;ˆθ T } + { 1 T L TX;θ 0 E 1 T L TX;ˆθ T }. E 1 T L TX;θ 0 1 T L TX;ˆθ T { E 1 T L TX;θ 0 1 T L TX;θ 0 } + { E 1 T L TX;ˆθ T 1 T L TX;ˆθ T } + { E 1 T L TX;ˆθ T 1 T L TX;ˆθ T } E 1 T L TX;θ 0 1 T L TX;ˆθ T { E 1 T L TX;θ 0 1 T L TX;θ 0 } + { E 1 T L TX;ˆθ T 1 T L TX;ˆθ T } + { E 1 T L TX;θ 0 1 T L TX;θ 0 }. Therefore, under Assumption 4.1.1ii we have E 1 T L TX;θ 0 1 T L TX;ˆθ T 3sup E 1 θ Θ T L TX;θ 1 T L TX;θ a.s. 0. Since L T θ has a unique minimum this implies ˆθ T a.s. θ 0. Hence we have shown consistency of the mle. We now need to show asymptotic normality. Theorem Suppose Assumption is satisfied. i Then the score statistic is ii Then the mle is { 1 L T X;θ D logfx;θ 2 θ0 N 0, E θ0 }. 4.5 T { D logfx;θ 2 } 1 T ˆθT θ 0 N 0, E θ0. 31

32 iii The log likelihood ratio is D 2 L T X;ˆθ T L T X;θ 0 χ 2 p PROOF. First we will prove i. We recall because {X t } are iid random variables, then 1 L T X;θ T θ0 = 1 T T logfx t ;θ θ0. Hence L TX;θ θ0 isthesumofiidrandomvariableswithmeanzeroandvariancevar logfxt;θ θ0. Therefore, by the CLT for iid random variables we have 4.5. We use i and Taylor mean value theorem to prove ii. We first note that by the mean value theorem we have 1 T L T X;θ = 1 ˆθT T L T X;θ θ0 +ˆθ T θ L T X;θ T 2 θt. 4.6 Now it can be shown because Θ has a compact support, ˆθ T θ 0 a.s. 0 and the expectations of the third derivative of L T is bounded that 1 2 L T X;θ P 1 2 T 2 θt T E L T X;θ 2 Substituting 4.7 into 4.6 gives θ0 2 logfx;θ = E 2 θ L T X;θ 1 1 L TˆθT θ 0 = T 2 T X;θ θt θ0 T 1 2 L T X;θ 1 1 L = E T 2 θ0 T X;θ θ0 +o p 1. T We mention that the proof above is for univariate 2 L T X;θ 2 θt, but by redo-ing the above steps pointwise it can easily be generalised to the multivariate case too. Hence by substituting the 4.5 into the above we have ii. It is straightfoward to prove iii by using 2 L T X;ˆθ T L T X;θ 0 ˆθ T θ 0 Iθ 0 ˆθ T θ 0, i and the result that if X N0,Σ, then AX N0,A ΣA. Example The Weibull By using Example we have T ˆθT θ Iθ,α 1 α θ + α Y θ α+1 t α ˆα T α T 1 α logy t logθ α θ +logyt θ Yt θ α 32.

33 Now we observe that RHS consists of a sum iid random variables this can be viewed as an average. Since the variance of this exists you can show that it is Iθ,α, the CLT can be applied and we have that ˆθT θ ˆα T α D N 0,Iθ,α 1. Remark i We recall that for iid random variables that the Fisher information for sample size T is { } loglt X;θ 2 logfx;θ 2 Iθ = E θ0 = TE θ0. Hence comparing with the above theorem, we see that for iid random variables so long as the regularity conditions are satisfied the MLE, asympotitically, attains the Cramer-Rao bound even if for finite samples this may not be true. Moreover, since ˆθ T θ 0 Iθ 0 1 L Tθ θ0 = T 1 Iθ T L T θ θ0, and var 1 T L T θ θ0 = 1 T Iθ 0, then it can be seen that ˆθ T θ 0 = O p T 1/2. ii Under suitable conditions a similar result holds true for data which is not iid. In summary, the MLE under certain regularity conditions tend to have the smallest variance, and for large samples, the variance is close to the lower bound, which is the Cramer-Rao bound. In the case that Assumption is satisfied, the MLE is said to be asymptotically efficient. This means for finite samples the MLE may not attain the C-R bound but asymptotically it will. iii A simple application of Theorem is to the derivation of the distribution of Iθ 0 1/2 ˆθ T θ 0. It is clear that by using Theorem we have where I p is the identity matrix and Iθ 0 1/2 ˆθ T θ 0 D N0,I p ˆθ T θ 0 Iθ 0 ˆθ T θ 0 D χ 2 p. iv Note that these results apply when θ 0 lies inside the parameter space Θ. As θ gets closer to the boundary of the parameter space 33

34 Remark Generalised estimating equations Closely related to the MLE are generalised estimating equations GEE, which are relate to the score statistic. These are estimators not based on maximising the likelihood but are related to equating the score statistic derivative of the likelihood to zero and solving for the unknown parameters. Often they are equivalent to the MLE but they can be adapted to be useful in themselves and some adaptions will not be the derivative of a likelihood The Fisher information See also Section 4.3, Davison Let us return to the Fisher information. We recall that undercertain regularity conditions an unbiased estimator, θx, of a parameter θ 0 is such that var θx Iθ 0 1, where LT θ 2 Iθ = E = E 2 L T θ 2. is the Fisher information. Furthermore, under suitable regularity conditions, the MLE will asymptotically attain this bound. It is reasonable to ask, how one can interprete this bound. i Situation 1. Iθ 0 = E 2 L T θ 2 θ0 is large hence variance of the mle will be small then it means that the gradient of L Tθ is steep. Hence even for small deviations from θ 0, L Tθ is likely to be far from zero. This means the mle ˆθ T is likely to be in a close neighbourhood of θ 0. ii Situation 2. Iθ 0 = E 2 L T θ 2 θ0 is small hence variance of the mle will large. In this case the gradient of the likelihood L Tθ is flatter and hence L Tθ 0 for a large neighbourhood about the true parameter θ. Therefore the mle ˆθ T can lie in a large neighbourhood of θ 0. This is one explanation as to why Iθ is called the Fisher information. It contains information on how close close any estimator of θ can be. Look at the censoring example, Example 4.20, page 112, Davison

35 Chapter 5 Confidence Intervals 5.1 Confidence Intervals and testing We first summarise the results in the previous section which will be useful in this section. For convenience, we will assume that the likelihood is for iid random variables, whose density is fx;θ 0 though it is relatively simple to see how this can be generalised to general likelihoods - of not necessarily iid rvs. Let us suppose that θ 0 is the true parameter that we wish to estimate. Based on Theorem we have { D logfx;θ 2 } 1 T ˆθT θ 0 N 0, E θ0, 5.1 { 1 L T T D logfx;θ 2 } θ=θ 0 N 0, E θ0 5.2 and 2 L T ˆθ T L T θ 0 D χ 2 p, 5.3 where p are the number of parameters in the vector θ. Using any of 5.1, 5.2 and 5.3 we can construct 95% CI for θ Constructing confidence intervals using the likelihood See also Section 4.5, Davison One the of main reasons that we show asymptotic normality of an estimator it is usually not possible to derive normality for finite samples is to construct confidence intervals CIs and to test. 35

36 In the case that θ 0 is a scaler vector of dimension one, it is easy to use 5.1 to obtain { logfx;θ 2 } 1/2 D T E θ0 ˆθT θ 0 N0, Based on the above the 95% CI for θ 0 is [ ˆθ T 1 logfx;θ 2 E θ0 z T α/2,ˆθ T + 1 logfx;θ 2 E θ0 z T α/2 ]. 2 logfx;θ The above, of course, requires an estimate of thestandardised Fisher information E θ0 = E 2 logfx;θ 2 θ0 Usually, we evaluate the second derivative of 1 T logl Tθ = 1 T L Tθ and replace θ with the estimator of θ, ˆθ T. Exercise: Use 5.2 to construct a CI for θ 0 based on the score The CI constructed above works well if θ is a scalar. But beyond dimension one, constructing a CI based on 5.1 and the p-dimensional normal is extremely difficult. More precisely, if θ 0 is a p-dimensional vector then the analogous version of 5.4 is { logfx;θ 2 } 1/2 D T E θ0 ˆθT θ 0 N0,Ip, using this it is difficult to obtain the CI of θ 0. One way to construct the CI is to square ˆθT θ 0 and use Based on above a 95% CI is TE logfx;θ 2 } D ˆθT θ 0 θ0 ˆθT θ 0 χ 2 p. 5.5 { θ; ˆθT θ logfx;θ 2 TE θ0 ˆθT θ } χ 2 p Notethat asinthescalarcase, this leadstotheintervalwith thesmallestlength. Adisadvantage of 5.6 is that we have to a estimate the information matrix and b try to find all θ such the above holds. This can be quite unwielding. An alternative method, which is asymptotically equivalent to the above but removes the need to estimate the information matrix and is to use 5.3. By using 5.3, a 1001 α% CI for θ 0 is { θ;2 L T ˆθ T L T θ } χ 2 p1001 α. 5.7 The above is not easy to calculate, but it is feasible. 36

37 Example In the case that θ 0 is a scalar the 95% CI based on 5.7 is { θ;l T θ L T ˆθ T 1 } 2 χ2 p0.95. Both the 95% CIs in 5.6 and 5.7 will be very close for relatively large sample sizes. However one advantage of using 5.7 instead of 5.6 is that it is easier to evaluate - no need to obtain the second derivative of the likelihood etc. Another feature which differentiates the CIs in 5.6 and 5.7 is that the CI based on 5.6 is symmetric about ˆθ T recall that X 1.96σ/ T, X σ/ T is symmetric about X, whereas the symmetry condition may not hold for sample sizes when constructing a CI for θ 0 using 5.7. This is a positive advantage of using 5.7 instead of 5.6. A disadvantage of using 5.7 instead of 5.6 is that sometimes in the CI based on 5.7 may have more than one interval. As you can see if the dimension of θ is large it is quite difficult to evaluate the CI try it for the simple case that the dimension is two!. Indeed for dimensions greater than three it is extremely hard. However in most cases, we are only interested in constructing CIs for certain parameters of interest, the other unknown parameters are simply nuisance parameters and CIs for them are not of interest. For example, for the normal distribution we may only be interested in CIs for the mean but not the variance. It is clear that directly using the log-likelihood ratio to construct CIs and also test will mean also constructing CIs for the nuisance parameters. Therefore below in Section?? we construct a variant of the likelihood called the Profile likelihood, which allows us to deal with nuisance parameters in a more efficient way Testing using the likelihood Let us suppose we wish to test the hypothesis H 0 : θ = θ 0 against the alternative H A : θ θ 0. We can use any of the results in 5.1, 5.2 and 5.3 to do the test - they will lead to slightly different p-values, but asympototically they are all equivalent, because they are all based essentially on the same derivation. We now list the three tests that one can use. The Wald test The Wald statistic is based on 5.1. We recall from 5.1 that if the null is true, then we have { D logfx;θ 2 } 1 T ˆθT θ 0 N 0, E θ0. 37

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Principle of Data Reduction

Principle of Data Reduction Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

GLM, insurance pricing & big data: paying attention to convergence issues.

GLM, insurance pricing & big data: paying attention to convergence issues. GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

STAT 830 Convergence in Distribution

STAT 830 Convergence in Distribution STAT 830 Convergence in Distribution Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Convergence in Distribution STAT 830 Fall 2011 1 / 31

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference 0. 1. Introduction and probability review 1.1. What is Statistics? What is Statistics? Lecture 1. Introduction and probability review There are many definitions: I will use A set of principle and procedures

More information

Metric Spaces. Chapter 7. 7.1. Metrics

Metric Spaces. Chapter 7. 7.1. Metrics Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Chapter 6: Point Estimation. Fall 2011. - Probability & Statistics

Chapter 6: Point Estimation. Fall 2011. - Probability & Statistics STAT355 Chapter 6: Point Estimation Fall 2011 Chapter Fall 2011 6: Point1 Estimat / 18 Chap 6 - Point Estimation 1 6.1 Some general Concepts of Point Estimation Point Estimate Unbiasedness Principle of

More information

Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

More information

Statistical Theory. Prof. Gesine Reinert

Statistical Theory. Prof. Gesine Reinert Statistical Theory Prof. Gesine Reinert November 23, 2009 Aim: To review and extend the main ideas in Statistical Inference, both from a frequentist viewpoint and from a Bayesian viewpoint. This course

More information

THE CENTRAL LIMIT THEOREM TORONTO

THE CENTRAL LIMIT THEOREM TORONTO THE CENTRAL LIMIT THEOREM DANIEL RÜDT UNIVERSITY OF TORONTO MARCH, 2010 Contents 1 Introduction 1 2 Mathematical Background 3 3 The Central Limit Theorem 4 4 Examples 4 4.1 Roulette......................................

More information

The Expectation Maximization Algorithm A short tutorial

The Expectation Maximization Algorithm A short tutorial The Expectation Maximiation Algorithm A short tutorial Sean Borman Comments and corrections to: em-tut at seanborman dot com July 8 2004 Last updated January 09, 2009 Revision history 2009-0-09 Corrected

More information

MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...

MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators... MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 2009-2016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

More information

Stochastic Inventory Control

Stochastic Inventory Control Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the

More information

1 Short Introduction to Time Series

1 Short Introduction to Time Series ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The

More information

(Quasi-)Newton methods

(Quasi-)Newton methods (Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

More information

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses Michael R. Powers[ ] Temple University and Tsinghua University Thomas Y. Powers Yale University [June 2009] Abstract We propose a

More information

Master s Theory Exam Spring 2006

Master s Theory Exam Spring 2006 Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series. Cointegration The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series. Economic theory, however, often implies equilibrium

More information

Efficiency and the Cramér-Rao Inequality

Efficiency and the Cramér-Rao Inequality Chapter Efficiency and the Cramér-Rao Inequality Clearly we would like an unbiased estimator ˆφ (X of φ (θ to produce, in the long run, estimates which are fairly concentrated i.e. have high precision.

More information

Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the 2012-13 school year.

Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the 2012-13 school year. This document is designed to help North Carolina educators teach the Common Core (Standard Course of Study). NCDPI staff are continually updating and improving these tools to better serve teachers. Algebra

More information

LOGNORMAL MODEL FOR STOCK PRICES

LOGNORMAL MODEL FOR STOCK PRICES LOGNORMAL MODEL FOR STOCK PRICES MICHAEL J. SHARPE MATHEMATICS DEPARTMENT, UCSD 1. INTRODUCTION What follows is a simple but important model that will be the basis for a later study of stock prices as

More information

Stat 704 Data Analysis I Probability Review

Stat 704 Data Analysis I Probability Review 1 / 30 Stat 704 Data Analysis I Probability Review Timothy Hanson Department of Statistics, University of South Carolina Course information 2 / 30 Logistics: Tuesday/Thursday 11:40am to 12:55pm in LeConte

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e. CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e. This chapter contains the beginnings of the most important, and probably the most subtle, notion in mathematical analysis, i.e.,

More information

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1. MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

More information

MULTIVARIATE PROBABILITY DISTRIBUTIONS

MULTIVARIATE PROBABILITY DISTRIBUTIONS MULTIVARIATE PROBABILITY DISTRIBUTIONS. PRELIMINARIES.. Example. Consider an experiment that consists of tossing a die and a coin at the same time. We can consider a number of random variables defined

More information

Lecture Notes 1. Brief Review of Basic Probability

Lecture Notes 1. Brief Review of Basic Probability Probability Review Lecture Notes Brief Review of Basic Probability I assume you know basic probability. Chapters -3 are a review. I will assume you have read and understood Chapters -3. Here is a very

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

How To Prove The Dirichlet Unit Theorem

How To Prove The Dirichlet Unit Theorem Chapter 6 The Dirichlet Unit Theorem As usual, we will be working in the ring B of algebraic integers of a number field L. Two factorizations of an element of B are regarded as essentially the same if

More information

PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include 2 + 5.

PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include 2 + 5. PUTNAM TRAINING POLYNOMIALS (Last updated: November 17, 2015) Remark. This is a list of exercises on polynomials. Miguel A. Lerma Exercises 1. Find a polynomial with integral coefficients whose zeros include

More information

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #15 Special Distributions-VI Today, I am going to introduce

More information

Differentiating under an integral sign

Differentiating under an integral sign CALIFORNIA INSTITUTE OF TECHNOLOGY Ma 2b KC Border Introduction to Probability and Statistics February 213 Differentiating under an integral sign In the derivation of Maximum Likelihood Estimators, or

More information

Ideal Class Group and Units

Ideal Class Group and Units Chapter 4 Ideal Class Group and Units We are now interested in understanding two aspects of ring of integers of number fields: how principal they are (that is, what is the proportion of principal ideals

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

SAS Certificate Applied Statistics and SAS Programming

SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and Advanced SAS Programming Brigham Young University Department of Statistics offers an Applied Statistics and

More information

Assessing the Relative Power of Structural Break Tests Using a Framework Based on the Approximate Bahadur Slope

Assessing the Relative Power of Structural Break Tests Using a Framework Based on the Approximate Bahadur Slope Assessing the Relative Power of Structural Break Tests Using a Framework Based on the Approximate Bahadur Slope Dukpa Kim Boston University Pierre Perron Boston University December 4, 2006 THE TESTING

More information

Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

More information

State Space Time Series Analysis

State Space Time Series Analysis State Space Time Series Analysis p. 1 State Space Time Series Analysis Siem Jan Koopman http://staff.feweb.vu.nl/koopman Department of Econometrics VU University Amsterdam Tinbergen Institute 2011 State

More information

Vector and Matrix Norms

Vector and Matrix Norms Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions 4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

More information

Sections 2.11 and 5.8

Sections 2.11 and 5.8 Sections 211 and 58 Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I 1/25 Gesell data Let X be the age in in months a child speaks his/her first word and

More information

Maximum likelihood estimation of mean reverting processes

Maximum likelihood estimation of mean reverting processes Maximum likelihood estimation of mean reverting processes José Carlos García Franco Onward, Inc. jcpollo@onwardinc.com Abstract Mean reverting processes are frequently used models in real options. For

More information

From the help desk: Bootstrapped standard errors

From the help desk: Bootstrapped standard errors The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

More information

Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

More information

Dongfeng Li. Autumn 2010

Dongfeng Li. Autumn 2010 Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis

More information

Factor analysis. Angela Montanari

Factor analysis. Angela Montanari Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

More information

**BEGINNING OF EXAMINATION** The annual number of claims for an insured has probability function: , 0 < q < 1.

**BEGINNING OF EXAMINATION** The annual number of claims for an insured has probability function: , 0 < q < 1. **BEGINNING OF EXAMINATION** 1. You are given: (i) The annual number of claims for an insured has probability function: 3 p x q q x x ( ) = ( 1 ) 3 x, x = 0,1,, 3 (ii) The prior density is π ( q) = q,

More information

Math 4310 Handout - Quotient Vector Spaces

Math 4310 Handout - Quotient Vector Spaces Math 4310 Handout - Quotient Vector Spaces Dan Collins The textbook defines a subspace of a vector space in Chapter 4, but it avoids ever discussing the notion of a quotient space. This is understandable

More information

Detekce změn v autoregresních posloupnostech

Detekce změn v autoregresních posloupnostech Nové Hrady 2012 Outline 1 Introduction 2 3 4 Change point problem (retrospective) The data Y 1,..., Y n follow a statistical model, which may change once or several times during the observation period

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014. University of Ljubljana Doctoral Programme in Statistics ethodology of Statistical Research Written examination February 14 th, 2014 Name and surname: ID number: Instructions Read carefully the wording

More information

E3: PROBABILITY AND STATISTICS lecture notes

E3: PROBABILITY AND STATISTICS lecture notes E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution SF2940: Probability theory Lecture 8: Multivariate Normal Distribution Timo Koski 24.09.2015 Timo Koski Matematisk statistik 24.09.2015 1 / 1 Learning outcomes Random vectors, mean vector, covariance matrix,

More information

The Heat Equation. Lectures INF2320 p. 1/88

The Heat Equation. Lectures INF2320 p. 1/88 The Heat Equation Lectures INF232 p. 1/88 Lectures INF232 p. 2/88 The Heat Equation We study the heat equation: u t = u xx for x (,1), t >, (1) u(,t) = u(1,t) = for t >, (2) u(x,) = f(x) for x (,1), (3)

More information

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4) Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

An extension of the factoring likelihood approach for non-monotone missing data

An extension of the factoring likelihood approach for non-monotone missing data An extension of the factoring likelihood approach for non-monotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions

More information

Notes from Week 1: Algorithms for sequential prediction

Notes from Week 1: Algorithms for sequential prediction CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

More information

CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION No: CITY UNIVERSITY LONDON BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION ENGINEERING MATHEMATICS 2 (resit) EX2005 Date: August

More information

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

More information

APPLIED MATHEMATICS ADVANCED LEVEL

APPLIED MATHEMATICS ADVANCED LEVEL APPLIED MATHEMATICS ADVANCED LEVEL INTRODUCTION This syllabus serves to examine candidates knowledge and skills in introductory mathematical and statistical methods, and their applications. For applications

More information

The equivalence of logistic regression and maximum entropy models

The equivalence of logistic regression and maximum entropy models The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/

More information

9.2 Summation Notation

9.2 Summation Notation 9. Summation Notation 66 9. Summation Notation In the previous section, we introduced sequences and now we shall present notation and theorems concerning the sum of terms of a sequence. We begin with a

More information

Introduction to Detection Theory

Introduction to Detection Theory Introduction to Detection Theory Reading: Ch. 3 in Kay-II. Notes by Prof. Don Johnson on detection theory, see http://www.ece.rice.edu/~dhj/courses/elec531/notes5.pdf. Ch. 10 in Wasserman. EE 527, Detection

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

4: SINGLE-PERIOD MARKET MODELS

4: SINGLE-PERIOD MARKET MODELS 4: SINGLE-PERIOD MARKET MODELS Ben Goldys and Marek Rutkowski School of Mathematics and Statistics University of Sydney Semester 2, 2015 B. Goldys and M. Rutkowski (USydney) Slides 4: Single-Period Market

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR) 2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

More information

MAS2317/3317. Introduction to Bayesian Statistics. More revision material

MAS2317/3317. Introduction to Bayesian Statistics. More revision material MAS2317/3317 Introduction to Bayesian Statistics More revision material Dr. Lee Fawcett, 2014 2015 1 Section A style questions 1. Describe briefly the frequency, classical and Bayesian interpretations

More information

THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING

THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING 1. Introduction The Black-Scholes theory, which is the main subject of this course and its sequel, is based on the Efficient Market Hypothesis, that arbitrages

More information

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d). 1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction

More information

4.5 Linear Dependence and Linear Independence

4.5 Linear Dependence and Linear Independence 4.5 Linear Dependence and Linear Independence 267 32. {v 1, v 2 }, where v 1, v 2 are collinear vectors in R 3. 33. Prove that if S and S are subsets of a vector space V such that S is a subset of S, then

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

More information

Quotient Rings and Field Extensions

Quotient Rings and Field Extensions Chapter 5 Quotient Rings and Field Extensions In this chapter we describe a method for producing field extension of a given field. If F is a field, then a field extension is a field K that contains F.

More information

5. Continuous Random Variables

5. Continuous Random Variables 5. Continuous Random Variables Continuous random variables can take any value in an interval. They are used to model physical characteristics such as time, length, position, etc. Examples (i) Let X be

More information

Introduction to Probability

Introduction to Probability Introduction to Probability EE 179, Lecture 15, Handout #24 Probability theory gives a mathematical characterization for experiments with random outcomes. coin toss life of lightbulb binary data sequence

More information

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1]. Probability Theory Probability Spaces and Events Consider a random experiment with several possible outcomes. For example, we might roll a pair of dice, flip a coin three times, or choose a random real

More information

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let Copyright c 2009 by Karl Sigman 1 Stopping Times 1.1 Stopping Times: Definition Given a stochastic process X = {X n : n 0}, a random time τ is a discrete random variable on the same probability space as

More information

Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes

Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes Yong Bao a, Aman Ullah b, Yun Wang c, and Jun Yu d a Purdue University, IN, USA b University of California, Riverside, CA, USA

More information

Chapter 4, Arithmetic in F [x] Polynomial arithmetic and the division algorithm.

Chapter 4, Arithmetic in F [x] Polynomial arithmetic and the division algorithm. Chapter 4, Arithmetic in F [x] Polynomial arithmetic and the division algorithm. We begin by defining the ring of polynomials with coefficients in a ring R. After some preliminary results, we specialize

More information

Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

More information

1 Teaching notes on GMM 1.

1 Teaching notes on GMM 1. Bent E. Sørensen January 23, 2007 1 Teaching notes on GMM 1. Generalized Method of Moment (GMM) estimation is one of two developments in econometrics in the 80ies that revolutionized empirical work in

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

DRAFT. Algebra 1 EOC Item Specifications

DRAFT. Algebra 1 EOC Item Specifications DRAFT Algebra 1 EOC Item Specifications The draft Florida Standards Assessment (FSA) Test Item Specifications (Specifications) are based upon the Florida Standards and the Florida Course Descriptions as

More information