On The So-Called Huber Sandwich Estimator and Robust Standard Errors by David A. Freedman
|
|
|
- Sarah Dalton
- 9 years ago
- Views:
Transcription
1 On The So-Called Huber Sandwich Estimator and Robust Standard Errors by David A. Freedman Abstract The Huber Sandwich Estimator can be used to estimate the variance of the MLE when the underlying model is incorrect. If the model is nearly correct, so are the usual standard errors, and robustification is unlikely to help much. On the other hand, if the model is seriously in error, the sandwich may help on the variance side, but the parameters being estimated by the MLE are likely to be meaningless except perhaps as descriptive statistics. Introduction This paper gives an informal account of the so-called Huber Sandwich Estimator, for which Peter Huber is not to be blamed. We discuss the algorithm, and mention some of the ways in which it is applied. Although the paper is mainly expository, the theoretical framework outlined here may have some elements of novelty. In brief, under rather stringent conditions, the algorithm can be used to estimate the variance of the MLE when the underlying model is incorrect. However, the algorithm ignores bias, which may be appreciable. Thus, results are liable to be misleading. To begin the mathematical exposition, let i index observations whose values are y i. Let θ R p be a p 1 parameter vector. Let y f i (y θ)be a positive density. If y i takes only the values 0 or 1, which is the chief case of interest here, then f i (0 θ) > 0, f i (1 θ) > 0, and f i (0 θ)+ f i (1 θ) = 1. Some examples involve real- or vector-valued y i, and the notation is set up in terms of integrals rather than sums. We assume θ f i (y θ) is smooth. (Other regularity conditions are elided.) Let Y i be independent with density f i ( θ). Notice that the Y i are not identically distributed: f i depends on the subscript i. In typical applications, the Y i cannot be identically distributed, as will be explained below. The data are modeled as observed values of Y i for i = 1,...,n. The likelihood function is n f i (Y i θ), viewed as a function of θ. The log likelihood function is therefore L(θ) = log f i (Y i θ). (1) The first and second partial derivatives of L with respect to θ are given by L (θ) = g i (Y i θ), L (θ) = h i (Y i θ). (2) To unpack the notation in (2), let φ denote the derivative of the function φ: differentiation is with respect to the parameter vector θ. Then g i (y θ) = [log f i (y θ)] = θ log f i(y θ), (3) 1
2 a1 p-vector. Similarly, h i (y θ) = [log f i (y θ)] = 2 θ 2 log f i(y θ), (4) a symmetric p p matrix. The quantity( E θ h(y i θ) is called the Fisher information matrix. It may help to note that E θ h i (Y i θ) = E θ gi (Y i θ) T g i (Y i θ) ) > 0, where T stands for transposition. Assume for the moment that the model is correct, and θ 0 is the true value of θ. So the Y i are independent and the density of Y i is f i ( θ 0 ). The log likelihood function can be expanded in a Taylor series around θ 0 : L(θ) = L(θ 0 ) + L (θ 0 )(θ θ 0 ) (θ θ 0) T L (θ 0 )(θ θ 0 ) +... (5) If we ignore higher-order terms and write. = for nearly equal this is an informal exposition the log likelihood function is essentially a quadratic, whose maximum can found by solving the likelihood equation L (θ) = 0. Essentially, the equation is L (θ 0 ) + (θ θ 0 ) T L (θ 0 ) = 0. (6) So ˆθ θ 0. = [ L (θ 0 )] 1 L (θ 0 ) T. (7) Then cov θ0 ˆθ =. [ L (θ 0 )] 1 [cov θ0 L (θ 0 )][ L (θ 0 )] 1, (8) the covariance being a symmetric p p matrix. In the conventional textbook development, L (θ 0 ) and cov θ0 L (θ 0 ) are computed approximately or exactly using Fisher information. Thus, L (θ 0 ) =. n E θ0 h i (Y i ). Furthermore, cov θ0 L (θ 0 ) = n E θ0 h i (Y i ). The sandwich idea is to estimate L (θ 0 ) directly from the sample data, as L ( ˆθ). Similarly, cov θ0 L (θ 0 ) is estimated as So (8) is estimated as where g i (Y i ˆθ) T g i (Y i ˆθ). A = L ( ˆθ) and B = ˆV = ( A) 1 B( A) 1 g i (Y i ˆθ) T g i (Y i ˆθ) The ˆV in (9) is the Huber sandwich estimator. The square roots of the diagonal elements of ˆV are robust standard errors or Huber-White standard errors. The middle factor B in (9) is not centered in any way. No centering is needed, because (9a) (9b) E θ [g i (Y i θ)] = 0, [ cov θ gi (Y i θ) ] [ = E θ gi (Y i θ) T g i (Y i θ) ]. (10) 2
3 Indeed, E θ [g i (Y i θ)] = g i (y θ)f i (y θ)dy = θ f i(y θ)dy = f i (y θ)dy θ = θ 1 = 0. (11) A derivative was passed through the integral sign in (11). Regularity conditions are needed to justify such maneuvers, but we finesse these mathematical issues. If the motivation for the middle factor in (9) is still obscure, try this recipe. Let U i be independent 1 p-vectors, with E(U i ) = 0. Now cov( U i ) = cov(u i ) = E(Ui T U i ). Estimate E(Ui T U i ) by Ui T U i. Take U i = g i (Y i θ 0 ). Finally, substitute ˆθ for θ 0. The middle factor B in (9) is quadratic. It does not vanish, although g i (Y i ˆθ) = 0. (12) Remember, ˆθ was chosen to solve the likelihood equation L (θ) = n g i (Y i θ) = 0, explaining (12). In textbook examples, the middle factor B in (9) will be of order n, being the sum of n terms. Similarly, L (θ 0 ) = n h i (Y i θ 0 ) will be of order n: see (2). Thus, (9) will be of order 1/n. Under suitable regularity conditions, the strong law of large numbers will apply to L (θ 0 ), so L (θ 0 )/n converges to a positive constant; the central limit theorem will apply to L (θ 0 ),so nl (θ 0 ) converges in law to a multivariate normal distribution with mean 0. In particular, the randomness in L is of order n. So is the randomness in L, but that can safely be ignored when computing the asymptotic distribution of [ L (θ 0 )] 1 L (θ 0 ) T, because L (θ 0 ) is of order n. Robust standard errors We turn now to the case where the model is wrong. We continue to assume the Y i are independent. The density of Y i, however, is ϕ i which is not in our parametric family. In other words, there is specification error in the model, so the likelihood function is in error too. The sandwich estimator (9) is held to provide standard errors that are robust to specification error. To make sense of the claim, we need the Key Assumption. There is a common θ 0 such that f i ( θ 0 ) is closest in the Kullback- Leibler sense of relative entropy, defined in (14) below to ϕ i. 3
4 (A possible extension will be mentioned, below.) Equation (11) may look questionable in this new context. But [ E 0 gi (Y i θ) ] ( ) = θ f 1 i(y θ) f i (y θ) ϕ i(y) dx = 0 at θ = θ 0. (13) This is because θ 0 minimizes the Kullback-Leibler relative entropy, θ [ ] ϕi (y) log ϕ i (y) dy. (14) f i (y θ) By the key assumption, we get the same θ 0 for every i. Under suitable conditions, the MLE will converge to θ 0. Furthermore, ˆθ θ 0 will be asymptotically normal, with mean 0 and covariance ˆV given by (9), that is, ˆV 1/2 ( ˆθ θ 0 ) N(0 p,i p p ). (15) By definition, ˆθ is the θ that maximizes θ i f i(y i θ) although it is granted that Y i does not have the density f i ( θ). In short, it is a pseudo-likelihood that is being maximized, not a true likelihood. The asymptotics in (15) therefore describe convergence to parameters of an incorrect model that is fitted to the data. For some rigorous theory in the independent but not identically distributed case, see Amemiya (1985, Section 9.2.2) or Fahrmeir and Kaufmann (1985). For the more familiar IID (independent and identically distributed) case, see Rao (1973, Chapter 6), or Lehmann and Casella (2003, Chapter 6). Lehmann (1998, Chapter 7) and van der Vaart (1998) are less formal, more approachable. These references all use Fisher information rather than (9), and consider true likelihood functions rather than pseudo-likelihoods. Why not assume IID variables? The sandwich estimator is commonly used in logit, probit, or cloglog specifications. See, for instance, Gartner and Segura (2000), Jacobs and Carmichael (2002), Gould, Lavy, and Passerman (2004), Lassen (2005), or Schonlau (2006). Calculations are made conditional on the explanatory variables, which are left implicit here. Different subjects have different values for the explanatory variables. Therefore, the response variables have different conditional distributions. Thus, according to the model specification itself, the Y i are not IID. If the Y i are not IID, then θ 0 exists only by virtue of the key assumption. Even if the key assumption holds, bias should be of greater interest than variance, especially when the sample is large and causal inferences are based on a model that is incorrectly specified. Variances will be small, and bias may be large. Specifically, inferences will be based on the incorrect density f i ( ˆθ). = f i ( θ 0 ), rather than the correct density ϕ i. Why do we care about f i ( θ 0 )?If the model were correct, or nearly correct that is, f i ( θ 0 ) = ϕ i or f i ( θ 0 ). = ϕ i there would be no reason to use robust standard errors. 4
5 A possible extension Suppose the Y i are independent but not identically distributed, and there is no common θ 0 such that f i ( θ 0 ) is closest to ϕ i. One idea is to choose θ n to minimize the total relative entropy, that is, to minimize [ ] ϕi (y) log ϕ i (y) dy. (16) f i (y θ) Of course, θ n would depend on n, and the MLE would have to be viewed as estimating this moving parameter. Many technical details remain to be worked out. For discussion along these lines, see White (1994, pp , pp ). Cluster samples The sandwich estimator is often used for cluster samples. The idea is that clusters are independent, but subjects within a cluster are dependent. The procedure is to group the terms in (9), with one group for each cluster. If we denote cluster j by c j, the middle factor in (9) would be replaced by [ ] T [ ] g i (Y i ˆθ) g i (Y i ˆθ). (17) j i c j The two outside factors in (9) would remain the same. The results of the calculation are sometimes called survey-corrected variances, or variances adjusted for clustering. There is undoubtedly a statistical model for which the calculation gives sensible answers, because the quantity in (17) should estimate the variance of [ j i c j g i (Y i ˆθ) ] if clusters are independent and ˆθ is nearly constant. (Details remain to be elucidated.) It is quite another thing to say what is being estimated by solving the non-likelihood equation n g i (Y i θ) = 0. This is a non-likelihood equation because i f i( θ) does not describe the behavior of the individuals comprising the population. If it did, we would not be bothering with robust standard errors in the first place. The sandwich estimator for cluster samples presents exactly the same conceptual difficulty as before. The linear case The sandwich estimator is often conflated with the correction for heteroscedasticity in White (1980). Suppose Y = Xβ + ɛ. We condition on X, assumed to be of full rank. Suppose the ɛ i are independent with expectation 0, but not identically distributed. The OLS estimator is ˆβ OLS = (X X) 1 X Y. White proposed that the covariance matrix of ˆβ OLS should be estimated as (X X) 1 X ĜX(X X) 1, where e = Y X ˆβ OLS is the vector of residuals, Ĝ ij = ei 2 if i = j, and Ĝ ij = 0ifi = j. Similar ideas can be used if the ɛ i are independent in blocks. White s method often gives good results, although Ĝ can be so variable that t-statistics are surprisingly non-t-like. Compare Beck, Katz, Alvarez, Garrett, and Lange (1993). The linear model is much nicer than other models, because ˆβ OLS is unbiased even in the case we are considering, although OLS may of course be inefficient, and more important the usual SEs may be wrong. White s correction tries to fix the SEs. 5 i c j
6 An example Suppose there is one real-valued explanatory variable, x, with values x i spread fairly uniformly over the interval from 0 to 10. Given the x i, the response variables Y i are independent, and logitp(y i = 1) = α + βx i + γx 2 i, (18) where logit p = log[p/(1 p)]. Equation (18) is a logit model with a quadratic response. The sample size is moderately large. However, an unwitting statistician fits a linear logit model, logitp(y i = 1) = a + bx i. (19) If γ is nearly 0, for example, then â. = α, ˆb. = β, and all is well with or without the robust SEs. Suppose, however, that α = 0, β = 3, and γ =.5. (The parameters are chosen so the quadratic has a minimum at 3, and the probabilities spread out through the unit interval.) The unwitting statistician will get â. = 5 and ˆb. = 1, concluding that on the logit scale, a unit increase in x makes the probability that Y = 1 go by one, across the whole range of x. The only difference between the usual SEs and the robust SEs is the confidence one has in this absurd conclusion. In truth, for x near 0, an unit increase in x makes the probability of a response go down, by3 (probabilities are measured here on the logit scale). For x near 3, increasing x makes no difference. For x near 10, a unit increase in x makes the probability go up by 7. Could the specification error be detected by some kind of regression diagnostics? Perhaps, especially if we knew what kind of specification errors to look for. Keep in mind, however, that the robust SEs are designed for use when there is undetected specification error. What about Huber? The usual applications of the so-called Huber sandwich estimator go far beyond the mathematics in Huber (1967), and our critical comments do not apply to his work. In free translation this is no substitute for reading the paper he assumes the Y i are IID, so f i f, and g i g, and h i h. He considers the asymptotics when the true density is f 0, not in the parametric family. Let A = h(y θ 0 )f 0 (y) dy, and B = g(y θ 0 ) T g(y θ 0 )f 0 (y) dy. Both are p p symmetric matrices. Plainly, L (θ 0 ) = 1 n n g(y i θ 0 ). Under regularity conditions discussed in the paper, (i) ˆθ θ 0, which minimizes the distance between f( θ) and f 0. (ii) n 1 L (θ 0 ) = n 1 n h(x i θ 0 ) A. (iii) n 1/2 B 1/2 L (θ 0 ) N(0 p,i p p ). Asymptotic normality of the MLE follows: C 1/2 n ( ˆθ θ 0 ) N(0 p 1,I p p ), (20a) where C n = n 1 ( A) 1 B( A) 1. (20b) Thus, Huber s paper answers a question that (for a mathematical statistician) seems quite natural: what is the asymptotic behavior of the MLE when the model is wrong? Applying the algorithm to 6
7 data, while ignoring the assumptions of the theorems and the errors in the models that is not Peter Huber. Summary and conclusions The sandwich algorithm, under stringent regularity conditions, yields variances for the MLE that are asymptotically correct even when the specification and hence the likelihood function are incorrect. However, it is quite another thing to ignore bias. It remains unclear why applied workers should care about the variance of an estimator for the wrong parameter. More particularly, inferences are based on a model that is admittedly incorrect. (If the model were correct, or nearly correct, there would be no need for sandwiches.) The chief issue, then, is the difference between the incorrect model that is fitted to the data and the process that generated the data. This is bias due to specification error. The algorithm does not take bias into account. Applied papers that use sandwiches rarely mention bias. There is room for improvement here. See Koenker (2005) for additional discussion. On White s correction, see Greene (2003, p. 220). For a more general discussion of independence assumptions, see Berk and Freedman (2003) or Freedman (2005). The latter reference also discusses model-based causal inference in the social sciences. References Amemiya, T. (1985). Advanced Econometrics, Harvard University Press. Beck, N., Katz, J. N., Alvarez, R. M., Garrett, G., and Lange, P. (1993). Government Partisanship, Labor Organization, and Macroeconomic Performance, American Political Science Review, 87, Fahrmeir, L., Kaufmann, H. (1985). Consistency and Asymptotic Normality of the Maximum Likelihood Estimator in Generalized Linear Models, The Annals of Statistics, 13, Berk, R. A. and Freedman, D. A. (2003). Statistical Assumptions as Empirical Commitments, in T. G. Blomberg and S. Cohen, eds., Law, Punishment, and Social Control: Essays in Honor of Sheldon Messinger, 2nd ed., Aldine de Gruyter, New York, pp Freedman, D. A. (2005). Statistical Models: Theory and Practice, Cambridge University Press. Gartner, S. S. and Segura, G. M. (2000). Race, Casualties, and Opinion in the Vietnam War, Journal of Politics, 62, Gould, E. D., Lavy, V., and Passerman, M. D. (2004). Immigrating to Opportunity: Estimating the Effect of School Quality Using a Natural Experiment on Ethiopians in Israel, Quarterly Journal of Economics, vol. 119, pp Greene, W. H. (2003). Econometric Analysis, 5th ed., Prentice Hall. Huber, P. J. (1967). The Behavior of Maximum Likelihood Estimates under Nonstandard Conditions, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. I, pp Jacobs, D. and Carmichael, J. T. (2002). The Political Sociology of the Death Penalty, American Sociological Review, 67, Koenker, R. (2005). Maximum Likelihood Asymptotics under Nonstandard Conditions: A Heuristic Introduction to Sandwiches, roger/courses/476/lectures/l10.pdf 7
8 Lassen, D. D. (2005). The Effect of Information on Voter Turnout: Evidence from a Natural Experiment, American Journal of Political Science, 49, Lehmann, E. L. (1998). Elements of Large-Sample Theory, Springer. Lehmann, E. L. and Casella, G. (2003). Theory of Point Estimation, 2nd ed., Springer. Rao, C. R. (1973). Linear Statistical Inference and its Applications, 2nd ed., Wiley. van der Vaart, A. (1998). Asymptotic Statistics, Cambridge University Press. White, H. (1980). A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity, Econometrica, 48, White, H. (1994). Estimation, Inference, and Specification Analysis, Cambridge University Press. Author s footnote. Dick Berk (Penn), Peter Westfall (Texas Tech), and Paul Ruud (Berkeley) made helpful comments. Department of Statistics UC Berkeley, CA September,
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
Maximum Likelihood Estimation
Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for
Basics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu [email protected] Modern machine learning is rooted in statistics. You will find many familiar
INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)
INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its
1 Teaching notes on GMM 1.
Bent E. Sørensen January 23, 2007 1 Teaching notes on GMM 1. Generalized Method of Moment (GMM) estimation is one of two developments in econometrics in the 80ies that revolutionized empirical work in
171:290 Model Selection Lecture II: The Akaike Information Criterion
171:290 Model Selection Lecture II: The Akaike Information Criterion Department of Biostatistics Department of Statistics and Actuarial Science August 28, 2012 Introduction AIC, the Akaike Information
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
Standard errors of marginal effects in the heteroskedastic probit model
Standard errors of marginal effects in the heteroskedastic probit model Thomas Cornelißen Discussion Paper No. 320 August 2005 ISSN: 0949 9962 Abstract In non-linear regression models, such as the heteroskedastic
Introduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling
What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and
PS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng [email protected] The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
Fitting Subject-specific Curves to Grouped Longitudinal Data
Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: [email protected] Currie,
Clustering in the Linear Model
Short Guides to Microeconometrics Fall 2014 Kurt Schmidheiny Universität Basel Clustering in the Linear Model 2 1 Introduction Clustering in the Linear Model This handout extends the handout on The Multiple
Chapter 4: Statistical Hypothesis Testing
Chapter 4: Statistical Hypothesis Testing Christophe Hurlin November 20, 2015 Christophe Hurlin () Advanced Econometrics - Master ESA November 20, 2015 1 / 225 Section 1 Introduction Christophe Hurlin
Inner Product Spaces
Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and
Estimating an ARMA Process
Statistics 910, #12 1 Overview Estimating an ARMA Process 1. Main ideas 2. Fitting autoregressions 3. Fitting with moving average components 4. Standard errors 5. Examples 6. Appendix: Simple estimators
2. Linear regression with multiple regressors
2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions
Orthogonal Distance Regression
Applied and Computational Mathematics Division NISTIR 89 4197 Center for Computing and Applied Mathematics Orthogonal Distance Regression Paul T. Boggs and Janet E. Rogers November, 1989 (Revised July,
From the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
Mean squared error matrix comparison of least aquares and Stein-rule estimators for regression coefficients under non-normal disturbances
METRON - International Journal of Statistics 2008, vol. LXVI, n. 3, pp. 285-298 SHALABH HELGE TOUTENBURG CHRISTIAN HEUMANN Mean squared error matrix comparison of least aquares and Stein-rule estimators
Logit Models for Binary Data
Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response
The Probit Link Function in Generalized Linear Models for Data Mining Applications
Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications
1 Prior Probability and Posterior Probability
Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which
Introduction to Fixed Effects Methods
Introduction to Fixed Effects Methods 1 1.1 The Promise of Fixed Effects for Nonexperimental Research... 1 1.2 The Paired-Comparisons t-test as a Fixed Effects Method... 2 1.3 Costs and Benefits of Fixed
MULTIVARIATE PROBABILITY DISTRIBUTIONS
MULTIVARIATE PROBABILITY DISTRIBUTIONS. PRELIMINARIES.. Example. Consider an experiment that consists of tossing a die and a coin at the same time. We can consider a number of random variables defined
Simple Regression Theory II 2010 Samuel L. Baker
SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
Chapter 3: The Multiple Linear Regression Model
Chapter 3: The Multiple Linear Regression Model Advanced Econometrics - HEC Lausanne Christophe Hurlin University of Orléans November 23, 2013 Christophe Hurlin (University of Orléans) Advanced Econometrics
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
CHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
Implementing Panel-Corrected Standard Errors in R: The pcse Package
Implementing Panel-Corrected Standard Errors in R: The pcse Package Delia Bailey YouGov Polimetrix Jonathan N. Katz California Institute of Technology Abstract This introduction to the R package pcse is
1 Sufficient statistics
1 Sufficient statistics A statistic is a function T = rx 1, X 2,, X n of the random sample X 1, X 2,, X n. Examples are X n = 1 n s 2 = = X i, 1 n 1 the sample mean X i X n 2, the sample variance T 1 =
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
UNIVERSITY OF WAIKATO. Hamilton New Zealand
UNIVERSITY OF WAIKATO Hamilton New Zealand Can We Trust Cluster-Corrected Standard Errors? An Application of Spatial Autocorrelation with Exact Locations Known John Gibson University of Waikato Bonggeun
Econometrics Simple Linear Regression
Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight
Efficiency and the Cramér-Rao Inequality
Chapter Efficiency and the Cramér-Rao Inequality Clearly we would like an unbiased estimator ˆφ (X of φ (θ to produce, in the long run, estimates which are fairly concentrated i.e. have high precision.
Multivariate Analysis of Variance (MANOVA): I. Theory
Gregory Carey, 1998 MANOVA: I - 1 Multivariate Analysis of Variance (MANOVA): I. Theory Introduction The purpose of a t test is to assess the likelihood that the means for two groups are sampled from the
Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)
Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and
Multi-variable Calculus and Optimization
Multi-variable Calculus and Optimization Dudley Cooke Trinity College Dublin Dudley Cooke (Trinity College Dublin) Multi-variable Calculus and Optimization 1 / 51 EC2040 Topic 3 - Multi-variable Calculus
Note on the EM Algorithm in Linear Regression Model
International Mathematical Forum 4 2009 no. 38 1883-1889 Note on the M Algorithm in Linear Regression Model Ji-Xia Wang and Yu Miao College of Mathematics and Information Science Henan Normal University
Special Theory of Relativity
June 1, 2010 1 1 J.D.Jackson, Classical Electrodynamics, 3rd Edition, Chapter 11 Introduction Einstein s theory of special relativity is based on the assumption (which might be a deep-rooted superstition
CHI-SQUARE: TESTING FOR GOODNESS OF FIT
CHI-SQUARE: TESTING FOR GOODNESS OF FIT In the previous chapter we discussed procedures for fitting a hypothesized function to a set of experimental data points. Such procedures involve minimizing a quantity
Least-Squares Intersection of Lines
Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a
Exact Nonparametric Tests for Comparing Means - A Personal Summary
Exact Nonparametric Tests for Comparing Means - A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola
Lasso on Categorical Data
Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.
Markov Chain Monte Carlo Simulation Made Simple
Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical
Multivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
5.1 Identifying the Target Parameter
University of California, Davis Department of Statistics Summer Session II Statistics 13 August 20, 2012 Date of latest update: August 20 Lecture 5: Estimation with Confidence intervals 5.1 Identifying
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional
Multiple group discriminant analysis: Robustness and error rate
Institut f. Statistik u. Wahrscheinlichkeitstheorie Multiple group discriminant analysis: Robustness and error rate P. Filzmoser, K. Joossens, and C. Croux Forschungsbericht CS-006- Jänner 006 040 Wien,
Equations, Inequalities & Partial Fractions
Contents Equations, Inequalities & Partial Fractions.1 Solving Linear Equations 2.2 Solving Quadratic Equations 1. Solving Polynomial Equations 1.4 Solving Simultaneous Linear Equations 42.5 Solving Inequalities
1 Solving LPs: The Simplex Algorithm of George Dantzig
Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.
Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur
Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #15 Special Distributions-VI Today, I am going to introduce
University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.
University of Ljubljana Doctoral Programme in Statistics ethodology of Statistical Research Written examination February 14 th, 2014 Name and surname: ID number: Instructions Read carefully the wording
4. Continuous Random Variables, the Pareto and Normal Distributions
4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random
1 Review of Least Squares Solutions to Overdetermined Systems
cs4: introduction to numerical analysis /9/0 Lecture 7: Rectangular Systems and Numerical Integration Instructor: Professor Amos Ron Scribes: Mark Cowlishaw, Nathanael Fillmore Review of Least Squares
Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes
Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, Discrete Changes JunXuJ.ScottLong Indiana University August 22, 2005 The paper provides technical details on
CURVE FITTING LEAST SQUARES APPROXIMATION
CURVE FITTING LEAST SQUARES APPROXIMATION Data analysis and curve fitting: Imagine that we are studying a physical system involving two quantities: x and y Also suppose that we expect a linear relationship
Reject Inference in Credit Scoring. Jie-Men Mok
Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business
A General Approach to Variance Estimation under Imputation for Missing Survey Data
A General Approach to Variance Estimation under Imputation for Missing Survey Data J.N.K. Rao Carleton University Ottawa, Canada 1 2 1 Joint work with J.K. Kim at Iowa State University. 2 Workshop on Survey
A Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
The Method of Least Squares
Hervé Abdi 1 1 Introduction The least square methods (LSM) is probably the most popular technique in statistics. This is due to several factors. First, most common estimators can be casted within this
Chapter 1. Vector autoregressions. 1.1 VARs and the identi cation problem
Chapter Vector autoregressions We begin by taking a look at the data of macroeconomics. A way to summarize the dynamics of macroeconomic data is to make use of vector autoregressions. VAR models have become
LOGIT AND PROBIT ANALYSIS
LOGIT AND PROBIT ANALYSIS A.K. Vasisht I.A.S.R.I., Library Avenue, New Delhi 110 012 [email protected] In dummy regression variable models, it is assumed implicitly that the dependent variable Y
On Marginal Effects in Semiparametric Censored Regression Models
On Marginal Effects in Semiparametric Censored Regression Models Bo E. Honoré September 3, 2008 Introduction It is often argued that estimation of semiparametric censored regression models such as the
ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2
University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages
Introduction to Matrix Algebra
Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary
Practice problems for Homework 11 - Point Estimation
Practice problems for Homework 11 - Point Estimation 1. (10 marks) Suppose we want to select a random sample of size 5 from the current CS 3341 students. Which of the following strategies is the best:
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Non Parametric Inference
Maura Department of Economics and Finance Università Tor Vergata Outline 1 2 3 Inverse distribution function Theorem: Let U be a uniform random variable on (0, 1). Let X be a continuous random variable
Estimating the random coefficients logit model of demand using aggregate data
Estimating the random coefficients logit model of demand using aggregate data David Vincent Deloitte Economic Consulting London, UK [email protected] September 14, 2012 Introduction Estimation
Permutation Tests for Comparing Two Populations
Permutation Tests for Comparing Two Populations Ferry Butar Butar, Ph.D. Jae-Wan Park Abstract Permutation tests for comparing two populations could be widely used in practice because of flexibility of
Nonlinear Regression:
Zurich University of Applied Sciences School of Engineering IDP Institute of Data Analysis and Process Design Nonlinear Regression: A Powerful Tool With Considerable Complexity Half-Day : Improved Inference
The Loss in Efficiency from Using Grouped Data to Estimate Coefficients of Group Level Variables. Kathleen M. Lang* Boston College.
The Loss in Efficiency from Using Grouped Data to Estimate Coefficients of Group Level Variables Kathleen M. Lang* Boston College and Peter Gottschalk Boston College Abstract We derive the efficiency loss
About the Gamma Function
About the Gamma Function Notes for Honors Calculus II, Originally Prepared in Spring 995 Basic Facts about the Gamma Function The Gamma function is defined by the improper integral Γ) = The integral is
Matrix Differentiation
1 Introduction Matrix Differentiation ( and some other stuff ) Randal J. Barnes Department of Civil Engineering, University of Minnesota Minneapolis, Minnesota, USA Throughout this presentation I have
STAT 830 Convergence in Distribution
STAT 830 Convergence in Distribution Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Convergence in Distribution STAT 830 Fall 2011 1 / 31
Hypothesis Testing for Beginners
Hypothesis Testing for Beginners Michele Piffer LSE August, 2011 Michele Piffer (LSE) Hypothesis Testing for Beginners August, 2011 1 / 53 One year ago a friend asked me to put down some easy-to-read notes
MATHEMATICAL METHODS OF STATISTICS
MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS
F Matrix Calculus F 1
F Matrix Calculus F 1 Appendix F: MATRIX CALCULUS TABLE OF CONTENTS Page F1 Introduction F 3 F2 The Derivatives of Vector Functions F 3 F21 Derivative of Vector with Respect to Vector F 3 F22 Derivative
SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
Bayesian Statistics in One Hour. Patrick Lam
Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical
Review of Fundamental Mathematics
Review of Fundamental Mathematics As explained in the Preface and in Chapter 1 of your textbook, managerial economics applies microeconomic theory to business decision making. The decision-making tools
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
Regression Analysis. Regression Analysis MIT 18.S096. Dr. Kempthorne. Fall 2013
Lecture 6: Regression Analysis MIT 18.S096 Dr. Kempthorne Fall 2013 MIT 18.S096 Regression Analysis 1 Outline Regression Analysis 1 Regression Analysis MIT 18.S096 Regression Analysis 2 Multiple Linear
Multiple Regression: What Is It?
Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in
Time Series Analysis
Time Series Analysis [email protected] Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby 1 Outline of the lecture Identification of univariate time series models, cont.:
Gamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
Gaussian Conjugate Prior Cheat Sheet
Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian
CHAPTER 3. Methods of Proofs. 1. Logical Arguments and Formal Proofs
CHAPTER 3 Methods of Proofs 1. Logical Arguments and Formal Proofs 1.1. Basic Terminology. An axiom is a statement that is given to be true. A rule of inference is a logical rule that is used to deduce
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
Option Valuation Using Intraday Data
Option Valuation Using Intraday Data Peter Christoffersen Rotman School of Management, University of Toronto, Copenhagen Business School, and CREATES, University of Aarhus 2nd Lecture on Thursday 1 Model
Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.
1. Differentiation The first derivative of a function measures by how much changes in reaction to an infinitesimal shift in its argument. The largest the derivative (in absolute value), the faster is evolving.
Multivariate normal distribution and testing for means (see MKB Ch 3)
Multivariate normal distribution and testing for means (see MKB Ch 3) Where are we going? 2 One-sample t-test (univariate).................................................. 3 Two-sample t-test (univariate).................................................
Time Series Analysis III
Lecture 12: Time Series Analysis III MIT 18.S096 Dr. Kempthorne Fall 2013 MIT 18.S096 Time Series Analysis III 1 Outline Time Series Analysis III 1 Time Series Analysis III MIT 18.S096 Time Series Analysis
Chapter 10: Basic Linear Unobserved Effects Panel Data. Models:
Chapter 10: Basic Linear Unobserved Effects Panel Data Models: Microeconomic Econometrics I Spring 2010 10.1 Motivation: The Omitted Variables Problem We are interested in the partial effects of the observable
Partial Fractions. Combining fractions over a common denominator is a familiar operation from algebra:
Partial Fractions Combining fractions over a common denominator is a familiar operation from algebra: From the standpoint of integration, the left side of Equation 1 would be much easier to work with than
Continued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
