# Modern Methods for Missing Data

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Modern Methods for Missing Data Paul D. Allison, Ph.D. Statistical Horizons LLC 1

2 Introduction Missing data problems are nearly universal in statistical practice. Last 25 years have seen revolutionary developments in missing data methods. Two major methods: multiple imputation (MI) and maximum likelihood (ML). MI more popular, but ML is superior. This talk: Why ML is better for missing data. How to do ML with SAS and Stata. 2

3 Assumptions Missing completely at random (MCAR) Suppose some data are missing on Y. These data are said to be MCAR if the probability that Y is missing is unrelated to Y or other variables X (where X is a vector of observed variables). Pr (Y is missing X,Y) = Pr(Y is missing) MCAR is the ideal situation. What variables must be in the X vector? Only variables in the model of interest. 3

4 Assumptions Missing at random (MAR) Data on Y are missing at random if the probability that Y is missing does not depend on the value of Y, after controlling for observed variables Pr (Y is missing X,Y) = Pr(Y is missing X) E.g., the probability of missing income depends on marital status, but within each marital status, the probability of missing income does not depend on income. Considerably weaker assumption than MCAR Only X s in the model must be considered. But, including other X s (correlated with Y) can make MAR more plausible. Can test whether missingness on Y depends on X Cannot test whether missingness on Y depends on Y 4

5 Ignorability The missing data mechanism is said to be ignorable if The data are missing at random and Parameters that govern the missing data mechanism are distinct from parameters to be estimated (unlikely to be violated) In practice, MAR and ignorable are used interchangeably If MAR but not ignorable (parameters not distinct), methods assuming ignorability would still be good, just not optimal. If missing data are ignorable, no need to model the missing data mechanism. Any general purpose method for handling missing data must assume that the missing data mechanism is ignorable. 5

6 Assumptions Not missing at random (NMAR) If the MAR assumption is violated, the missing data mechanism must be modeled to get good parameter estimates. Heckman s regression model for sample selection bias is a good example. Effective estimation for NMAR missing data requires very good prior knowledge about missing data mechanism. Data contain no information about what models would be appropriate No way to test goodness of fit of missing data model Results often very sensitive to choice of model Listwise deletion able to handle one important kind of NMAR 6

7 Conventional Imputation Any method that substitutes estimated values for missing values Replacement with means Regression imputation (replace with conditional means) Problems Often leads to biased parameter estimates (e.g., variances) Usually leads to standard error estimates that are biased downward Treats imputed data as real data, ignores inherent uncertainty in imputed values. 7

8 Multiple Imputation Three basic steps: 1. Introduce random variation into the imputation process and generate several data sets, each with different imputed values. 2. Perform the desired analysis on each data set. 3. Combine the results into a single set of parameter estimates, standard errors, and test statistics. 8

9 Multiple Imputation Properties If assumptions are met and MI is done properly, resulting estimators are 1. Consistent (implies approximately unbiased). 2. Asymptotically efficient (almost) minimum sampling variability 3. Asymptotically normal justifies uses of normal table for p-values and confidence intervals. 9

10 MCMC Method in SAS and Stata Assumptions: 1. Ignorability (->MAR) 2. Multivariate normality Multivariate normality implies All variables are normally distributed All conditional expectation functions are linear All conditional variance functions are homoscedastic. Multivariate normality also implies that optimal imputation can be done by linear regression. Example: a data set has three variables, X, Y, and Z. Suppose X and Y are fully observed, but Z has missing data for 20% of the cases. 10

11 Regression Imputation Conventional imputation: a regression of Z on X and Y for the complete cases yields the imputation equation Problem: Variance of imputed values is too small. Solution: where E is a random draw from a standard normal distribution and s is the root mean squared error from the regression. 11

12 Multiple Random Imputation Problems with single, random imputation Not fully efficient because of random variation. Reported standard errors are too low. Solution: Do it multiple times Produce multiple data sets, each with different imputed values. Estimate parameters on each data set. Averaging the parameter estimates dampens the variation. A few data sets yields close to full efficiency. Variability among the estimates provides information for correcting the standard errors. Yields accurate standard errors that fully account for the uncertainty about the missing values. 12

13 Complications To get the right amount of variability in the imputed values, the imputation parameters must be random draws from their posterior distribution. When more than one variable has missing data, imputation typically requires an iterative method of repeated imputations. Both of these issues can be solved by the MCMC algorithm available in many software packages. An alternative algorithm is the fully conditional specification (FCS), also known as multiple imputation by chained equations (MICE). 13

14 Maximum Likelihood Choose as parameter estimates those values which, if true, would maximize the probability of observing what has, in fact, been observed. Likelihood function: Expresses the probability of the data as a function of the data and the unknown parameter values. Example: Let f(y θ) be the probability density for y, given θ (a vector of parameters). For a sample of n independent observations, the likelihood function is 14

15 Properties of Maximum Likelihood To get ML estimates, we find the value of θ that maximizes the likelihood function. Under usual conditions, ML estimates have the following properties: Consistent (implies approximately unbiased in large samples) Asymptotically efficient Asymptotically normal 15

16 ML with Ignorable Missing Data Suppose we have n independent observations on k variables (y 1,y 2,,y k ). With no missing data, the likelihood function is Now suppose that data are missing for individual i for y 1 and y 2. The likelihood for that individual is just the probability (or density) of observing the remaining variables. If y 1 and y 2 are discrete that s found by 16

17 ML with Ignorable Missing Data If the missing variables are continuous, the likelihood for i is If there are m observations with complete data and n-m observations with data missing on y 1 and y 2, the likelihood function for the full data set becomes This can be maximized by standard methods to get ML estimates of θ. 17

18 Why ML is Better Than MI 1. ML is more efficient than MI Smaller true standard errors. Full efficiency for MI requires an infinite number of data sets. How many is enough? This may be important for small samples. 2. For a given data set, ML always gives the same results. MI gives a different result each time you use it. Can force MI to give the same results by setting the seed. (Computers only do pseudo-random numbers.) No reason to prefer one seed to any other seed. Reduce variability by using more data sets. How many? 18

19 Why ML is Better 3. MI requires many uncertain decisions. ML needs far fewer. MCMC vs. FCS? If FCS, what model for each variable? How many data sets, is it enough? How many iterations between data sets? Prior distributions? How to incorporate interactions and nonlinearities? Which method for multivariate testing? 19

20 Why ML is Better 4. With MI, there s always potential conflict between imputation model and analysis model. No conflict with ML because only one model. Imputation model involves choice of variables, specification of relationships, probability distributions. For results to be correct, analysis model must be compatible with imputation model. Some common sources of incompatibility: Analysis model contains variables that were not in the imputation model. Analysis model contains interactions and non-linearities, but imputation model was strictly linear. Most problems occur when imputation model is more restrictive than analysis model. 20

21 Why Ever Do Multiple Imputation? Once you ve imputed the missing values, you can use the data set with any other statistical software. No need to learn new software. Lots of possibilities ML requires specialized procedures. Nothing in SAS, Stata, SPSS or R when predictors are missing for logistic regression, Cox regression, Poisson regression, etc. 21

22 How to Do ML Scenario 1: Regression model with missing data on dependent variable only. Any kind of regression (linear, logistic, etc.) Assume MAR. Maximum likelihood reduces to listwise deletion (complete case analysis). Could possibly do better if there is a good auxiliary variable (one highly correlated with dependent variable but not in the model). If data are NMAR, could possibly do better with ML using Heckman s method for selection bias. But that s problematic. 22

23 Repeated Measures Scenario 2: Repeated measures regression with data missing on dependent variable only. Solution: Use ML (or REML) to estimate mixed model. Under MAR, estimates have the usual optimal properties. No need to do anything special. Example: 595 people surveyed annually for 7 years. So 4165 records in data set with these variables: LWAGE = log of hourly wage (dependent variable) FEM = 1 if female, 0 if male T = year of the survey, 1 to 7 ID = ID number for each person 23

24 Repeated Measures Example No missing data in original sample. SAS Estimated a linear model with main effects of FEM and T, and their interaction. Used robust standard errors to correct for dependence PROC SURVEYREG DATA=my.wages; MODEL lwage = fem t fem*t; CLUSTER id; Stata reg lwage fem t fem#c.t, cluster(id) 24

25 Repeated Measures (cont.) Estimated Regression Coefficients Parameter Estimate Standard Error t Value Pr > t Intercept <.0001 FEM <.0001 T <.0001 FEM*T

26 Repeated Measures With Dropout I deliberately made some data MAR. The probability of dropping out in year t+1, given response at time t was After 7 years, over half the people had dropped out. Reestimating the model with PROC SURVEYREG yielded: Estimated Regression Coefficients Parameter Estimate Standard Error t Value Pr > t Intercept <.0001 FEM <.0001 t <.0001 FEM*t

27 Mixed Model Let s estimate a random intercepts model: SAS: PROC MIXED DATA=my.wagemiss; MODEL lwage = fem t fem*t / SOLUTION; RANDOM INTERCEPT / SUBJECT=id; Stata: xtreg lwage fem t fem#c.t, i(id) mle 27

28 PROC MIXED Output Effect Estimate Standard Error DF t Value Pr > t Intercept <.0001 FEM <.0001 t <.0001 FEM*t Number of Observations Number of Observations Read 4165 Number of Observations Used 2834 Number of Observations Not Used

29 Binary Example Data source: National Drug Abuse Treatment Clinical Trials Network sponsored by National Institute on Drug Abuse Sample: 134 opioid-addicted youths, half randomly assigned to a new drug treatment over a 12-week period. Other half got standard detox therapy. Dependent variable = 1 if subject tested positive for opiates in each of 12 weeks, otherwise 0. A lot of missing data: Percentage missing at each week ranged from 10% to 63%. Model: Mixed logistic regression where p is probability of positive test, z is treatment indicator and t is time: 29

30 PROC GLIMMIX SAS: PROC GLIMMIX DATA=my.nidalong METHOD=QUAD(QPOINTS=5); CLASS usubjid; WHERE week NE 0; MODEL opiates = week week treat / D=B SOLUTION ; RANDOM INTERCEPT / SUBJECT=usubjid; Stata: xtlogit opiates c.week##c.week##treat if week~=0, i(usubjid) i(usubjid) 30

31 GLIMMIX Output Solutions for Fixed Effects Standard Effect Estimate Error DF t Value Pr > t Intercept treat week treat*week <.0001 week*week treat*week*week <

32 Figure 1. Predicted Probability of a Positive Test in Weeks 1 Through 12, for Treatment and Control Groups. 32

33 Missing Data on Predictors Scenario 3: Linear models with missing data on predictors. Mixed modeling software does nothing for cases with observed y but missing x s. Those cases are deleted, losing potentially useful information. Solution: Full information ML using structural equation modeling (SEM) software. 33

34 Full Information ML Also known as raw ML or direct ML Directly maximize the likelihood for the specified model Several computer packages can do this for any LISREL model Amos ( Mplus ( LISREL ( MX (freeware) (views.vcu.edu/mx) EQS ( PROC CALIS 9.22 (support.sas.com) Stata 12 ( 34

35 NLSY Example 581 children surveyed in 1990 as part of the National Longitudinal Survey of Youth. Variables: ANTI antisocial behavior, a scale ranging from 0 to 6. SELF self-esteem, a scale ranging from 6 to 24. POV poverty status of family, 1 if in poverty, else 0. BLACK 1 if child is black, otherwise 0 HISPANIC 1 if child is Hispanic, otherwise 0 CHILDAGE child s age in 1990 DIVORCE 1 if mother was divorced in 1990, otherwise 0 GENDER MOMAGE 1 if female, 0 if male mother s age at birth of child MOMWORK 1 if mother was employed in 1990, otherwise 0 Data missing on SELF, POV, BLACK, HISPANIC, MOMWORK 35

36 Listwise Deletion Goal is to estimate a linear regression with ANTI as the dependent variable. Listwise deletion removes 386 of the 581 cases: Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept self pov black hispanic childage divorce gender momage momwork

37 FIML in SAS and Stata Like MCMC (in SAS and Stata), FIML assumes SAS Missing at random. Multivariate normality. PROC CALIS DATA=nlsymiss METHOD=FIML; PATH anti <- self pov black hispanic childage divorce gender momage momwork; Stata sem anti <- self pov black hispanic childage divorce gender divorce gender momage momwork, method(mlmv) method(mlmv) 37

38 CALIS Output PATH List Path Parameter Estimate Standard Error t Value anti <--- self _Parm anti <--- pov _Parm anti <--- black _Parm anti <--- hispanic _Parm anti <--- childage _Parm anti <--- divorce _Parm anti <--- gender _Parm anti <--- momage _Parm anti <--- momwork _Parm

39 Beyond Linear Regression. CALIS and sem do FIML for a wide class of linear models Non-recursive simultaneous equations Latent variables Fixed and random effects for longitudinal data. Mplus ( does the same, but it also does FIML for Logistic regression (binary, ordinal, multinomial). Poisson and negative binomial regression. Cox regression and piecewise exponential regression for survival data. Models in which data are not missing at random. 39

40 Data Not Missing at Random Sometimes we suspect that data are not missing at random: people with high incomes may be less likely to report their income people who ve been arrested may be less likely to report arrest status. If data are NMAR, correct inference requires that the missing data mechanism be modeled as part of the inference. IF the model is correct, both ML and MI can produce optimal inferences in NMAR situations. Problems: 1. Data contain no information to determine the correct model for the missing data mechanism. 2. Inference may be very sensitive to model choice. 40

41 Selection Models Heckman s model for selection bias Designed for NMAR data on the dependent variable in a linear regression. Linear model for Y: Y i = βx i + ε i where ε i is normal with mean 0, constant variance, and uncorrelated with X. Probit model for R: Pr(R i =1 Y i,x i ) = Φ(a 0 + a 1 Y i + a 2 X i ) where Φ(.) is the cumulative distribution function for a standard normal variable. If a 1 =0, data are MAR. If a 1 =a 2 = 0, data are MCAR 41

42 Heckman s Model (cont.) This model implies a likelihood function whose parameters are all identified. It can be maximized in SAS with PROC QLIM or in Stata with heckman command. Example: Women s labor force participation. Variable Label N Mean Minimum Maximum INLF NWIFEINC EDUC EXPER EXPERSQ AGE KIDSLT6 KIDSGE6 LWAGE =1 if in lab force, 1975 (faminc - wage*hours)/1000 years of schooling actual labor mkt exper exper**2 woman's age in yrs # kids < 6 years # kids 6-18 log(wage)

43 PROC QLIM DATA=my.mroz; MODEL inlf = nwifeinc educ exper expersq age kidslt6 kidsge6 /DISCRETE; MODEL lwage = educ exper expersq / SELECT(inlf=1); Parameter Estimates Parameter DF Estimate Standard Error t Value Approx Pr > t LWAGE.Intercept LWAGE.EDUC <.0001 LWAGE.EXPER LWAGE.EXPERSQ _Sigma.LWAGE <.0001 INLF.Intercept INLF.NWIFEINC INLF.EDUC <.0001 INLF.EXPER <.0001 INLF.EXPERSQ INLF.AGE <.0001 INLF.KIDSLT < INLF.KIDSGE _Rho

44 Conclusion Whenever possible, ML is preferable to MI for handling missing data. Deterministic result. No conflict between imputation and analysis model. Fewer uncertain decisions. Usually, the same assumptions. For many scenarios, ML is readily available in standard packages Listwise deletion for missing only on response variable. Mixed models for repeated measures with missing only on response variable. SEM software for missing predictors. Heckman models for NMAR on response variable. For other scenarios, use Mplus. 44

45 More information To get a pdf of the paper on which this talk is based, go to support.sas.com/resources/papers/ proceedings12/ pdf For information on short courses on a variety of statistical topics go to 45

### Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional

### Missing Data. Paul D. Allison INTRODUCTION

4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data, I mean data that are missing for some (but not all) variables and for some (but not all)

### MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

### Missing Data Techniques for Structural Equation Modeling

Journal of Abnormal Psychology Copyright 2003 by the American Psychological Association, Inc. 2003, Vol. 112, No. 4, 545 557 0021-843X/03/\$12.00 DOI: 10.1037/0021-843X.112.4.545 Missing Data Techniques

### Analyzing Structural Equation Models With Missing Data

Analyzing Structural Equation Models With Missing Data Craig Enders* Arizona State University cenders@asu.edu based on Enders, C. K. (006). Analyzing structural equation models with missing data. In G.

### Problem of Missing Data

VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

### A Basic Introduction to Missing Data

John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

### Handling attrition and non-response in longitudinal data

Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

### Missing Data Dr Eleni Matechou

1 Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: R.J.A. Little and D.B. Rubin 2nd edition Statistical Analysis with Missing Data J.L. Schafer and J.W.

### Multiple Imputation for Missing Data: A Cautionary Tale

Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust

ALLISON 1 TABLE OF CONTENTS 1. INTRODUCTION... 3 2. ASSUMPTIONS... 6 MISSING COMPLETELY AT RANDOM (MCAR)... 6 MISSING AT RANDOM (MAR)... 7 IGNORABLE... 8 NONIGNORABLE... 8 3. CONVENTIONAL METHODS... 10

### Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics

### A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

123 Kwantitatieve Methoden (1999), 62, 123-138. A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA Joop J. Hox 1 ABSTRACT. When we deal with a large data set with missing data, we have to undertake

### Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

### When to Use Which Statistical Test

When to Use Which Statistical Test Rachel Lovell, Ph.D., Senior Research Associate Begun Center for Violence Prevention Research and Education Jack, Joseph, and Morton Mandel School of Applied Social Sciences

### 2. Making example missing-value datasets: MCAR, MAR, and MNAR

Lecture 20 1. Types of missing values 2. Making example missing-value datasets: MCAR, MAR, and MNAR 3. Common methods for missing data 4. Compare results on example MCAR, MAR, MNAR data 1 Missing Data

### CHAPTER 6 EXAMPLES: GROWTH MODELING AND SURVIVAL ANALYSIS

Examples: Growth Modeling And Survival Analysis CHAPTER 6 EXAMPLES: GROWTH MODELING AND SURVIVAL ANALYSIS Growth models examine the development of individuals on one or more outcome variables over time.

### CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships

### Sensitivity Analysis in Multiple Imputation for Missing Data

Paper SAS270-2014 Sensitivity Analysis in Multiple Imputation for Missing Data Yang Yuan, SAS Institute Inc. ABSTRACT Multiple imputation, a popular strategy for dealing with missing values, usually assumes

### Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

### Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

Missing Data & How to Deal: An overview of missing data Melissa Humphries Population Research Center Goals Discuss ways to evaluate and understand missing data Discuss common missing data methods Know

### Introduction to mixed model and missing data issues in longitudinal studies

Introduction to mixed model and missing data issues in longitudinal studies Hélène Jacqmin-Gadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models

### SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

### Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation

### CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Examples: Multilevel Modeling With Complex Survey Data CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA Complex survey data refers to data obtained by stratification, cluster sampling and/or

### Combining Multiple Imputation and Inverse Probability Weighting

Combining Multiple Imputation and Inverse Probability Weighting Shaun Seaman 1, Ian White 1, Andrew Copas 2,3, Leah Li 4 1 MRC Biostatistics Unit, Cambridge 2 MRC Clinical Trials Unit, London 3 UCL Research

### Dealing with Missing Data

Res. Lett. Inf. Math. Sci. (2002) 3, 153-160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904

### HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. A General Formulation 3. Truncated Normal Hurdle Model 4. Lognormal

### Imputing Missing Data using SAS

ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

### ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages

### Dealing with missing data: Key assumptions and methods for applied analysis

Technical Report No. 4 May 6, 2013 Dealing with missing data: Key assumptions and methods for applied analysis Marina Soley-Bori msoley@bu.edu This paper was published in fulfillment of the requirements

### 1 Logit & Probit Models for Binary Response

ECON 370: Limited Dependent Variable 1 Limited Dependent Variable Econometric Methods, ECON 370 We had previously discussed the possibility of running regressions even when the dependent variable is dichotomous

### ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Quantile Treatment Effects 2. Control Functions

### Lecture 27: Introduction to Correlated Binary Data

Lecture 27: Introduction to Correlated Binary Data Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South

### Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

Introduction to Multilevel Modeling Using HLM 6 By ATS Statistical Consulting Group Multilevel data structure Students nested within schools Children nested within families Respondents nested within interviewers

### Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple

### Econometrics II. Lecture 9: Sample Selection Bias

Econometrics II Lecture 9: Sample Selection Bias Måns Söderbom 5 May 2011 Department of Economics, University of Gothenburg. Email: mans.soderbom@economics.gu.se. Web: www.economics.gu.se/soderbom, www.soderbom.net.

### CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA

Examples: Mixture Modeling With Longitudinal Data CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA Mixture modeling refers to modeling with categorical latent variables that represent subpopulations

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### GETTING STARTED: STATA & R BASIC COMMANDS ECONOMETRICS II. Stata Output Regression of wages on education

GETTING STARTED: STATA & R BASIC COMMANDS ECONOMETRICS II Stata Output Regression of wages on education. sum wage educ Variable Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------

### Models for Count Data With Overdispersion

Models for Count Data With Overdispersion Germán Rodríguez November 6, 2013 Abstract This addendum to the WWS 509 notes covers extra-poisson variation and the negative binomial model, with brief appearances

### Missing data and net survival analysis Bernard Rachet

Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27-29 July 2015 Missing data and net survival analysis Bernard Rachet General context Population-based,

### Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

### PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

### Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

### Raul Cruz-Cano, HLTH653 Spring 2013

Multilevel Modeling-Logistic Schedule 3/18/2013 = Spring Break 3/25/2013 = Longitudinal Analysis 4/1/2013 = Midterm (Exercises 1-5, not Longitudinal) Introduction Just as with linear regression, logistic

### LMM: Linear Mixed Models and FEV1 Decline

LMM: Linear Mixed Models and FEV1 Decline We can use linear mixed models to assess the evidence for differences in the rate of decline for subgroups defined by covariates. S+ / R has a function lme().

### Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear

### ECON Introductory Econometrics. Lecture 15: Binary dependent variables

ECON4150 - Introductory Econometrics Lecture 15: Binary dependent variables Monique de Haan (moniqued@econ.uio.no) Stock and Watson Chapter 11 Lecture Outline 2 The linear probability model Nonlinear probability

### problem arises when only a non-random sample is available differs from censored regression model in that x i is also unobserved

4 Data Issues 4.1 Truncated Regression population model y i = x i β + ε i, ε i N(0, σ 2 ) given a random sample, {y i, x i } N i=1, then OLS is consistent and efficient problem arises when only a non-random

### I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

### Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

### Craig K. Enders Arizona State University Department of Psychology craig.enders@asu.edu

Craig K. Enders Arizona State University Department of Psychology craig.enders@asu.edu Topic Page Missing Data Patterns And Missing Data Mechanisms 1 Traditional Missing Data Techniques 7 Maximum Likelihood

### IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results How important is R-squared? R-squared Published in Agricultural Economics 0.45 Best article of the

### Introduction to Fixed Effects Methods

Introduction to Fixed Effects Methods 1 1.1 The Promise of Fixed Effects for Nonexperimental Research... 1 1.2 The Paired-Comparisons t-test as a Fixed Effects Method... 2 1.3 Costs and Benefits of Fixed

### Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

1 Web-based Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed

### Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

[Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator

### How To Use A Monte Carlo Study To Decide On Sample Size and Determine Power

How To Use A Monte Carlo Study To Decide On Sample Size and Determine Power Linda K. Muthén Muthén & Muthén 11965 Venice Blvd., Suite 407 Los Angeles, CA 90066 Telephone: (310) 391-9971 Fax: (310) 391-8971

### CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

Examples: Monte Carlo Simulation Studies CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES Monte Carlo simulation studies are often used for methodological investigations of the performance of statistical

### SHORT COURSE ON MPLUS Getting Started with Mplus HANDOUT

SHORT COURSE ON MPLUS Getting Started with Mplus HANDOUT Instructor: Cathy Zimmer 962-0516, cathy_zimmer@unc.edu INTRODUCTION a) Who am I? Who are you? b) Overview of Course i) Mplus capabilities ii) The

Mplus Short Courses Topic 2 Regression Analysis, Eploratory Factor Analysis, Confirmatory Factor Analysis, And Structural Equation Modeling For Categorical, Censored, And Count Outcomes Linda K. Muthén

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Survival Analysis Using SPSS. By Hui Bian Office for Faculty Excellence

Survival Analysis Using SPSS By Hui Bian Office for Faculty Excellence Survival analysis What is survival analysis Event history analysis Time series analysis When use survival analysis Research interest

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out Sandra Taylor, Ph.D. IDDRC BBRD Core 23 April 2014 Objectives Baseline Adjustment Introduce approaches Guidance

### Binary Logistic Regression

Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

### A General Approach to Variance Estimation under Imputation for Missing Survey Data

A General Approach to Variance Estimation under Imputation for Missing Survey Data J.N.K. Rao Carleton University Ottawa, Canada 1 2 1 Joint work with J.K. Kim at Iowa State University. 2 Workshop on Survey

### Regression with a Binary Dependent Variable

Regression with a Binary Dependent Variable Chapter 9 Michael Ash CPPA Lecture 22 Course Notes Endgame Take-home final Distributed Friday 19 May Due Tuesday 23 May (Paper or emailed PDF ok; no Word, Excel,

### Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

### C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)}

C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)} 1. EES 800: Econometrics I Simple linear regression and correlation analysis. Specification and estimation of a regression model. Interpretation of regression

### Wednesday PM. Multiple regression. Multiple regression in SPSS. Presentation of AM results Multiple linear regression. Logistic regression

Wednesday PM Presentation of AM results Multiple linear regression Simultaneous Stepwise Hierarchical Logistic regression Multiple regression Multiple regression extends simple linear regression to consider

### SUMAN DUVVURU STAT 567 PROJECT REPORT

SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.

### Introduction to Multivariate Models: Modeling Multivariate Outcomes with Mixed Model Repeated Measures Analyses

Introduction to Multivariate Models: Modeling Multivariate Outcomes with Mixed Model Repeated Measures Analyses Applied Multilevel Models for Cross Sectional Data Lecture 11 ICPSR Summer Workshop University

### Multilevel Models for Longitudinal Data. Fiona Steele

Multilevel Models for Longitudinal Data Fiona Steele Aims of Talk Overview of the application of multilevel (random effects) models in longitudinal research, with examples from social research Particular

### Re-analysis using Inverse Probability Weighting and Multiple Imputation of Data from the Southampton Women s Survey

Re-analysis using Inverse Probability Weighting and Multiple Imputation of Data from the Southampton Women s Survey MRC Biostatistics Unit Institute of Public Health Forvie Site Robinson Way Cambridge

### Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, Discrete Changes JunXuJ.ScottLong Indiana University August 22, 2005 The paper provides technical details on

### , then the form of the model is given by: which comprises a deterministic component involving the three regression coefficients (

Multiple regression Introduction Multiple regression is a logical extension of the principles of simple linear regression to situations in which there are several predictor variables. For instance if we

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### INTRODUCTORY STATISTICS

INTRODUCTORY STATISTICS FIFTH EDITION Thomas H. Wonnacott University of Western Ontario Ronald J. Wonnacott University of Western Ontario WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore

### Dealing with Missing Data

Dealing with Missing Data Roch Giorgi email: roch.giorgi@univ-amu.fr UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January

### Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics

### Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

### Introduction to Regression and Data Analysis

Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

### Efficient and Practical Econometric Methods for the SLID, NLSCY, NPHS

Efficient and Practical Econometric Methods for the SLID, NLSCY, NPHS Philip Merrigan ESG-UQAM, CIRPÉE Using Big Data to Study Development and Social Change, Concordia University, November 2103 Intro Longitudinal

### Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

### Best Practices for Missing Data Management in Counseling Psychology

Journal of Counseling Psychology 2010 American Psychological Association 2010, Vol. 57, No. 1, 1 10 0022-0167/10/\$12.00 DOI: 10.1037/a0018082 Best Practices for Missing Data Management in Counseling Psychology

### Poisson Models for Count Data

Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

### Nominal and ordinal logistic regression

Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

### HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

### MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

### The Probit Link Function in Generalized Linear Models for Data Mining Applications

Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/\$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications

### From the help desk: Bootstrapped standard errors

The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

### Lecture - 32 Regression Modelling Using SPSS

Applied Multivariate Statistical Modelling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur Lecture - 32 Regression Modelling Using SPSS (Refer

### Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

Analyzing Intervention Effects: Multilevel & Other Approaches Joop Hox Methodology & Statistics, Utrecht Simplest Intervention Design R X Y E Random assignment Experimental + Control group Analysis: t

### Missing data: the hidden problem

white paper Missing data: the hidden problem Draw more valid conclusions with SPSS Missing Data Analysis white paper Missing data: the hidden problem 2 Just about everyone doing analysis has some missing

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### IBM SPSS Missing Values 22

IBM SPSS Missing Values 22 Note Before using this information and the product it supports, read the information in Notices on page 23. Product Information This edition applies to version 22, release 0,