# Imputing Missing Data using SAS

 To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video
Save this PDF as:

Size: px
Start display at page:

## Transcription

2 Multiple imputation is the last strategy that will be discussed. Instead of attempting to estimate each value and using these estimates to predict the parameters, this method draws a random sample of the missing values from its distribution. This method involves 3 steps, creating multiple imputed data sets, carrying out the analysis model on each of the imputed data sets, and then combining parameter estimates to get a single set of parameter estimates. Steps 2 and 3 are fairly mechanical (see documentation), however, step 1 requires some planning. The method of choice depends on the patterns of missingness. For data sets with monotone continuous missing patterns, one can use stochastic regression as discussed earlier. For data sets with arbitrary missing patterns, it is suggested to use the Markov Chain Monte Carlo (MCMC) method ( Multiple Imputation in SAS: part 1 ). In short this is very similar to maximum likelihood. It estimates the missing values, obtains new parameter estimates and then uses those estimates to predict the missing values again. However, the parameter estimates are derived using Bayesian estimation of the mean vector and covariance matrix. When using multiple imputation, the number of imputed data sets must be specified and as few as three to five data sets can be adequate. However, the larger the percentage of missing data, the more imputations are necessary to get an accurate estimate. Unfortunately, there is not a simple criterion to determine when MCMC has converged. The technique that is commonly used is to visually inspect the parameters from successive iterations to determine if there is a clear trend. If there is no trend, this suggests that the model has reached a stationary distribution. In addition, autocorrelation may be possible between imputed datasets. It is best to make sure that the new dataset does not have this feature. This method requires a fairly large sample size. After the multiple imputation is complete, the predetermined model is then used on each set and the parameter estimates are combined to obtain a singular set of estimates. METHODS The primary goal of this paper is to compare and contrast the previously discussed methods of imputation for missing data. To do this, various versions of the same data set were simulated with random values set to missing. The data used for the simulations was based on a fabricated example data set that measured a group of subject s seizure occurrences within a time period of 2 weeks. Three predictor variables (treatment, baseline seizure count, and age) were recorded for 100 observations, where each had 4 time periods. For the sake of the simulations, no censoring was incorporated in the fabrication of the data set. The general trend of the example data set responses was linear, but this may not be the case for all datasets. The results of these simulations are obviously dependent on the dataset and may differ for other kinds of data. For example, if responses increase with time, then time points near the end and the beginning of the time will be drastically off course. The first step to comparing the methods was deciding on a model to use for the data. The 4 time period seizure counts were modeled using the 3 predictor variables: what treatment they were assigned, their baseline seizure count, and their age. After the model was fit, the parameter estimates were recorded as the true parameter values (as a baseline comparison). After that, a certain percentage of responses were dropped randomly at the following rates: 5%, 10%, 15%, 20%, and 25%. After imputing, the same model previously used was fit, and the new parameter estimates were record and then compared across data sets. This was done 1000 times for each of the 5 missing rates. Some of these imputation methods are easy to incorporate in SAS, demonstrating their appeal of convenience. List-wise deletion can be done simply by using the DELETE statement as shown in the following program. DATA ListWiseExample; SET test; ARRAY var[*] Response1 Response2 ; DO i=1 TO dim(var); IF missing(var(i)) THEN DELETE; END; 2

3 The remainder of the methods, stochastic regression, maximum likelihood estimation and multiple imputation can be all run using the MI procedure. PROC MI has the following structure: PROC MI <options> BY variables; CLASS variables; [Imputation method]; FREQ variable; TRANSFORM transform (variables </options>); VAR variables; The BY statement specifies groups to stratify imputations analysis. The CLASS statement lists which variables are categorical variables. If the variable exists in the data set, the FREQ statement specifies the frequency of occurrence. TRANSFORM specifies the variables to be transformed before imputing. The VAR statement specifies the numeric variables to be analyzed/imputed. To choose which imputation method you want, you have 4 options. If the data is missing at random, you would use EM (expectation maximization - MLE), FCS (fully conditional specification - Regression), or MCMC (Markov Chain Monte Carlo). If you know that your data has monotone missingness, you would use the MONOTONE statement to impute. The simulation data example is assumed to be missing at random and thus EM, FCS and MCMC are the options that are to be used. To impute data using stochastic regression, the FCS statement is used. Within the FCS statement, you can impute using a discriminant function, logistic regression, regression, or predictive mean matching. Including the FCS statement will automatically use regression for continuous variables and discriminant functions for categorical variables. For each imputed variable, all other variables in the VAR statement are used as covariates. An error term is added to each missing observation based on the observed variability of the data. As an option, you can specify the set of effect to impute the variables. PROC MI is a procedure designed to do multiple imputations; however, you can use it to do single imputations by specifying in the options that NIMPUTE (number of imputations) is equal to 1. In addition, in the FCS statement, make sure to include that NBITER (number of burn in iterations) equals 1. This will ensure 1 iteration of imputation and 1 set of imputed data. Lastly, to output the imputed data, use the OUT = option. The imputation for the example simulation data using stochastic regression is shown in the following code. PROC MI DATA = test NIMPUTE = 1 OUT = example3; CLASS Treat; FCS NBITER = 1; VAR Treat B_Count Age Response1 Response2; To use maximum likelihood estimation, instead of using the FCS statement, you would use the EM statement. In addition, specify the OUT = option to save the imputed data set. You can also output the mean and covariance matrix estimates from the MLE using the OUT = option. Do be aware that as of now, PROC MI does not utilize categorical data in MLE estimation, and a dummy variable needs to be created if you want to include that information. Use the following PROC MI step to utilize MLE: PROC MI DATA = test; EM OUT = example4; VAR Treat B_Count Age Response1 Response2; The last method is multiple imputation using MCMC. Even though there are various options for the MCMC statement, only a few were used as a focused for this paper. You can specify whether or not a single chain is used for all imputations or separate chains are used for each imputation. You can also 3

4 include a PLOTS option to view both the trace and autocorrelation function plots. This allows you to ensure convergence and independence in your iterations. You can specify the number of imputed data sets in the PROC MI statement using the NIMPUTE option. To output the imputed data sets, specify OUT = option in the PROC MI statement. The default number of burn-in iterations is 200. The MCMC imputation for the simulation example is shown in the following code. PROC MI DATA = test NIMPUTE = 6 OUT = example5; MCMC CHAIN = MULTIPLE PLOTS = TRACE PLOTS = ACF; VAR Treat B_Count Age Response1 Response2; Now that you have your missing data imputed multiple times, you can run your model and obtain your estimates for the model parameter estimates. For this paper, a multivariate regression model was used. You need to specify the OUTEST option so that a data set containing parameter estimates is output. Another way to output a data set of parameter estimates would be to use the ODS OUTPUT option. It is important to also include the COVOUT option for the next step, as shown in the following code. PROC REG DATA = example5 OUTEST = regout COVOUT; MODEL Response1 Response2 = Treat B_Count Age; BY _imputation_; For the multiple imputation a new procedure needs to be used, the MIANALYZE procedure. This procedure summarizes the multiple imputations and provides parameter estimates. PROC MIANALYZE has the following structure: PROC MIANALYZE <options>; BY variables; CLASS variables; MODELEFFECTS effects; <label:> TEST equation1<,...,<equationk>></options>; STDERR variables; The BY and CLASS statements have the standard uses. The required MODELEFFECTS statement lists the effects to be analyzed. The TEST statement tests linear hypothesis around the parameters. If the data analyzed includes both parameter estimates and standard errors as variables, use STERR statement to list the standard errors associated with effects. For the simulation in this paper, because there are multiple responses, some alterations are necessary for the MCMC imputation analysis to work. After PROC MI and PROC REG, you must use PROC MIANALYZE separately on each response variable. If no BY statement is used, an error statement about within-imputation COV matrix not being symmetric will appear. In addition, after PROC REG is finished, the dataset is sorted by imputation, not by the response variables. Therefore, a PROC SORT is also necessary as shown in the following code. PROC REG DATA = MCMCimp OUTEST = regmcmcout COVOUT; MODEL Response1 Response2 = Treat B_Count Age; BY _imputation_; PROC SORT DATA = regmcmcout; BY _DEPVAR_; PROC MIANALYZE DATA = regmcmcout; BY _DEPVAR_; MODELEFFECTS Treat B_Count Age; Run; 4

5 It is important to note that both PROC MI and PROC MIANALYZE are only appropriate with single-level observations. That is, if data has observations that are nested within clusters, these methods are not appropriate as they may result in variance estimates based toward zero and possible other biased parameters (Mistler, 2013). When comparing methods, two things were of interest for this paper. How far the simulated parameter estimates were from the true parameter estimates and how far the simulated standard deviation estimates were from the true standard deviation. Bias was calculated in the following way. Parameter Estimate Bias = β b i Standard Deviation Bias = σ β s bi In addition, the mean square error of the parameter estimate bias for each percentage and overall were investigated. MSE was calculated in the following way. 4 1,000 MSE for each percentage = [(b ij β) 2 j=1 i=1 ] /(4 x 1,000) 5 4 1,000 i=1 MSE Overall = [(b ijk β) 2 k=1 j=1 ]/(5 x 4 x 1,000) Where i is simulation iterations, j is dependent variable, and k is percentage. Lastly, how the methods were related to each was also of interest. Because each method was used on the same simulated missing dataset, methods that are similar to each will most likely both estimate either high or low on the same dataset. A scatterplot matrix of the parameter bias is provided to further explore this. RESULTS MEAN SQUARE ERROR After completing the simulations, the measured mean square error (MSE) was obtained for each method. Table 1 displays the MSE for each method by percentage of missing observations as well as overall MSE. Percentage List-Wise Stochastic Regression MLE MCMC Overall Table 1. Mean Square Error for Each Method at Various Percentages of Missing Data 5

6 For each of the methods investigated, the MSE increased as the percentage of missing observations increased. List-Wise had the highest MSE for each percentage and overall. In addition, MLE had the lowest MSE of the four methods for each percentage and overall. BIAS As previously mentioned, the different Bias Estimates for our data were also recorded. Table 2 displays the parameter estimate bias (while Table 3 displays the sample SD bias) for each method. In addition both tables provide 95% confidence intervals on the biases. If the confidence interval does not contain zero, the method at that percentage was considered to be biased either high or low, as shown in bold. That is, a biased high estimate will have a confidence interval with only positive bounds and vice versa. Otherwise, the method was considered to be unbiased. Missing Percentage List-Wise Deletion Stochastic Regression MLE MCMC Mean ( , ) ( , ) (0.0070, ) ( , ) ( , ) Mean (0.0067, ) (0.0010, ) (0.0151, ) ( , ) ( , ) Mean (0.0041, ) (0.0045, ) (0.0326, ) (0.0026, ) (0.0101, ) Mean (0.0046, ) (0.0012, ) (0.0321, ) (0.0003, ) (0.0091, ) Table 2. Parameter Estimate Bias and Confidence Interval for Each Method at Various Percentages of Missing Observations. In terms of parameter estimates, biases performed in the following way. List-Wise was in general unbiased, except for at 15% missing. This is most likely an artifact of the fact that the data was missing at random, and not due the particular missing percentage. Because the data were missing at random, we essentially have a smaller sample size when using this method and our expected parameter estimate would not change. However, for the other methods they were in general biased high. MLE and MCMC were biased high for all percentages while Stochastic Regression was biased high only for 5%, 10% and 15%. Missing Percentage List-Wise Deletion Mean ( , ) ( , ) ( , ) ( , ) ( , ) 6

7 Stochastic Regression MLE MCMC Mean ( , ) ( , ) ( , ) ( , ) ( , ) Mean (0.0144, ) (0.0315, ) (0.0461, ) (0.0731, ) (0.0948, ) Mean ( , ) ( , ) ( , ) ( , ) ( , ) Table 3. Standard Deviation Bias and Confidence Interval for Each Method at Various Percentages of Missing Observations. The last output provides a comparison of bias for each method. Figure one displays these comparisons. Figure 1. Scatterplot Matrix of the Parameter Estimate of each Method. 7

8 CONCLUSION As seen above, the MSE for MLE and MCMC were consistently the smallest out of the four methods. For each increase in missing percentage, list-wise deletion s MSE increased on average by Regression increased on average by 0.157, MLE by 0.095, and MCMC by Based on this, it appears MLE and MCMC deal with higher missing percentages better than the other methods. For the parameter estimates, List-Wise Deletion was unbiased the most. However this result cannot be expected across all data sets. The unbiasedness of the results is due to the fact that the simulation data was designed to be missing at random. However, the slightest relationship between observations and their completeness will skew the parameter estimates. The next best method based on the simulation is stochastic regression. If your data is linear with respect to both predictor variables and response variables, stochastic regression imputation will be the most precise method. However, unless this is known a priori, it is safer to use MLE and MCMC as they are more robust methods. For the standard deviation (SD) estimates, the List-Wise Deletion underestimated the standard deviation consistently and severely. Stochastic Regression underestimated for smaller percentages of missingness, but for 20% and 25%, the SD estimate was on point. MLE overestimated consistently for all missing rates and MCMC underestimated consistently. However, none of them are as drastic as the list-wise deletion method. The scatterplot matrix reveals a couple of interesting things. MCMC and MLE imputation methods appear to be very similar to each other. Regression also appears to be equally related to MLE and MCMC. However list-wise deletion is nothing like the other methods and is fairly inconsistent. For this analysis to be truly comprehensive, the same method should be used on various datasets with different attributes. In addition, as an improvement on the methods used here, a greater population of observations could be created that could be sampled from and then each sample would have random recordings removed. Plans for future analysis include exploring datasets with categorical responses and using different models to see how each method estimates. REFERENCES Mistler, Stephen A SAS Macro for Applying Multiple Imputation to Multilevel Data. SAS Global Forum 2013 Proceedings. Paper Multiple Imputation in SAS, Part 1. UCLA: Statistical Computing Seminars. January 15th, Available at Multiple Imputation in SAS, Part 2. UCLA: Statistical Computing Seminars. January 15th, Available at Yuan, Yang Multiple Imputation for Missing Data: Concepts and New Development. SAS Users Group International 25 Proceedings. P ACKNOWLEDGMENTS I would like to thank Rebecca Ottesen for motivating and encouraging me to write and submit this paper. Thank you to Dr. Andrew Schaffner for inspiring my work and guiding me with the design. Thank you to Dr. Louise Hadden for helping with editing and the general submission process. RECOMMENDED READING SAS Knowledge Base Focus Areas Multiple Imputation for Missing Data SAS Documentation: The MI Procedure SAS Documentation: the MIANALYZE Procedure 8

9 CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Christopher Yim SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 9

### MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

### Problem of Missing Data

VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

### Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional

### Dealing with Missing Data for Credit Scoring

ABSTRACT Paper SD-08 Dealing with Missing Data for Credit Scoring Steve Fleming, Clarity Services Inc. How well can statistical adjustments account for missing data in the development of credit scores?

### Dealing with Missing Data

Res. Lett. Inf. Math. Sci. (2002) 3, 153-160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904

### Handling attrition and non-response in longitudinal data

Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

### Sensitivity Analysis in Multiple Imputation for Missing Data

Paper SAS270-2014 Sensitivity Analysis in Multiple Imputation for Missing Data Yang Yuan, SAS Institute Inc. ABSTRACT Multiple imputation, a popular strategy for dealing with missing values, usually assumes

### A Basic Introduction to Missing Data

John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

### Missing Data Dr Eleni Matechou

1 Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: R.J.A. Little and D.B. Rubin 2nd edition Statistical Analysis with Missing Data J.L. Schafer and J.W.

### Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

### Missing Data. Paul D. Allison INTRODUCTION

4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data, I mean data that are missing for some (but not all) variables and for some (but not all)

### 2. Making example missing-value datasets: MCAR, MAR, and MNAR

Lecture 20 1. Types of missing values 2. Making example missing-value datasets: MCAR, MAR, and MNAR 3. Common methods for missing data 4. Compare results on example MCAR, MAR, MNAR data 1 Missing Data

### A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

123 Kwantitatieve Methoden (1999), 62, 123-138. A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA Joop J. Hox 1 ABSTRACT. When we deal with a large data set with missing data, we have to undertake

### Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

### CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships

### Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

1 Web-based Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed

### Missing Data Techniques for Structural Equation Modeling

Journal of Abnormal Psychology Copyright 2003 by the American Psychological Association, Inc. 2003, Vol. 112, No. 4, 545 557 0021-843X/03/\$12.00 DOI: 10.1037/0021-843X.112.4.545 Missing Data Techniques

### Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

### Imputing Values to Missing Data

Imputing Values to Missing Data In federated data, between 30%-70% of the data points will have at least one missing attribute - data wastage if we ignore all records with a missing value Remaining data

### Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten Missing Data Treatments

Brockmeier, Kromrey, & Hogarty Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten s Lantry L. Brockmeier Jeffrey D. Kromrey Kristine Y. Hogarty Florida A & M University

### Dealing with missing data: Key assumptions and methods for applied analysis

Technical Report No. 4 May 6, 2013 Dealing with missing data: Key assumptions and methods for applied analysis Marina Soley-Bori msoley@bu.edu This paper was published in fulfillment of the requirements

### APPLIED MISSING DATA ANALYSIS

APPLIED MISSING DATA ANALYSIS Craig K. Enders Series Editor's Note by Todd D. little THE GUILFORD PRESS New York London Contents 1 An Introduction to Missing Data 1 1.1 Introduction 1 1.2 Chapter Overview

### Multiple Imputation for Missing Data: A Cautionary Tale

Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust

### Age to Age Factor Selection under Changing Development Chris G. Gross, ACAS, MAAA

Age to Age Factor Selection under Changing Development Chris G. Gross, ACAS, MAAA Introduction A common question faced by many actuaries when selecting loss development factors is whether to base the selected

### How To Use A Monte Carlo Study To Decide On Sample Size and Determine Power

How To Use A Monte Carlo Study To Decide On Sample Size and Determine Power Linda K. Muthén Muthén & Muthén 11965 Venice Blvd., Suite 407 Los Angeles, CA 90066 Telephone: (310) 391-9971 Fax: (310) 391-8971

### Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

### Bayesian Methods. 1 The Joint Posterior Distribution

Bayesian Methods Every variable in a linear model is a random variable derived from a distribution function. A fixed factor becomes a random variable with possibly a uniform distribution going from a lower

### Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple

### Lab 8: Introduction to WinBUGS

40.656 Lab 8 008 Lab 8: Introduction to WinBUGS Goals:. Introduce the concepts of Bayesian data analysis.. Learn the basic syntax of WinBUGS. 3. Learn the basics of using WinBUGS in a simple example. Next

### Analyzing Structural Equation Models With Missing Data

Analyzing Structural Equation Models With Missing Data Craig Enders* Arizona State University cenders@asu.edu based on Enders, C. K. (006). Analyzing structural equation models with missing data. In G.

### IBM SPSS Missing Values 22

IBM SPSS Missing Values 22 Note Before using this information and the product it supports, read the information in Notices on page 23. Product Information This edition applies to version 22, release 0,

### ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages

### Comparison of Estimation Methods for Complex Survey Data Analysis

Comparison of Estimation Methods for Complex Survey Data Analysis Tihomir Asparouhov 1 Muthen & Muthen Bengt Muthen 2 UCLA 1 Tihomir Asparouhov, Muthen & Muthen, 3463 Stoner Ave. Los Angeles, CA 90066.

### HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

### Introduction to Fixed Effects Methods

Introduction to Fixed Effects Methods 1 1.1 The Promise of Fixed Effects for Nonexperimental Research... 1 1.2 The Paired-Comparisons t-test as a Fixed Effects Method... 2 1.3 Costs and Benefits of Fixed

### CHAPTER 6 EXAMPLES: GROWTH MODELING AND SURVIVAL ANALYSIS

Examples: Growth Modeling And Survival Analysis CHAPTER 6 EXAMPLES: GROWTH MODELING AND SURVIVAL ANALYSIS Growth models examine the development of individuals on one or more outcome variables over time.

### Introduction to mixed model and missing data issues in longitudinal studies

Introduction to mixed model and missing data issues in longitudinal studies Hélène Jacqmin-Gadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models

### Multiple Imputation for Missing Data: Concepts and New Development (Version 9.0)

Multiple Imputation for Missing Data: Concepts and New Development (Version 9.0) Yang C. Yuan, SAS Institute Inc., Rockville, MD Abstract Multiple imputation provides a useful strategy for dealing with

### Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

[Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator

### Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

### SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

ALLISON 1 TABLE OF CONTENTS 1. INTRODUCTION... 3 2. ASSUMPTIONS... 6 MISSING COMPLETELY AT RANDOM (MCAR)... 6 MISSING AT RANDOM (MAR)... 7 IGNORABLE... 8 NONIGNORABLE... 8 3. CONVENTIONAL METHODS... 10

### Combining information from different survey samples - a case study with data collected by world wide web and telephone

Combining information from different survey samples - a case study with data collected by world wide web and telephone Magne Aldrin Norwegian Computing Center P.O. Box 114 Blindern N-0314 Oslo Norway E-mail:

### CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Examples: Multilevel Modeling With Complex Survey Data CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA Complex survey data refers to data obtained by stratification, cluster sampling and/or

### CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

Examples: Monte Carlo Simulation Studies CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES Monte Carlo simulation studies are often used for methodological investigations of the performance of statistical

### OLS is not only unbiased it is also the most precise (efficient) unbiased estimation technique - ie the estimator has the smallest variance

Lecture 5: Hypothesis Testing What we know now: OLS is not only unbiased it is also the most precise (efficient) unbiased estimation technique - ie the estimator has the smallest variance (if the Gauss-Markov

### Stephen du Toit Mathilda du Toit Gerhard Mels Yan Cheng. LISREL for Windows: PRELIS User s Guide

Stephen du Toit Mathilda du Toit Gerhard Mels Yan Cheng LISREL for Windows: PRELIS User s Guide Table of contents INTRODUCTION... 1 GRAPHICAL USER INTERFACE... 2 The Data menu... 2 The Define Variables

### Multiple Regression: What Is It?

Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in

### Lecture 3 : Hypothesis testing and model-fitting

Lecture 3 : Hypothesis testing and model-fitting These dark lectures energy puzzle Lecture 1 : basic descriptive statistics Lecture 2 : searching for correlations Lecture 3 : hypothesis testing and model-fitting

### Model Selection and Claim Frequency for Workers Compensation Insurance

Model Selection and Claim Frequency for Workers Compensation Insurance Jisheng Cui, David Pitt and Guoqi Qian Abstract We consider a set of workers compensation insurance claim data where the aggregate

### Geostatistics Exploratory Analysis

Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt

### Best Practices for Missing Data Management in Counseling Psychology

Journal of Counseling Psychology 2010 American Psychological Association 2010, Vol. 57, No. 1, 1 10 0022-0167/10/\$12.00 DOI: 10.1037/a0018082 Best Practices for Missing Data Management in Counseling Psychology

### EC 6310: Advanced Econometric Theory

EC 6310: Advanced Econometric Theory July 2008 Slides for Lecture on Bayesian Computation in the Nonlinear Regression Model Gary Koop, University of Strathclyde 1 Summary Readings: Chapter 5 of textbook.

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

Dummy Coding for Dummies Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health ABSTRACT There are a number of ways to incorporate categorical variables into

### Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### IBM SPSS Missing Values 20

IBM SPSS Missing Values 20 Note: Before using this information and the product it supports, read the general information under Notices on p. 87. This edition applies to IBM SPSS Statistics 20 and to all

### Software Cost Estimation with Incomplete Data

890 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 10, OCTOBER 2001 Software Cost Estimation with Incomplete Data Kevin Strike, Khaled El Emam, and Nazim Madhavji AbstractÐThe construction of

### Problems with OLS Considering :

Problems with OLS Considering : we assume Y i X i u i E u i 0 E u i or var u i E u i u j 0orcov u i,u j 0 We have seen that we have to make very specific assumptions about u i in order to get OLS estimates

### Module 5: Multiple Regression Analysis

Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College

### Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

Analyzing Intervention Effects: Multilevel & Other Approaches Joop Hox Methodology & Statistics, Utrecht Simplest Intervention Design R X Y E Random assignment Experimental + Control group Analysis: t

### Appendix A: Sampling Methods

Appendix A: Sampling Methods What is Sampling? Sampling is used in an @RISK simulation to generate possible values from probability distribution functions. These sets of possible values are then used to

### CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA

Examples: Mixture Modeling With Longitudinal Data CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA Mixture modeling refers to modeling with categorical latent variables that represent subpopulations

### Modern Methods for Missing Data

Modern Methods for Missing Data Paul D. Allison, Ph.D. Statistical Horizons LLC www.statisticalhorizons.com 1 Introduction Missing data problems are nearly universal in statistical practice. Last 25 years

### SAS Certificate Applied Statistics and SAS Programming

SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and Advanced SAS Programming Brigham Young University Department of Statistics offers an Applied Statistics and

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### Getting Started with HLM 5. For Windows

For Windows August 2012 Table of Contents Section 1: Overview... 3 1.1 About this Document... 3 1.2 Introduction to HLM... 3 1.3 Accessing HLM... 3 1.4 Getting Help with HLM... 4 Section 2: Accessing Data

### Corporate Defaults and Large Macroeconomic Shocks

Corporate Defaults and Large Macroeconomic Shocks Mathias Drehmann Bank of England Andrew Patton London School of Economics and Bank of England Steffen Sorensen Bank of England The presentation expresses

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Official SAS Curriculum Courses

Certificate course in Predictive Business Analytics Official SAS Curriculum Courses SAS Programming Base SAS An overview of SAS foundation Working with SAS program syntax Examining SAS data sets Accessing

### Introduction to proc glm

Lab 7: Proc GLM and one-way ANOVA STT 422: Summer, 2004 Vince Melfi SAS has several procedures for analysis of variance models, including proc anova, proc glm, proc varcomp, and proc mixed. We mainly will

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

### Paper PO06. Randomization in Clinical Trial Studies

Paper PO06 Randomization in Clinical Trial Studies David Shen, WCI, Inc. Zaizai Lu, AstraZeneca Pharmaceuticals ABSTRACT Randomization is of central importance in clinical trials. It prevents selection

### Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

### Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

### Statistical Analysis with Missing Data

Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES

### Data Cleaning and Missing Data Analysis

Data Cleaning and Missing Data Analysis Dan Merson vagabond@psu.edu India McHale imm120@psu.edu April 13, 2010 Overview Introduction to SACS What do we mean by Data Cleaning and why do we do it? The SACS

### Forecasting Hospital Bed Availability Using Simulation and Neural Networks

Forecasting Hospital Bed Availability Using Simulation and Neural Networks Matthew J. Daniels Michael E. Kuhl Industrial & Systems Engineering Department Rochester Institute of Technology Rochester, NY

### Reliability estimators for the components of series and parallel systems: The Weibull model

Reliability estimators for the components of series and parallel systems: The Weibull model Felipe L. Bhering 1, Carlos Alberto de Bragança Pereira 1, Adriano Polpo 2 1 Department of Statistics, University

### Chapter 15 Multiple Choice Questions (The answers are provided after the last question.)

Chapter 15 Multiple Choice Questions (The answers are provided after the last question.) 1. What is the median of the following set of scores? 18, 6, 12, 10, 14? a. 10 b. 14 c. 18 d. 12 2. Approximately

### How to report the percentage of explained common variance in exploratory factor analysis

UNIVERSITAT ROVIRA I VIRGILI How to report the percentage of explained common variance in exploratory factor analysis Tarragona 2013 Please reference this document as: Lorenzo-Seva, U. (2013). How to report

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### INTRODUCTORY STATISTICS

INTRODUCTORY STATISTICS FIFTH EDITION Thomas H. Wonnacott University of Western Ontario Ronald J. Wonnacott University of Western Ontario WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore

### Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set

Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set April 2015 SHORT REPORT Baby FACES 2009 This page is left blank for double-sided printing. Imputing Attendance Data in a Longitudinal

### Testing for serial correlation in linear panel-data models

The Stata Journal (2003) 3, Number 2, pp. 168 177 Testing for serial correlation in linear panel-data models David M. Drukker Stata Corporation Abstract. Because serial correlation in linear panel-data

### Gamma Distribution Fitting

Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

### Chapter 1 Introduction. 1.1 Introduction

Chapter 1 Introduction 1.1 Introduction 1 1.2 What Is a Monte Carlo Study? 2 1.2.1 Simulating the Rolling of Two Dice 2 1.3 Why Is Monte Carlo Simulation Often Necessary? 4 1.4 What Are Some Typical Situations

### Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models

Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models Clement A Stone Abstract Interest in estimating item response theory (IRT) models using Bayesian methods has grown tremendously

### Distributions: Population, Sample and Sampling Distributions

119 Part 2 / Basic Tools of Research: Sampling, Measurement, Distributions, and Descriptive Statistics Chapter 9 Distributions: Population, Sample and Sampling Distributions In the three preceding chapters

### A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

### Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization Function to minimize w.r.t. b 0, b 1 Q = n (Y i (b 0 + b 1 X i )) 2 i=1 Minimize this by maximizing

### Parallelization Strategies for Multicore Data Analysis

Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management