Discussion. Seppo Laaksonen Introduction
|
|
|
- Jayson Douglas
- 9 years ago
- Views:
Transcription
1 Journal of Official Statistics, Vol. 23, No. 4, 2007, pp Discussion Seppo Laaksonen 1 1. Introduction Bjørnstad s article is a welcome contribution to the discussion on multiple imputation (MI) in survey estimation that is rarely used by national statistical institutes (NSI), as observed from for instance my own small-scale survey (Laaksonen, 2005). This survey, however, shows that MI methodology has been under discussion and investigation at some NSIs and, consequently, may actually be exploited some day. However, the big question is how to use it. I think that technical aspects are no longer its main obstacle in the sense that standard sample survey data files can be handled without problems with new computers, and even be reproduced several times by imputing. This obstacle was often mentioned in a discussion in the Journal of the American Statistical Association in 1996 (see Rubin, 1996 and its discussion). This will continue to be a problem with very large files. However, MI is not a general tool for handling missing data as Rubin would hope. I have understood especially from discussions on the web that MI is most popular among biometricists, but they focus on multivariate data analysis, not on estimating means, totals and other statistics. Are NSIs more conservative, or are there some other reasons behind this nonuse of MI? I argue that the main reasons lie in the difficulties of using MI well. This is not only due to MI itself but also, and more importantly, to the fact that NSI data files are complex and not well suited as current tools for MI. I am not here saying that MI is to blame, but the main reason is that, on the whole, it is difficult to find any good imputation method for many practical situations. It should be noted that MI is not (as it is often understood to be) a specific imputation method, but just an option for variance estimation, as Bjørnstad also points out. This, thus, means that before repeating imputations and creating several datasets, one has to concentrate on single imputation (SI) and hope to succeed in the best way at this stage. This fact, unfortunately, has often been forgotten by MI contributors. At the next stage, a good strategy should be found for repeating SIs so that variance estimates could be estimated as correctly as possible. At this stage, SI needs to be extended with a stochastic solution. This MI strategy does not work with many NSI data at all. In many practical situations it is best to only use a deterministic method and, hence, exclude MI (but still try to estimate the variances correctly). It is also difficult for me to encourage the application 1 Statistics Finland, Department of Statistics, P O Box 68, FIN Statistics Finland. [email protected] q Statistics Sweden
2 468 Journal of Official Statistics of MI in cases where sampling weights are equal to one, or small, as often occurs in business surveys, in particular. As Bjørnstad points out, the core question in imputations is, to use Rubin s term, requiring the method to be proper. This term is nice and illustrative, and in his book from 1987 and later on in several articles Rubin (1987) explains what it means. However, the properness requirements have been presented at quite a general level and it is not easy to see how they should be interpreted in practice. To me, imputation is proper if the missingness due to nonresponse and other things is traced back as well as possible from the point of view of each estimation task. This requirement also naturally concerns variation. Bjørnstad gives a more general interpretation, saying that imputations should be random draws from a posterior distribution when exploiting a Bayesian framework. Next, he concludes that the methods NSIs use for imputing for nonresponse very seldom, if ever, satisfy the requirement of being proper. This is not explicitly proven in his article, but in my opinion this argument is correct also outside NSI data files (see e.g., the discussion relating to Rubin 1996, Fay 1996, and Rao 1996). Thus, imputation results will obviously be improving as imputation strategies approach properness, whatever this term means. At the same time, the quality of point estimates will improve, too. As already mentioned, the first key requirement for good point estimates is success in SI. If several SIs have been performed, the main advantage of MI will be in the robustness of these point estimates, since some uncertainty will be reduced. However, this advantage is not expected to be substantial for simple point estimates such as averages and totals. Instead, certain quantile estimates for continuous variables could be more robust and reliable when using MI. (This comment is based on my experiment for estimating the poverty rate of Finland in the early 1990s (Laaksonen 1992). Poor households responded worse in this particular survey but I was able to use register data to impute disposable incomes for unit nonrespondents. My MI corresponds to Fay s (1996) fractionally weighted imputation. This technique was considered the best for estimating the number of poor households, but this was not the case in general. Properness is even more important for interval estimation. Rubin s well-known formula includes the two terms, which I here call the within-imputation-variance and the betweenimputation-variance. Both of these should naturally be based on best possible estimates using m completed datasets. Bjørnstad s basic point is to present an alternative approach to the calculation of the second term. He says that this result does not require any Bayesian framework and his formula is also in his opinion more applicable for NSIs. Bjørnstad develops formulas for different imputation strategies but the basic feature of each formula is essentially the same as that of the next. His coefficient for the between-imputationvariance is of type (k þ 1=m) in which k is the inverse of the response rate, whereas in Rubin s formula, k is fixed to be equal to one. Unless the between-imputation-variance is equal to zero, the new formula increases the variance estimate while the missingness rate increases. Respectively, the standard error and the confidence interval will be larger. Moreover, the imputations do not need to follow any Bayesian rule as would be the case if k ¼ 1. It is not easy to act as an adjudicator between these two parties, since both results are conditional on many things. As already discussed, the big question is how to make imputations properly concerning both SI and MI. Further I suppose that Bjørnstad s approach also requires strict rules for performing
3 Laaksonen: Discussion 469 his imputations so that the requirements of his approach will be met ideally. He does not discuss this question much, although his examples shed some light on it. I, thus, suggest that he should determine how to make imputations correctly (if not properly ) when using his non-bayesian framework. I will come back to these substantial questions in Section 3 after presenting some empirical tests in the next section. 2. Proper or Improper Imputations for Two NSI Datasets The title of this section stems from my problem in ascertaining whether a certain imputation strategy is proper or not. I hope that having seen the results from the following two patterns of examples the reader will be able to answer this question. The variable being imputed is simple and binary in the first pattern, whereas it is quite complex and continuous in the second. However, both examples are complex from the point of view of their missingness mechanisms. Awkwardly, the auxiliary variables are by no means ideal, as unfortunately is usually the case with the data of NSIs and other survey institutes. This means that it is difficult to develop proper adjustment strategies. In the first example this succeeds slightly better. I show results from several imputation approaches and from some weighting adjustments in order to illustrate the results in a broader perspective. The methods and results of these experiments are presented in the following two subsections Binary Variable Laaksonen and Chambers (2006) performed survey estimates based on several methods in a fairly simple follow-up situation. In the first phase of the survey, the response rate was about 67 per cent. Hence the follow-up of the nonrespondents was carried out based on a 40 per cent simple random sample. Due to the great efforts made in this survey, full responses were received to the simple question irrespective of whether the particular units (businesses) were innovative or not. The same question was, of course, included in the actual survey. The dataset gave several options for both point and interval estimation. In all cases, the responses of the second phase of the survey were exploited as auxiliary information. In the original article, two main strategies were used, weighting adjustments and the prediction approach. The weighting adjustments were performed in two ways, both with respondents of the first phase, but with different auxiliary data. This covered the first and second phase respondents in the first approach, but the whole initial sample in the second. The latter approach required imputation of missing values for 60 per cent of the first phase nonrespondents. These same imputed values were also exploited in the prediction approach. For this article, I performed these imputations eight times, and on each occasion the data for MI-based estimations were available. Our imputation strategy was simply to borrow each imputed value from a respondent of the second phase sample. This was done using a random draw with replacement (often called random hot-decking, but I do not like this term). The method corresponds to the strategy of the first example of Bjørnstad. Since all the draws in MIs are conducted independently of each other, I assume that the strategy is proper from both the Bayesian and the non-bayesian point of view.
4 470 Journal of Official Statistics Our weighting strategy was based on two steps: response probabilities were calculated first and rescaling was performed next so that the margins were correct. The variance estimation formula of this approach includes two additive terms. The first term is analogous to the relevant sampling variance but is based on adjusted weights. The second term shows specifically the impact of missingness and this thus increases with the missingness rate. It is analogous to the second term of the Bjørnstad formula and also resembles the variance formula of mean imputation. For details of all these formulas and the data see Laaksonen and Chambers (2006). This experiment was not real although efforts were made to make it as close to real life as possible. Consequently, we used simulations in our analysis. Hence, we were able to compare the results from different aspects but I will here only take a close look at the variance estimates. It should be noted, however, that the point estimates were fairly correct with all these strategies although not complete. The variance estimates varied more, although not dramatically. The largest variances (around 6,000) were obtained using the prediction approach and the imputation-based weighting approach. The first weighting approach gave a slightly lower value (5,700) than might have been considered logical since uncertainty due to imputations was absent. Both Bjørnstad s formula (5,800) and Rubin s formula (5,400) generate logically lower values than the corresponding SI methods. How much lower should these variances be? The value from the Rubin formula could be considered too low but I am not sure about this conclusion. The 95 per cent confidence interval coverage of this method is just below 95, whereas for Bjørnstad it is just above 95. All other methods give somewhat conservative estimates, the coverage varying from 95.2 to Continuous Variable Wages and incomes and their many components are typical NSI variables. Missing values often very awkwardly violate estimates based on these variables. I was here able to use the income data of the Euredit project (Charlton 2003). Some missingness was found within the data set, but afterwards we had the opportunity to compare these estimates against those computed from the cleaned, true data. The dataset of this experiment is very large; almost 200,000 persons after omitting persons with zero incomes. The relevant missingness rate was about 27 per cent. It was known that the missingness mechanism could not be ignored but the file contained a number of potential auxiliary variables for adjustment attempts. The problem was that, except for age, all these variables were categorical (such as gender, level of education, marital status, region, and employment status) and not well-related to incomes at the individual level (but much better at aggregate levels). This did not allow the construction of a well-fitting adjustment model. In terms of R-square, the maximum values were around 40 per cent. The fitting difficulty is a good thing here since it will encourage the development of a strategy for proper imputation or for another good adjustment (see also the results from the Euredit project in which a number of SI methods were tested without any magnificent success). I performed three different imputations using this dataset. One was based on the SAS procedure for MI (lately referred to as Proc MI Method and also as SAS ). In this
5 Laaksonen: Discussion 471 approach I used the same imputation model as with my second method. So, the variable being imputed was the dependent variable of the general linear model. The auxiliary variables were used maximally (as e.g., Rubin 1996 proposes) and exactly identically in all of the three imputation models, thus including the third model as well. However, the dependent variable of the third model was the binary missingness indicator. Consequently, this model was estimated using logistic regression. The Proc MI Method is fairly automatic and finds the imputed values when certain rules have been inserted into the procedure. I did not concern myself with how the SAS really produces the results. I was more like an ordinary user. However, I calculated the variance estimations myself from the multiply imputed datasets so that both Rubin s formula and Bjørnstad s formula were used and results calculated for both. In the second method, having first estimated the multivariate general linear model (GLM) I then calculated the predicted values plus the noise terms first and then used these metrics when searching for the nearest neighbour for each person with missing income data. The noise term was calculated using truncated, normally distributed random numbers with the zero as mean and with residual variance (see details of this approach in Laaksonen 2003, Euredit 2004 and Laaksonen 2005, also known as IMAI, which is short for Integrated Modelling Approach to Imputation). This IMAI strategy gives an easy opportunity to perform MIs, hoping that each SI also gives quite good point estimates. I suppose the method is fairly proper in all senses but further research is still needed. The third example is analogous to the second one (belongs to the IMAI framework as well, and is referred to as Logit ) so that each imputed value has been taken from a nearest donor without missingness and the metrics are based on the predicted values plus the noise term of the imputation model. Since these values are now probabilities, it is not automatically clear how the random noise should be performed. I deduced that it would be good to find each donor at random from the neighbourhood of a unit with missingness. This was made possible by adding to each deterministic probability a random number that was drawn from a uniform distribution of the interval (25%, þ 5%). I know that this is a slightly heuristic method like some other methods in the MI literature (see e.g., the strategy of the Solas software where estimated response probabilities are used to form a certain number of imputation cells and the donor is drawn randomly within each such cell). Apart from the above selections for SIs and MIs it is not automatically clear how each model should be specified. An advantage of the third approach is that the response variable is simple and needs no transformation. By contrast, the income variable is not as simple as this due to its skew distribution. The usual way in subject-matter research is to use its logarithm. This was attempted here, too, and it worked quite well with the second method. For a reason I do not know, this did not work with the first method when the average was estimated. Instead, it worked better for estimating the coefficient of variation (CV) that measures income differences. In the case of the second method, both models, linear and logarithmic, worked quite well for the average and the CV. The third method also gave reasonable point estimates (although, as usual in my experience, the general tendency was that the bias would not be reduced completely, and only a value approaching the true one would be obtained). Nevertheless, I did not use logarithms in any of the imputation models, since this strategy was not working well in the case of the CVs of the first method. Thus, although
6 472 Journal of Official Statistics I am not happy with all point estimates derived with the Proc MI method, I considered it fair to concentrate on the results in which all point estimates (concerning e.g., average income) are reasonable and compare the variance estimates with each other. For empirical tests I wanted to have different sizes of datasets. This could be easily achieved by trimming an initial dataset randomly. Likewise, I wanted to see what effects larger missingness would have. For this test, the trimming was performed to the 50 per cent response rate. As far as repeated SIs are concerned, the MI literature often states that a reasonable number could be in the range from five to seven. I first chose eight imputations for each experiment, but since some of the results were not at all logical, I extended this number to 16. The results in Table 1 are based on this number although I am still not completely satisfied with them. Altogether the table includes ¼ 84 experiments presented as standard errors, since these are smaller and more illustrative than variances. Table 1 allows a number of interpretations. We observe, for example, that the standard error is not always larger when the missingness is larger, as it logically should be (these discrepancies are in italics). This problem is especially present in small sample sizes with the Proc MI method (more imputations than 16 might have helped). The obvious reason for this too small variation in the remaining sample is that the method does not work properly. A more detailed look shows that the betweenimputation-variance is particularly small when the nonmissingness is 73 per cent of these samples and, consequently, the difference between Rubin s method and Bjørnstad s method is not significant. In general, these two alternatives give a different picture of the imputation uncertainty. In most cases, the Proc MI method gives the lowest standard errors, except for the full sample of 200,000 units, where all standard errors of this sort are relatively large, and those based on the Bjørnstad formula are naturally the largest. This is difficult to interpret but again illustrates that the properness of this automated method is problematic. The standard errors of my two methods, GLM and Logit, are fairly logical, the largest ones resulting from the Logit-based nearest neighbour methods. For this method the standard error between the two nonmissingness rates is also the largest. The same concerned adjustments from Rubin s formula to Bjørnstad s formula. The overall interpretation of Table 1 is that it is difficult to find an imputation method that could be viewed as proper, especially when the sample size is small and the missingess is large. The automated software does not give any guarantee for such a success. In this exercise, I was not able to make simulations and, consequently, get any ideal benchmark. Fortunately, the dataset gives the option to adopt an alternative perspective, that is, to construct reweights for the data and perform corresponding estimates using these. I thus created adjusted weights following the same methodology as in the case of the binary variable experiment (Laaksonen and Chambers 2006). The adjustment model was in this case exactly the same as in the third imputation method of this subsection but, naturally, the estimated response probabilities are now used without any noise term, and only for the respondents. Table 2 presents the analogous results with the imputations of Table 1. In addition to the overall standard errors, I include in the table its first component
7 Table 1. Estimated standard errors for average income based on three different multiple imputations for seven different gross sample sizes and two nonmissingness rates. Notation: SAS denotes the first method (Proc MI), GLM the second and Logit the third; R denotes Rubin s, B denotes Bjørnstad s formula Sample size MI methods R-SAS B-SAS R-GLM B-GLM R-Logit B-Logit 73% 50% 73% 50% 73% 50% 73% 50% 73% 50% 73% 50% 3,000 1,952 1,791 1,955 1,830 2,367 2,351 2,447 2,574 2,405 2,814 2,496 3,370 5,000 1,536 1,372 1,539 1,398 1,840 2,034 1,899 2,363 2,140 2,724 2,280 3,472 10,000 1,105 1,045 1,113 1,069 1,233 1,355 1,252 1,479 1,426 1,951 1,497 2,434 20, , , , ,546 50, , , Laaksonen: Discussion 473
8 474 Journal of Official Statistics Table 2. Estimated standard errors for average income based on the weighting of seven different gross sample sizes and two nonmissingness rates Sample size Based on both variance terms Based on the first term only Assuming simple random sampling (SRS) 73% 50% 73% 50% 73% 50% 3,000 4,480 5,627 3,964 4,536 2,500 3,012 5,000 3,296 4,086 2,894 3,196 1,945 2,301 10,000 2,310 2,960 2,033 2,318 1,380 1,727 20,000 1,594 2,047 1,389 1, ,194 50, , , , , and the simple random-based standard errors in order to gain a wider perspective for comparing the results with those presented in Table 1. First, it is important to note that the point estimates using these reweights were very accurate, both in respect of averages and of CVs. This method can thus be considered to be very reliable. Interestingly, the standard errors of the full formula (Columns 2 and 3 of the table) are always much larger than those from any comparable MI experiment. Furthermore, even if the incomplete formula (the second column) has been used, which is often the case in practice, the standard errors are still large, compared to MI-based ones. Finally, the SRS-based sampling errors are also quite large. Thus, this would appear to provide some evidence of the superiority of a good MI method if the only criterion was the estimated standard error, but I do not draw this conclusion. Instead, this is another new drawback of the SAS MI method, especially in the case of small sample sizes. On the other hand, these results are more favourable for Bjørnstad than for Rubin. 3. Further Comments What have we learned thus far from my experiments and Bjørnstad s article? The experiments lend some support to the argument that Bjørnstad s formula is safer, since it generates a larger standard error (and confidence interval). In some cases, when the between-imputationvariance is small, this formula does not help much, although the imputation model was badly fitting. This illogical result could be interpreted as a clear indication of the improperness of such an imputation method. Fortunately, the between-imputation-variance is more significant in most MI experiments and, consequently, the effect of the Bjørnstad formula is more essential. In these cases, the imputation method could be considered as the more proper one but I cannot, nevertheless, be convinced about the properness of these methods either. This annoyance is usual when handling NSI data. In my view, Bjørnstad is right in his key conclusions but, as he says, this topic requires further research. I would like to see imputation research concentrate on the properness of imputation, and more practically than what has been the case up to now. This remains a big problem both for Bayesians and for non-bayesians.
9 Laaksonen: Discussion References Charlton, J. (ed.) (2003). Towards Effective Statistical Editing and Imputation Strategies Findings of the Euredit project. Specifically the following chapters: Volume 2, Chapter Method 10: Integrated Modelling Approach to Imputation and Error Localisation (IMAI) written by S. Laaksonen. Technical Appendix: Descriptions of Datasets and Their Perturbations written by H. Wagstaff. Fay, R.E. (1996). Alternative Paradigms for the Analysis of Imputed Survey Data. Journal of the American Statistical Association, 91, Laaksonen, S. (1992). Adjustment Methods for Nonresponse and Their Application to Finnish Income Data. Statistical Journal of Economic Commission for Europe of United Nations, 9, Laaksonen, S. (2003). Alternative Imputation Techniques for Complex Metric Variables. Journal of Applied Statistics, Laaksonen, S. (2005). Integrated Modelling Approach to Imputation and Discussion on Imputation Variance. UNECE Work Session on Statistical Editing, Ottawa, May e.pdf. Laaksonen, S. and Chambers, R. (2006). Survey Estimation under Informative Nonresponse with Follow-up. Journal of Official Statistics, 22, Rao, J.N.K. (1996). On Variance Estimation with Imputed Survey Data. Journal of the American Statistical Association, 91, Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: Wiley. Rubin, D.B. (1996). Multiple Imputation after 18 þ Years (with Discussion). Journal of the American Statistical Association, 91, Received January 2006
Multiple Imputation for Missing Data: A Cautionary Tale
Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust
Workpackage 11 Imputation and Non-Response. Deliverable 11.2
Workpackage 11 Imputation and Non-Response Deliverable 11.2 2004 II List of contributors: Seppo Laaksonen, Statistics Finland; Ueli Oetliker, Swiss Federal Statistical Office; Susanne Rässler, University
MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
Handling attrition and non-response in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
Problem of Missing Data
VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;
Point and Interval Estimates
Point and Interval Estimates Suppose we want to estimate a parameter, such as p or µ, based on a finite sample of data. There are two main methods: 1. Point estimate: Summarize the sample by a single number
Imputing Missing Data using SAS
ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are
A Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random
[Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
Reflections on Probability vs Nonprobability Sampling
Official Statistics in Honour of Daniel Thorburn, pp. 29 35 Reflections on Probability vs Nonprobability Sampling Jan Wretman 1 A few fundamental things are briefly discussed. First: What is called probability
CALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
Mode and Patient-mix Adjustment of the CAHPS Hospital Survey (HCAHPS)
Mode and Patient-mix Adjustment of the CAHPS Hospital Survey (HCAHPS) April 30, 2008 Abstract A randomized Mode Experiment of 27,229 discharges from 45 hospitals was used to develop adjustments for the
The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy
BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.
Statistics 2014 Scoring Guidelines
AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home
Handling missing data in Stata a whirlwind tour
Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia
UNDERSTANDING THE TWO-WAY ANOVA
UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables
Measuring investment in intangible assets in the UK: results from a new survey
Economic & Labour Market Review Vol 4 No 7 July 21 ARTICLE Gaganan Awano and Mark Franklin Jonathan Haskel and Zafeira Kastrinaki Imperial College, London Measuring investment in intangible assets in the
Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University
Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional
Missing data in randomized controlled trials (RCTs) can
EVALUATION TECHNICAL ASSISTANCE BRIEF for OAH & ACYF Teenage Pregnancy Prevention Grantees May 2013 Brief 3 Coping with Missing Data in Randomized Controlled Trials Missing data in randomized controlled
Introduction to Fixed Effects Methods
Introduction to Fixed Effects Methods 1 1.1 The Promise of Fixed Effects for Nonexperimental Research... 1 1.2 The Paired-Comparisons t-test as a Fixed Effects Method... 2 1.3 Costs and Benefits of Fixed
Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS
Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple
Dealing with Missing Data
Res. Lett. Inf. Math. Sci. (2002) 3, 153-160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
Data Cleaning and Missing Data Analysis
Data Cleaning and Missing Data Analysis Dan Merson [email protected] India McHale [email protected] April 13, 2010 Overview Introduction to SACS What do we mean by Data Cleaning and why do we do it? The SACS
INTERNATIONAL COMPARISONS OF PART-TIME WORK
OECD Economic Studies No. 29, 1997/II INTERNATIONAL COMPARISONS OF PART-TIME WORK Georges Lemaitre, Pascal Marianna and Alois van Bastelaer TABLE OF CONTENTS Introduction... 140 International definitions
Sample Size Issues for Conjoint Analysis
Chapter 7 Sample Size Issues for Conjoint Analysis I m about to conduct a conjoint analysis study. How large a sample size do I need? What will be the margin of error of my estimates if I use a sample
Comparison of Imputation Methods in the Survey of Income and Program Participation
Comparison of Imputation Methods in the Survey of Income and Program Participation Sarah McMillan U.S. Census Bureau, 4600 Silver Hill Rd, Washington, DC 20233 Any views expressed are those of the author
Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation
Statistical modelling with missing data using multiple imputation Session 4: Sensitivity Analysis after Multiple Imputation James Carpenter London School of Hygiene & Tropical Medicine Email: [email protected]
Chapter 6 Experiment Process
Chapter 6 Process ation is not simple; we have to prepare, conduct and analyze experiments properly. One of the main advantages of an experiment is the control of, for example, subjects, objects and instrumentation.
Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza
Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and
Consumer Price Indices in the UK. Main Findings
Consumer Price Indices in the UK Main Findings The report Consumer Price Indices in the UK, written by Mark Courtney, assesses the array of official inflation indices in terms of their suitability as an
Stupid Divisibility Tricks
Stupid Divisibility Tricks 101 Ways to Stupefy Your Friends Appeared in Math Horizons November, 2006 Marc Renault Shippensburg University Mathematics Department 1871 Old Main Road Shippensburg, PA 17013
Canonical Correlation Analysis
Canonical Correlation Analysis LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the similarities and differences between multiple regression, factor analysis,
Appendix A: Survey Questions
Appendix A: Survey Questions World Values Survey : All things considered, how satisfied are you with your life as a whole these days? Please use this card to help with your answer. 1 Dissatisfied 2 3 4
A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values
Methods Report A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values Hrishikesh Chakraborty and Hong Gu March 9 RTI Press About the Author Hrishikesh Chakraborty,
An introduction to Value-at-Risk Learning Curve September 2003
An introduction to Value-at-Risk Learning Curve September 2003 Value-at-Risk The introduction of Value-at-Risk (VaR) as an accepted methodology for quantifying market risk is part of the evolution of risk
Kenken For Teachers. Tom Davis [email protected] http://www.geometer.org/mathcircles June 27, 2010. Abstract
Kenken For Teachers Tom Davis [email protected] http://www.geometer.org/mathcircles June 7, 00 Abstract Kenken is a puzzle whose solution requires a combination of logic and simple arithmetic skills.
Poverty among ethnic groups
Poverty among ethnic groups how and why does it differ? Peter Kenway and Guy Palmer, New Policy Institute www.jrf.org.uk Contents Introduction and summary 3 1 Poverty rates by ethnic group 9 1 In low income
Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis
Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics
Betting on Volatility: A Delta Hedging Approach. Liang Zhong
Betting on Volatility: A Delta Hedging Approach Liang Zhong Department of Mathematics, KTH, Stockholm, Sweden April, 211 Abstract In the financial market, investors prefer to estimate the stock price
Simple Regression Theory II 2010 Samuel L. Baker
SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September
A General Approach to Variance Estimation under Imputation for Missing Survey Data
A General Approach to Variance Estimation under Imputation for Missing Survey Data J.N.K. Rao Carleton University Ottawa, Canada 1 2 1 Joint work with J.K. Kim at Iowa State University. 2 Workshop on Survey
CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction
CA200 Quantitative Analysis for Business Decisions File name: CA200_Section_04A_StatisticsIntroduction Table of Contents 4. Introduction to Statistics... 1 4.1 Overview... 3 4.2 Discrete or continuous
CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont
CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont To most people studying statistics a contingency table is a contingency table. We tend to forget, if we ever knew, that contingency
USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA
USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Logistic regression is an increasingly popular statistical technique
Means, standard deviations and. and standard errors
CHAPTER 4 Means, standard deviations and standard errors 4.1 Introduction Change of units 4.2 Mean, median and mode Coefficient of variation 4.3 Measures of variation 4.4 Calculating the mean and standard
Types of Error in Surveys
2 Types of Error in Surveys Surveys are designed to produce statistics about a target population. The process by which this is done rests on inferring the characteristics of the target population from
Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance
2 Making Connections: The Two-Sample t-test, Regression, and ANOVA In theory, there s no difference between theory and practice. In practice, there is. Yogi Berra 1 Statistics courses often teach the two-sample
QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS
QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS This booklet contains lecture notes for the nonparametric work in the QM course. This booklet may be online at http://users.ox.ac.uk/~grafen/qmnotes/index.html.
LOGIT AND PROBIT ANALYSIS
LOGIT AND PROBIT ANALYSIS A.K. Vasisht I.A.S.R.I., Library Avenue, New Delhi 110 012 [email protected] In dummy regression variable models, it is assumed implicitly that the dependent variable Y
Estimation and Confidence Intervals
Estimation and Confidence Intervals Fall 2001 Professor Paul Glasserman B6014: Managerial Statistics 403 Uris Hall Properties of Point Estimates 1 We have already encountered two point estimators: th e
Data Analysis: Analyzing Data - Inferential Statistics
WHAT IT IS Return to Table of ontents WHEN TO USE IT Inferential statistics deal with drawing conclusions and, in some cases, making predictions about the properties of a population based on information
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Hedge Effectiveness Testing
Hedge Effectiveness Testing Using Regression Analysis Ira G. Kawaller, Ph.D. Kawaller & Company, LLC Reva B. Steinberg BDO Seidman LLP When companies use derivative instruments to hedge economic exposures,
Empirical Methods in Applied Economics
Empirical Methods in Applied Economics Jörn-Ste en Pischke LSE October 2005 1 Observational Studies and Regression 1.1 Conditional Randomization Again When we discussed experiments, we discussed already
Exact Nonparametric Tests for Comparing Means - A Personal Summary
Exact Nonparametric Tests for Comparing Means - A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola
Chapter 1 Introduction. 1.1 Introduction
Chapter 1 Introduction 1.1 Introduction 1 1.2 What Is a Monte Carlo Study? 2 1.2.1 Simulating the Rolling of Two Dice 2 1.3 Why Is Monte Carlo Simulation Often Necessary? 4 1.4 What Are Some Typical Situations
Missing Data. Paul D. Allison INTRODUCTION
4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data, I mean data that are missing for some (but not all) variables and for some (but not all)
Sample Size and Power in Clinical Trials
Sample Size and Power in Clinical Trials Version 1.0 May 011 1. Power of a Test. Factors affecting Power 3. Required Sample Size RELATED ISSUES 1. Effect Size. Test Statistics 3. Variation 4. Significance
Study into average civil compensation in mesothelioma cases: statistical note
Study into average civil compensation in mesothelioma cases: statistical note Nick Coleman, John Forth, Hilary Metcalf, Pam Meadows, Max King and Leila Tufekci 23 April 2013 National Institute of Economic
3 Some Integer Functions
3 Some Integer Functions A Pair of Fundamental Integer Functions The integer function that is the heart of this section is the modulo function. However, before getting to it, let us look at some very simple
Farm Business Survey - Statistical information
Farm Business Survey - Statistical information Sample representation and design The sample structure of the FBS was re-designed starting from the 2010/11 accounting year. The coverage of the survey is
Poverty Among Migrants in Europe
EUROPEAN CENTRE EUROPÄISCHES ZENTRUM CENTRE EUROPÉEN Orsolya Lelkes is Economic Policy Analyst at the European Centre for Social Welfare Policy and Research, http://www.euro.centre.org/lelkes Poverty Among
Credit Scorecards for SME Finance The Process of Improving Risk Measurement and Management
Credit Scorecards for SME Finance The Process of Improving Risk Measurement and Management April 2009 By Dean Caire, CFA Most of the literature on credit scoring discusses the various modelling techniques
IB Math Research Problem
Vincent Chu Block F IB Math Research Problem The product of all factors of 2000 can be found using several methods. One of the methods I employed in the beginning is a primitive one I wrote a computer
SAMPLING DISTRIBUTIONS
0009T_c07_308-352.qd 06/03/03 20:44 Page 308 7Chapter SAMPLING DISTRIBUTIONS 7.1 Population and Sampling Distributions 7.2 Sampling and Nonsampling Errors 7.3 Mean and Standard Deviation of 7.4 Shape of
Principal components analysis
CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k
A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R
A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R Federico Perea Justo Puerto MaMaEuSch Management Mathematics for European Schools 94342 - CP - 1-2001 - DE - COMENIUS - C21 University
6.4 Normal Distribution
Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under
A Procedure for Classifying New Respondents into Existing Segments Using Maximum Difference Scaling
A Procedure for Classifying New Respondents into Existing Segments Using Maximum Difference Scaling Background Bryan Orme and Rich Johnson, Sawtooth Software March, 2009 Market segmentation is pervasive
Paid and Unpaid Labor in Developing Countries: an inequalities in time use approach
Paid and Unpaid Work inequalities 1 Paid and Unpaid Labor in Developing Countries: an inequalities in time use approach Paid and Unpaid Labor in Developing Countries: an inequalities in time use approach
A Review of Methods. for Dealing with Missing Data. Angela L. Cool. Texas A&M University 77843-4225
Missing Data 1 Running head: DEALING WITH MISSING DATA A Review of Methods for Dealing with Missing Data Angela L. Cool Texas A&M University 77843-4225 Paper presented at the annual meeting of the Southwest
Sensitivity Analysis in Multiple Imputation for Missing Data
Paper SAS270-2014 Sensitivity Analysis in Multiple Imputation for Missing Data Yang Yuan, SAS Institute Inc. ABSTRACT Multiple imputation, a popular strategy for dealing with missing values, usually assumes
Rubrics for Assessing Student Writing, Listening, and Speaking High School
Rubrics for Assessing Student Writing, Listening, and Speaking High School Copyright by the McGraw-Hill Companies, Inc. All rights reserved. Permission is granted to reproduce the material contained herein
IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem
IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem Time on my hands: Coin tosses. Problem Formulation: Suppose that I have
DEFINING AND COMPUTING EQUIVALENT INDUCTANCES OF GAPPED IRON CORE REACTORS
ISEF 20 - XV International Symposium on Electromagnetic Fields in Mechatronics, Electrical and Electronic Engineering Funchal, Madeira, September -3, 20 DEFINING AND COMPUTING EQUIVALENT INDUCTANCES OF
Appendix B Data Quality Dimensions
Appendix B Data Quality Dimensions Purpose Dimensions of data quality are fundamental to understanding how to improve data. This appendix summarizes, in chronological order of publication, three foundational
Variables Control Charts
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. Variables
Time Series and Forecasting
Chapter 22 Page 1 Time Series and Forecasting A time series is a sequence of observations of a random variable. Hence, it is a stochastic process. Examples include the monthly demand for a product, the
Sensitivity Analysis 3.1 AN EXAMPLE FOR ANALYSIS
Sensitivity Analysis 3 We have already been introduced to sensitivity analysis in Chapter via the geometry of a simple example. We saw that the values of the decision variables and those of the slack and
Time Series Forecasting Techniques
03-Mentzer (Sales).qxd 11/2/2004 11:33 AM Page 73 3 Time Series Forecasting Techniques Back in the 1970s, we were working with a company in the major home appliance industry. In an interview, the person
The labour market, I: real wages, productivity and unemployment 7.1 INTRODUCTION
7 The labour market, I: real wages, productivity and unemployment 7.1 INTRODUCTION Since the 1970s one of the major issues in macroeconomics has been the extent to which low output and high unemployment
Decimals and other fractions
Chapter 2 Decimals and other fractions How to deal with the bits and pieces When drugs come from the manufacturer they are in doses to suit most adult patients. However, many of your patients will be very
Chapter Two. THE TIME VALUE OF MONEY Conventions & Definitions
Chapter Two THE TIME VALUE OF MONEY Conventions & Definitions Introduction Now, we are going to learn one of the most important topics in finance, that is, the time value of money. Note that almost every
Logit Models for Binary Data
Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response
REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION
REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety
