# Approaches for Analyzing Survey Data: a Discussion

Save this PDF as:

Size: px
Start display at page:

## Transcription

2 coincide with the researcher s target population. While the survey producer s target population is finite, like the survey population, these two populations usually differ, as seen in Figure 1. In the case of a household telephone survey, for example, the survey population would lack any individuals in households without a telephone, even though these people could be in the survey producer s target population. The survey producer usually provides weights in his data files to allow estimation of characteristics of his finite target population. These weights contain adjustments for known differences between the survey producer s survey and target populations. If the researcher s target population differs from the survey producer s target population, adjustments to the weights provided by the survey producer may be required to account for these differences. Figure 1. Finite Target Population and Survey Population An example of a research question related to a characteristic of a finite target population is the following: Was there a difference in 2002 between Ontario and Quebec organic farmers in average expenses per acre to grow tomatoes? To study such a question, the researcher might have access to the data from a 2002 cross-sectional survey of Canadian farmers where questions were asked about organic farming techniques used that year for various crops. The researcher s target population is a domain in the finite population targeted by the survey provider. may have a logistic model in mind and be particularly interested in the coefficient of the obesity variable. The researcher is not confining his target population to any finite group at a fixed point in time, but may feel that the logistic model approximately describes the relationships among the variables involved during the past 15 years in western cultures, for example. Thus, his target population could be considered to be infinite. Suppose the researcher had used a 1995 American health survey as his data source for fitting and testing his model. It would seem reasonable to presume that the researcher s logistic model could have generated the values of the variables involved for a finite population such as the finite population targeted by the providers of the data for that health survey. While the quantities of interest to the researcher are parameters of a model generating an infinite population, there are finite population parameters associated with these quantities of interest. In the case of the logistic model described above, the finite population parameters associated with the model coefficients could be the estimates of these coefficients when all the values from the full finite population are available. Such estimates are descriptive parameters of the finite population and frequently are useful summary statistics in their own right. In Figure 2 we illustrate the relationships among the various quantities when the target population is infinite. In this figure, θ ξ represents the quantities of interest in the infinite target population, whereas θ p represents the associated finite population quantities. Figure 2. Infinite Target Population 2.2 Infinite Target Population A researcher s target population is generally said to be infinite when the values of variables for this population are thought to have been generated by a statistical model. The quantities of interest to the researcher are characteristics of the model, such as the model parameters. Consider, for example, the problem of investigating whether obesity is a risk factor for arthritis, controlling for age and sex. In this case the researcher 3. Principles for Making Statistical Inference For statistical inferences, a researcher is interested both in what he observed and in what he did not observe. Of primary interest is the distribution of estimates under hypothetical random repetitions. The distribution of these estimates depends on whether or not a statisti- 2772

3 cal model is presumed to have generated the values of a finite population, and the properties of the model. As well, the distribution of the estimates may or may not be affected by the sample design. Consider, first of all, the case of a finite target population where no statistical model is presumed to have generated the finite population and where the only randomization is the design-based randomization. This case is illustrated in Figure 3. Here, the characteristic of interest is a descriptive parameter of the finite population represented by θ p. Through the sampling design for the survey, sample i is selected and the estimate of θ p derived from this sample is denoted by θˆ i. However, it is possible that, under the sampling design used, a large number of samples different from sample i could have been chosen, each of them leading to their specific estimate of θ p. The distribution of these different possible estimates is what may be called the design-based sampling distribution of the estimate. This is the basis for design-based inferences. The final case that we wish to present is still the case of the infinite target population where the values of variables for this population are thought to have been generated by a statistical model and it is the characteristics of the model that are of primary interest to the researcher. However, we want to explicitly account for the presumption that the model could have generated the values of the variables in the finite population from which the survey sample was drawn. In this situation, our focus is on the distribution of the estimates of the model parameters of interest, and we want to take account of the variability implied by the model as well as the variability implied by the survey design. This case is called model-design-based randomization and is illustrated in Figure 5. We feel that this is the randomization framework under which many questions related to appropriate analysis methods for survey data could be best explored. For a more rigorous treatment of the asymptotic theory in the design-model-based framework, see Rubin-Bleuer and Schiopu-Kratina (2005). Figure 4. Model-based Randomization Figure 3. Design-based Randomization Figure 5. Model-design-based Randomization Let us now turn to the case of an infinite target population where the values of variables for this population are described through a model and it is a characteristic of the model, say θ ξ, that is of primary interest to the researcher. Model-based inferences are based on the sampling distribution of the estimates of that characteristic due to different samples being drawn directly from that model. This is illustrated in Figure 4. In summary, if we let θ represent the characteristic of interest (which could be θ ξ orθ p ) and if we let θˆ be 2773

4 its estimator, then the distribution of θˆ is the distribution of the different conceptual values of this estimator, depending on the randomization assumptions that have been made: design-based, model-based or modeldesign-based. This implies, for example, that the expected value of the estimator is E = k lim ˆ θ k, k i i= 1 where ˆ θ ˆ θ,..., ˆ 1, 2 θ k are k independent draws from the distribution. The bias of θˆ is then the difference between this expected value and the target parameter. Also, the variance of θˆ is V k 2 = lim ( ˆ θ E ) k. k i= 1 Both the target population and the randomization assumptions matter when it comes to the values taken by these quantities. 3.1 Informativeness and Ignorability When variability due both to the model and to the survey design is being considered, two concepts encountered in the literature are informativeness and ignorability. See Pfeffermann (1993) for some discussion of these. The generation of the observed sample is actually a two-phase process, where at the first phase the finite population is generated according to the model and at the second phase the sample is drawn according to the survey design. When the sample can be assumed to have been generated directly from the model (without this affecting the distribution of the sample variable values), the sampling is said to be not informative. Otherwise it is informative. Simple random sampling designs are noninformative. For more complex sampling plans, whether or not the sampling is informative will depend on the validity of the model assumptions for the observed sample. The concept of informativeness is illustrated in Figure 6. Next, consider a particular analysis of the data generated from this two-phase process. If a model-based method of inference for the analysis is valid under the two-phase model-design-based randomization process, the sampling is said to be ignorable for that analysis. Otherwise it is nonignorable. For example, when fitting a linear model using ordinary least squares regression estimation, if the actual model residuals are correlated within sampled clusters in a cluster sample, the sample design is nonignorable if the intra-cluster correlation is not properly taken into account. The concept i of ignorability is illustrated in Figure 7 for inferences about the model parameter, θ ξ. It follows that noninformative sampling is ignorable for all analyses (Binder and Roberts, 2001). Some research has been done on diagnostics for ignorability (see, for example, Fuller (1984)). Figure 6. Non-informative Sample Design Figure 7. Ignorable Sampling 4. The Most Common Approaches to Analysis The two approaches commonly used for analyzing survey data are the following: (a) Design-based: This is the most commonly used approach for estimating finite population quantities for large-scale surveys, and is, as discussed below, also often appropriate when making inferences about model parameters. In this approach, the only source of randomness explicitly accounted for is that due to the survey design. Survey weighting is used to produce esti- 2774

5 mates of unknown finite population quantities which are the descriptive quantities of interest in the case of a finite target population and are related to the model quantities of interest in the case of an infinite target population. Design-based variance measures the variability among estimates from possible samples selected by the same design from the same finite population. There are a variety of methods for obtaining designbased variance estimates. (b) Model-based: This approach, which is generally used when the quantities of interest are the parameters of a model, assumes that all randomness is expressed explicitly in the model. It is thus possible that a model for the infinite population will need modification so that it details the impact of the survey design on the variables being described in the sample taken. Classical non-survey approaches are used to fit the model, estimate variances and make inferences. 4.1 Why Take a Design-based Approach When the target population is infinite and the quantities of interest are parameters of a model generating values of the variables in a finite population, we contend that model-design-based randomization can serve to explain how the survey data were generated. However, we feel that, for a great number of problems studied by researchers, a pure design-based approach can still lead to valid inferences in the model-design-based randomization framework. There are several reasons for this. First of all, under model-design randomization, a design-based approach gives valid inferences for model parameters when the mean model is approximately correct for the infinite population and when sampling fractions are small. Obviously, ˆ θ θ = ( ˆ θ θ ) + ( θ ) p ξ p p p θ ξ. Thus, if E p ( ˆ θ p ) θ p and E ξ ( θ p ) θξ, then E ˆ ξp ( θ p θ ξ ) 0. Also, V ( ˆ ) ( ) ( ˆ ξ p θ p θξ Vξ θ p + EξV p θ p ) = O ( 1 N) + O(1 n). If the sampling fraction, n/n, is small, V ( ˆ ) ( ˆ ξp θ p θξ EξV p θ p ), and using Vˆ ( ˆ p θ p ) will give valid model-design-based inferences about θ ξ. Secondly, researchers particularly secondary users of the data may not know enough about the design to completely model its impact. Even if a researcher does know the design well, suitable design variables may not exist on the data files provided for analysis for inclusion in a parsimonious model. Thus, appropriate modification of a model to explain the survey data may not be feasible and thus a design-based approach may make more sense. Finally, a researcher may not want design variables in his model since inclusion of these variables could change the interpretation of other model parameters (see, for example, Chambers (1986)). Using the form of the model that generates the infinite population, plus design-based methods to implicitly account for the impact of the survey design on the model holding in the sample thus may seem like a more palatable option. It should be noted that a pure design-based approach would not be valid under model-design-based randomization when sampling fractions are not small. However, in this case, the model-design-based framework could point to appropriate corrections to the design-based variance estimates. 5. Applying These Principles and Approaches to Integrating Data From More Than One Survey As data are being collected and are being made accessible to researchers from an increasing number of surveys, the researchers are noting that comparable variables of interest are available from more than one survey source. It is often the case that the sample sizes for the problem that they wish to study are small in each of the survey sources. Of interest to these researchers is whether and how to perform the analysis by integrating the data from more than one survey. 5.1 Integrating When Target Population is Finite Let us start with the situation where the quantity of interest is a descriptive parameter that is a characteristic of a finite population. The quantity of interest could be, for example, the prevalence rate of a disease or the proportion of smokers in a population. In Figure 8, we illustrate a complex case where teenagers were sampled in 1994, 1996, and However, the target population of interest to the researcher includes all teenagers in the years 1994 to 1998, so that teenagers in 1995 and 1997 are also part of the researcher s target population. Note that the population of all teenagers in the years is a conceptual one, since it never exists at any single point in time. Note also that persons who were teenagers in more than one year are considered here as different units in the conceptual finite population. 2775

6 Figure 8. Integrating with Finite Target Populations The first broad choice for integrating the data would be to estimate the parameter from each data source separately and then to combine the estimates through averaging. Before proceeding, the researcher should perform some preliminary work. First of all, he should check on the assumption of equality of the parameter across the different finite populations. This confirmatory work could involve some formal statistical testing and also background investigation into the subject matter. (The power of the statistical tests may not be high if the sample sizes from each survey are low.) Secondly, he should consider the meaning of the average of estimates if the parameters are unequal, and determine whether, in such a case, the average would have relevance to his research. In the case described here, and in many other situations, the question that arises is whether it makes sense to integrate the data from more than one survey. Such integration could be considered when either of the following two conditions apply: (i) if the researcher s target population is the combination of the finite populations targeted by the survey producer for the different surveys (i.e., each finite population is like a super-stratum). In this case, the quantity of interest need not be assumed to be constant over the different super-strata, although whether or not this is true could influence the choice of approach to integration; (ii) if the researcher s target population is a bigger population than the combined finite populations targeted by the survey producers, as in our example above. In this case, some assumptions about the relationship between the quantities of interest in the populations that were not sampled with the quantities of interest in the populations that were sampled would need to be made. For example, one might assume that for the population illustrated in Figure 8 the average smoking rate for teenagers in the years is similar to the average over only the years 1994, 1996, and Alternatively, for some other characteristic, such as prevalence rate for some health condition, one might assume that the characteristic of interest is constant, or has a constant linear trend, over all the years in the researcher s target population. In the next two subsections, we describe the two broad choices for integrating the data Separate Approach to Integration As well, he should consider whether a weighted average, rather than a simple average, would have more advantages for his particular research. The large body of research into the topics of population-size-adjusted or design-effect-adjusted weighting could help with this decision. However, it is important to note that optimal methods for weight adjustments may depend on knowing the variances or design-effects of an estimate, and these variances are often estimated from data based on small sample sizes. When the surveys are independent, it is usually feasible to construct estimates of the variances for the estimator using a separate approach. On the other hand, when the surveys are not independent, the correlation between surveys will need to be accounted for in the variance estimates Pooling Approach to Integration As a second approach to integration, the researcher could pool the data from the different surveys, considering the data from each as being from a different superstratum, and then treat the data as if from a single survey. However, before proceeding, there are again some things to consider. The researcher should do some confirmatory work regarding an assumption of equality of the parameter across the superstrata. He should consider the meaning of the pooled estimate if equality is not true. (For example, does he actually want an estimate of the prevalence rate in the pooled populations if the prevalence rates within the different populations are not the same?) He could also consider whether doing weight rescaling within each data source would be advantageous. For example, he could explore whether it lead to a more efficient estimate. However, in the situation of unequal parameters in the different finite populations, he 2776

7 would need to consider whether the rescaled estimate would make sense. As in the case of a separate approach, it is usually feasible to construct estimates of variances when a pooled approach is used. It should be noted that only under specific conditions would the two approaches pooled and combined give the same point estimate (even when estimating the same quantity). populations presumed to have been generated by the model could be attributed to a survey effect, such as mode effect, of which the researcher had not been previously aware. Figure 9. Fitting Linear Models Using Integrated Surveys 5.2 Integrating When Target Population is Infinite We now turn to the situation where the quantities of interest are parameters of a model describing an infinite population. It would seem feasible for a researcher to consider integrating the data from more than one survey if the statistical model (which describes an infinite population) could be presumed to have generated the values of each of the finite populations targeted by the survey producers for the different surveys under consideration for integration. Furthermore, the model could and probably should contain parameters particular to each finite population. As is the case for a descriptive parameter of a finite population, either pooling or combining are possible approaches for integrating the data from the different surveys. However, for the infinite population, where modeling is involved, the pooling approach has some distinct advantages. When pooling, it is generally straightforward to allow for and to test for inequalities in parameters among the different finite populations presumed to have been generated by the model. Consider, for example, the simple situation displayed in Figure 9, where three different surveys collected information on the same two variables and where the model of interest to the researcher posited a linear relationship between the two variables. If the researcher pooled the data from the three surveys and fitted a linear model without consideration of the source of each data point, his estimated line would have had a strong positive slope, as shown on the left of Figure 9. If, however, he allowed for different slopes and intercepts for the different data sources in his model for the pooled data, his estimated lines would have the form shown on the right of Figure 8. It appears as if the lines are parallel, but with a negative slope. Further investigation by the researcher reveals that the negative linear relationship between the two variables made sense and that the difference in the locations of the lines for the three finite 6. Conclusions There is controversy about using a design-based approach for estimating model parameters. We feel that the issues raised in this controversy can be discussed and clarified in a model-design-based framework. As well, as shown in this paper, use of this framework will identify the situations where a pure design-based approach makes sense. In these discussions, the notion of the appropriate target population is important. References Binder, David A. and Roberts, Georgia R. (2001), Can Informative Designs be Ignorable? Newsletter of the Survey Research Methods Section, Issue 12, American Statistical Association. Binder, David A. and Roberts, Georgia R. (2003), Design-based and Model-based Methods for Estimating Model Parameters, in Analysis of Survey Data, (eds. R.L. Chambers and Chris Skinner) Wiley, Chichester, pp Chambers, R.L. (1986), Design-Adjusted Parameter Estimation, Journal of the Royal Statistical Society, Series A, 149, pp Fuller, Wayne A. (1984), Least Squares and Related Analyses for Complex Survey Designs. Survey Methodology, 10, pp Graubard, Barry I. and Korn, Edward L. (2002), Inference for Superpopulation Parameters Using Sample Surveys, Statistical Science, 17, pp Korn, Edward L. and Graubard, Barry I. (1995), Analysis of Large Health Surveys: Accounting for the Sampling Design, Journal of the Royal Statistical Society, Series A, 158, pp Pfeffermann, Danny (1993), The Role of Sampling Weights When Modeling Survey Data, International Statistical Review, 61, pp

8 Rubin-Bleuer, Susana, and Schiopu-Kratina, Ioana, (2005), On the Two-Phase Framework for Joint Model and Design-Based Inference, Annals of Statistics, 33, pp

### Clarifying Some Issues in the Regression Analysis of Survey Data

Survey Research Methods (2007) http://w4.ub.uni-konstanz.de/srm Vol. 1, No. 1, pp. 11-18 c European Survey Research Association Clarifying Some Issues in the Regression Analysis of Survey Data Phillip

### Visualization of Complex Survey Data: Regression Diagnostics

Visualization of Complex Survey Data: Regression Diagnostics Susan Hinkins 1, Edward Mulrow, Fritz Scheuren 3 1 NORC at the University of Chicago, 11 South 5th Ave, Bozeman MT 59715 NORC at the University

### Marketing Mix Modelling and Big Data P. M Cain

1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

### Chapter 19 Statistical analysis of survey data. Abstract

Chapter 9 Statistical analysis of survey data James R. Chromy Research Triangle Institute Research Triangle Park, North Carolina, USA Savitri Abeyasekera The University of Reading Reading, UK Abstract

### Handling attrition and non-response in longitudinal data

Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

### New SAS Procedures for Analysis of Sample Survey Data

New SAS Procedures for Analysis of Sample Survey Data Anthony An and Donna Watts, SAS Institute Inc, Cary, NC Abstract Researchers use sample surveys to obtain information on a wide variety of issues Many

### A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models

A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models Grace Y. Yi 13, JNK Rao 2 and Haocheng Li 1 1. University of Waterloo, Waterloo, Canada

### Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses G. Gordon Brown, Celia R. Eicheldinger, and James R. Chromy RTI International, Research Triangle Park, NC 27709 Abstract

### Statistical methods for the comparison of dietary intake

Appendix Y Statistical methods for the comparison of dietary intake Jianhua Wu, Petros Gousias, Nida Ziauddeen, Sonja Nicholson and Ivonne Solis- Trapala Y.1 Introduction This appendix provides an outline

### Comparison of Estimation Methods for Complex Survey Data Analysis

Comparison of Estimation Methods for Complex Survey Data Analysis Tihomir Asparouhov 1 Muthen & Muthen Bengt Muthen 2 UCLA 1 Tihomir Asparouhov, Muthen & Muthen, 3463 Stoner Ave. Los Angeles, CA 90066.

### Multilevel Modeling of Complex Survey Data

Multilevel Modeling of Complex Survey Data Sophia Rabe-Hesketh, University of California, Berkeley and Institute of Education, University of London Joint work with Anders Skrondal, London School of Economics

### Introduction to Regression and Data Analysis

Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

### Sampling solutions to the problem of undercoverage in CATI household surveys due to the use of fixed telephone list

Sampling solutions to the problem of undercoverage in CATI household surveys due to the use of fixed telephone list Claudia De Vitiis, Paolo Righi 1 Abstract: The undercoverage of the fixed line telephone

### South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)

### Design-Based Estimators for Snowball Sampling

Design-Based Estimators for Snowball Sampling Termeh Shafie Department of Statistics, Stockholm University SE-106 91 Stockholm, Sweden Abstract Snowball sampling, where existing study subjects recruit

### Cluster Sampling: Single stage cluster sampling

Chapter 6 Cluster Sampling: Single stage cluster sampling 6.1 Introduction Element sampling designs discussed in Chapter 3 and Chapter 4 are not always feasible when there is no sampling frame for the

### Inequality, Mobility and Income Distribution Comparisons

Fiscal Studies (1997) vol. 18, no. 3, pp. 93 30 Inequality, Mobility and Income Distribution Comparisons JOHN CREEDY * Abstract his paper examines the relationship between the cross-sectional and lifetime

### Systematic Reviews and Meta-analyses

Systematic Reviews and Meta-analyses Introduction A systematic review (also called an overview) attempts to summarize the scientific evidence related to treatment, causation, diagnosis, or prognosis of

### Department of Economics

Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 1473-0278 On Testing for Diagonality of Large Dimensional

### Multilevel modelling of complex survey data

J. R. Statist. Soc. A (2006) 169, Part 4, pp. 805 827 Multilevel modelling of complex survey data Sophia Rabe-Hesketh University of California, Berkeley, USA, and Institute of Education, London, UK and

### COURSES: 1. Short Course in Econometrics for the Practitioner (P000500) 2. Short Course in Econometric Analysis of Cointegration (P000537)

Get the latest knowledge from leading global experts. Financial Science Economics Economics Short Courses Presented by the Department of Economics, University of Pretoria WITH 2015 DATES www.ce.up.ac.za

### Survey Inference for Subpopulations

American Journal of Epidemiology Vol. 144, No. 1 Printed In U.S.A Survey Inference for Subpopulations Barry I. Graubard 1 and Edward. Korn 2 One frequently analyzes a subset of the data collected in a

### COMMON CORE STATE STANDARDS FOR

COMMON CORE STATE STANDARDS FOR Mathematics (CCSSM) High School Statistics and Probability Mathematics High School Statistics and Probability Decisions or predictions are often based on data numbers in

### Statistical Models in R

Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

### Teaching Business Statistics through Problem Solving

Teaching Business Statistics through Problem Solving David M. Levine, Baruch College, CUNY with David F. Stephan, Two Bridges Instructional Technology CONTACT: davidlevine@davidlevinestatistics.com Typical

### ANALYTIC AND REPORTING GUIDELINES

ANALYTIC AND REPORTING GUIDELINES The National Health and Nutrition Examination Survey (NHANES) Last Update: December, 2005 Last Correction, September, 2006 National Center for Health Statistics Centers

### 11. Analysis of Case-control Studies Logistic Regression

Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

### Reflections on Probability vs Nonprobability Sampling

Official Statistics in Honour of Daniel Thorburn, pp. 29 35 Reflections on Probability vs Nonprobability Sampling Jan Wretman 1 A few fundamental things are briefly discussed. First: What is called probability

### Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

### Power and sample size in multilevel modeling

Snijders, Tom A.B. Power and Sample Size in Multilevel Linear Models. In: B.S. Everitt and D.C. Howell (eds.), Encyclopedia of Statistics in Behavioral Science. Volume 3, 1570 1573. Chicester (etc.): Wiley,

### Multilevel Modeling of Complex Survey Data

Multilevel Modeling of Complex Survey Data Tihomir Asparouhov 1, Bengt Muthen 2 Muthen & Muthen 1 University of California, Los Angeles 2 Abstract We describe a multivariate, multilevel, pseudo maximum

### From the help desk: Bootstrapped standard errors

The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

### Age to Age Factor Selection under Changing Development Chris G. Gross, ACAS, MAAA

Age to Age Factor Selection under Changing Development Chris G. Gross, ACAS, MAAA Introduction A common question faced by many actuaries when selecting loss development factors is whether to base the selected

### Supporting Online Material for

www.sciencemag.org/cgi/content/full/319/5862/414/dc1 Supporting Online Material for Application of Bloom s Taxonomy Debunks the MCAT Myth Alex Y. Zheng, Janessa K. Lawhorn, Thomas Lumley, Scott Freeman*

### Incentives for Improving Cybersecurity in the Private Sector: A Cost-Benefit Perspective

Incentives for Improving Cybersecurity in the Private Sector: A Cost-Benefit Perspective Testimony for the House Committee on Homeland Security s Subcommittee on Emerging Threats, Cybersecurity, and Science

### Random Effects Models for Longitudinal Survey Data

Analysis of Survey Data. Edited by R. L. Chambers and C. J. Skinner Copyright 2003 John Wiley & Sons, Ltd. ISBN: 0-471-89987-9 CHAPTER 14 Random Effects Models for Longitudinal Survey Data C. J. Skinner

### CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Examples: Multilevel Modeling With Complex Survey Data CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA Complex survey data refers to data obtained by stratification, cluster sampling and/or

### Fairfield Public Schools

Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

### Stat 9100.3: Analysis of Complex Survey Data

Stat 9100.3: Analysis of Complex Survey Data 1 Logistics Instructor: Stas Kolenikov, kolenikovs@missouri.edu Class period: MWF 1-1:50pm Office hours: Middlebush 307A, Mon 1-2pm, Tue 1-2 pm, Thu 9-10am.

### Evaluating Mode Effects in the Medicare CAHPS Fee-For-Service Survey

Evaluating Mode Effects in the Medicare Fee-For-Service Survey Norma Pugh, MS, Vincent Iannacchione, MS, Trang Lance, MPH, Linda Dimitropoulos, PhD RTI International, Research Triangle Park, NC 27709 Key

### What is the purpose of this document? What is in the document? How do I send Feedback?

This document is designed to help North Carolina educators teach the Common Core (Standard Course of Study). NCDPI staff are continually updating and improving these tools to better serve teachers. Statistics

### Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### A Basic Introduction to Missing Data

John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

### U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009. Notes on Algebra

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009 Notes on Algebra These notes contain as little theory as possible, and most results are stated without proof. Any introductory

### The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data ABSTRACT INTRODUCTION SURVEY DESIGN 101 WHY STRATIFY?

The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health, ABSTRACT

### A General Approach to Variance Estimation under Imputation for Missing Survey Data

A General Approach to Variance Estimation under Imputation for Missing Survey Data J.N.K. Rao Carleton University Ottawa, Canada 1 2 1 Joint work with J.K. Kim at Iowa State University. 2 Workshop on Survey

### The Elasticity of Taxable Income: A Non-Technical Summary

The Elasticity of Taxable Income: A Non-Technical Summary John Creedy The University of Melbourne Abstract This paper provides a non-technical summary of the concept of the elasticity of taxable income,

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Robust Inferences from Random Clustered Samples: Applications Using Data from the Panel Survey of Income Dynamics

Robust Inferences from Random Clustered Samples: Applications Using Data from the Panel Survey of Income Dynamics John Pepper Assistant Professor Department of Economics University of Virginia 114 Rouss

### National Endowment for the Arts. A Technical Research Manual

2012 SPPA PUBLIC-USE DATA FILE USER S GUIDE A Technical Research Manual Prepared by Timothy Triplett Statistical Methods Group Urban Institute September 2013 Table of Contents Introduction... 3 Section

### Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study.

Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study Prepared by: Centers for Disease Control and Prevention National

### Table 1" Cancer Analysis No Yes Diff. S EE p-value OLS 62.8 61.3 1.5 0.6 0.013. Design- Based 63.6 62.7 0.9 0.9 0.29

Epidemiologic Studies Utilizing Surveys: Accounting for the Sampling Design Edward L. Korn, Barry I. Graubard Edward L. Korn, Biometric Research Branch, National Cancer Institute, EPN-739, Bethesda, MD

### Getting Correct Results from PROC REG

Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking

### Penalized regression: Introduction

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

### 3. Data Analysis, Statistics, and Probability

3. Data Analysis, Statistics, and Probability Data and probability sense provides students with tools to understand information and uncertainty. Students ask questions and gather and use data to answer

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

### Chapter XXI Sampling error estimation for survey data* Donna Brogan Emory University Atlanta, Georgia United States of America.

Chapter XXI Sampling error estimation for survey data* Donna Brogan Emory University Atlanta, Georgia United States of America Abstract Complex sample survey designs deviate from simple random sampling,

Microdata User Guide Survey of Principals 004/05 December 006 Table of Contents 1.0 Administration... 3.0 Authority... 3 3.0 Background... 3 4.0 Objectives... 4 5.0 Content... 4 6.0 Uses... 5 7.0 Data

### Survey Data Analysis in Stata

Survey Data Analysis in Stata Jeff Pitblado Associate Director, Statistical Software StataCorp LP Stata Conference DC 2009 J. Pitblado (StataCorp) Survey Data Analysis DC 2009 1 / 44 Outline 1 Types of

### Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

### Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables

Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables Introduction In the summer of 2002, a research study commissioned by the Center for Student

### 2. Linear regression with multiple regressors

2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions

### CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,

### I. Introduction. II. Background. KEY WORDS: Time series forecasting, Structural Models, CPS

Predicting the National Unemployment Rate that the "Old" CPS Would Have Produced Richard Tiller and Michael Welch, Bureau of Labor Statistics Richard Tiller, Bureau of Labor Statistics, Room 4985, 2 Mass.

### CHAPTER 4 EXAMPLES: EXPLORATORY FACTOR ANALYSIS

Examples: Exploratory Factor Analysis CHAPTER 4 EXAMPLES: EXPLORATORY FACTOR ANALYSIS Exploratory factor analysis (EFA) is used to determine the number of continuous latent variables that are needed to

### INTRODUCTORY STATISTICS

INTRODUCTORY STATISTICS FIFTH EDITION Thomas H. Wonnacott University of Western Ontario Ronald J. Wonnacott University of Western Ontario WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore

### ANALYTICAL MODELING IN COMPLEX SURVEYS OF WORK PRACTICES

ANALYTICAL MODELING IN COMPLEX SURVEYS OF WORK PRACTICES JEROME P. REITER, ELAINE L. ZANUTTO, and LARRY W. HUNTER Jerome P. Reiter is Assistant Professor of the Practice of Statistics and Decision Sciences

### Univariate Regression

Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

### 10. Analysis of Longitudinal Studies Repeat-measures analysis

Research Methods II 99 10. Analysis of Longitudinal Studies Repeat-measures analysis This chapter builds on the concepts and methods described in Chapters 7 and 8 of Mother and Child Health: Research methods.

### Organizing Your Approach to a Data Analysis

Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

### Factor analysis. Angela Montanari

Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

### Interpretation of Somers D under four simple models

Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms

### IBM SPSS Complex Samples 22

IBM SPSS Complex Samples 22 Note Before using this information and the product it supports, read the information in Notices on page 51. Product Information This edition applies to version 22, release 0,

### MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

### A Brief Introduction to Property Testing

A Brief Introduction to Property Testing Oded Goldreich Abstract. This short article provides a brief description of the main issues that underly the study of property testing. It is meant to serve as

### Measurement in ediscovery

Measurement in ediscovery A Technical White Paper Herbert Roitblat, Ph.D. CTO, Chief Scientist Measurement in ediscovery From an information-science perspective, ediscovery is about separating the responsive

### Applications of R Software in Bayesian Data Analysis

Article International Journal of Information Science and System, 2012, 1(1): 7-23 International Journal of Information Science and System Journal homepage: www.modernscientificpress.com/journals/ijinfosci.aspx

### Introduction to Longitudinal Data Analysis

Introduction to Longitudinal Data Analysis Longitudinal Data Analysis Workshop Section 1 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development Section 1: Introduction

### Teaching Multivariate Analysis to Business-Major Students

Teaching Multivariate Analysis to Business-Major Students Wing-Keung Wong and Teck-Wong Soon - Kent Ridge, Singapore 1. Introduction During the last two or three decades, multivariate statistical analysis

### APPLICATION OF LINEAR REGRESSION MODEL FOR POISSON DISTRIBUTION IN FORECASTING

APPLICATION OF LINEAR REGRESSION MODEL FOR POISSON DISTRIBUTION IN FORECASTING Sulaimon Mutiu O. Department of Statistics & Mathematics Moshood Abiola Polytechnic, Abeokuta, Ogun State, Nigeria. Abstract

### Weighting European Social Survey Data

Weighting European Social Survey Data 25th April 2014 http://www.europeansocialsurvey.org/ Contents II 1 Do analyses conducted with ESS data need to be weighted? 1 2 What weights are there to apply? 1

### IAB Evaluation Study of Methods Used to Assess the Effectiveness of Advertising on the Internet

IAB Evaluation Study of Methods Used to Assess the Effectiveness of Advertising on the Internet ARF Research Quality Council Paul J. Lavrakas, Ph.D. November 15, 2010 IAB Study of IAE The effectiveness

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Elementary Statistics

Elementary Statistics Chapter 1 Dr. Ghamsary Page 1 Elementary Statistics M. Ghamsary, Ph.D. Chap 01 1 Elementary Statistics Chapter 1 Dr. Ghamsary Page 2 Statistics: Statistics is the science of collecting,

### Functional Principal Components Analysis with Survey Data

First International Workshop on Functional and Operatorial Statistics. Toulouse, June 19-21, 2008 Functional Principal Components Analysis with Survey Data Hervé CARDOT, Mohamed CHAOUCH ( ), Camelia GOGA

### Is the Forward Exchange Rate a Useful Indicator of the Future Exchange Rate?

Is the Forward Exchange Rate a Useful Indicator of the Future Exchange Rate? Emily Polito, Trinity College In the past two decades, there have been many empirical studies both in support of and opposing

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### Annex 6 BEST PRACTICE EXAMPLES FOCUSING ON SAMPLE SIZE AND RELIABILITY CALCULATIONS AND SAMPLING FOR VALIDATION/VERIFICATION. (Version 01.

Page 1 BEST PRACTICE EXAMPLES FOCUSING ON SAMPLE SIZE AND RELIABILITY CALCULATIONS AND SAMPLING FOR VALIDATION/VERIFICATION (Version 01.1) I. Introduction 1. The clean development mechanism (CDM) Executive

### CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical

### Categorical Data Analysis

Richard L. Scheaffer University of Florida The reference material and many examples for this section are based on Chapter 8, Analyzing Association Between Categorical Variables, from Statistical Methods

### The Contextualization of Project Management Practice and Best Practice

The Contextualization of Project Management Practice and Best Practice Claude Besner PhD, University of Quebec at Montreal Brian Hobbs PhD, University of Quebec at Montreal Abstract This research aims

### Instructional Delivery Model Courses in the Ph.D. program are offered online.

Doctor of Philosophy in Education Doctor of Philosophy Mission Statement The Doctor of Philosophy (Ph.D.) is designed to support the mission of the Fischler School of Education. The program prepares individuals

### Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Math Review for the Quantitative Reasoning Measure of the GRE revised General Test www.ets.org Overview This Math Review will familiarize you with the mathematical skills and concepts that are important

### CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the

### GRADES 7, 8, AND 9 BIG IDEAS

Table 1: Strand A: BIG IDEAS: MATH: NUMBER Introduce perfect squares, square roots, and all applications Introduce rational numbers (positive and negative) Introduce the meaning of negative exponents for