A survey analysis example



Similar documents
INTRODUCTION TO SURVEY DATA ANALYSIS THROUGH STATISTICAL PACKAGES

Analysis of complex survey samples.

Multivariate Logistic Regression

Multiple Linear Regression

Lab 13: Logistic Regression

Generalized Linear Models

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

2. Filling Data Gaps, Data validation & Descriptive Statistics

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Outline. Dispersion Bush lupine survival Quasi-Binomial family

Package survey. February 20, 2015

Complex sampling and R.

Statistical Models in R

Logistic Regression (a type of Generalized Linear Model)

5. Linear Regression

Régression logistique : introduction

Psychology 205: Research Methods in Psychology

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Exchange Rate Regime Analysis for the Chinese Yuan

MIXED MODEL ANALYSIS USING R

Comparing Nested Models

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

n + n log(2π) + n log(rss/n)

Final Exam Practice Problem Answers

High School Graduation Rates in Maryland Technical Appendix

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Lecture 5: Model Checking. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Lucky vs. Unlucky Teams in Sports

Nominal and Real U.S. GDP

How Far is too Far? Statistical Outlier Detection

Simple example of collinearity in logistic regression

Two-phase designs in epidemiology

Statistical Models in R

An Introduction to Spatial Regression Analysis in R. Luc Anselin University of Illinois, Urbana-Champaign

Wholesale Lending Non-Conforming LTV Matrix (JF30, JF15, JA51, JA71, JA10) Effective Date: 1/20/2015

Univariate Regression

MSwM examples. Jose A. Sanchez-Espigares, Alberto Lopez-Moreno Dept. of Statistics and Operations Research UPC-BarcelonaTech.

FISHER CAM-H300C-3F CAM-H650C-3F CAM-H1300C-3F

Lecture Notes Module 1

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

The Stock Market s Reaction to Accounting Information: The Case of the Latin American Integrated Market. Abstract

MAT12X Intermediate Algebra

Correlation and Simple Linear Regression

Elementary Statistics

>

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

First Midterm Exam (MATH1070 Spring 2012)

Using R for Linear Regression

Introduction to Data Visualization

The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data ABSTRACT INTRODUCTION SURVEY DESIGN 101 WHY STRATIFY?

Electronic Thesis and Dissertations UCLA

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Visualization of Complex Survey Data: Regression Diagnostics

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Chapter 7 Section 1 Homework Set A

Chapter 11 Introduction to Survey Sampling and Analysis Procedures

Statistical Analysis of Life Insurance Policy Termination and Survivorship

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Gamma Distribution Fitting

Exploratory data analysis (Chapter 2) Fall 2011

Statistics 522: Sampling and Survey Techniques. Topic 5. Consider sampling children in an elementary school.

DATA INTERPRETATION AND STATISTICS

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Survey Data Analysis in Stata

Financial Risk Models in R: Factor Models for Asset Returns. Workshop Overview

Proposition 38. Tax for Education and Early Childhood Programs. Initiative Statute.

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Logistic regression (with R)

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

ASSESSING FINANCIAL EDUCATION: EVIDENCE FROM BOOTCAMP. William Skimmyhorn. Online Appendix

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Implementing Panel-Corrected Standard Errors in R: The pcse Package

2. Linear regression with multiple regressors

Data exploration with Microsoft Excel: analysing more than one variable

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

Statistics. Measurement. Scales of Measurement 7/18/2012

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Descriptive Statistics

Modeling Lifetime Value in the Insurance Industry

Confidence Intervals for the Difference Between Two Means

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Analyzing Flow Cytometry Data with Bioconductor

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

Part 2: Analysis of Relationship Between Two Variables

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

SPSS Guide: Regression Analysis

Transcription:

A survey analysis example Thomas Lumley November 20, 2013 This document provides a simple example analysis of a survey data set, a subsample from the California Academic Performance Index, an annual set of tests used to evaluate California schools. The API website, including the original data files are at http://api.cde.ca.gov. The subsample was generated as a teaching example by Academic Technology Services at UCLA and was obtained from http://www.ats.ucla.edu/stat/stata/library/svy_survey.htm. We have a cluster sample in which 15 school districts were sampled and then all schools in each district. This is in the data frame apiclus1, loaded with data(api). The two-stage sample is defined by the sampling unit (dnum) and the population size(fpc). Sampling weights are computed from the population sizes, but could be provided separately. > library(survey) > data(api) > dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc) The svydesign function returns an object containing the survey data and metadata. > summary(dclus1) 1 - level Cluster Sampling design With (15) clusters. svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc) Probabilities: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.02954 0.02954 0.02954 0.02954 0.02954 0.02954 Population size (PSUs): 757 Data variables: [1] "cds" "stype" "name" "sname" "snum" "dname" [7] "dnum" "cname" "cnum" "flag" "pcttest" "api00" [13] "api99" "target" "growth" "sch.wide" "comp.imp" "both" [19] "awards" "meals" "ell" "yr.rnd" "mobility" "acs.k3" [25] "acs.46" "acs.core" "pct.resp" "not.hsg" "hsg" "some.col" [31] "col.grad" "grad.sch" "avg.ed" "full" "emer" "" [37] "api.stu" "fpc" "pw" 1

We can compute summary statistics to estimate the mean, median, and quartiles of the Academic Performance Index in the year 2000, the number of elementary, middle, and high schools in the state, the total number of students, and the proportion who took the test. Each function takes a formula object describing the variables and a survey design object containing the data. > svymean(~api00, dclus1) mean SE api00 644.17 23.542 > svyquantile(~api00, dclus1, quantile=c(0.25,0.5,0.75), ci=true) $quantiles api00 551.75 652 717.5 $CIs,, api00 (lower 493.2835 564.3250 696.0000 upper) 622.6495 710.8375 761.1355 > svytotal(~stype, dclus1) stypee 4873.97 1333.32 stypeh 473.86 158.70 stypem 846.17 167.55 > svytotal(~, dclus1) 3404940 932235 > svyratio(~api.stu,~, dclus1) Ratio estimator: svyratio.survey.design2(~api.stu, ~, dclus1) Ratios= api.stu 0.8497087 SEs= api.stu 0.008386297 The ordinary R subsetting functions [ and subset work correctly on these survey objects, carrying along the metadata needed for valid standard errors. Here we compute the proportion of high school students who took the test 2

> svyratio(~api.stu, ~, design=subset(dclus1, stype=="h")) Ratio estimator: svyratio.survey.design2(~api.stu, ~, design = subset(dclus1, stype == "H")) Ratios= api.stu 0.8300683 SEs= api.stu 0.01472607 The warnings referred to in the output occured because several school districts have only one high school sampled, making the second stage standard error estimation unreliable. Specifying a large number of variables is made easier by the make.formula function > vars<-names(apiclus1)[c(12:13,16:23,27:37)] > svymean(make.formula(vars),dclus1,na.rm=true) mean SE api00 643.203822 25.4936 api99 605.490446 25.4987 sch.wideno 0.127389 0.0247 sch.wideyes 0.872611 0.0247 comp.impno 0.273885 0.0365 comp.impyes 0.726115 0.0365 bothno 0.273885 0.0365 bothyes 0.726115 0.0365 awardsno 0.292994 0.0397 awardsyes 0.707006 0.0397 meals 50.636943 6.6588 ell 26.891720 2.1567 yr.rndno 0.942675 0.0358 yr.rndyes 0.057325 0.0358 mobility 17.719745 1.4555 pct.resp 67.171975 9.6553 not.hsg 23.082803 3.1976 hsg 24.847134 1.1167 some.col 25.210191 1.4709 col.grad 20.611465 1.7305 grad.sch 6.229299 1.5361 avg.ed 2.621529 0.1054 full 87.127389 2.1624 emer 10.968153 1.7612 573.713376 46.5959 api.stu 487.318471 41.4182 3

Summary statistics for subsets can also be computed with svyby. Here we compute the average proportion of English language learners and of students eligible for subsidized school meals for elementary, middle, and high schools > svyby(~ell+meals, ~stype, design=dclus1, svymean) stype ell meals se.ell se.meals E E 29.69444 53.09028 1.411617 7.070399 H H 15.00000 37.57143 5.347065 5.912262 M M 22.68000 43.08000 2.952862 6.017110 Regression models show that these socieconomic variables predict API score and whether the school achieved its API target > regmodel <- svyglm(api00~ell+meals,design=dclus1) > logitmodel <- svyglm(i(sch.wide=="yes")~ell+meals, design=dclus1, family=quasibinomial()) > summary(regmodel) Call: svyglm(formula = api00 ~ ell + meals, design = dclus1) Survey design: svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 817.1823 18.6709 43.768 1.32e-14 *** ell -0.5088 0.3259-1.561 0.144 meals -3.1456 0.3018-10.423 2.29e-07 *** --- Signif. codes: 0 ^aăÿ***^aăź 0.001 ^aăÿ**^aăź 0.01 ^aăÿ*^aăź 0.05 ^aăÿ.^aăź 0.1 ^aăÿ ^aăź 1 (Dispersion parameter for gaussian family taken to be 3161.207) Number of Fisher Scoring iterations: 2 > summary(logitmodel) Call: svyglm(formula = I(sch.wide == "Yes") ~ ell + meals, design = dclus1, family = quasibinomial()) Survey design: svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.899557 0.509915 3.725 0.00290 ** 4

ell 0.039925 0.012443 3.209 0.00751 ** meals -0.019115 0.008825-2.166 0.05117. --- Signif. codes: 0 ^aăÿ***^aăź 0.001 ^aăÿ**^aăź 0.01 ^aăÿ*^aăź 0.05 ^aăÿ.^aăź 0.1 ^aăÿ ^aăź 1 (Dispersion parameter for quasibinomial family taken to be 0.9627734) Number of Fisher Scoring iterations: 5 We can calibrate the sampling using the statewide total for the previous year s API > gclus1 <- calibrate(dclus1, formula=~api99, population=c(6194, 3914069)) which improves estimation of some quantities > svymean(~api00, gclus1) mean SE api00 666.72 3.2959 > svyquantile(~api00, gclus1, quantile=c(0.25,0.5,0.75), ci=true) $quantiles api00 592.0652 681.181 736.5414 $CIs,, api00 (lower 553.0472 663.3175 721.1432 upper) 617.0915 696.8528 755.2959 > svytotal(~stype, gclus1) stypee 4881.77 302.15 stypeh 463.35 183.03 stypem 848.88 194.76 > svytotal(~, gclus1) 3357372 243227 > svyratio(~api.stu,~, gclus1) 5

Ratio estimator: svyratio.survey.design2(~api.stu, ~, gclus1) Ratios= api.stu 0.8506941 SEs= api.stu 0.008674888 6