Soci252-002 Data Analysis in Sociological Research. Homework 5 Computer Handout



Similar documents
Package dsstatsclient

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

Chapter 23 Inferences About Means

Data Analysis Tools. Tools for Summarizing Data

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

Using Stata for Categorical Data Analysis

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

Basic Statistical and Modeling Procedures Using SAS

Simple Linear Regression Inference

Chapter 19 The Chi-Square Test

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

TABLE OF CONTENTS. About Chi Squares What is a CHI SQUARE? Chi Squares Hypothesis Testing with Chi Squares... 2

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Tutorial 5: Hypothesis Testing

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Mind on Statistics. Chapter 15

CHAPTER 15 NOMINAL MEASURES OF CORRELATION: PHI, THE CONTINGENCY COEFFICIENT, AND CRAMER'S V

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

The Dummy s Guide to Data Analysis Using SPSS

Categorical Data Analysis

Association Between Variables

The Chi-Square Test. STAT E-50 Introduction to Statistics

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

11. Analysis of Case-control Studies Logistic Regression

Mind on Statistics. Chapter 13

Using Excel in Research. Hui Bian Office for Faculty Excellence

Introduction to Regression and Data Analysis

THE FIRST SET OF EXAMPLES USE SUMMARY DATA... EXAMPLE 7.2, PAGE 227 DESCRIBES A PROBLEM AND A HYPOTHESIS TEST IS PERFORMED IN EXAMPLE 7.

Comparing Two Groups. Standard Error of ȳ 1 ȳ 2. Setting. Two Independent Samples

Additional sources Compilation of sources:

This chapter discusses some of the basic concepts in inferential statistics.

SPSS Tests for Versions 9 to 13

Math 58. Rumbos Fall Solutions to Review Problems for Exam 2

Elementary Statistics

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

DIRECTIONS. Exercises (SE) file posted on the Stats website, not the textbook itself. See How To Succeed With Stats Homework on Notebook page 7!

AP Statistics: Syllabus 1

An introduction to IBM SPSS Statistics

An introduction to using Microsoft Excel for quantitative data analysis

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

An SPSS companion book. Basic Practice of Statistics

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

t-test Statistics Overview of Statistical Tests Assumptions

12: Analysis of Variance. Introduction

Recall this chart that showed how most of our course would be organized:

Tutorial for proteome data analysis using the Perseus software platform

Crosstabulation & Chi Square

Chapter 7. One-way ANOVA

II. DISTRIBUTIONS distribution normal distribution. standard scores

Stats for Strategy Fall 2012 First-Discussion Handout: Stats Using Calculators and MINITAB

Analysis of Variance. MINITAB User s Guide 2 3-1

Study Guide for the Final Exam

How To Check For Differences In The One Way Anova

Guide to Microsoft Excel for calculations, statistics, and plotting data

Two Correlated Proportions (McNemar Test)

Chapter 7: Simple linear regression Learning Objectives

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Two-sample t-tests. - Independent samples - Pooled standard devation - The equal variance assumption

Factors affecting online sales

Binary Diagnostic Tests Two Independent Samples

Introduction to Statistics with GraphPad Prism (5.01) Version 1.1

Exploratory data analysis (Chapter 2) Fall 2011

7. Comparing Means Using t-tests.

Chapter 2 Probability Topics SPSS T tests

Solutions to Homework 10 Statistics 302 Professor Larget

STATISTICA Formula Guide: Logistic Regression. Table of Contents

SAS Software to Fit the Generalized Linear Model

Analyzing Research Data Using Excel

Analysis of categorical data: Course quiz instructions for SPSS

Results from the 2014 AP Statistics Exam. Jessica Utts, University of California, Irvine Chief Reader, AP Statistics

How To Run Statistical Tests in Excel

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

NCSS Statistical Software

Dongfeng Li. Autumn 2010

People like to clump things into categories. Virtually every research

Projects Involving Statistics (& SPSS)

Odds ratio, Odds ratio test for independence, chi-squared statistic.

R with Rcmdr: BASIC INSTRUCTIONS

MTH 140 Statistics Videos

How to set the main menu of STATA to default factory settings standards

NCSS Statistical Software

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Testing a claim about a population mean

Beginning Tutorials. PROC FREQ: It s More Than Counts Richard Severino, The Queen s Medical Center, Honolulu, HI OVERVIEW.

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Introduction to Statistics with SPSS (15.0) Version 2.3 (public)

The Statistics Tutor s Quick Guide to

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

Module 5: Statistical Analysis

STAT 145 (Notes) Al Nosedal Department of Mathematics and Statistics University of New Mexico. Fall 2013

Difference of Means and ANOVA Problems

Practice problems for Homework 12 - confidence intervals and hypothesis testing. Open the Homework Assignment 12 and solve the problems.

Chapter 23. Inferences for Regression

Transcription:

University of North Carolina Chapel Hill Soci252-002 Data Analysis in Sociological Research Spring 2013 Professor François Nielsen Homework 5 Computer Handout Readings This handout covers computer issues related to Chapters 23, 24, 25 and 26 in De Veaux et al. 2012. Stats: Data and Models. 3e. Addison-Wesley. (STATSDM3) Chapter 23 Inferences about Means Calculating a CI for a Mean by hand I illustrate calculation of the one-sample t-interval for the mean with the Triphammer speed data from STATSDM3, pp.557 8. I had earlier copied the data in csv format from the textbook CD to my work directory. I use read.csv to read in the data and attach the dataframe. I then check for near-normality with a stem and leaf display and a normal quantile plot, and calculate the confidence interval. > Speeds <- read.csv("ch23_triphammer_speeds.csv", sep=",", header=true) > attach(speeds) > speed [1] 29 34 34 28 30 29 38 31 29 34 32 31 27 37 29 26 24 34 36 31 34 36 21 > stem(speed) The decimal point is 1 digit(s) to the right of the 2 14 2 6789999 3 0111244444 3 6678 > qqnorm(speed) # graph not shown > qqline(speed) > ybar <- mean(speed) > s <- sd(speed) > n <- length(speed) > alpha <-.10 # to get 90 pct interval > tstar <- qt(1 - alpha/2, df=n-1) # critical t value > tstar [1] 1.717144 > SE <- s/sqrt(n) > ME <- tstar*se > c(ybar - ME, ybar + ME) [1] 29.52257 32.56439 1

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 2 We can be 90% confident that the mean speed of vehicles on Triphammer Road is between 29.5 and 32.6 miles per hour. Calculating a t-test for the Mean by hand Using the same data we test whether the mean speed of all cars exceeds the speed limit of 30 miles an hour. This is a directional hypothesis with H 0 : µ = µ 0 = 30 and H A : µ > µ 0 = 30. All preliminary calculations are the same as above. > # calculate the t-statistic > mu0 <- 30 > t <- (ybar - mu0)/se > t [1] 1.178114 > pval <- 1 - pt(t, n-1) > pval [1] 0.1256691 The p-value of 0.126 means that if the mean speed is actually 30 miles per hour one would observe a mean of 31 mph about 12.6% of the time. This is insufficient basis for rejecting the null hypothesis that the mean speed is 30 mph. Calculating a CI and t-test Using t.test Both one-sample t-intervals and one-sample t-tests are carried out with the t.test function in R. Depending on whether a CI or a test is desired, we may want to ignore the part of the output we don t need. I first replicate the calculation for the 90% CI for mean speed. > t.test(speed, conf.level = 0.90) One Sample t-test data: speed t = 35.0489, df = 22, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 90 percent confidence interval: 29.52257 32.56439 mean of x 31.04348 The confidence interval is the same as calculated earlier by hand. We can ignore the p-value and the rest of the hypothesis testing part as it tests by default the null hypothesis µ = 0, which is not meaningful here. Testing the hypothesis that mean speed is greater than 30 mph is done as follows. > t.test(speed, mu = 30, alternative = "greater") One Sample t-test data: speed t = 1.1781, df = 22, p-value = 0.1257 alternative hypothesis: true mean is greater than 30 29.52257 Inf mean of x 31.04348

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 3 The hypothesis test part of the output replicates what we had calculated by hand earlier. Corresponding to the one-sided test the output includes a one-sided confidence interval for the mean speed, which is calculated as (29.5, ). We will not be using one-sided intervals in this course and so we can ignore this part of the output. Chapter 24 Comparing Means Comparisons of means for two independent samples are also carried out with the R function t.test. There are two ways to organize the data for input into t.test: as separate vectors and as a single variable with a factor to identify group membership. Data as separate vectors I use the example of a comparison of battery life of brand name and generic batteries in STATSDM3 (pp.585 587). In the first methods the data are input as two separate vectors. I then create a side-by-side boxplot of the two samples (not shown), and then carry out the t-test. I specify var.equal = FALSE for clarity, but this is the default. > # Data as separate vectors > brandname <- c(194.0, 205.5, 199.2, 172.4, 184.0, 169.5) > generic <- c(190.7, 203.5, 203.5, 206.5, 222.5, 209.4) > boxplot(brandname, generic) > t.test(brandname, generic, var.equal=false) Welch Two Sample t-test data: brandname and generic t = -2.5462, df = 8.986, p-value = 0.03143 alternative hypothesis: true difference in means is not equal to 0-35.097420-2.069246 mean of x mean of y 187.4333 206.0167 The results are the same as in the textbook, except that the confidence interval for the difference in means is negative ( 35.1 min, 2.1 min). We could have obtained positive values by just entering generic first, brandname second. Data as single vector with identifying factor This method of entering data is more practical when working with data frames. The battery life for both samples is contained in a single variable Times, and group membership is identified by a factor Battery.Type. The t.test function is then called with a model formula Times ~ Battery.Type, meaning that battery life (on the left) is modelled by the factor battery type (on the right). Note that side-by-side boxplots are drawn with the plot() function also using a model formula. 1 > # Data as single vector with group factor > Batteries <- read.csv("ch24_battery_life.csv", sep=",", header=true) > Batteries Times Battery.Type 1 194.0 Brand Name 2 205.5 Brand Name 1 You can check for yourself that the side-by-side boxplots drawn with plot(times ~ Battery.Type) are better labelled than those produced with boxplot().

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 4 3 199.2 Brand Name 4 172.4 Brand Name 5 184.0 Brand Name 6 169.5 Brand Name 7 190.7 Generic 8 203.5 Generic 9 203.5 Generic 10 206.5 Generic 11 222.5 Generic 12 209.4 Generic > attach(batteries) > plot(times ~ Battery.Type) # side-by-side boxplots, not shown > t.test(times ~ Battery.Type, var.equal = FALSE) Welch Two Sample t-test data: Times by Battery.Type t = -2.5462, df = 8.986, p-value = 0.03143 alternative hypothesis: true difference in means is not equal to 0-35.097420-2.069246 mean in group Brand Name mean in group Generic 187.4333 206.0167 We see that this method of data entry produces the same test and confidence interval as the use of separate vectors. Pooled t-test The pooled t-test is one in which variances of the two samples can be assumed to be equal. The SE is then based on variance calculated from the pooled data. The pooled test is obtained with t.test by setting the option var.equal = TRUE. In the battery life example, the pooled t-test is obtained as follows (output is not shown). > t.test(times ~ Battery.Type, var.equal = TRUE) > detach(batteries) # clean up t-intervals and t-tests from Summarized Data The package BSDA provides an R function tsum.test for one-sample and two-sample tests of means based on summary data instead of original data. The package is authored by Alan T. Arnholt as a companion to the book Kitchens, Larry J. 2002. Basic Statistics and Data Analysis. Duxbury Press. The package is available on CRAN; it must be installed on your system before you use it. Alternatively, the R script BSDAsumfunc.R with the relevant BSDA functions can be sourced from the course site (see commands below). The following illustration tests the difference in the mean proportion of women in state cabinets in 22 states with a Democratic governor (µ 1 ) and in 17 states with a Republican governor (µ 2 ). (Note the proportion of women in a state cabinet is treated here as a quantitative variable.) The options and defaults are the same as for t.test, so the following is a two-sided test. > # t-tests and intervals from summarized data > library(bsda) > # uncomment the next line if you do not have the BSDA package installed > # source("http://www.unc.edu/%7enielsen/soci252/assign/bsdasumfunc.r") > ybar1 <- 0.239864; s1 <- 0.135046; n1 <- 22

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 5 > ybar2 <- 0.171235; s2 <- 0.08224; n2 <- 17 > tsum.test(ybar1, s1, n1, ybar2, s2, n2) Welch Modified Two-Sample t-test data: Summarized x and y t = 1.9594, df = 35.317, p-value = 0.058 alternative hypothesis: true difference in means is not equal to 0-0.00245477 0.13971277 mean of x mean of y 0.239864 0.171235 The p-value of.058 >.05 indicates that the two-sided test of the difference in mean proportion of women is not significant at the.05 level. But now let s do a one-sided test. > tsum.test(ybar1, s1, n1, ybar2, s2, n2, alt = "greater") Welch Modified Two-Sample t-test data: Summarized x and y t = 1.9594, df = 35.317, p-value = 0.029 alternative hypothesis: true difference in means is greater than 0 0.009464452 NA mean of x mean of y 0.239864 0.171235 The one-sided test has a p-value of.029, indicating a significantly greater proportion of women in state cabinets of Democratic as compared to Republican governors. (Note that the one-sided p-value is half the two-sided p-value.) Chapter 25 Paired Samples and Blocks Paired t-test In paired samples the observations are linked in pairs, such as measurements on the same individuals at two points in time, or upstream and downstream of a river, or of siblings in a pair. In R a paired t-test is carried out with function t.test by specifying paired = TRUE. As a example I use data that we collected in class earlier in the semester. For a paired t-test using a dataframe the observations are contained in two different variables. Here variable own is total number of siblings in the student s family; uncle is total number of children in family of oldest aunt or uncle of the student. > Famsize <- read.csv("famsize.csv", sep=",", header=true) > head(famsize) # list first 6 cases own uncle 1 2 2 2 3 3 3 3 1 4 3 1 5 4 7 6 2 0

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 6 > attach(famsize) > t.test(own, uncle, paired = TRUE) Paired t-test data: own and uncle t = 3.6304, df = 32, p-value = 0.0009768 alternative hypothesis: true difference in means is not equal to 0 0.5054239 1.7976064 mean of the differences 1.151515 > cor(own, uncle) [1] 0.3113127 > detach(famsize) # clean up The test is highly significant: there is a very low probability that a difference between own and uncle family size as large as this would be produced by chance alone. We discussed earlier the kind of systematic bias that may inflate own family size reported by students in class. Note that sizes of own and uncle family are in fact correlated r = 0.311. What sociological mechanisms might explain the positive correlation? Chapter 26 Comparing Counts Goodness-of-Fit Test The function chisq.test in R differs from its counterpart in many other statistical programs in that it will perform any goodness-of-fit test, even one that is not associated with a contingency table. The following shows how to enter the data for the comparison of the age distribution in a sample from the jury pool of a large municipal court district with the age distribution in the district as a whole as given by the census for the seven age categories 18 19, 20 24, 25 29, 30 39, 40 49, 50 64, and 65 and over (Koopmans 1987, pp.413 417). Below I enter the sample counts into variable obs, and the census proportions into variable ps. > # Goodness-of-fit test > obs <- c(23, 96, 134, 293, 297, 380, 113) > ps <- c(0.061, 0.150, 0.135, 0.217, 0.153, 0.182, 0.102) > chisq.test(obs, correct=false, p=ps) Chi-squared test for given probabilities data: obs X-squared = 231.26, df = 6, p-value < 2.2e-16 > round(residuals(chisq.test(obs, correct=false, p=ps)), digits = 3) [1] -6.480-7.375-3.452 0.181 6.476 8.776-1.994 The p-value of the test is very small, indicating a highly significant discrepancy between the counts in various age categories in obs and the census proportions ps. Thus we can reject the null hypothesis that the age distributions of the jury pool and of the county population are the same. The last line of command prints the standardized residuals (with 3 significant digits). They show that the jury pool is significantly deficient in the younger age categories, overrepresented in the middle-aged categories, and deficient again in the 65 and over category (why this pattern?).

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 7 Homogeneity and Independence Test from Summary Counts The mechanics of homogeneity and independence tests for a contingency table are the same. However count data may be available in different forms. In this paragraph I show how to enter data as summary counts, as found in a book or published article. The data are party affiliation of black and white respondents in the 2002 General Social Survey (Agresti 2007, p.60). You can use the following commands as a template that can be adapted to analyse any two-ways contingency table you want. > # Entering pre-tabulated data for chi-square test > partybyrace <- matrix(c(871, 444, 873, 302, 80, 43), nrow = 2, byrow = TRUE) > rownames(partybyrace) <- c("white", "Black") > colnames(partybyrace) <- c("dem", "Indep", "Rep") > names(dimnames(partybyrace)) <- c("race", "Party") > partybyrace Party Race Dem Indep Rep White 871 444 873 Black 302 80 43 > round(100*prop.table(partybyrace, 1), digits = 1) Party Race Dem Indep Rep White 39.8 20.3 39.9 Black 71.1 18.8 10.1 > test <- chisq.test(partybyrace) > test Pearson s Chi-squared test data: partybyrace X-squared = 167.8457, df = 2, p-value < 2.2e-16 > residuals(test) Party Race Dem Indep Rep White -3.548581 0.2495696 3.826892 Black 8.051632-0.5662665-8.683111 I first entered the contingency table as a matrix, specifying the number of rows (I could have specified the number of columns with ncol = 3 instead) and specifying that the data vector should be distributed row by row with byrow = TRUE (the default is by column). Then I added row names and column names, and names for the two dimensions Race and Party of the table. I then printed the contingency table, and then used the expression prop.table(partybyrace, 1) to show the conditional distributions of Party (dimension 2) given Race (dimension 1). (Use a 2 instead of 1 to percentage by columns.) We can see that blacks are much more likely than whites to identify as Democrat. I then carried out the chi-square test and saved the resulting object in test for later use. The tiny p-value of the chi-square test indicates that we can reject the hypothesis that Race and Party are independent. I then generated the residuals. The pattern of residuals shows that Democratic identification is significantly stronger and Republican identification significantly weaker among black respondents than expected based on independence. Homogeneity or Independence Test from a Dataframe I illustrate the chi-square test with data from a dataframe Chile from the package car. The data are from a public opinion survey of voting intentions carried out just before the 1988 Chile Plebiscite. The plebiscite asked voters

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 8 whether General Pinochet should continue to be President or should step down, so a yes (Y) and no (N) vote represent support and opposition to Pinochet, respectively; survey respondents could also specify whether they intended to abstain (A) or were undecided (U). I cross-tabulate voting intention with education categorized as primary, secondary and post-secondary (college) education. I first want to rearrange the levels of the education variable so they are in the natural order P (= primary), S (= secondary), and PS (= post-secondary). Then I calculate the contingency table, the chi-square value, the conditional distributions of voting intentions by education, and the standardized residuals. (Some irrelevant output is omitted.) > library(car)... > data(chile) > levels(chile$education) # education levels in wrong order [1] "P" "PS" "S" > Chile$education <- factor(chile$education, levels=c("p", "S", "PS")) > levels(chile$education) # now education levels in right order [1] "P" "S" "PS" > attach(chile)... > votebyed <- table(education, vote) > votebyed vote education A N U Y P 52 266 296 422 S 103 397 237 311 PS 32 224 52 130 > chisq.test(votebyed) Pearson s Chi-squared test data: votebyed X-squared = 135.8485, df = 6, p-value < 2.2e-16 > round(100*prop.table(votebyed, 1), 1) vote education A N U Y P 5.0 25.7 28.6 40.7 S 9.8 37.9 22.6 29.7 PS 7.3 51.1 11.9 29.7 We see that the association of voting intention with education is highly significant (chi-square = 135.85 with 6 df, p<.000). The conditional distributions show that respondents with only primary schooling are more likely to support Pinochet (Y is 40.7% compared with 29.7% for respondents with either secondary or post-secondary education). Respondents with primary schooling are also more likely to be undecided (28.6%, compared to 22.6% and 11.9% for respondents with secondary or post-secondary education levels). Intention to vote No (against Pinochet) increases monotonically with level of education (25.7%, 37.9% and 51.1% for primary, secondary, and post-secondary levels respectively). To examine these patterns further it is useful to compute the residuals. > residuals(chisq.test(votebyed)) vote education A N U Y P -2.83150838-5.15320625 3.59250661 3.58461538

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 9 S 2.86931756 1.47995902-0.39077765-2.51431295 PS -0.08363231 5.63613432-4.92063527-1.62374326 The patterns found in the conditional distributions of voting intention by education level are confirmed by the pattern of residuals. We see that the two largest residuals ( 5.153 and 5.636) correspond to the lower propensity of primary schooling respondents, and higher propensity of post-secondary respondents, to intend a No vote, relative to expected, and so on. Additional Analyses: Mosaic Plot and Measures of Association The package vcd provides the mosaic command to produce a very useful visualization of the residuals of a contingency table. The function assocstats calculates common chi-square based measures of association, which can be printed with the summary function. I illustrate the use of mosaic and assocstats with the votebyed table calculated earlier from the Chile data. > # Additional analysis with contingency tables > library(vcd) > mosaic(votebyed, shade = TRUE) > summary(assocstats(votebyed)) Number of cases in table: 2522 Number of factors: 2 Test for independence of all factors: Chisq = 135.85, df = 6, p-value = 7.528e-27 X^2 df P(> X^2) Likelihood Ratio 139.00 6 0 Pearson 135.85 6 0 Phi-Coefficient : 0.232 Contingency Coeff.: 0.226 Cramer s V : 0.164 The graph produced by mosaic(votebyed, shade = TRUE) (Figure 1) is a visual representation of the table of residuals of votebyed calculated earlier. Blue represents an excess (positive residual) and red a deficit (negative residual) relative to the expected frequency. Greater color saturation signifies greater significance (larger absolute values of the residual). For example there are highly significant deficits in No votes for primary school respondents and in Undecided votes for post-secondary respondents, and a highly significant excess of No vote for post-secondary respondents. Other less significant discrepancies are marked with more desaturated color. The areas of the cells of the mosaic plot are proportional to the cell counts. The coefficients of association will be explained in class. > # Gamma coefficient for ordered tables > library(vcdextra) > GKgamma(partybyrace) gamma : -0.575 std. error : 0.034 CI : -0.642-0.507 The package vcdextra has a function GKgamma to calculate the Goodman-Kruskal γ coefficient for a contingency table involving two ordinal variables. I illustrate the use of GKgamma with the table partybyrace created earlier. The coefficient γ can vary between 1 and +1. The calculated value of.575 indicates a strong negative association between race (White to Black top to bottom) and party affiliation (Dem to Rep left to right).

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 10 vote A N U Y Pearson residuals: 5.64 education PS S P 4.00 2.00 0.00 2.00 4.00 5.15 p value = < 2.22e 16 Figure 1: Mosaic plot of voting intentions in 1988 Chile referendum by level of education. A = Abstain, N = No (against Pinochet), U = Undecided, Y = Yes (pro-pinochet). Chile data set from John Fox s car package.