Implementing Multiple Comparisons on Pearson Chi-square Test for an R C Contingency Table in SAS



Similar documents
SAS/STAT. 9.2 User s Guide. Introduction to. Nonparametric Analysis. (Book Excerpt) SAS Documentation

Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY

Beginning Tutorials. PROC FREQ: It s More Than Counts Richard Severino, The Queen s Medical Center, Honolulu, HI OVERVIEW.

SAS Software to Fit the Generalized Linear Model

Multiple samples: Pairwise comparisons and categorical outcomes

Basic Statistical and Modeling Procedures Using SAS

Chapter 12 Nonparametric Tests. Chapter Table of Contents

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Multiple-Comparison Procedures

EPS 625 INTERMEDIATE STATISTICS FRIEDMAN TEST

Comparing Multiple Proportions, Test of Independence and Goodness of Fit

Nonparametric Statistics

1 Overview. Fisher s Least Significant Difference (LSD) Test. Lynne J. Williams Hervé Abdi

TABLE OF CONTENTS. About Chi Squares What is a CHI SQUARE? Chi Squares Hypothesis Testing with Chi Squares... 2

Projects Involving Statistics (& SPSS)

Package dunn.test. January 6, 2016

Section 12 Part 2. Chi-square test

Chapter 19 The Chi-Square Test

1 Nonparametric Statistics

SPSS Tests for Versions 9 to 13

UNDERSTANDING THE TWO-WAY ANOVA

Descriptive Statistics

Paper PO-008. exceeds 5, where and (Stokes, Davis and Koch, 2000).

THE FIRST SET OF EXAMPLES USE SUMMARY DATA... EXAMPLE 7.2, PAGE 227 DESCRIBES A PROBLEM AND A HYPOTHESIS TEST IS PERFORMED IN EXAMPLE 7.

Tests for Two Proportions

Chapter 5 Analysis of variance SPSS Analysis of variance

Analysis of Variance ANOVA

The Kendall Rank Correlation Coefficient

SPSS Explore procedure

Crosstabulation & Chi Square

Analysis of categorical data: Course quiz instructions for SPSS

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Section 13, Part 1 ANOVA. Analysis Of Variance

Analysing Questionnaires using Minitab (for SPSS queries contact -)

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Quick Start to Data Analysis with SAS Table of Contents. Chapter 1 Introduction 1. Chapter 2 SAS Programming Concepts 7

Statistical tests for SPSS

Is it statistically significant? The chi-square test

VI. Introduction to Logistic Regression

Introduction to Quantitative Methods

Chi-square test Fisher s Exact test

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Lecture 14: GLM Estimation and Logistic Regression

Consider a study in which. How many subjects? The importance of sample size calculations. An insignificant effect: two possibilities.

Descriptive Analysis

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

An Introduction to Statistical Tests for the SAS Programmer Sara Beck, Fred Hutchinson Cancer Research Center, Seattle, WA

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

ABSORBENCY OF PAPER TOWELS

Two Correlated Proportions (McNemar Test)

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Package HHG. July 14, 2015

An introduction to IBM SPSS Statistics

Data analysis process

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Creating Dynamic Reports Using Data Exchange to Excel

Labels, Labels, and More Labels Stephanie R. Thompson, Rochester Institute of Technology, Rochester, NY

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Rank-Based Non-Parametric Tests

The Chi-Square Test. STAT E-50 Introduction to Statistics

Chi Squared and Fisher's Exact Tests. Observed vs Expected Distributions

Lecture 19: Conditional Logistic Regression

Using Stata for Categorical Data Analysis

Testing differences in proportions

Generalized Linear Models

Statistics for Sports Medicine

Statistics, Data Analysis & Econometrics

1 Overview and background

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

13: Additional ANOVA Topics. Post hoc Comparisons

Biostatistics: Types of Data Analysis

Independent t- Test (Comparing Two Means)

Analysis of Variance. MINITAB User s Guide 2 3-1

The Dummy s Guide to Data Analysis Using SPSS

Lecture 8. Confidence intervals and the central limit theorem

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

How Does My TI-84 Do That

CHAPTER 11 CHI-SQUARE AND F DISTRIBUTIONS

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

Mind on Statistics. Chapter 15

Study Guide for the Final Exam

An introduction to using Microsoft Excel for quantitative data analysis

Paper PO06. Randomization in Clinical Trial Studies

Research Methods & Experimental Design

Final Exam Practice Problem Answers

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

ANOVA ANOVA. Two-Way ANOVA. One-Way ANOVA. When to use ANOVA ANOVA. Analysis of Variance. Chapter 16. A procedure for comparing more than two groups

Simulating Chi-Square Test Using Excel

Friedman's Two-way Analysis of Variance by Ranks -- Analysis of k-within-group Data with a Quantitative Response Variable

XPost: Excel Workbooks for the Post-estimation Interpretation of Regression Models for Categorical Dependent Variables

Categorical Variables

Recommend Continued CPS Monitoring. 63 (a) 17 (b) 10 (c) (d) 20 (e) 25 (f) 80. Totals/Marginal

Statistical Functions in Excel

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Salary. Cumulative Frequency

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation

Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship)

Transcription:

Paper 1544-2014 Implementing Multiple Comparisons on Pearson Chi-square Test for an R C Contingency Table in SAS Man Jin, Forest Research Institute; Binhuan Wang, New York University School of Medicine ABSTRACT This paper illustrates a permutation method for implementing multiple comparisons on Pearson s Chi-square test for an R C contingency table, using the SAS procedure FREQ and a newly developed SAS macro called CHISQ_MC. This method is analogous to the Tukey-type multiple comparison method for one-way analysis of variance. INTRODUCTION A common statistical analysis for count data is to compare the frequency distributions of a multinomial outcome of interest for two or more groups. Assume that there are R groups and the categorical outcome variable has C nominal categories. The data for this analysis can be displayed in a contingency table with R rows and C columns. In SAS, the analysis can be performed using the FREQ procedure with /CHISQ or /Pearson option. The output provides the result of using Pearson s Chi-squared test for testing against the null hypothesis that the frequency distributions are the same across R groups. If the Chi-square test results in a p-value which is smaller than some significance level, say 0.05, then there is strong evidence to state that there is difference between at least two of the groups. In the case when there are only two groups being compared (2 C table), it is easy to conclude that the frequency distributions are different between the two groups. However, in the case where there are three or more groups, there is no follow-up test available in SAS to directly identify the significant differences between pairs of groups. In this paper, we propose a Tukey-type multiple comparison method for R C tables based on permutation. The method is analogous to the Tukey-type multiple comparison method for one-way analysis of variance (ANOVA; Tukey, 1953, by option TUKEY in PROC GLM), and methods for R 2 contingency tables (Elliott and Reisch, 2006, by SAS macro CompProp; Brown and Fears, 1981, by option Fisher in PROC MULTTEST). As pointed out by SAS Usage Note 22265 Performing multiple comparisons or tests on subtables following a significant Pearson chi-square (http://support.sas.com/kb/22/565.html), there are no multiple comparison methods directly available in PROC FREQ, but there are a number of approaches to consider, including the following three approaches. First, correspondence analysis is often used as a way to visualize the non-independence in a table. However, no statistical tests of comparisons are performed. Second, use the WHERE statement to conduct Pearson s Chi-square test on each pair of groups and then use PROC MULTTEST to adjust pairwise p-values. However, there is no SAS macro can implement these steps directly and the methods available are less powerful than the Tukey-type adjustment. Third, multiple comparisons can be done by fitting a Poisson loglinear model with PROC GENMOD, and then constructing custom CONTRAST statements to test pair comparisons. However, the method is model based and is less powerful than the Tukey-type adjustment. METHOD To illustrate the newly proposed method, we consider the same example used in the aforementioned SAS Usage Note 22265. The SAS output of the data Melanoma are printed in Table 1, where Type is the group variable (R=4) and Site the outcome variable (C=3). The significant chi-square test, produced by the CHISQ option in PROC FREQ, indicates lack of independence between TYPE and SITE. First, we develop a macro, which is called PAIRWISE_CHISQ and used as a subroutine in main macro CHISQ_MC, to calculate Chi-square test statistics and p-values for all pairs of groups. Because there are 4 groups, there are 4 3/2=6 pairs and consequently there are 6 comparisons. The output of PAIRWISE_CHISQ is shown in Table 2, where the p-values are unadjusted p-values. The most conservative multiple comparison is Bonferroni correction 1

(e.g., Abdi, 2007), which use significant level 0.05 6=0.3 instead of 0.05 to claim a significance, leading to familywise-error-rate (FWER) α=0.05. Obs Type site Count 1 Hutchinson's Head 22 2 Hutchinson's Trunk 2 3 Hutchinson's Extremities 10 4 Superficial Head 16 5 Superficial Trunk 54 6 Superficial Extremities 115 7 Nodular Head 19 8 Nodular Trunk 33 9 Nodular Extremities 73 10 Indeterminate Head 11 11 Indeterminate Trunk 17 12 Indeterminate Extremities 28 Table 1: Melanoma Data. Overall Chi-square Test Statistic=65.81 (df= 6; p<0.0001). Obs PAIR_COMPARED CHI_STAT DF P_VALUE 1 Hutchinson's vs Indeterminate 19.8430 2 0.00005 2 Hutchinson's vs Nodular 34.8196 2 0.00000 3 Hutchinson's vs Superficial 63.5139 2 0.00000 4 Indeterminate vs Nodular 1.1688 2 0.55743 5 Indeterminate vs Superficial 5.7295 2 0.05700 6 Nodular vs Superficial 3.2167 2 0.20022 Table 2: Output of Macro PAIRWISE_CHISQ. There are 6 Comparisons. To describe the Tukey-type adjustment, denote pairwise Chi-squared test statistics as number of pairs. Under the null hypothesis that there is no difference across groups, where m is the. Motivated by the idea of Tukey s honest significance test (HSD), we consider the distribution of under the null hypothesis. Denote the upper α 100% percentile of the null distribution of Q as and use it as critical value, then the family-wise-error-rate is controlled as. In ANOVA, individual test statistic is T-test statistic and Q follows the studentize range distribution under the null hypothesis. However, in the current case where individual test statistic is Chi-square test statistic and the null distribution of Q is unknown. Here we develop a permutation procedure to approximate the null distribution of Q. After we ungroup the grouped data Melanoma in Table 1, we randomly permute the group assignment. For each random permutation j, we execute macro PAIRWISE_CHISQ, obtain the corresponding pairwise Chi-square test statistics as in Table 2, and calculate the maximum, say, of these Chi-square test statistics. If we repeat such random permutation by many times, say J=1000, we can approximate the p-value for We call such p-values Tukey-type adjusted p-values. by 2

These steps are written into our main macro CHISQ_MC. This macro has 6 arguments: (1) grouped dataset; (2) group variable; (3) outcome variable; (4) frequency variable; (5) number of permutations; and (6) seed for random permutations. Table 3 is obtained by running: %Chisq_MC(data=melanoma, group=type, outcome=site, count=count, num_perm=5000, seed=100); In output such as Table 3, the original pairwise p-values are shown in the 5th column. The Bonferroni adjusted p- values, which are the multiplications of the original p-values by the number of pairs (and truncated by 1), are shown in the 6th column. The Tukey-type adjusted p-values, which are based on random permutation, are shown in the last column. We add the seed argument in the macro just for reproducibility. Both Bonferroni adjusted p-values and Tukey adjusted p-values can be compared with some α, say α=0.05, to control the FWER as α. Obs PAIR_COMPARED CHI_STAT DF P_VALUE P_BONFERRONI P_TUKEY 1 Hutchinson's vs Indeterminate 19.8430 2 0.00005 0.00029 0.0000 2 Hutchinson's vs Nodular 34.8196 2 0.00000 0.00000 0.0000 3 Hutchinson's vs Superficial 63.5139 2 0.00000 0.00000 0.0000 4 Indeterminate vs Nodular 1.1688 2 0.55743 1.00000 0.9442 5 Indeterminate vs Superficial 5.7295 2 0.05700 0.34199 0.2198 6 Nodular vs Superficial 3.2167 2 0.20022 1.00000 0.5906 Table 3: Final Output of Macro Chisq_MC. The Last Column is Based on 5,000 Permutations. MACROS The code for the macro PAIRWISE_CHISQ: %macro Pairwise_Chisq(data=, group=, outcome=, count=); proc freq data=&data noprint; table &group/out=out_group; data groupid; set out_group; GID=_n_; data _null_; set groupid nobs=nobs; call symput('num_group', compress(put(nobs, 11.))); stop; data ChiTest; *store pairwise Chisquare statistics to this; stop; %Do group1=1 %to &num_group-1; *for each pair obtain chisquare stat; %Do group2=&group1+1 %to &num_group; data group_pair; set groupid; if GID EQ &group1 then call symput('name_group1', type); if GID EQ &group2 then call symput('name_group2', type); proc freq data=&data noprint; output out=ctout PCHI; tables &group*&outcome /chisq cellchi2 nopercent; where &group in ("&name_group1", "&name_group2"); weight &count; 3

data CTout1; set CTout; PAIR_COMPARED="&name_group1" " vs " "&name_group2"; data ChiTest; set ChiTest CTout1; %End; %End; data ChiTest; *rename variables for final results; set ChiTest; CHI_STAT=_PCHI_; DF=DF_PCHI; P_VALUE=P_PCHI; keep pair_compared Chi_stat DF p_value; %mend Pairwise_Chisq; The code for macro CHISQ_MC: %macro Chisq_MC (data=, group=, outcome=, count=, num_perm=, seed=); options nosource nonotes; data data_ungrouped; *convert grouped data to ungrouped data; set &data; keep &group &outcome; do i=1 to &count; output; end; *** excute submacro pairwise_chisq on the original data; %pairwise_chisq(data=&data, group=&group, outcome=&outcome, count=&count); data ChiTest_original; *save the pairwise chisquare test on original data; set ChiTest; data ChiMax_all; *store max chisq stats from all permutations stop; *** repeart permutations; %Do perm=1 %to &num_perm; data outcome_var; *start to ermute the group assignment; set data_ungrouped; keep &outcome; data group_var; set data_ungrouped; random_number=ranuni(&seed); call symput('seed', &seed+1); keep &group random_number; proc sort data=group_var; *randomly permute group assignment; by random_number; data data_permuted; *obtain a randomly permuted data; merge group_var outcome_var; proc freq data=data_permuted noprint; *group ungrouped permuted data; table &group*&outcome/out=data_permuted_grouped; *execute the submacro pairwise_chisq on permuted data; %pairwise_chisq(data=data_permuted_grouped, group=&group, outcome=&outcome, count=count); Data ChiTest_perm; *save the pairwise chisq stat on a permuted data; set ChiTest; proc iml; *calculate the max of pairwise chisq on a permuted data; use ChiTest_perm; read all; Chi_max=max(Chi_stat); 4

create ChiMax_perm var {Chi_max &perm}; append; quit; data ChiMax_all; set ChiMax_all ChiMax_perm; %End; *end of all permutations; proc iml; use ChiTest_original; read all; use ChiMax_all; read all; n_pair=nrow(p_value);*number of pairs compared; *initinals for Bonferroni and Tukey ajusted p-values; p_bonferroni=p_value; p_tukey=p_value; *for each pair calculate adjusted p-values; do pair=1 to n_pair; p_bonferroni[pair]=min(p_value[pair]*n_pair,1); p_tukey[pair]=sum(chi_max>chi_stat[pair])/nrow(chi_max); end; create p_adjusted var{p_bonferroni p_tukey}; append; quit; data output_final; *THIS IS FINAL OUTPUT; merge ChiTest_original p_adjusted; proc print data=output_final; *PRINT FINAL OUTPUT; title "RESULTS of MACRO Chisq_MC"; options source notes; %mend Chisq_MC; CONCLUSION The proposed multiple comparison method for an R C contingency table analysis provides a post hoc test when the overall Chi-square test is significant. The proposed macro CHISQ_MC makes the interpretation of results easier and clearer. The proposed method can also be applied to arbitrary comparisons other than pairwise, and to other test statistics other than Chisq-test. REFERRENCES Abdi, H. (2007). Bonferroni and Šidák corrections for multiple comparisons. In N.J. Salkind (ed.). Encyclopedia of Measurement and Statistics. Thousand Oaks, CA: Sage. Brown, C.C. and Fears, T.R. (1981), Exact significance levels for multiple binomial testing with application to carcinogenicity screens, Biometrics 37: 763 774. Elliott, A.C. and Reisch, J.S. (2006). Implementing a multiple comparison test for proportions in a 2xC crosstabulation in SAS. Proceedings of the SAS User's Group International 31, San Francisco, CA, March 26 29, 2006. Paper 204-31. Mantel, N. (1966). Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemotherapy Reports 50: 163 70. Šidák, Z. (1967). Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association 62: 626-633. Tukey, J.W. (1953). The problem of multiple comparisons. Dittoed manuscript of 396 pages, Department of Statistics, Princeton University. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Man Jin Forest Research Institute Harborside Financial Center, PLAZA 5 Jersey City, NJ 07311 Phone: (201) 427-8568 Email: man.jin@frx.com 5

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 6