Paper PO-008. exceeds 5, where and (Stokes, Davis and Koch, 2000).



Similar documents
Beginning Tutorials. PROC FREQ: It s More Than Counts Richard Severino, The Queen s Medical Center, Honolulu, HI OVERVIEW.

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY

Lecture 19: Conditional Logistic Regression

Two Correlated Proportions (McNemar Test)

Permuted-block randomization with varying block sizes using SAS Proc Plan Lei Li, RTI International, RTP, North Carolina

Tests for Two Proportions

SAS/STAT. 9.2 User s Guide. Introduction to. Nonparametric Analysis. (Book Excerpt) SAS Documentation

SAS Software to Fit the Generalized Linear Model

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Chapter 12 Nonparametric Tests. Chapter Table of Contents

Basic Statistical and Modeling Procedures Using SAS

Paper D Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

Use of the Chi-Square Statistic. Marie Diener-West, PhD Johns Hopkins University

VI. Introduction to Logistic Regression

ABSTRACT INTRODUCTION

Lecture 25. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

EXST SAS Lab Lab #4: Data input and dataset modifications

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Test Positive True Positive False Positive. Test Negative False Negative True Negative. Figure 5-1: 2 x 2 Contingency Table

Multivariate Logistic Regression

Simulating Chi-Square Test Using Excel

Modeling Lifetime Value in the Insurance Industry

How to set the main menu of STATA to default factory settings standards

Binary Diagnostic Tests Two Independent Samples

Weight of Evidence Module

Survey Analysis: Options for Missing Data

Multinomial and ordinal logistic regression using PROC LOGISTIC Peter L. Flom National Development and Research Institutes, Inc

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.

Generalized Linear Models

Topic 8. Chi Square Tests

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

4. CLASSICAL METHODS OF ANALYSIS OF GROUPED DATA

Creating Dynamic Reports Using Data Exchange to Excel

SUGI 29 Statistics and Data Analysis

How to Reduce the Disk Space Required by a SAS Data Set

Point Biserial Correlation Tests

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

SAMPLE SIZE TABLES FOR LOGISTIC REGRESSION

LOGISTIC REGRESSION ANALYSIS

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Analysis of categorical data: Course quiz instructions for SPSS

Quick Start to Data Analysis with SAS Table of Contents. Chapter 1 Introduction 1. Chapter 2 SAS Programming Concepts 7

Lecture 14: GLM Estimation and Logistic Regression

MEASURES OF LOCATION AND SPREAD

PREDICTION OF INDIVIDUAL CELL FREQUENCIES IN THE COMBINED 2 2 TABLE UNDER NO CONFOUNDING IN STRATIFIED CASE-CONTROL STUDIES

Final Exam Practice Problem Answers

Simulate PRELOADFMT Option in PROC FREQ Ajay Gupta, PPD, Morrisville, NC

Logistic Regression.

Chapter 19 The Chi-Square Test

Logistic (RLOGIST) Example #3

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Tests for One Proportion

One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University

Paper Evaluation of methods to determine optimal cutpoints for predicting mortgage default Abstract Introduction

Labels, Labels, and More Labels Stephanie R. Thompson, Rochester Institute of Technology, Rochester, NY

Using the Magical Keyword "INTO:" in PROC SQL

Paper PO06. Randomization in Clinical Trial Studies

Abbas S. Tavakoli, DrPH, MPH, ME 1 ; Nikki R. Wooten, PhD, LISW-CP 2,3, Jordan Brittingham, MSPH 4

Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

Common Univariate and Bivariate Applications of the Chi-square Distribution

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study.

Chi-square test Fisher s Exact test

Introduction to Proc SQL Steven First, Systems Seminar Consultants, Madison, WI

Constructing a Table of Survey Data with Percent and Confidence Intervals in every Direction

Chi Squared and Fisher's Exact Tests. Observed vs Expected Distributions

Nonparametric Statistics

New SAS Procedures for Analysis of Sample Survey Data

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Chapter 23. Two Categorical Variables: The Chi-Square Test

Katie Minten Ronk, Steve First, David Beam Systems Seminar Consultants, Inc., Madison, WI

SPSS: Getting Started. For Windows

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Performing Queries Using PROC SQL (1)

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Variables Control Charts

Logit Models for Binary Data

An Introduction to Statistical Tests for the SAS Programmer Sara Beck, Fred Hutchinson Cancer Research Center, Seattle, WA

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Life Table Analysis using Weighted Survey Data

Paper RIV15 SAS Macros to Produce Publication-ready Tables from SAS Survey Procedures

Part 3. Comparing Groups. Chapter 7 Comparing Paired Groups 189. Chapter 8 Comparing Two Independent Groups 217

Chapter 3 RANDOM VARIATE GENERATION

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

The entire SAS code for the %CHK_MISSING macro is in the Appendix. The full macro specification is listed as follows: %chk_missing(indsn=, outdsn= );

Leveraging Ensemble Models in SAS Enterprise Miner

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

Instructions for Analyzing Data from CAHPS Surveys:

The Chi-Square Test. STAT E-50 Introduction to Statistics

3.4 Statistical inference for 2 populations based on two samples

Using Macros to Automate SAS Processing Kari Richardson, SAS Institute, Cary, NC Eric Rossland, SAS Institute, Dallas, TX

11. Analysis of Case-control Studies Logistic Regression

Calculating Effect-Sizes

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Transcription:

Paper PO-008 A Macro for Computing the Mantel-Fleisś Criterion Brandon Welch MS, Rho, Inc., Chapel Hill, NC Jane Eslinger BS, Rho, Inc., Chapel Hill, NC Rob Woolson MS, Rho, Inc., Chapel Hill, NC ABSTRACT When using chi-square statistics to evaluate the relationship between two dichotomous variables, the test statistics approach the chi-square distribution as the sample size increases. In practice, the validity of the chi-square approximation depends on the expected cell counts exceeding five. For examining three-way tables using the Cochran-Mantel-Haenszel (CMH) statistic, the same rule does not apply. Mantel and Fleisś (1980) proposed an extension to the minimum of five rule for 2 x 2 x H tables. We propose a user-friendly macro to supplement an analysis that uses the CMH one-degree-of-freedom test. The macro outputs all calculated components of the criterion to a SAS data set and accompanying log. INTRODUCTION It is common in many areas of statistical practice to examine the relationship between a binary response variable and a binary predictor variable in the presence of a potentially confounding categorical factor. Data are typically presented in a series of partial 2 x 2 tables segmented by a stratification factor. Cochran (1954) first proposed a test of conditional independence in 2 x 2 x H tables, treating the rows in each 2 x 2 table as independent binomials. Mantel and Haenszel (1959) proposed a similar, but more generalized test based on the hypergeometric distribution (Agresti, 1996). The CMH method tests whether the conditional odds ratio between the response and predictor variables equals one in each partial 2 x 2 table. The approach conditions on the row and column totals in each partial 2 x 2 table and, without loss of generality, utilizes the cell in the first row and column (n h11) (Agresti, 1996). The test statistic (where is the expected value of n h11 and is its variance under the null hypothesis) has a large-sample chi-square distribution with one degree of freedom (Stokes, Davis and Koch, 2000). In general, the CMH statistic is valid with sparse data but requires a large sample size. Mantel and Fleisś showed that it is appropriate to compute the CMH p-value from a chi-square distribution with one degree of freedom if exceeds 5, where and (Stokes, Davis and Koch, 2000). When performing an analysis using the CMH test, PROC FREQ produces asymptotic p-values for the general association (CMH) test statistic. We propose supplementing this analysis with the %MantelFleiss macro presented in this article. TABLE STRUCTURE Visually, the table structure follows where X=2, Y=2 and h runs from 1 to q. 1

X X For h=1 Y n 111 n 112 n 11+ n 121 n 122 n 12+ n 1+1 n 1+2 n 1 For h=2 Y n 211 n 212 n 21+ n 221 n 222 n 22+ n 2+1 n 2+2 n 2... X For q th table Y n q11 n q12 n q1+ n q21 n q22 n q2+ n q+1 n q+2 n q MACRO MACRO CALL: %MantelFleiss(In,Strat,Var1,Var2,Count) The macro consists of five parameters. Parameters &IN, &STRAT, &VAR1 and &VAR2 are required, whereas &COUNT is optional. The &IN parameter denotes the input data set and requires unique observations. &STRAT, &VAR1 and &VAR2 represent the stratification, row, and column variables respectively. &COUNT issues the WEIGHT statement in PROC FREQ. If the &COUNT parameter is omitted, all observations in the input data set are assumed to have a weight of one. For illustration, we divide the criterion into four components. We compute each component using the following syntax: 1) Summing the expected values in each n h11 cell for h=1,2,...,q. SYNTAX: %*get expected counts and totals in each partial table; PROC FREQ data = &In; tables &Strat*&Var1*&Var2 / expected outexpect out = _expect; %if %nrbquote(&count.) ne %then weight &Count.;; ODS output CrossTabFreqs = _counts; %*get sum of expected cell counts in cell n11 in each partial table (assign to macro var SUMEXP); DATA _null_; set _expect end = eof; by &Strat; retain sumexp 0; if first.&strat then do; sumexp = sumexp + expected; if eof then call symput('sumexp',put(sumexp,8.4)); 2

2) Summing the maximum difference between the row one total and the column two total for each of the h strata. The calculations span two data steps, but we combine the syntax here for illustration. See APPENDIX for unified method. SYNTAX: retain n11_ n1_1 n1_2; if first.&strat._ then do; n11_ = 0; n1_1 = 0; n1_2 = 0; %*get each total; if &Var1._ = 1 then n11_ = frequency; %*for row one in the hth strata; if &Var2._ = 1 then n1_1 = frequency; %*for column one in the hth strata; if &Var2._ = 2 then n1_2 = frequency; %*for column two in the hth strata; %*define (nh11)l for final computation; max_ = max(0,n11_-n1_2); %***********************************************; %*get sum of maximum for each of the hth strata*; %***********************************************; retain mf_suml; if _n_ = 1 then do; mf_suml = 0; mf_suml = mf_suml + max_; 3) Summing the minimum of the column one total and the row one total across all strata. Except for using a different formula (for maximum), the abbreviated code below is nearly identical to step two. See APPENDIX for unified method. SYNTAX: %*define (nh11)u for final computation; min_ = min(n1_1,n11_); retain mf_sumu; if _n_ = 1 then do; mf_sumu = 0; mf_sumu = mf_sumu + min_; 4) Computing the criterion by combining all steps. Compare summed expected values (step one) to maximum (step two) and minimum (step three), then find the minimum. If minimum is greater than five, then the criterion is satisfied. 3

SYNTAX: if eof then do; mflendpt = &sumexp - mf_suml; mfrendpt = mf_sumu - &sumexp; mf_r = min(mflendpt,mfrendpt); put "Sum of lower bound (sum[(nh11)l]) = " mf_suml; put "Sum of upper bound (sum[(nh11)u]) = " mf_sumu; put "Left end-point: sum[mh11] - sum[(nh11)l] = " mflendpt; put "Right endpoint: sum[(nh11)u] - sum[mh11] = " mfrendpt; put "Mantel-Fleiss R = min[" mflendpt "," mfrendpt "] = " mf_r; if mf_r>5 then put "Mantel-Fleiss criterion satisfied (since R > 5)"; else put "Mantel-Fleiss criterion NOT satisfied (since R <= 5)"; output; EXAMPLE1 The following is a hypothetical scenario comparing treatments A and B within three sites (Arkansas, Indiana and Illinois). In this example, the criterion is not satisfied. Therefore the chi-square approximation is not valid: DATA SETUP: PROC FORMAT; value trtf 1 = 'TreatmentA' 2 = 'TreatmentB' ; value sitef 1 = 'Arkansas' 2 = 'Indiana' 3 = 'Illinois' ; value respf 1 = 'Present' 2 = 'Absent' ; DATA test; do site = 1 to 3; do treat = 1 to 2; do resp = 1 to 2; input counts @@; output; label site = "Site" treat = "Treatment" resp = ""; format treat trtf. site sitef. resp respf.; cards; 12 5 7 3 1 2 5 2 0 9 5 6 ; 4

TABLE STRUCTURE site=arkansas (h=1) A 12 5 17 B 7 3 10 Total 19 8 27 site=indiana (h=2) A 1 2 3 B 5 2 7 Total 6 4 10 site=illinois (h=3) A 0 9 9 B 5 6 11 Total 5 15 20 MACRO CALL: %MantelFleiss(test,site,treat,resp,counts) 1) Summing the expected values in each n h11 cell for h=1,2,...,q. Sum of expected values in cells nh11: sum(mh11) = 20.5130 2) Summing the maximum difference between the row one total and the column two total. Sum of lower bound (sum[(nh11)l]) = 9 3) Summing the minimum of the column one total and the row one total. Sum of upper bound (sum[(nh11)u]) = 25 4) Computing the criterion by combining steps two and three. Left end-point: sum[mh11] - sum[(nh11)l] = 11.513 Right endpoint: sum[(nh11)u] - sum[mh11] = 4.487 Mantel-Fleiss R = min[11.513,4.487 ] = 4.487 Mantel-Fleiss criterion NOT satisfied (since R <= 5) EXAMPLE2 The following example illustrates a case in which the criterion is satisfied: TABLE STRUCTURE site=arkansas A 12 5 17 B 7 3 10 Total 19 8 27 5

site=indiana A 1 2 3 B 5 2 7 Total 6 4 10 site=illinois A 5 9 14 B 5 6 11 Total 10 15 25 1) Summing the expected values in each n h11 cell for h=1,2,...,q. Sum of expected values in cells nh11: sum(mh11) = 19.3630 2) Summing the maximum difference between the row one total and the column two total. Sum of lower bound (sum[(nh11)l]) = 9 3) Summing the minimum of the column one total and the row one total. Sum of upper bound (sum[(nh11)u]) = 30 4) Computing the criterion by combining steps two and three. Left end-point: sum[mh11] - sum[(nh11)l] = 10.363 Right endpoint: sum[(nh11)u] - sum[mh11] = 10.637 Mantel-Fleiss R = min[10.363,10.637 ] = 10.363 Mantel-Fleiss criterion satisfied (since R > 5) CONCLUSION The macro presented in this paper determines the validity of the chi-square approximation for the CMH one-degreeof-freedom test. If the Mantel-Fleisś criterion is satisfied, the chi-square approximation is valid. However, if the criterion is not satisfied, other techniques are necessary. Techniques for producing exact p-values are illustrated by Agresti (1996). Both PROC LOGISTIC and PROC MULTTEST provide tools for obtaining exact CMH p-values. The LOGISTIC approach uses the STRATA and EXACT statements. For MULTTEST, the STRATA statement and PERM option will produce similar results. See http://support.sas.com/kb/32/711.html for detailed examples. REFERENCES Agresti, A. 1996. Categorical Data Analysis. Hoboken, NJ. John Wiley & Sons, Inc. Cochran, W. G. 1954. Some Methods for Strengthening the Common tests. Biometrics. 10: 417-451. Dmitrienko, A., Molenberghs, G., Chuang-Stein, C., and Offen, W. 2005. Analysis of clinical trials using SAS : a practical guide. Cary, NC: SAS Institute Inc. Kahn, H. and Sempos, C. 1989. Statistical Methods in Epidemiology. New York, NY. Oxford University Press. Mantel, N. and Joseph L Fleisś. 1980. Minimum Expected Cell Size Requirements For The Mantel-Haenszel One- Degree-Of-Freedom Chi-Square Test And A Related Rapid Procedure. American Journal of Epidemiology. 112: 129-134. Mantel, N. and W Haenszel. 1959. Statistical Aspects of the Analysis of Data from Retrospective Studies of Disease. Journal of The National Cancer Institute. 22(4): 719-748. 6

SAS Institute Inc, 2009. SAS usage note. Usage Note 32711: How can I get an exact CMH test? http://support.sas.com/kb/32/711.html. Stokes, Maura E., Davis, Charles S., Koch, Gary G. 2000. Categorical Data Analysis Using The SAS System. Cary, NC: SAS Institute Inc. CONTACT INFORMATION Brandon Welch Rho, Inc. Statistical Programming 6330 Quadrangle Dr., Ste. 500 Chapel Hill, NC 27517 Email: Brandon_Welch@rhoworld.com Phone: 919-595-6339 APPENDIX %macro MantelFleiss(In,Strat,Var1,Var2,Count); %*-----------------------------------------------------------; %* &In - input data set ; %* &Strat - stratification variable (e.g., site, race, etc.); %* &Var1 - row variable ; %* &Var2 - column variable ; %* &Count - optional weight variable ; %*-----------------------------------------------------------; %*reset output data sets; PROC DATASETS nodetails nolist; delete _counts _expect _mantel:; QUIT; %*determine number of strata levels; PROC SQL noprint; select count(distinct &strat) into: stratflag from &In; QUIT; %put **Number of strata levels &stratflag; %*get expected counts and totals in each partial table; PROC FREQ data = &In; tables &Strat*&Var1*&Var2 / expected outexpect out = _expect; %if %nrbquote(&count.) ne %then weight &Count.;; ODS output CrossTabFreqs = _counts; %*get sum of expected cell counts in cell n11 in each partial table (assign to macro var SUMEXP); DATA _null_; set _expect end = eof; by &Strat; retain sumexp 0; if first.&strat then do; sumexp = sumexp + expected; if eof then call symput('sumexp',put(sumexp,8.4)); %put Sum of expected values in cells nh11: sum(mh11) = %cmpres(&sumexp); PROC SORT data = _counts; by &Strat &Var1 &Var2; 7

%*Subset to the needed totals output by PROC FREQ number the &Strat, &Var1, and &Var2 for use below 0 - totals 1 - first &Strat/&Var1/&Var2 2 - second &Strat/&Var1/&Var2...; DATA _mantelf1; %*only include marginal totals from ODS; set _counts(where = (_type_ not in ('100' '111'))); by &Strat &Var1 &Var2; retain &Strat._ 0 &Var1._ &Var2._; %*reset for change in strata; if first.&strat then do; &Var1._ = 0; &Var2._ = 0; %*get numeric version of strata variable; if first.&strat then &Strat._ = &Strat._ + 1; %*define zero as total for any &Var1/&Var2; if missing(&var1) then &Var1._ = 0; else &Var1._ + 1; if missing(&var2) then &Var2._ = 0; else &Var2._ + 1; keep &Strat._ &Var1._ &Var2._ &Strat. &Var1. &Var2. frequency; DATA _mantelf2; set _mantelf1; by &Strat._; retain n11_ n1_1 n1_2; if first.&strat._ then do; n11_ = 0; n1_1 = 0; n1_2 = 0; %*get each total; if &Var1._ = 1 then n11_ = frequency; %*for row one in the hth strata; if &Var2._ = 1 then n1_1 = frequency; %*for column one in the hth strata; if &Var2._ = 2 then n1_2 = frequency; %*for column two in the hth strata; %*define (nh11)l and (nh11)u for final computation; max_ = max(0,n11_-n1_2); min_ = min(n1_1,n11_); if last.&strat._; %*sum acoss min and max to get lower/upper bounds of criterion; DATA _mantelf3(keep = mf_suml mf_sumu mflendpt mfrendpt mf_r); set _mantelf2 end = eof; retain mf_suml mf_sumu; if _n_ = 1 then do; mf_suml = 0; mf_sumu = 0; 8

mf_suml = mf_suml + max_; mf_sumu = mf_sumu + min_; if eof then do; mflendpt = &sumexp-mf_suml; mfrendpt = mf_sumu-&sumexp; mf_r = min(mflendpt,mfrendpt); put "Sum of lower bound (sum[(nh11)l]) = " mf_suml; put "Sum of upper bound (sum[(nh11)u]) = " mf_sumu; put "Left end-point: sum[mh11] - sum[(nh11)l] = " mflendpt; put "Right endpoint: sum[(nh11)u] - sum[mh11] = " mfrendpt; put "Mantel-Fleiss R = min[" mflendpt "," mfrendpt "] = " mf_r; if mf_r>5 then put "Mantel-Fleiss criterion satisfied (since R > 5)"; else put "Mantel-Fleiss criterion NOT satisfied (since R <= 5)"; output; label mf_suml = "Summation of Maximum (sum[(nh11)l])" mf_sumu = "Summation of Minumum (sum[(nh11)u])" mflendpt = "Left End-Point: sum[mh11] - sum[(nh11)l]" mfrendpt = "Right End-Point: sum[(nh11)u] - sum[mh11]" mf_r = "R Value: min[sum[mh11] - sum[(nh11)l],sum[(nh11)u] - sum[mh11]]"; %m 9