Paper SP Confidence Intervals for the Binomial Proportion with Zero Frequency



Similar documents
Confidence Intervals

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Sensitivity Analysis in Multiple Imputation for Missing Data

ESTIMATING COMPLETION RATES FROM SMALL SAMPLES USING BINOMIAL CONFIDENCE INTERVALS: COMPARISONS AND RECOMMENDATIONS

SAS Software to Fit the Generalized Linear Model

Binary Diagnostic Tests Two Independent Samples

Two Correlated Proportions (McNemar Test)

Tests for Two Proportions

Using Stata for One Sample Tests

Basic Statistical and Modeling Procedures Using SAS

Beginning Tutorials. PROC FREQ: It s More Than Counts Richard Severino, The Queen s Medical Center, Honolulu, HI OVERVIEW.

Lecture 19: Conditional Logistic Regression

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Paper PO03. A Case of Online Data Processing and Statistical Analysis via SAS/IntrNet. Sijian Zhang University of Alabama at Birmingham

Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY

Remove Voided Claims for Insurance Data Qiling Shi

SAS Certificate Applied Statistics and SAS Programming

Confidence Interval Calculation for Binomial Proportions

Week 4: Standard Error and Confidence Intervals

Constructing a Table of Survey Data with Percent and Confidence Intervals in every Direction

Tests for One Proportion

Effective Use of SQL in SAS Programming

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

5.1 Identifying the Target Parameter

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Logit Models for Binary Data

Multinomial and Ordinal Logistic Regression

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Paper D Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

Oh No, a Zero Row: 5 Ways to Summarize Absolutely Nothing

Chapter 6 INTERVALS Statement. Chapter Table of Contents

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

SAS/STAT. 9.2 User s Guide. Introduction to. Nonparametric Analysis. (Book Excerpt) SAS Documentation

ABSTRACT INTRODUCTION

How To Check For Differences In The One Way Anova

An macro: Exploring metadata EG and user credentials in Linux to automate notifications Jason Baucom, Ateb Inc.

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

NCSS Statistical Software

Instant Interactive SAS Log Window Analyzer

Confidence intervals for two sample binomial distribution

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

ing Automated Notification of Errors in a Batch SAS Program Julie Kilburn, City of Hope, Duarte, CA Rebecca Ottesen, City of Hope, Duarte, CA

VI. Introduction to Logistic Regression

Confidence Intervals for One Standard Deviation Using Standard Deviation

Paper PO-008. exceeds 5, where and (Stokes, Davis and Koch, 2000).

NCSS Statistical Software. One-Sample T-Test

Descriptive Statistics and Measurement Scales

A Method for Cleaning Clinical Trial Analysis Data Sets

Chapter 5 Analysis of variance SPSS Analysis of variance

How To Test For Significance On A Data Set

11. Analysis of Case-control Studies Logistic Regression

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Introduction to Quantitative Methods

Part 3. Comparing Groups. Chapter 7 Comparing Paired Groups 189. Chapter 8 Comparing Two Independent Groups 217

SPSS Explore procedure

Survey Analysis: Options for Missing Data

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser Quiz 1.

The correlation coefficient

Confidence Intervals for Spearman s Rank Correlation

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Estimation of σ 2, the variance of ɛ

Multiple Imputation for Missing Data: A Cautionary Tale

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

3 The spreadsheet execution model and its consequences

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Point Biserial Correlation Tests

Predicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS

Labels, Labels, and More Labels Stephanie R. Thompson, Rochester Institute of Technology, Rochester, NY

Post Processing Macro in Clinical Data Reporting Niraj J. Pandya

Confidence Intervals for Cp

Multinomial and ordinal logistic regression using PROC LOGISTIC Peter L. Flom National Development and Research Institutes, Inc

Health Care and Life Sciences

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Opgaven Onderzoeksmethoden, Onderdeel Statistiek

Study Guide for the Final Exam

Paper PO06. Randomization in Clinical Trial Studies

Automating SAS Macros: Run SAS Code when the Data is Available and a Target Date Reached.

Dongfeng Li. Autumn 2010

Non-Inferiority Tests for Two Proportions

PharmaSUG Paper MS05

SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

Ordinal Regression. Chapter

ABSTRACT INTRODUCTION HOW WE RANDOMIZE PATIENTS IN SOME OF THE CLINICAL TRIALS. allpt.txt

Why? A central concept in Computer Science. Algorithms are ubiquitous.

From the help desk: Swamy s random-coefficients model

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Generalized Linear Models

Analysis of categorical data: Course quiz instructions for SPSS

Poisson Models for Count Data

Categorical Data Analysis

Additional sources Compilation of sources:

Imputing Missing Data using SAS

Changing the Shape of Your Data: PROC TRANSPOSE vs. Arrays

UNTIED WORST-RANK SCORE ANALYSIS

STRUTS: Statistical Rules of Thumb. Seattle, WA

Transcription:

Paper SP10-009 Confidence Intervals for the Binomial Proportion with Zero Frequency Xiaomin He, ICON Clinical Research, North Wales, PA Shwu-Jen Wu, Biostatistical Consultant, Austin, TX ABSTRACT Estimating confidence interval for the binomial proportion is a challenge to statisticians and programmers when the proportion has zero frequency. The most widely used method based on Wald asymptotic statistics gives a degenerate interval, that is, (0, 0), in this case. This paper reviews the statistical methods used for estimating confidence intervals which are available in SAS version 9.. In practice, when calculating the frequency and intervals, SAS by default does not present the missing categorical level; this level has zero frequency but is no less important than other categorical levels. This paper also builds a macro to share tips on how to create confidence intervals with zero frequency. KEY WORDS Binomial Proportion, Confidence Intervals, Zero Frequency, Wilson (Score) Confidence Interval, SAS Macro. INTRODUCTION AND BACKGROUND In a clinical trial, assume one observation has several levels and the proportion of observations in the first variable level is your primary interest. A binary response is a typical example, which has 0 (non-response) and 1 (response). Define n 1 as the frequency of the first (or designated) level and n as the total frequency of the one-way table. The binomial proportion is computed as p ˆ = n1 / n. Denote by z α / the 100( 1 α / ) th percentile of the standard normal distribution. Several methods to estimate the confidence interval for the binomial proportion (we focus on two-sided intervals here) are as follows: Wald asymptotic confidence interval: ( p z ( 1 )/ n, + z ( 1 )/ n ) ˆ α / α /. Agresti-Coull confidence interval: ( p z p( p) n p z p( 1 α / 1 /, + α / p) / n ), where n 1 = n1 + zα / /, n = n + z α, and p = n / n 1. / Jeffreys confidence interval: β α, n + 1/, n n + 1/, β 1 α /, n + 1/, n n 1/, ( ( / 1 1 ) ( 1 1 + )) Where β ( α,b,c) is α th percentile of the beta distribution with shape parameters b and c. The lower bound is set to 0 when n 1 = 0, and the upper bound is set to 1 when n 1 = n. Exact (Clopper-Pearson) confidence interval: 1 1 n n 1 + 1 n n 1 1 + ( ( )), 1 + ( ) ( ( ) ( )), n1f 1 α /,n1, n n1 + 1 n1 + 1 F α /, n1 + 1, n n1 Where F ( α, b, c) is α th percentile of the F distribution with b and c degrees of freedom. The lower bound is set to 0 when n 1 = 0, and the upper bound is set to 1 when n 1 = n. Wilson (score) confidence interval: + zα / / + zα / ( α / )/( 4n) /( 1+ zα / / n), ( ) ( ) ( ) + z / 4n / 1+ z / n ( n) zα / ( 1 ) + z /( n) + z ( 1 ) α / 1 α / α /.

The literature (see Brown, Cai & DasGupta (001), Fleiss, Levin & Paik (003) for details) has compared these confidence intervals (and more). The coverage probability of the interval covering the true binomial proportion is the one of criteria of comparison. Generally, two aberrations of two-sided confidence intervals estimators for the binomial proportion were considered: overshoot and degeneracy. For zero frequency, n 1 = 0, the simplest and most widely used Wald asymptotic gives a degenerate interval, that is, (0, 0), which has the poorest coverage probability among all of these methods. Note that a continuity correction, 1 /( n), was suggested to adjust for the difference between the normal approximation and the binomial distribution, which improves the Wald asymptotic interval in some respects but is still very inadequate. The Agresti-Coull confidence interval is another adjusted Wald asymptotic interval that adds successes and failures ( z α / is close to for α = 0. 05 ). Jefferys confidence interval is an equal-tailed interval based on noninformative Jeffreys prior to a binomial proportion. Exact (Clopper-Pearson) confidence interval is constructed by inverting the equal-tailed test based on the binomial distribution. Due to the discrete property of binomial distribution, the exact (Clopper-Pearson) confidence interval is not exactly ( 1 α ) but is at least ( 1 α ), so it is conservative. Wilson (score) confidence interval is constructed by inverting the normal test that uses the null ˆ α /. proportion in the variance (the score test). The bounds are the roots of p p = z p( 1 p) / n Except the Wald asymptotic method, all other four methods are recommended for calculating confidence intervals for the binomial proportions. In the case of binomial proportion with zero frequency, Agresti-Coull always gives the longest confidence interval, while Jeffreys gives the shortest. From the anti-conservative and coverage consideration standpoint, we would recommend using the Wilson (score) confidence interval because it has been shown to have better performance than the exact (Clopper-Pearson) confidence interval. Prior to SAS 9.1.3, PROC FREQ procedure only provides the Wald asymptotic and exact (Clopper-Pearson) confidence intervals for the binomial proportion. In SAS 9., when you specify the BINOMIAL (ALL) option in the TABLES statement, then all of five confidence interval mentioned in this paper will be presented. You can also specify one or more types of binomial confidence intervals instead of ALL. The choices are AC (Agresti-Coull), EXACT (Clopper-Pearson), J (Jeffreys), W (Wilson score) and WALD (Wald asymptotic). SAS APPLICATION When using PROC FREQ to calculate the frequency and estimate confidence intervals, SAS by default doesn t include missing observations in the analysis. In this sense, the observations with zero frequency will be treated as missing and not presented in the output. However, as far as we know, observations with zero frequency are as important as other observations. A comprehensive summary including all categorical levels should be created. We will use the following sample data set to illustrate, and how to avoid, the problem when deriving the confidence interval for the binomial proportion with zero frequency. /* Group A has all binary levels of observations; Group B has the zero frequency */ /* at response=1; and Group C has zero frequency at response=0. */ data temp; do i = 1 to 0; group='a'; response=0; output; end; do i = 1 to 80; group='a'; response=1; output; end; do i = 1 to 100; group='b'; response=0; output; end; do i = 1 to 100; group='c'; response=1; output; end; ods select BinomialProp; proc freq data=temp; by group; tables response / binomial; Below is the output based on the PROC FREQ statement above. group=a The FREQ Procedure Binomial Proportion for response = 0 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Proportion 0.000 ASE 0.0400 95% Lower Conf Limit 0.116 95% Upper Conf Limit 0.784 Exact Conf Limits 95% Lower Conf Limit 0.167 95% Upper Conf Limit 0.918 group=b The FREQ Procedure Binomial Proportion for response = 0 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Proportion 1.0000 ASE 0.0000 95% Lower Conf Limit 1.0000 Exact Conf Limits 95% Lower Conf Limit 0.9638 group=c The FREQ Procedure Binomial Proportion for response = 1 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Proportion 1.0000 ASE 0.0000 95% Lower Conf Limit 1.0000 Exact Conf Limits 95% Lower Conf Limit 0.9638 Note that unlike Groups A and B, the binomial proportion for Group C was calculated for response=1 because there is 0 observation for response=0. SAS by default reports the binomial proportion in the first non-missing variable level; or you can specify the variable level to be calculated, but it must be non-missing. Therefore, the exact (Clopper- Pearson) confidence intervals for Groups B and C are identical : (0.9638, 1.000). Also, SAS did not issue notes/warnings in the log, because there is not any algorithm error in programming. But this is not what we intended to derive. In this paper, we share a tip about how to correctly calculate and present the confidence interval for the binomial proportion with zero frequency by using SAS. The algorithm is as follows: Suppose the discrete variables in the raw dataset include response, group and other stratification; Define a weight variable from the raw dataset. If the raw dataset doesn t have a weight variable, then generate one and assign all weights to be 1; Create a dummy dataset which contains all categorical levels of discrete variables; Merge the dummy dataset with the raw dataset by discrete variables; For the observation of discrete variables with zero frequency, the value of weight variable is missing. Hardcode it as 0, such as the COALESCE function in PROC SQL; When using PROC FREQ, add the ZEROS option in the WEIGHT statement; If necessary, specify the response variable level for which to compute the proportion. Note that response could be not limited to a binary variable. It may have there or more categorical levels, which 3

makes our method be more generally and robustly used in practice. A SAS macro is shown here to illustrate the strategy: /*----------------------------------------------------------------------------------*/ /*- Macro Name: CI_BP /*- Description: Calculate confidence intervals for the binomial proportion of /*- response variables. /*- SAS Version: 9. /*- Data Input: indata = The input data. (Required) /*- Data Output: None or user defined /*- Parameters: group = Grouping or block variable. (Required) /*- response = Response or event variable for which to compute the /*- proportion. (Required) /*- weight = Frequency or weight variable. If it is not specified, /*- the default value is 1. (Optional) /*- by = Stratification variable. (Optional) /*- level = Formatted value of the response level in which the /*- confidence interval is calculated. By default the confidence /*- interval is for the first response level. (Optional) /*- method = Method used for calculating confidence intervals. /*- The default value is ALL, which requests all types of /*- confidence intervals introduced in the paper. The /*- alternatives are AC(Agresti-Coull), EXACT (Clopper-Pearson), /*- J(Jeffreys), WALD(Wald), and W (Wilson score) confidence /*- intervals. (Optional) /*----------------------------------------------------------------------------------*/ %macro CI_BP(indata=, group=, response=, weight=1, by=, level=, method=all); proc sql noprint; create table indata_ as select %if %length(&by) > 0 %then %do; &by as by, %end; &group as group, &response as response, &weight as weight from &indata; %if %length(&by) > 0 %then %do; create table by_ as select distinct(&by) as by from &indata; %end; create table group_ as select distinct(&group) as group from &indata; create table response_ as select distinct(&response) as response from &indata; create table dummy_ as select %if %length(&by) > 0 %then %do; a.by, %end; b.group, c.response from %if %length(&by) > 0 %then %do; by_ as a, %end; group_ as b, response_ as c; create table outdata_ as select %if %length(&by) > 0 %then %do; a.by, %end; a.group, a.response, coalesce(b.weight, 0) as weight from dummy_ as a left join indata_ as b on %if %length(&by) > 0 %then %do; a.by=b.by and %end; a.group=b.group and a.response=b.response; quit; proc freq data=outdata_; by %if %length(&by) > 0 %then %do; by %end; group; tables response / binomial (&method %if &level= %then level=1; %else level="&level";); weight weight /zeros; %mend CI_BP; %CI_BP(indata=temp, group=group, response=response); Note that the macro was written on SAS 9.. In SAS 9.0 and 9.1, the options after BINOMIAL in the TABLES statement should be removed; in that case only the Wald Asymptotic and exact (Clopper-Pearson) confidence intervals were calculated. For SAS version prior to 9.0, the algorithm doesn t work because ZEROS option in 4

WEIGHT statement is unavailable. Therefore, algorithm of using 1 minus the confidence interval of total compensative categorical levels is a more reasonable choice. We didn t build the macro using for early SAS versions due to limited use. You may specify any formatted value of the response level in which the confidence interval is calculated; and the macro can also be applied to one more stratification variable other than group variable. A more general example is as follows: data temp; do i=1 to 0; b=1; g='a'; r=0; output; end; do i=1 to 40; b=1; g='a'; r=1; output; end; do i=1 to 40; b=1; g='a'; r=; output; end; do i=1 to 100; b=1; g='b'; r=0; output; end; do i=1 to 30; b=1; g='c'; r=1; output; end; do i=1 to 70; b=1; g='c'; r=; output; end; do i=1 to 80; b=; g='a'; r=0; output; end; do i=1 to 0; b=; g='a'; r=1; output; end; do i=1 to 90; b=; g='b'; r=0; output; end; do i=1 to 10; b=; g='b'; r=; output; end; do i=1 to 50; b=; g='c'; r=1; output; end; do i=1 to 50; b=; g='c'; r=; output; end; * Calculate the exact confidence interval for the proportion of strongest response; %CI_BP(indata=temp, group=g, response=r, by=b, level=, method=exact); CONCLUSION This paper gives an overview on the statistical methods used for estimating confidence intervals, which are all available in SAS 9.. We recommend using the Wilson (score) confidence interval from the consideration of anticonservative and coverage when the binomial proportions have zero frequency. We also notice that confidence intervals generated from the PROC FREQ procedure by default have unexpected results for the binomial proportions with zero frequency. By creating a dummy dataset with all categorical levels and hard-coding the missing (or of zero frequency) observations, we build a macro to share the tips on how to create confidence intervals under this situation. REFERENCES SAS/STAT 9. User s Guide, The FREQ Procedure. Agresti,A. and Coull,B.A.(1998), Approximate is Better than Exact for Interval Estimation of Binomial Proportions, The American Statistician,5,119 16. Clopper,C.J.,and Pearson,E.S.(1934), The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial, Biometrika 6, 404 413. Collett,D.(1991), Modelling Binary Data, London: Chapman & Hall. Wilson, E.B.(197), Probable Inference, the Law of Succession, and Statistical Inference, Journal of the American Statistical Association,, 09 1. Brown, L.D., Cai, T.T., & DasGupta, A. (001). Confidence intervals for a binomial proportion (with discussion), Statistical Science, 16, 101 133. Fleiss, J.L., Levin, B., and Paik, M.C. (003), Statistical Methods for Rates and Proportions, Third Edition, New York: John Wiley & Sons. ACKNOWLEDGMENTS We are grateful to Dr. Liz Morgenthien and Hima Bhatia from ICON Clinical Research and Robert Schechter from Octagon Research Solutions for their valuable suggestions and reviewing of this paper. CONTACT INFORMATION 5

Your comments and questions are valued and encouraged. Contact the author at: Name: Xiaomin He, Ph.D. Enterprise: ICON Clinical Research Address: 1700 Pennbrook Parkway City, State ZIP: North Wales, PA 19454 Work Phone: 15-616-6406 Fax: 15-616-8685 E-mail: Xiaomin.he@iconplc.com Web: www.iconplc.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 6