BMS 617 Statistical Techniques for the Biomedical Sciences Lecture 11: Chi-Squared and Fisher's Exact Tests Chi Squared and Fisher's Exact Tests This lecture presents two similarly structured tests, Chi-squared and Fisher's exact tests Fisher's exact test is a computationally intensive test Chi squared test provides a good approximation in many cases Both tests are for purely categorical data i.e. looking at the count of values in given categories There are two distinct uses for these tests: Testing if data match an expected distribution Comparing proportions among different groups Observed vs Expected Distributions The chi-squared test can be used to test whether a discrete distribution of results follow a predicted or expected distribution. Example (from Motulsky): Study from 2007 investigated heart disease among firefighters Hypothesized that risk of death from heart disease was related to duty the firefighter was performing at the time Null hypothesis is that risk of death from heart disease is independent of duty performed Kales et al. study Duty Number of heart attacks observed Proportion of time spent in duty Fire Suppression 144 2.0% Alarm response 138 16.0% Physical training 56 8.0% Other duties 111 74.0% Total 449 100.0% Under the null hypothesis, the number of heart attack deaths occuring during each duty would be proportional to the time spent in that duty So, for example, the number of deaths expected during fire suppression would be 2.0% of 449 = 9.0. 1/6
Observed and Expected values Duty Number of heart attacks observed Expected number Fire Suppression 144 9.0 Alarm response 138 71.8 Physical training 56 35.9 Other duties 111 332.3 Total 449 449.0 The number of deaths from heart attack while on active duty, and particularly while actively working on fire suppression, is far higher than the expected number The Chi-squared goodness of fit test The Chi-squared goodness of fit test computes a p-value for the following question: If the data were distributed as defined in the null hypothesis, what is the probability of seeing a discrepancy between the observed and expected values this large solely by random sample selection? In our example, distribution for the null hypothesis is merely the one in which the chance of death from heart disease is randomly distributed according to time The chi-squared statistic is defined as follows: For each category: Subtract the expected value from the observed value and squared the difference Divide the square by the expected value Sum these up. This is the chi-squared (χ 2 ) value. The chi-squared distribution The chi-squared value under the null hypothesis follows a known distribution that depends on the number of degrees of freedom The degrees of freedom in a goodness of fit test is the number of categories minus one In the firefighter example there are three degrees of freedom The p-value is obtained by comparing the number of degrees of freedom to a table of χ 2 values, or by using statistical software. For our example, χ 2 =2245, d.f.=3, and p<10-6 Chi-squared goodness of fit test: cautions When using the chi-squared goodness of fit test: Always use actual count data for the observed numbers Do not use percentages or normalized values 2/6
The observed values should always be integers If the expected number is less than 5 in any category, or less than 10 if there are only two categories, the results may be invalid. For two categories, a binomial test may be used instead Or the sample size can be increased to achieve the required expected values Do not confuse this test (goodness of fit test) with the chi-squared test for comparing proportions (test of independence). Proportion comparison studies Many studies, particularly clinical studies, answer a question of the type "Does exposure to a risk factor (or a specific treatment, etc) change the rate of disease?" Note that in these studies, both the dependent variable (disease status) and the independent variable (treated vs untreated, etc), are categorical Types of study Some jargon: Incidence: rate of new cases of disease Prevalence: proportion of a sample which has the disease Cross-sectional study: a sample is chosen without control as to how many are affected with the disease or as to how many were exposed to the risk factor. The subjects are divided as to exposure to the risk factor and disease prevalence is compared. Prospective or longditudinal study: Two groups are selected, one exposed to a risk factor (or treatment), one not. They are followed over the natural timeline of the disease, and disease incidence is compared. Experimental study: A single sample is chosen and randomly divided into two groups. One group is treated (or exposed to a risk factor), one is not. Incidence is compared between the two groups. Case-control or retrospective study: Two groups are selected, one with the disease, one without. The number exposed to the risk factor (or treated, etc) is compared between the two groups. Contingency Tables Data from all these types of study may be summarized in a contingency table Rows in the table represent exposure to the risk factor or treatment status Columns in the table represent disease status Cells in the table are counts of the number of subjects in that category Disease No disease Total Exposed/Treated A B A+B Not Exposed/Not Treated C D C+D 3/6
Total A+C B+D A+B+C+D Example experimental study Example study we saw previously: CABG vs PTCA in coronary artery disease patients Sample of 1829 patients with CAD. 914 Randomly assigned to bypass (CABG), 915 to angioplasty (PTCA). Focus on five-year survival rates Survived 5 years Did not survive 5 years Total CABG 542 372 914 PCTA 537 378 915 Total 1079 750 1829 40.7% of those receiving CABG died within five years, and 41.3% of those receiving PTCA died within five years Clearly little difference Diabetes and CABG/PTCA Study also looked at survival rates for CABG vs PTCA among patients treated for diabetes Note this is still an experimental study as the treatments were assigned and the outcome was measured subsequently Even though the diabetes diagnosis was retrieved retrospectively Five year survival among diabetic patients: Survived 5 years Did not survive 5 years Total CABG 93 87 180 PCTA 69 104 173 Total 162 191 353 Mortality rates were 48.3% for CABG-treated patients and 60.1% for PTCA-treated patients Diabetes and CABG/PTCA: Data Analysis Aim is to know the extent to which these data generalize to the general population CAD patients with diabetes One way is to compute confidence intervals for these proportions Saw how to do this in lecture 2. 95% CI for CABG is 41% to 56% and for PTCA is 53% to 67% The CIs overlap, but this does not mean the difference is not statistically significant Attributable Risk and NNT The difference between the two proportions is called the attributable risk The amount of risk which can be attributed to the treatment or exposure to risk factor 4/6
For our example the attributable risk is 60.1%-48.3%=11.8%. This means that for diabetic patients there is an 11.8% risk of mortality in five years associated with choosing PTCA over CABG. The reciprocal of attributable risk is called the Number needed to treat (NNT). For this example, NNT=1/0.118=8.5. This means that if we choose to give CAD patients with diabetes CABG instead of PTCA, for every 8.5 patients one will survive five years who would not have done under PTCA. Relative Risk The relative risk is the ratio of the risks. In this example it is 48.3%/60.1%=0.80. This means that CAD-diabetic patients treated with CABG have 80% of the chance of mortality of PTCA treated patients. The confidence interval of the relative risk can also be computed. For this example the 95% confidence interval is 66% to 98%. Be careful with the percentages: For attributable risk, the percentages represent the percentages of subjects For relative risk, the proportion 0.80 (80%) represents a relative probability; it's the proportion of risk assumed by one group relative to the other group Attributable Risk or Relative Risk? Whether attributable risk (or NNT) or relative risk are more useful depends on the context When making a choice between treatments, the relative risk is often more intutive Risk under CABG is 80% of the risk under PTCA But on a population level, the relative risk can be misleading Imagine a vaccine which reduces the occurence of a disease by 60% Relative risk of taking the vaccine is 40% compared to no vaccine The utility of the vaccine depends on the prevalence of the disease in the general population If the disease is very rare, say prevalence is 1 in 1 million, then administering the vaccine will save 4 in 10 million people If the disease is common, say prevalence of 1%, then the vaccine will save 4 in 1000 people Fisher's Exact Test The data analysis can be supplemented with a p-value Remember the p-value cannot be interpreted without a null hypothesis The null hypothesis is that the categories corresponding to the rows in the contingency table are independent of the categories corresponding to the columns. In this example, the null hypothesis is that the mortality rate does not depend on which treatment was used The p-value is the probability of seeing mortality rates this different under the assumption of the null hypothesis. The best way to compute a p-value for contingency table data is with Fisher's Exact Test. For very large sample sizes (in the 100,000s and above), this test is computationally 5/6
Summary unwieldy, and a chi-squared test can be used as an approximation For the example data, p=0.0325 by Fisher's Exact Test Chi-squared and Fisher's Exact Tests can be used for two scenarios: Testing if categorical data match an expected distribution Testing for independence of two categorical variables i.e. testing if proportions are equivalent among two or more sets of counted data Chi-squared goodness of fit test used to test if categorical data matches an expected distribution Approximate test Good when expected values are at least 5 in each category (at least 10 if only two categories) Must use actual counts, not proportions or normalized values Contingency tables tabluate subject counts in different categories Usually use rows in the table for the independent (predictor) variable Columns for the dependent (outcome) variable Summary (continued) Attributable risk is the difference in proportions between treatment categories Number needed to treat (NNT) is the reciprocal of attributable risk Relative risk is the ratio of proportions between treatment categories Fisher's exact test can be used to compute a p-value for the null hypothesis that there is no relationship between the dependent and independent variable (i.e. the variables are independent of each other) Computationally prohibitive for very large data sets Can use chi-squared test for independence instead (but never needed in practice) 6/6