Statistics for EES and MEME Chi-square tests and Fisher s exact test

Similar documents
Chapter 19 The Chi-Square Test

2 GENETIC DATA ANALYSIS

VI. Introduction to Logistic Regression

statistics Chi-square tests and nonparametric Summary sheet from last time: Hypothesis testing Summary sheet from last time: Confidence intervals

Solutions to Homework 10 Statistics 302 Professor Larget

Hardy-Weinberg Equilibrium Problems

Chi Squared and Fisher's Exact Tests. Observed vs Expected Distributions

Odds ratio, Odds ratio test for independence, chi-squared statistic.

LAB : PAPER PET GENETICS. male (hat) female (hair bow) Skin color green or orange Eyes round or square Nose triangle or oval Teeth pointed or square

THE FIRST SET OF EXAMPLES USE SUMMARY DATA... EXAMPLE 7.2, PAGE 227 DESCRIBES A PROBLEM AND A HYPOTHESIS TEST IS PERFORMED IN EXAMPLE 7.

CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont

Math 108 Exam 3 Solutions Spring 00

Heredity. Sarah crosses a homozygous white flower and a homozygous purple flower. The cross results in all purple flowers.

Test Positive True Positive False Positive. Test Negative False Negative True Negative. Figure 5-1: 2 x 2 Contingency Table

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Mendelian Genetics in Drosophila

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Chapter 23. Two Categorical Variables: The Chi-Square Test

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Chi-square test Fisher s Exact test

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Chapter 5. Discrete Probability Distributions

EMPIRICAL FREQUENCY DISTRIBUTION

Goodness of Fit. Proportional Model. Probability Models & Frequency Data

HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

HLA data analysis in anthropology: basic theory and practice

Section 12 Part 2. Chi-square test

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Tests for Two Proportions

A trait is a variation of a particular character (e.g. color, height). Traits are passed from parents to offspring through genes.

Chi-Squared Tests 97. Although we made up several potential test statistics in class, the standard test statistic for this situation is

Recommend Continued CPS Monitoring. 63 (a) 17 (b) 10 (c) (d) 20 (e) 25 (f) 80. Totals/Marginal

Beginning Tutorials. PROC FREQ: It s More Than Counts Richard Severino, The Queen s Medical Center, Honolulu, HI OVERVIEW.

11. Analysis of Case-control Studies Logistic Regression

Comparing Multiple Proportions, Test of Independence and Goodness of Fit

Week 4: Standard Error and Confidence Intervals

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Chapter 3 RANDOM VARIATE GENERATION

Two Correlated Proportions (McNemar Test)

Using Stata for Categorical Data Analysis

Data Analysis Tools. Tools for Summarizing Data

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

Chapter 13. Chi-Square. Crosstabs and Nonparametric Tests. Specifically, we demonstrate procedures for running two separate

Lecture 14: GLM Estimation and Logistic Regression

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Name: Date: Use the following to answer questions 3-4:

Statistical Impact of Slip Simulator Training at Los Alamos National Laboratory

Section 13, Part 1 ANOVA. Analysis Of Variance

Simple Linear Regression Inference

Final Exam Practice Problem Answers

Statistical Testing of Randomness Masaryk University in Brno Faculty of Informatics

People like to clump things into categories. Virtually every research

Statistical Functions in Excel

Variations on a Human Face Lab

GENETIC CROSSES. Monohybrid Crosses

Lin s Concordance Correlation Coefficient

Nonparametric Statistics

The Genetics of Drosophila melanogaster

First-year Statistics for Psychology Students Through Worked Examples

Is it statistically significant? The chi-square test

Statistics. One-two sided test, Parametric and non-parametric test statistics: one group, two groups, and more than two groups samples

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

PRACTICE PROBLEMS - PEDIGREES AND PROBABILITIES

Generalized Linear Models

Poisson Models for Count Data

Come scegliere un test statistico

Contingency Tables and the Chi Square Statistic. Interpreting Computer Printouts and Constructing Tables

One-Way Analysis of Variance (ANOVA) Example Problem

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

SAS Software to Fit the Generalized Linear Model

TABLE OF CONTENTS. About Chi Squares What is a CHI SQUARE? Chi Squares Hypothesis Testing with Chi Squares... 2

Math 58. Rumbos Fall Solutions to Review Problems for Exam 2

SAMPLING & INFERENTIAL STATISTICS. Sampling is necessary to make inferences about a population.

Normality Testing in Excel

Chi Square Tests. Chapter Introduction

HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...

CHAPTER 14 NONPARAMETRIC TESTS

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

MATH 140 Lab 4: Probability and the Standard Normal Distribution

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

Association Between Variables

18.06 Problem Set 4 Solution Due Wednesday, 11 March 2009 at 4 pm in Total: 175 points.

MAT 155. Key Concept. February 03, S4.1 2_3 Review & Preview; Basic Concepts of Probability. Review. Chapter 4 Probability

STATISTICS 8, FINAL EXAM. Last six digits of Student ID#: Circle your Discussion Section:

Analysis of categorical data: Course quiz instructions for SPSS

Chapter 9 Patterns of Inheritance

Testing differences in proportions

Multinomial and Ordinal Logistic Regression

Common Univariate and Bivariate Applications of the Chi-square Distribution

Analysis of Variance. MINITAB User s Guide 2 3-1

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Using Microsoft Excel for Probability and Statistics

Transcription:

Statistics for EES and MEME Chi-square tests and Fisher s exact test Dirk Metzler 3. June 2013 Contents 1 X 2 goodness-of-fit test 1 2 X 2 test for homogeneity/independence 3 3 Fisher s exact test 7 4 X 2 test for fitted models with free parameters 9 1 X 2 goodness-of-fit test Mendel s experiments with peas green (recessive) vs. yellow (dominant) round (dominant) vs. wrinkled (recessive) Expected frequencies when crossing double-hybrids: green yellow wrinkled 1 16 round 3 16 3 16 9 16 Observed in experiment (n = 556): green yellow wrinkled 32 101 round 108 315 Do the observed frequencies agree with the expected ones? Relative frequencies: 1

green/wrink. yell./wrink. green/round yell./round expected 0.0625 0.1875 0.1875 0.5625 observed 0.0576 0.1942 0.1816 0.5665 Can these deviations be well explained by pure random? Measure deviations by X 2 -statistic: X 2 = i E i (O i E i ) 2 where E i = expected number in class i and O i = observed number in class i. Why scaling (O i E i ) 2 by dividing by E i = EO i? Let n be the total sample size and p i be the probability (under the null hypothesis) each individual to contribute O i. Thus, Under the null hypothesis, O i is binomially distributed: ( ) n Pr(O i = k) = p k i (1 p i ) n k. k If p is rather small, n p (1 p) n p and By the way... E(O i E i ) 2 = Var(O i ) = n p (1 p). E (O i E i ) 2 E i = Var(O i) EO i = 1 p 1. the binomial distribution with small p and large n can be approximated by the Poisson distribution: ( ) n p k (1 p) n k λk k k! e λ with λ = n p. A random variable Y with possible values 0, 1, 2,... is Poisson distributed with parameter λ, if Then, EY = Var(Y ) = λ. Pr(Y = k) = λk k! e λ. 2

g/w y/w g/r y/r sum theory 0.0625 0.1875 0.1875 0.5625 expected 34.75 104.25 104.25 312.75 556 observed 32 101 108 315 556 O E 2.75 3.25 3.75 2.25 (O E) 2 7.56 10.56 14.06 5.06 (O E) 2 E 0.22 0.10 0.13 0.02 0.47 X 2 = 0.47 Is a value of X 2 = 0.47 usual? The distribution of X 2 depends on the degrees of freedom (df). in this case: the sum of the observations must be n = 556. when the first three numbers 32, 101, 108 are given, the last one is determined by 315 = 556 32 101 108. df = 3 densitiy of chi square distribution with df=3 densitiy of chi square distribution with df=3 dchisq(x, df = 3) 0.00 0.05 0.10 0.15 0.20 0.25 dchisq(x, df = 3) 0.00 0.05 0.10 0.15 0.20 0.25 0 2 4 6 8 10 12 x 0 2 4 6 8 10 12 x > pchisq(0.47,df=3)[1ex] [1] 0.07456892 p-value = 92.5% > obs <- c(32,101,108,315) > prob <- c(0.0625,0.1875,0.1875,0.5625) > chisq.test(obs,p=prob) 3

Chi-squared test for given probabilities data: obs X-squared = 0.47, df = 3, p-value = 0.9254 2 X 2 test for homogeneity/independence The cowbird is a brood parasite of Oropendola http://commons. wikimedia.org/wiki/ File:Montezuma_ Oropendola.jpgphoto (c) by J. Oldenettel References [Smi68] N.G. Smith (1968) The advantage of being parasitized. Nature, 219(5155):690-4 Cowbird eggs look very similar to oropendola eggs. Usually, oropendola rigorously remove all eggs that are not very similar to theirs. In some areas, cowbird eggs are quite different from oropendola eggs but are tolerated. Why? Possible explanation: botfly (german: Dasselfliegen) larvae often kill juvenile oropendola. nests with cowbird eggs are somehow better protected against the botfly. 0 no. of cowbird eggs numbers of nests affected by botflies affected by botflies 16 not affected by botflies 2 4 1 2 11 2 1 16

percentages of nests affected by botflies no. of cowbird eggs 0 1 2 affected by botflies 89% 15% 6% not affected by botflies 11% 85% 94% apparently, the affection with botflies is reduced when the nest contains cowbird eggs statistically significant? null hypothesis: The probability of a nest to be affected with botflies is independent of the presence of cowbird eggs. numbers of nests affected by botflies no. of cowbird eggs 0 1 2 affected by botflies 16 2 1 1919 not affected by botflies 2 11 16 29 18 13 16 4848 which numbers of affected nests would we expect under the null hypothesis? The same ratio of 19/48 in each group. expected numbers of nests affected by botflies, given row sums and column sums no. of cowbird eggs 0 1 2 affected by botflies 7.3 5.2 6.5 19 not affected by botflies 10.7 7.8 10.5 29 18 13 16 48 18 19 48 = 7.3 13 19 48 = 5.2 All other values are now determined by the sums. affected by botflies 16 2 1 19 Observed (O): not affected by botflies 2 11 16 29 18 13 16 48 Expected (E): O-E: X 2 = i affected by botflies 7.3 5.2 6.5 19 not affected by botflies 10.7 7.8 10.5 29 18 13 16 48 affected by botflies 8.7-3.2-5.5 0 not affected by botflies -8.7 3.2 5.5 0 0 0 0 0 (O i E i ) 2 = 29.5544 E i given the sums of rows and columns, two values in the table determine the rest df=2 for contingency table with 2 rows and 3 columns 5

in general for tables with n rows and m columns: df = (n 1) (m 1) densitiy of chi square distribution with df=2 dchisq(x, df = 2) 0.0 0.1 0.2 0.3 0.4 0.5 0 5 10 15 20 25 30 x > M <- matrix(c(16,2,2,11,1,16),nrow=2) > M [,1] [,2] [,3] [1,] 16 2 1 [2,] 2 11 16 > chisq.test(m) Pearson s Chi-squared test data: M X-squared = 29.5544, df = 2, p-value = 3.823e-07 The p-value is based on approximation by χ 2 -distribution. Rule of thumb: χ 2 -approximation appropriate if all expectation values are 5. Alternative: approximate p-value by simulation: > chisq.test(m,simulate.p.value=true,b=50000) Pearson s Chi-squared test with simulated p-value (based on 50000 replicates) data: M X-squared = 29.5544, df = NA, p-value = 2e-05 6

3 Fisher s exact test References [McK91] J.H. McDonald, M. Kreitman (1991) Adaptive protein evolution at the Adh locus in Drosophila. Nature 351:652-654. synonymous replacement polymorphisms 43 2 fixed 17 7 > McK <- matrix(c(43,17,2,7),2, dimnames=list(c("polymorph","fixed"), c("synon","replace"))) > McK synon replace polymorph 43 2 fixed 17 7 > chisq.test(mck) Pearson s Chi-squared test with Yates continuity correction data: McK X-squared = 6.3955, df = 1, p-value = 0.01144 Warning message: In chisq.test(mck) : Chi-Square-Approximation may be incorrect > chisq.test(mck,simulate.p.value=true,b=100000) Pearson s Chi-squared test with simulated p-value (based on 1e+05 replicates) data: McK X-squared = 8.4344, df = NA, p-value = 0.00649 Fisher s exact test A B C D null hypothesis: EA/EC EB/ED = 1 7

For 2 2 tables exact p-values can be computed (no approximation, no simulation). > fisher.test(mck) Fisher s Exact Test for Count Data data: McK p-value = 0.006653 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 1.437432 92.388001 sample estimates: odds ratio 8.540913 43 2 45 17 7 24 60 9 69 a b K c d M U V N is Given the row sums and column sums and assuming independence, the probability of a )( M ) )( M ) c d ( K a ( K b Pr(a) = ( N = Pr(b) = ( N ) U) V hypergeometric distribution p-value: Pr(b = 0) + Pr(b = 1) + Pr(b = 2) 8

a b 45 c d 24 60 9 69 b Pr(b) 0 0.000023 1 0.00058 2 0.00604 3 0.0337 4 0.1117 5 0.2291 6 0.2909 7 0.2210 8 0.0913 9 0.0156 One-sided Fisher test: for b = 2: p-value=pr(0) + Pr(1) + Pr(2) = 0.00665313 for b = 3: p-value=pr(0)+pr(1)+pr(2)+pr(3) = 0.04035434 Two-sided Fisher test: Sum up all probabilities that are smaller or equal to Pr(b). for b = 2: p-value=pr(0) + Pr(1) + Pr(2) = 0.00665313 for b = 3: p-value= Pr(0) + Pr(1) + Pr(2) + Pr(3) + Pr(9) = 0.05599102 4 X 2 test for fitted models with free parameters Given a population in Hardy-Weinberg equilibrium and a gene locus with two alleles A and B with frequencies p and 1 p. Genotype frequencies AA AB BB p 2 2 p (1 p) (1 p) 2 example: M/N blood type; sample: white Americans observed: MM MN NN 1787 3037 1305 estimated allele frequency p of M: 2 1787 + 3037 2 = 0.5393 expected: MM MN NN p 2 2 p (1 p) (1 p) 2 0.291 0.497 0.212 1782.7 3045.5 1300.7 9

NM all possible observations (O MM,O MN,O NN) are located on a triangle (simplex) between (,0,0) (0,,0) and (0,0,) NN MM NM The points representing the Expected Values (E MM,E MN,E NN) depend on one parameter p between 0 and 1 and thus form a curve in the simplex. NN MM NM under the null hypothesis, one of these values must be the true one NN MM 10

NM The observed (O MM,O NM,O NN) will deviate from the expected. NN MM NM We do not know the true expectation values so we estimate (E MM,E MN,E NN) by taking the closest point on the curve of possible values, i.e. we hit the curve in a right angle. NN MM NM We do not know the true expectation values so we estimate (E MM,E MN,E NN) by taking the closest point on the curve of possible values, i.e. we hit the curve in a right angle. Thus, deviations between our our observations (O,O,O ) and MM NM NN our (E MM,E NM,E NN) can only be in one NN dimension: perpendicular to the curve. MM df = k 1 m k = number of categories (k=3 genotypes) m = number of model parameters (m=1 param- 11

eter p) in blood type example: df = 3 1 1 = 1 > p <- (2* 1787+3037)/(2* ) > probs <- c(p^2,2*p*(1-p),(1-p)^2) > X <- chisq.test(c(1787,3037,1305),p=probs)$statistic[[1]] > p.value <- pchisq(x,df=1,lower.tail=false) > X [1] 0.04827274 > p.value [1] 0.8260966 12