1 COMP6053 lecture: The chi-squared test and distribution
2 Assumption of normal error term In all of the techniques we've covered so far, there has been a background assumption that the noise in the data generating process was normally distributed. Taking ANOVA as an example, we assume that each group has a characteristic mean value and that a normal random variate is added to that mean to get the actual data point.
3 Normality assumption in ANOVA
4 A reasonable assumption? Often the normal-error-term assumption is a safe one: the truth is close enough that our inferential logic works well. But sometimes it is obvious that the data generating process could not have been some systematic effect plus a normal error term.
5 Distribution of X across 2 groups Values in group A and B are clearly being generated by two very different -- and nonnormal -- processes.
6 Distribution of X across 2 groups If we were trying to test for a difference between the means of samples A and B, we can't really trust the answer we'd get from a 2-sample t-test. We need some statistical tests that don't make the normally-distributed-errors assumption.
7 Non-parametric statistics The term "non-parametric" refers to statistical tools that do not assume normality in the error term of the generation process. There are non-parametric equivalents for most of the techniques we've discussed. Why not use them all the time? The mathematics are trickier, and we lose some power by dropping the normality assumption.
8 The chi-squared test This is a versatile non-parametric test. We can use it to ask whether a sample fits an expected distribution (of arbitrary shape). Consider throwing a die many times to see whether or not it's fair. We expect a uniform distribution: an equal number of 1s, 2s, 3s, etc.
9 The chi-squared test Most of the time we won't get a perfectly uniform distribution though. How far away from uniform would the results have to be before we would suspect the die was not fair? How might we measure deviation from the expected values?
10 Sample of 300 throws
11 Difference from what was expected? Obs. Exp. O - E (O-E)^
12 Difference from what was expected? We're "expecting" to see each number 50 times. We can tally up how far from the expected value each observation is. We then try squaring those values to deal with negative numbers. We're on our way to some measure of deviation from the expected values.
13 Scaling our measure The particular size of these numbers depends on our sample size: if we'd thrown the dice 30 times or 3000 times we'd have different numbers. We scale the "observed minus expected, squared" term somewhat by dividing through by the expected value.
14 Scaling our measure Obs. (O-E)^2 (O-E)^2 / E
15 The chi-squared test The final step is to sum the (O-E)^2 / E terms. The total value is the chi-squared statistic. In our example this number is But what distribution should we expect this to have? How do we translate this number to a p-value?
16 Repeated sampling experiment Let's generate 500 samples of 300 die throws, and calculate the chi-squared statistic each time. That should give us an idea of how often extreme deviations from uniformity (and thus large values on the chi-squared statistic) come up by chance.
17 Repeated sampling experiment
18 Repeated sampling experiment In practice, a p-value of 0.05 (i.e., the topmost 5% of the distribution) equates to a chisquared statistic of around 10. The exact value is For p = 0.01, the critical chi-squared value is Our observed value of 7.88 is not unusual.
19 How is the chi-squared statistic distributed? Imagine that we were doing the same kind of test not with die-rolling but with coin tossing. If we toss a fair coin 100 times, clearly the expected values for heads and tails are 50 each. In reality we'll get combinations like 48/52, 53/47, 45/55, etc. What's the distribution of those numbers?
20 The binomial distribution The binomial distribution describes the number of heads we expect to get from throwing a coin multiple times. The binomial describes the outcome of multiple Bernoulli trials, i.e., any situation where there's a probability p of one outcome and 1-p for the other. What does it look like?
21 Binomial normal distribution As N gets larger, the binomial approximates the normal distribution very closely.
22 From binomial to chi-squared So if the number of heads we get in multiple coin tosses is binomially distributed... And the binomial is approximately normal... Then the chi-squared procedure of squaring the differences from the expected value, then dividing by the expected value, will give us what?
23 From binomial to chi-squared Each term in our chi-squared procedure is taking an approximately normal distribution and squaring it. The chi-squared distribution is in fact the sum of K squared-standard-normal deviates (K is the degrees of freedom of the test). Ironic that we've returned to the normal distribution in an effort to avoid assuming it was present.
24 Degrees of freedom? If we toss a coin 100 times, it's easy to see how many degrees of freedom there are: if there are 53 heads, there must be 47 tails. Thus there's 1 degree of freedom for the chisquared test here. Similarly with the die: once we know how many 1s, 2s, 3s, 4s, and 5s, the number of 6s is determined. So there are 5 degrees of freedom.
25 The chi-squared distribution
26 Another use of chi-squared tests We've seen how to use the test to look at whether a one-dimensional distribution fits its expected values reasonably closely. What about two-dimensional distributions?
27 Recap: when to use which test? If we have a continuous outcome measure, and a binary predictor variable, we use a two-sample t-test. If we have a continuous outcome measure, and one or more categorical predictor variables, we use ANOVA. If we have a continuous outcome measure, and a single continuous predictor variable, we use simple linear regression.
28 Recap: when to use which test? If we have a continuous outcome measure and multiple continuous and categorical predictor variables, we use multiple regression. If we have a binary categorical outcome measure and multiple predictor variables, we use logistic regression.
29 When to use a chi-squared test? If we have a categorical outcome measure, and one (or more) categorical predictor variables, it turns out we can use a chisquared test. We're asking whether the observed incidence of outcomes for each level of the predictor variable could have happened by chance.
30 Voting example Suppose we're asking whether sex has any relevance to predicting the way people will vote. We ask 100 randomly selected men and 100 randomly selected women which party they voted for at the last election. Note that both variables are categorical.
31 Voting example Labour Tory Lib dem Men Women Number of men and women voting for each party shown, and the marginal totals.
32 Voting example: expected values Labour Tory Lib dem Men Women Using the marginal totals, we can easily calculate the expected number of counts in each cell if there were no relationship between sex and voting preference.
33 Voting example: chi-squared We now have six places to calculate "observed minus expected". We can use the same logic of the chisquared test as outlined earlier. Sum of: ( O - E ) ^ 2 / E.
34 How many degrees of freedom? How many degrees of freedom? You might think 5, because we have 6 observed counts. But there are not 5 numbers here that are free to vary if we are to hit the marginal totals. DF = num columns minus one, times num rows minus one. In this case: 2 x 1 = 2.
35 Voting example The calculation for our voting example works out at a chi-squared value of The critical value for a chi-squared distribution with 2 DF, p = 0.05, is Our test statistic is larger than the critical value, so we reject the null hypothesis of no link between sex and voting (assuming we're happy with the p = 0.05 threshold).
36 How to do this in R In R the easiest way to run a chi-squared test is to first produce a table. For the example: chisq.test(table(sex,voting)) The output includes the test statistic, the degrees of freedom, and the associated p- value.
37 Further non-parametric tests No time to cover them all in this lecture, but there are many tests that do not assume a normally distributed error term in the data generation process. The most useful is the rank-sum test (also known as the Wilcoxon rank-sum test and the Mann-Whitney U test).
38 Further non-parametric tests We would use the rank-sum test to investigate whether these two samples come from distributions with the same mean.
39 Further non-parametric tests It's based on the logic that if distribution A has a higher mean than distribution B, then values from sample A should be in the top half of the joint sample more often than those from sample B. Used in the same situations as a two-sample t-test. wilcox.test(avalues,bvalues)
40 Additional material There is no R script for this lecture; you just need the commands chisq.test, table, and wilcox. Here is the Python program for generating the histograms and sampling experiment.
Nonparametric tests and ANOVAs: What you need to know Nonparametric tests Nonparametric tests are usually based on ranks There are nonparametric versions of most parametric tests arametric One-sample and
CHI SQUARE ANALYSIS I NTRODUCTION TO NON- PARAMETRI C ANALYSES HYPOTHESIS TESTS SO FAR We ve discussed One-sample t-test Dependent Sample t-tests Independent Samples t-tests One-Way Between Groups ANOVA
Lecture 7: Binomial Test, Chisquare Test, and ANOVA May, 01 GENOME 560, Spring 01 Goals ANOVA Binomial test Chi square test Fisher s exact test Su In Lee, CSE & GS email@example.com 1 Whirlwind Tour of One/Two
DATA ANALYSIS IN SPSS POINT AND CLICK By Andy Lin (IDRE Stat Consulting) www.ats.ucla.edu/stat/seminars/special/spss_analysis.pdf TODAY S SEMINAR Continuous, Categorical, and Ordinal variables Descriptive
Parametric Statistics 1 Nonparametric Statistics Timothy C. Bates firstname.lastname@example.org Assume data are drawn from samples with a certain distribution (usually normal) Compute the likelihood that groups are
Chi-Square Frequency distributions, the mean, and the standard deviation are all examples of statistics that are used to describe the data for a particular variable. But there is another class of statistical
Experimental & Behavioral Economics Lecture 6: Non-parametric tests and selection of sample size Based on Siegel, Sidney, and N. J. Castellan (1988) Nonparametric statistics for the behavioral sciences,
Chi Squared Test Why Chi Squared? To test to see if, when we collect data, is the variation we see due to chance or due to something else? This is the Chi Squared Test Formula Greek letter X is Chi Don
Research Skills: One-way independent-measures ANOVA, Graham Hole, March 2009: Page 1: One-way Independent-measures Analysis of Variance (ANOVA): What is "Analysis of Variance"? Analysis of Variance, or
6 The Binomial Probability Distribution and Related Topics Now that we are able to calculate probabilities, we can turn our attention to a related, and probably more important topic. 6.1 Introduction to
1 of 1 7/9/009 8:3 PM Virtual Laboratories > 9. Hy pothesis Testing > 1 3 4 5 6 7 6. Chi-Square Tests In this section, we will study a number of important hypothesis tests that fall under the general term
An Introduction to Chi-Squared Tests and Correlation Looking at a Single Distribution Sometimes we have a single set of measurements, and want to compare them to what we expect to see. Examples: - 20 coin-flips»
Recap Sampling distribution of proportions Lecture 15: Chi square tests Statistics 101 Prof. Rundel October 25, 2011 Review question Roughly 90% of individuals are right handed. The shape of the sampling
Chapter 11 Chi-Square Tests Computing Chi-Square Distribution Probabilities Chi-square probabilities can be computed by using the χ 2 cdf( function. This function can be accessed by pressing ( ) and looking
AS A TEST FOR HYPOTHESES (INDEPENDENCE OF VARIABLES) A. A. Introduction 1. the distribution, the chi-square, can be used to test goodness of fit or to test the independence of variables. we will use it
Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are
Chapter 16: Nonparametric Tests In Chapter 1-13 we discussed tests of hypotheses in a parametric statistics framework: which assumes that the functional form of the (population) probability distribution
COMP6053 lecture: Hypothesis testing, t-tests, p-values, type-i and type-ii errors email@example.com The t-test This lecture introduces the t-test -- our first real statistical test -- and the related
Statistics for Managers Using Microsoft Excel/SPSS Chapter Hypothesis Testing With Categorical Data 999 Prentice-Hall, Inc. Chap. - Chapter Topics Z Test for Differences in Two Proportions (Independent
ECON 497 Lecture H Page 1 of 1 Metropolitan State University ECON 497: Research and Forecasting Lecture Notes H: More Hypothesis Testing - ANOVA, the Chi Square Test and the Kolmogorov-Smirnov Test for
Data Analysis Lecture Empirical Model Building and Methods (Empirische Modellbildung und Methoden) Prof. Dr. Dr. h.c. Dieter Rombach Dr. Andreas Jedlitschka SS 2014 Analysis of Experiments - Introduction
One tailed t-test using SPSS: Divide the probability that a two-tailed SPSS test produces. SPSS always produces two-tailed significance level. One tailed vs two tailed tests: a normal distribution filled
13.2 The Chi Square Test for Homogeneity of Populations The setting: Used to compare distribution of proportions in two or more populations. Data is organized in a two way table Explanatory variable (Treatments)
Stat E100 Unit 9: Inference for Categorical Variables Textbook Chapter 9 Lecture Outline Contingency Tables χ test for Association of two Categorical Variables Eyewear-by-sex as a contingency table A contingency
Chi Square Analysis When do we use chi square? More often than not in psychological research, we find ourselves collecting scores from participants. These data are usually continuous measures, and might
The x 2 (AKA chi-square) Distribution Here's a new distribution that is somewhat helpful to understand for today. It is the sum of the squares of a bunch of independent standard Gaussians. Here's a good
Chi-square test FPP 28 More types of inference for nominal variables Nominal data is categorical with more than two categories Compare observed frequencies of nominal variable to hypothesized probabilities
Combinatorics: The Fine Art of Counting Week 7 Lecture Notes Discrete Probability Continued Note Binomial coefficients are written horizontally. The symbol ~ is used to mean approximately equal. The Bernoulli
The Gaussian distribution (normal distribution) When the distribution of the observations is normal, then 95% of all observation are located in the interval: mean-1.96 SD to mean+1.96 SD represents a descriptive
COMP6053 lecture: Logistic regression firstname.lastname@example.org Regression with a binary outcome variable Previous lecture: simple linear regression, with one continuous variable (height) being used to predict
BIOSTATISTICS FOR THE CLINICIAN Mary Lea Harper, Pharm.D. Learning Objectives 1. Understand when to use and how to calculate and interpret different measures of central tendency (mean, median, and mode)
Table 8.6 8.5 THE CONTINUOUS UNIFORM DISTRIBUTION Computers and calculators are able to use only a finite number of digits to represent a number. Whether there are 2 or 3 or even 100 digits, the actual
Bivariate Statistics Session 2: Measuring Associations Chi-Square Test Features Of The Chi-Square Statistic The chi-square test is non-parametric. That is, it makes no assumptions about the distribution
How to read the Descriptive Statistics results in Excel PIZZA BAKERY SHOES GIFTS PETS Mean 83.00 92.09 72.30 87.00 51.63 Standard Error 9.47 11.73 9.92 11.35 6.77 Median 80.00 87.00 70.00 97.50 49.00 Mode
Topic 1. Introduction To The Concept of Statistical Tests Using Chi-Square Assumptions Assume 100 customers tried each of three products and selected the one they like best. (The 100 represent a population
MATH 140 Lab 4: Probability and the Standard Normal Distribution Problem 1. Flipping a Coin Problem In this problem, we want to simualte the process of flipping a fair coin 1000 times. Note that the outcomes
Module 9: Nonparametric Tests The Applied Research Center Module 9 Overview } Nonparametric Tests } Parametric vs. Nonparametric Tests } Restrictions of Nonparametric Tests } One-Sample Chi-Square Test
Statistical tests for SPSS Paolo Coletti A.Y. 2010/11 Free University of Bolzano Bozen Premise This book is a very quick, rough and fast description of statistical tests and their usage. It is explicitly
Chapter 10: Chi-Square Tests: Solutions 10.1 Goodness of Fit Test In this section, we consider experiments with multiple outcomes. The probability of each outcome is fixed. Definition: A chi-square goodness-of-fit
Probability and Statistics Random Chance A tossed penny can land either heads up or tails up. These are mutually exclusive events, i.e. if the coin lands heads up, it cannot also land tails up on the same
CHI-SQUARE AND ODDS RATIOS Semester Recap We ve covered: Descriptive Statistics Measures of Central Tendency Measures of Variability Z-scores and Graphing Association and Prediction Correlation Regression
CHAPTER 11 CHI-SQUARE: NON-PARAMETRIC COMPARISONS OF FREQUENCY The hypothesis testing statistics detailed thus far in this text have all been designed to allow comparison of the means of two or more samples
z and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial tests Chapters 3.5.1 3.5.2, 3.3.2 Prof. Tesler Math 283 April 28 May 3, 2011 Prof. Tesler z and t tests for
Probability and the Chi-Square Test written by J. D. Hendrix Learning Objectives Upon completing the exercise, each student should be able: to determine the chance that a given state will occur in a system
2 x 2 Case Chi Square for Contingency Tables A test for p 1 = p 2 We have learned a confidence interval for p 1 p 2, the difference in the population proportions. We want a hypothesis testing procedure
QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS This booklet contains lecture notes for the nonparametric work in the QM course. This booklet may be online at http://users.ox.ac.uk/~grafen/qmnotes/index.html.
1.6 CHI-SQUAR DISTRIBUTION The Mathematical form of the Chi-square distribution is obtained from the Gamma 1 distribution when and t ( being a positive integer). This gives 1 x 1 f x x e, x 0 where e is
Chi-square test Testing for independeny The r x c contingency tables square test 1 The chi-square distribution HUSRB/0901/1/088 Teaching Mathematics and Statistics in Sciences: Modeling and Computer-aided
Statistics in a nutshell Statistics can be classified Descriptive: Describes a population sample Able to apply measures of central tendency & dispersion Inferential: Try to infer something that applies
Math 58. Rumbos Fall 2008 1 Solutions to Review Problems for Exam 2 1. For each of the following scenarios, determine whether the binomial distribution is the appropriate distribution for the random variable
Slide 1 Section 10-1 & 10-2 Overview and Multinomial Experiments: Goodness of Fit Overview Slide 2 We focus on analysis of categorical (qualitative or attribute) data that can be separated into different
Outline of lecture Testing means, part II The paired t-test Options in statistics sometimes there is more than one option One-sample t-test: review testing the sample mean The paired t-test testing the
Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible
Contents Introduction to inferential statistics using Microsoft Excel 1 Introduction... 1 2 Calculating confidence intervals in Excel... 2 3 Carrying out tests of difference... 7 4 Carrying out tests of
MILLER AND PRASNIKAR: STRATEGIC PLAY,draft 1 Introduction The solution to the game predicts the outcomes of experimental subjects playing the game in a classroom laboratory. The conduct of the experiment
Ms. Foglia Date AP: LAB 8: THE CHI-SQUARE TEST Probability, Random Chance, and Genetics Why do we study random chance and probability at the beginning of a unit on genetics? Genetics is the study of inheritance,
BONUS CHAPTER Chi-Square Tests In the previous chapters, we explored the wonderful world of hypothesis testing as we compared means and proportions of one, two, three, and more populations, making an educated
COMP6053 lecture: Transformations, polynomial fitting, and interaction terms email@example.com The linearity assumption Regression models are both powerful and useful. But they assume that a predictor
MAT 1000 Mathematics in Today's World We talked about Cryptography Last Time We will talk about probability. Today There are four rules that govern probabilities. One good way to analyze simple probabilities
Hypothesis Test Notes Chi-Squared Test Statistic & Goodness of Fit Test Remember when comparing a sample percentage to a claimed population percentage we use a 1 proportion hypothesis test and a Z-test
Chi-square Goodness of Fit Test The chi-square test is designed to test differences whether one frequency is different from another frequency. The chi-square test is designed for use with data on a nominal
Math 130 Assignment Chapter 18: 6, 10, 38 Chapter 19: 4, 6, 8, 10, 14, 16, 40 Chapter 20: 2, 4, 9 Chapter 18 Homework 5 Solutions 18.6] M&M s. The candy company claims that 10% of the M&M s it produces
Probability Theory Review for Machine Learning Samuel Ieong November 6, 2006 1 Basic Concepts Broadly speaking, probability theory is the mathematical study of uncertainty. It plays a central role in machine
Business Statistics, 9e (Groebner/Shannon/Fry) Chapter 13 Goodness of Fit Tests and Contingency Analysis 1) A goodness of fit test can be used to determine whether a set of sample data comes from a specific
Regression Analysis Prof. Soumen Maity Department of Mathematics Indian Institute of Technology, Kharagpur Lecture - 7 Multiple Linear Regression (Contd.) This is my second lecture on Multiple Linear Regression
Simple Linear Regression Does sex influence mean GCSE score? In order to answer the question posed above, we want to run a linear regression of sgcseptsnew against sgender, which is a binary categorical
CHAPTER 9: Analysis and Inference for Two-Way Tables: Two way tables compare two categorical variables measured on a set of cases. Examples Gender versus major Political party versus voting status Sometimes
Pearson s 2x2 Chi-Square Test of Independence -- Analysis of the Relationship between two Qualitative or (Binary) Variables Analysis of 2-Between-Group Data with a Qualitative (Binary) Response Variable
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. Multiple
Inference for Population Proportions p and p 1 p 2, more than 2 samples and association When Y ~ Bin(n, p), and the population proportion, p is unknown, we can build confidence intervals and do hypothesis
Categorical Data Analysis Related topics/headings: Categorical data analysis; or, Nonparametric statistics; or, chi-square tests for the analysis of categorical data. OVERVIEW For our hypothesis testing
The purpose of Statistics is to ANSWER QUESTIONS USING DATA Know the type of question and you can choose what type of statistics... Aim: DESCRIBE Type of question: What's going on? Examples: How many chapters
Chapter 16 Analysis of Categorical Data In this chapter, we explore techniques for analysing categorical data. Categorical data are non-numerical data that are frequency counts of categories from one or
Non Parametric Statistics Διατμηματικό ΠΜΣ Επαγγελματική και Περιβαλλοντική Υγεία-Διαχείριση και Οικονομική Αποτίμηση Δημήτρης Φουσκάκης Introduction So far in the course we ve assumed that the data come
DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 816 CONTINGENCY TABLES I. AGENDA: A. Cross-classifications 1. Two-by-two and R by C tables 2. Statistical independence 3. The interpretation
By Hui Bian Office for Faculty Excellence A parametric statistical test is a test whose model specifies certain conditions about the parameters of the population from which the research sample was drawn.
Chi-Square Test A fundamental problem is genetics is determining whether the experimentally determined data fits the results expected from theory (i.e. Mendel s laws as expressed in the Punnett square).
Chapter 21 Statistical Tests for Ordinal Data The Mann-Whitney (Rank-Sum) Test In this example, we deal with a dependent variable that is measured on a ratio scale (time in seconds), but its distribution
1. Chi-Squared Tests We'll now look at how to test statistical hypotheses concerning nominal data, and specifically when nominal data are summarized as tables of frequencies. The tests we will considered
http://www.squadron13.com/games/balldrop/balldrop.htm Consider a little experiment on the effect of handedness-match on various types of athletic performance. What is the independent variable in this study?
Statistical Tests Involving Population Means Hypothesis Test: Population Means Hypothesis test procedures can be applied to any population parameter of interest. In these next few slides we investigate
MA 116 - Chapter 5 Review Name SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question. Determine the possible values of the random variable. 1) Suppose that two
Biostatistics: Pre-test Primer Larry Liang MD University of Texas Southwestern Medical Center Question 1 Which of the following will increase the sensitivity of a test? A. Decrease Type I Error B. Increase
Data Analysis in Clinical Trials Dr. Abha Aggarwal Deputy Director National Institute of Medical Statistics Ansari Nagar, New Delhi-110029 Introduction In clinical research we want the results of a trial
Stat 20: Intro to Probability and Statistics Lecture 16: More Box Models Tessa L. Childers-Day UC Berkeley 22 July 2014 By the end of this lecture... You will be able to: Determine what we expect the sum
EPI-546: Fundamentals of Epidemiology and Biostatistics Course Notes - Statistics MSc (Credit to Roger J. Lewis, MD, PhD) Outline: I. Classical Hypothesis (significance) testing A. Type I (alpha) error