Outline. Dispersion Bush lupine survival Quasi-Binomial family
|
|
|
- Bruce Ward
- 9 years ago
- Views:
Transcription
1 Outline 1 Three-way interactions 2 Overdispersion in logistic regression Dispersion Bush lupine survival Quasi-Binomial family 3 Simulation for inference Why simulations Testing model fit: simulating the deviance s distribution Testing overdispersion: simulating the dispersion s distribution
2 Interaction between 2 categorical predictors Two-way interaction between X 1 (k 1 levels, think gender) and X 2 (k 2 levels, think food type): when the mean response (Y or log odds) differences between X 1 levels depend on the level of X 2. With no interaction, there are constraints on group means. X 1 requires k 1 1 extra coefficients (other than intercept), X 2 requires k 2 1 extra coefficients. In R, X 1 X 2 is a shortcut for X 1 + X 2 + X 1 : X 2. The interaction X 1 : X 2 requires (k 1 1) (k 2 1) df, i.e. extra coefficients. With interaction, total # of coefficients (including intercept) = total # of groups, k 1 k 2 : No constraints on groups means.
3 Degrees of freedom For model Y X 1 X 2 : source df intercept 1 X 1 k 1 1 X 2 k 2 1 X 1 : X 2 (k 1 1)(k 2 1) total # coefs k 1 k 2 residual n k 1 k 2 where n is the number of observations (rows) in the data.
4 Interactions between 3 categorical predictors Two-way interactions between X 1, X 2 and X 3 (k 3 levels, think preterm): X 1 : X 2 is the 2-way interaction between X 1 and X 2 when X 3 = 0 or reference level. X 1 : X 3 is the 2-way interaction between X 1 and X 3 when X 2 = 0 or reference level. X 2 : X 3 is the 2-way interaction between X 2 and X 3 when X 1 = 0 or reference level. There is a three-way interaction X 1 : X 2 : X 3 if the interaction coefficient(s) X 1 : X 2 depend on the level of X 3, or, equivalently, if the X 2 : X 3 interaction coefficient(s) depend on the level of X 1, or if the X 1 : X 3 interaction coefficient(s) depends on the level of X 2. With the 3-way interaction and all 2-way interactions, the total # of coefficients (including intercept) equals the total # of groups, k 1 k 2 k 3 : No constraints on groups means.
5 Degrees of freedom For Y X 1 X 2 X 3 : source df intercept 1 X 1 k 1 1 X 2 k 2 1 X 1 : X 2 (k 1 1)(k 2 1) X 1 : X 3 (k 1 1)(k 3 1) X 2 : X 3 (k 2 1)(k 3 1) X 1 : X 2 : X 3 (k 1 1)(k 2 1)(k 3 1) total # coef k 1 k 2 k 3 residual n k 1 k 2 k 3 where n is the number of observations (rows) in the data.
6 X1 + X3 X1+X2+X3 X1*X2 + X3 x3=1 x3=0 x1=1 x1=0 mean Y or log odds b1 b3 mean Y or log odds b1 b3 mean Y or log odds b1 b3 b2 b2 x2cat 0 (no) 1 (yes) X1*X2 + X3 + X1:X3 x2cat 0 (no) 1 (yes) X1*X2+ X3 +X1:X3 +X2:X3 x2cat 0 (no) 1 (yes) X1*X2*X3 mean Y or log odds b1 b3 mean Y or log odds b1 b3 mean Y or log odds b1 b3 b2 b2 b2 x2cat 0 (no) 1 (yes) x2cat 0 (no) 1 (yes) x2cat 0 (no) 1 (yes)
7 Notations in R (and many other programs) These are equivalent notations for model formulas: (X 1 + X 2 ) 2 and X 1 X 2, and X 1 + X 2 + X 1 :X 2. Main effects and one 2-way interaction: X 1 X 2 + X 3 and X 1 + X 2 + X 3 + X 1 :X 2 Mains effects and two 2-way interactions: X 1 X 2 + X 3 + X 2 :X 3 and X 1 + X 2 + X 3 + X 1 :X 2 + X 2 :X 3 and X 1 + X 2 X 3 + X 1 :X 2 Main effects and all 2-way interactions: X 1 X 2 + X 3 + X 2 :X 3 + X 1 :X 3 and (X 1 + X 2 + X 3 ) 2 All: X 1 X 2 X 3 and X 1 + X 2 + X 3 + X 1 :X 2 + X 1 :X 3 + X 2 :X 3 and (X 1 + X 2 + X 3 ) 3 (and many others).
8 Hierarchy principle Include all lower-level interactions. If we include X 1 :X 2, then we must also include X 1 and X 2. If we allow X 1 to have an effect that depends on the level of X 2, it makes little sense to assume that X 1 has no effect when X 2 = 0 or reference level. If we include X 1 :X 2 :X 3 then we must also include all three 2-way interactions (X 1 :X 2, X 1 :X 3, X 2 :X 3 ) and all three main effects (X 1, X 2, X 3 ). A predictor X 1 is said to have an effect as soon as one interaction with X 1 is present.
9 Hierarchy principle Example: if breastfeeding (X 1 ) has no effect except on preterm baby girls, and if the reference levels are bottle feeding, boys, full term, then we may find source df coef p-value breast girl preterm breast:girl breast:preterm girl:preterm breast:girl:preterm < 10 5 Then there is evidence that breastfeeding has an effect.
10 Interpretation When we find complex interactions, we often break down the data into smaller data sets, for ease of interpretation. If there is evidence that breastfeeding affects the probability of respiratory disease differently in boys/girls and in preterm/full term babies, then we may break down the data into 2 sets of data: preterm babies only, and full term babies only. cons: not one, but 2 analyses are required. pros: no need to consider 3-way interactions and simpler interpretation: only two-way interaction (breast/girl) in each analysis.
11 Outline 1 Three-way interactions 2 Overdispersion in logistic regression Dispersion Bush lupine survival Quasi-Binomial family 3 Simulation for inference Why simulations Testing model fit: simulating the deviance s distribution Testing overdispersion: simulating the dispersion s distribution
12 Dispersion The binomial distribution assumes that var(y i ) = n i p i (1 p i ) In real situations, the variance could differ from n i p i (1 p i ) if the n i individuals from observation i are not independent (all from the same inoculation batch? all from the same plate? etc.) Could it be that var(y i ) = σ 2 n i p i (1 p i ) for some σ 2?? violating the binomial assumption? σ 2 is called the dispersion. σ 2 < 1: underdispersion (rare) σ 2 = 1: binomial okay: not violated σ 2 > 1: overdispersion
13 Dispersion We can estimate the dispersion in two (non-equivalent) ways. from the Pearson residuals, used in summary and anova: ˆσ 2 = n i=1 (r P i ) 2 n k Most appropriate because the Pearson residuals were standardized based on their estimated SD: r P i = y i n i ˆp i ˆn i p i (1 ˆp i ) from the deviance residuals, used in drop1, technically easier: ˆσ 2 D = n i=1 (r D i ) 2 n k = resisual D residual df often, these two estimates are very similar. If the data truly follow a Binomial distribution: the dispersion should be 1.
14 Bush lupine seedling survival Do entomopathogenic nematodes protect plants indirectly, by killing root-feeding ghost moth caterpillars? (Strong et al, 1999). Field experiment with 120 bush lupine seedlings, randomly assigned to treatments (15 per treatment): with 0 or up to 32 hatchlings of ghost moth caterpillars, and with/without nematodes. > dat = read.table("lupine.txt", header=t) > dat$n = 15 > dat caterpillars nematodes lupine n 1 0 absent present absent present absent present absent present 13 15
15 Bush lupine seedling survival > fit = glm(lupine/n caterpillars+nematodes, family=binomial, weights=n, data=dat) > summary(fit) Deviance Residuals: Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) *** caterpillars *** nematodespresent ** (Dispersion parameter for binomial family taken to be 1) Residual deviance: on 5 degrees of freedom > resp = residuals(fit, type="pearson") > resp
16 Bush lupine seedling survival > sum(respˆ2)/(8-3) [1] # based on Pearson s residuals: better > sum(residuals(fit, type="deviance")ˆ2)/5 [1] # based on deviance > fit$deviance [1] > fit$df.residual [1] 5 > fit$deviance / fit$df.residual # easier way to estimate the [1] # dispersion based on deviance residuals Here, the dispersion parameter is estimated to be 1.12, more than 1: slight overdispersion. Can this difference be due to chance alone?
17 Bush lupine seedling survival Is overdispersion significant? Test for lack of fit: compare the residual deviance to a χ 2 distribution with df = df residual. But many other reasons can result in a lack of fit. > pchisq( , df=5, lower.tail=f) [1] Here: no significant lack of fit, interpreted as no significant overdispersion. With a simulation more later. If we really need to, how to account for over-dispersion?
18 Quasi-Binomial analysis We can account for overdispersion (or underdispersion) using a quasi-binomial analysis. > fitquasi = glm(lupine/n caterpillars+nematodes, family=quasibinomial, weights=n, data=dat) > summary(fitquasi) Deviance Residuals: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * caterpillars * nematodespresent * (Dispersion parameter for quasibinomial family taken to be ) Null deviance: on 7 degrees of freedom Residual deviance: on 5 degrees of freedom
19 Quasi-Binomial analysis > summary(fit) # was standard logistic regression... Estimate Std. Error z value Pr(> z ) (Intercept) *** caterpillars *** nematodespresent ** > * sqrt( ) [1] > pt(3.630, df=5, lower.tail=f) * 2 [1] Compared with the standard binomial (logistic) regression: Same deviance residuals, same null/residual deviance, same df s. Same estimated coefficients Different SE for coefficients: there were multiplied by ˆσ Wald z-tests in logistic regression become t-tests in quasi-binomial regression, on df=residual df.
20 Model comparison Analysis of Deviance in logistic regression (which uses the χ 2 distribution) becomes ANOVA in a quasi-binomial analysis: uses the F distribution as in standard regression, where Sums of Squares are replaced by deviances and MSError = ˆσ 2 is replaced by the dispersion. > anova(fitquasi, test="f") Terms added sequentially (first to last) Df Deviance Resid.Df Resid.Dev F Pr(>F) NULL caterpillars * nematodes * > ( )/(7-6) / [1] > pf(10.865, df1=1, df2=5, lower.tail=f) # df1=7-6=1 [1] > ( )/(6-5) / [1] > pf(11.318, df1=1, df2=5, lower.tail=f) [1]
21 Outline 1 Three-way interactions 2 Overdispersion in logistic regression Dispersion Bush lupine survival Quasi-Binomial family 3 Simulation for inference Why simulations Testing model fit: simulating the deviance s distribution Testing overdispersion: simulating the dispersion s distribution
22 Why use simulations? to assess uncertainty in predictions, in estimated regression coefficients, etc. This is in contrast to deriving formulas for standard errors. pros: cons: we do not need to learn how to derive formulas. we can use simulations even when formulas are only approximations we can assess uncertainty in quantities for which there are not formulas. we will learn how to write small programs in R. we will do parametric bootstrapping.
23 Recall runoff data We explained runoff events with total amount of precipition (inches) and maximum intensity at 10 minutes (in/min). We used logistic regression (why?) > fit2 = glm(runoffevent Precip+MaxIntensity10, family=binomial, data=runoff) > summary(fit2)... Residual deviance: on 228 degrees of freedom > fit2$df.residual [1] 228 > fit2$deviance [1] # seems too low > pchisq(116.11, df=228) # P(X2 < ) [1] e-11 > curve(dchisq(x,df=228), from=100,to=300, xlab="x2") > points(116.11, 0, col="red", pch=16)
24 Is there real underdispersion? dchisq(x, df = 228) x2 Why is the deviance so low? Also, ˆσ 2 = Underdispersion? > sum( residuals(fit2, type="pearson")ˆ2 )/228 [1]
25 Simulating the distribution of the residual deviance Formula : When sample sizes are large and if the model is correct, then the residual deviance is approximately χ 2 distributed, on residual df. Here: deviance= on df= 228. The formula gives us a very low p-value and suggests underdispersion. But what can we trust? To see what the deviance is just by chance, we will simulate new data sets under our hypothesized model: use our estimated model and repeat many experiments in silico. Our model will be correct (H 0 true) for these in silico experiments. apply the treatment we applied to our original data set, calculate the deviance of each of these new data sets, repeat many times, summarize the deviances
26 Simulating one new data set > dim(runoff) # > mu=predict(fit2, type="response") # estimated probabilities > mu # of runoff events for each storm > sim.events = rbinom(231, size=1, prob=mu) # One experiment > sim.events # was simulated [1] [38] [223] > sim.data = data.frame( + Precip = runoff$precip, + MaxIntensity10 = runoff$maxintensity10, + RunoffEvent = sim.events + )
27 Simulating one new data set > sim.data # just to check the new data set Precip MaxIntensity10 RunoffEvent # now apply the same treatment > sim.fit = glm(runoffevent Precip + MaxIntensity10, + data=sim.data, family=quasibinomial) > sim.fit$deviance [1] # this is random: from one experiment. We got this deviance just by chance : the model (binomial, Precip + MaxIntensity10) was true.
28 Simulating many new data sets First define a function to make and analyze a single data set: > simulate.deviance = function(){ + sim.data = data.frame( + Precip = runoff$precip, + MaxIntensity10 = runoff$maxintensity10, + RunoffEvent = rbinom(231, size=1, prob=mu) + ) + sim.fit = glm(runoffevent Precip + MaxIntensity10, + data=sim.data, family=binomial) + return( sim.fit$deviance ) + } > simulate.deviance() # check that this function works [1] # this is random > simulate.deviance() [1] # new random deviance from new expt > simulate.deviance() [1] # from another new experiment
29 Simulating many new data sets We replicate this simulation many times (1000 usually enough). replicate(n, function) : needs a number n of times and a function to be repeated. [985] > sim1000 = replicate(1000, simulate.deviance()) > sim1000 [1] [13] [25] [961] [973] [997]
30 Summarizing simulated deviances Now we summarize the 1000 deviance values, which were obtained by chance under our hypothesized model: > hist(sim100) Overlay the the simulated distribution with the theoretical χ 2 distribution... which we know is a bad approximation because all sample sizes are 1. > hist(sim1000, xlim=c(70,300)) > curve(1000*10*dchisq(x,df=228), add=t) # I used 1000 * 10 * chi-square_density in order to match # the area under the curve (1000 * 10 * 1) with # the area of the histogram: 1000 points * width 10 of each b > text("x2 distribution on 228 df", x=228, y=200) > points(116.11, 0, col="red", pch=16)
31 Summarizing simulated deviances Histogram of sim1000 Frequency X2 distribution on 228 df sim1000 No sign of lack of fit: so far, it looks like our binomial model is adequate.
32 Testing lack of fit with simulations Conclusions: In the runoff experiment, the distribution of the deviance under H 0 : our model is correct is very far from a χ 2 distribution on df=residual df. In this case (all n i = 1) we should not trust the p-value obtained from comparing the residual deviance to the χ 2 distribution on residual df (was ). Instead, we should test H 0 : the model is correct versus lack of fit with simulations. How often is the deviance even lower than just by chance? We obtain p-value = 0.57: No lack of fit, No evidence of underdispersion.
33 Testing lack of fit with simulations > sim1000 < [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FA [13] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FA... [985] TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FA [997] TRUE FALSE TRUE FALSE > sum( (sim1000 < )) # true=1 false=0, so the sum of the true/false values # will be the number of true s. [1] 573 > sum( (sim1000 < )) / 1000 [1] p-value = 0.57 = probability of observing a deviance of or smaller, when the model is really true. Obtained by parametric boostrapping.
34 Simulation to test for overdispersion: lupine seedlings We found a dispersion of 1.12, slighly over 1. Is this overdispersion? > fit = glm(lupine/15 caterpillars+nematodes, weight=size, data=lup, family=binomial) > fit$df.residual [1] 5 > sum( residuals(fit, type="pearson")ˆ2 ) / 5 [1] # this is the estimated dispersion, sigma2 hat. We will simulate many in silico new data sets, for which the model with caterpillars + nematodes and the binomial distribution will be true. Then see what is the distribution of the estimated dispersion until this null hypothesis.
35 Simulation to test for overdispersion: lupine seedlings First, write a function to simulate one new experiment, analyze the data, and extract the estimated dispersion. probs = fitted(fit) # estimated probs of lupine survival probs = predict(fit, type="response") # equivalent simulate.dispersion = function(){ sim.data = data.frame( caterpillars = lup$caterpillars, nematodes = lup$nematodes, size = lup$size, lupine = rbinom(8, size=15, prob=probs) ) sim.fit = glm(lupine/15 caterpillars+nematodes, weight=size, data=sim.data, family=binomial) sim.s2 = sum( residuals(sim.fit, type="pearson")ˆ2 )/5 return(sim.s2) }
36 Simulate many new ˆσ 2 under H 0 : no overdispersion Now apply the function, a few times to check that it works, then replicate many times (and save the values): > simulate.dispersion() [1] > simulate.dispersion() [1] > simulate.dispersion() [1] > simulate.dispersion() [1] > sim1000s2 = replicate(1000, simulate.dispersion()) > sim1000s2 [1] [985] [997]
37 Summarize the simulated dispersion values The null distribution of the estimated dispersion is skewed left: > summary(sim1000s2) Min. 1st Qu. Median Mean 3rd Qu. Max > hist(sim1000s2, breaks=30) # I asked for 30 breakpoints in the histogram > points(x= , y=0, col="red", pch=16) # placed a point at the dispersion for the original data > text("1.1212", x= , y=10) > sum( sim1000s2 > ) / 1000 [1] # proportion of simulated s2 that are > p-value = 0.36 to test the null hypothesis that our model is correct with no overdispersion. We accept it.
38 Visualize the simulated dispersion values Histogram of sim1000s2 Frequency sim1000s2
Logistic Regression (a type of Generalized Linear Model)
Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge
" Y. Notation and Equations for Regression Lecture 11/4. Notation:
Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate
Multivariate Logistic Regression
1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation
Generalized Linear Models
Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the
Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)
Spring 204 Class 9: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.) Big Picture: More than Two Samples In Chapter 7: We looked at quantitative variables and compared the
Simple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
Psychology 205: Research Methods in Psychology
Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready
5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
Testing for Lack of Fit
Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data? If the model is correct then ˆσ 2 should be an unbiased estimate of σ 2. If we have a model which is not complex enough to fit
1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation
Parkland College A with Honors Projects Honors Program 2014 Calculating P-Values Isela Guerra Parkland College Recommended Citation Guerra, Isela, "Calculating P-Values" (2014). A with Honors Projects.
ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.
ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R. 1. Motivation. Likert items are used to measure respondents attitudes to a particular question or statement. One must recall
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
MONT 107N Understanding Randomness Solutions For Final Examination May 11, 2010
MONT 07N Understanding Randomness Solutions For Final Examination May, 00 Short Answer (a) (0) How are the EV and SE for the sum of n draws with replacement from a box computed? Solution: The EV is n times
Multiple Linear Regression
Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is
Statistics Review PSY379
Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses
DATA INTERPRETATION AND STATISTICS
PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE
Categorical Data Analysis
Richard L. Scheaffer University of Florida The reference material and many examples for this section are based on Chapter 8, Analyzing Association Between Categorical Variables, from Statistical Methods
Stat 5303 (Oehlert): Tukey One Degree of Freedom 1
Stat 5303 (Oehlert): Tukey One Degree of Freedom 1 > catch
E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F
Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,
13. Poisson Regression Analysis
136 Poisson Regression Analysis 13. Poisson Regression Analysis We have so far considered situations where the outcome variable is numeric and Normally distributed, or binary. In clinical work one often
Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares
Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation
Final Exam Practice Problem Answers
Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
SPSS Guide: Regression Analysis
SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar
CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
Regression Analysis: A Complete Example
Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty
Lab 13: Logistic Regression
Lab 13: Logistic Regression Spam Emails Today we will be working with a corpus of emails received by a single gmail account over the first three months of 2012. Just like any other email address this account
Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)
Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.) Logistic regression generalizes methods for 2-way tables Adds capability studying several predictors, but Limited to
EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION
EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION EDUCATION AND VOCABULARY 5-10 hours of input weekly is enough to pick up a new language (Schiff & Myers, 1988). Dutch children spend 5.5 hours/day
Elementary Statistics Sample Exam #3
Elementary Statistics Sample Exam #3 Instructions. No books or telephones. Only the supplied calculators are allowed. The exam is worth 100 points. 1. A chi square goodness of fit test is considered to
Regression 3: Logistic Regression
Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing
Two-sample hypothesis testing, II 9.07 3/16/2004
Two-sample hypothesis testing, II 9.07 3/16/004 Small sample tests for the difference between two independent means For two-sample tests of the difference in mean, things get a little confusing, here,
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
NCSS Statistical Software
Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, two-sample t-tests, the z-test, the
Chapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -
Multinomial and Ordinal Logistic Regression
Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,
Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition
Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma [email protected] The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
Handling missing data in Stata a whirlwind tour
Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled
STAT 350 Practice Final Exam Solution (Spring 2015)
PART 1: Multiple Choice Questions: 1) A study was conducted to compare five different training programs for improving endurance. Forty subjects were randomly divided into five groups of eight subjects
VI. Introduction to Logistic Regression
VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.
Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing
1 Simple Linear Regression I Least Squares Estimation
Simple Linear Regression I Least Squares Estimation Textbook Sections: 8. 8.3 Previously, we have worked with a random variable x that comes from a population that is normally distributed with mean µ and
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this
Chapter 4 and 5 solutions
Chapter 4 and 5 solutions 4.4. Three different washing solutions are being compared to study their effectiveness in retarding bacteria growth in five gallon milk containers. The analysis is done in a laboratory,
Descriptive Statistics
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
Introduction to Analysis of Variance (ANOVA) Limitations of the t-test
Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA Limitations of the t-test Although the t-test is commonly used, it has limitations Can only
Examining a Fitted Logistic Model
STAT 536 Lecture 16 1 Examining a Fitted Logistic Model Deviance Test for Lack of Fit The data below describes the male birth fraction male births/total births over the years 1931 to 1990. A simple logistic
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Lecture Notes Module 1
Lecture Notes Module 1 Study Populations A study population is a clearly defined collection of people, animals, plants, or objects. In psychological research, a study population usually consists of a specific
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
Basic Statistical and Modeling Procedures Using SAS
Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom
UNIVERSITY OF NAIROBI
UNIVERSITY OF NAIROBI MASTERS IN PROJECT PLANNING AND MANAGEMENT NAME: SARU CAROLYNN ELIZABETH REGISTRATION NO: L50/61646/2013 COURSE CODE: LDP 603 COURSE TITLE: RESEARCH METHODS LECTURER: GAKUU CHRISTOPHER
Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne
Applied Statistics J. Blanchet and J. Wadsworth Institute of Mathematics, Analysis, and Applications EPF Lausanne An MSc Course for Applied Mathematicians, Fall 2012 Outline 1 Model Comparison 2 Model
Exchange Rate Regime Analysis for the Chinese Yuan
Exchange Rate Regime Analysis for the Chinese Yuan Achim Zeileis Ajay Shah Ila Patnaik Abstract We investigate the Chinese exchange rate regime after China gave up on a fixed exchange rate to the US dollar
Logistic regression (with R)
Logistic regression (with R) Christopher Manning 4 November 2007 1 Theory We can transform the output of a linear regression to be suitable for probabilities by using a logit link function on the lhs as
Study Guide for the Final Exam
Study Guide for the Final Exam When studying, remember that the computational portion of the exam will only involve new material (covered after the second midterm), that material from Exam 1 will make
Recall this chart that showed how most of our course would be organized:
Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical
ANOVA. February 12, 2015
ANOVA February 12, 2015 1 ANOVA models Last time, we discussed the use of categorical variables in multivariate regression. Often, these are encoded as indicator columns in the design matrix. In [1]: %%R
STA 130 (Winter 2016): An Introduction to Statistical Reasoning and Data Science
STA 130 (Winter 2016): An Introduction to Statistical Reasoning and Data Science Mondays 2:10 4:00 (GB 220) and Wednesdays 2:10 4:00 (various) Jeffrey Rosenthal Professor of Statistics, University of Toronto
II. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar
business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
Section 13, Part 1 ANOVA. Analysis Of Variance
Section 13, Part 1 ANOVA Analysis Of Variance Course Overview So far in this course we ve covered: Descriptive statistics Summary statistics Tables and Graphs Probability Probability Rules Probability
KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management
KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To
A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn
A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps
Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics
Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This
Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)
Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and
LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING
LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING In this lab you will explore the concept of a confidence interval and hypothesis testing through a simulation problem in engineering setting.
Statistiek II. John Nerbonne. October 1, 2010. Dept of Information Science [email protected]
Dept of Information Science [email protected] October 1, 2010 Course outline 1 One-way ANOVA. 2 Factorial ANOVA. 3 Repeated measures ANOVA. 4 Correlation and regression. 5 Multiple regression. 6 Logistic
Chi Square Tests. Chapter 10. 10.1 Introduction
Contents 10 Chi Square Tests 703 10.1 Introduction............................ 703 10.2 The Chi Square Distribution.................. 704 10.3 Goodness of Fit Test....................... 709 10.4 Chi Square
Biostatistics: Types of Data Analysis
Biostatistics: Types of Data Analysis Theresa A Scott, MS Vanderbilt University Department of Biostatistics [email protected] http://biostat.mc.vanderbilt.edu/theresascott Theresa A Scott, MS
Simple linear regression
Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression We are often interested in studying the relationship among variables to determine whether they are associated with one another. When we think that changes in a
Chapter 7. One-way ANOVA
Chapter 7 One-way ANOVA One-way ANOVA examines equality of population means for a quantitative outcome and a single categorical explanatory variable with any number of levels. The t-test of Chapter 6 looks
12.5: CHI-SQUARE GOODNESS OF FIT TESTS
125: Chi-Square Goodness of Fit Tests CD12-1 125: CHI-SQUARE GOODNESS OF FIT TESTS In this section, the χ 2 distribution is used for testing the goodness of fit of a set of data to a specific probability
Binary Diagnostic Tests Two Independent Samples
Chapter 537 Binary Diagnostic Tests Two Independent Samples Introduction An important task in diagnostic medicine is to measure the accuracy of two diagnostic tests. This can be done by comparing summary
BIOL 933 Lab 6 Fall 2015. Data Transformation
BIOL 933 Lab 6 Fall 2015 Data Transformation Transformations in R General overview Log transformation Power transformation The pitfalls of interpreting interactions in transformed data Transformations
3. Analysis of Qualitative Data
3. Analysis of Qualitative Data Inferential Stats, CEC at RUPP Poch Bunnak, Ph.D. Content 1. Hypothesis tests about a population proportion: Binomial test 2. Chi-square testt for goodness offitfit 3. Chi-square
Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion
Descriptive Statistics Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Statistics as a Tool for LIS Research Importance of statistics in research
The F distribution and the basic principle behind ANOVAs. Situating ANOVAs in the world of statistical tests
Tutorial The F distribution and the basic principle behind ANOVAs Bodo Winter 1 Updates: September 21, 2011; January 23, 2014; April 24, 2014; March 2, 2015 This tutorial focuses on understanding rather
Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures
Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone:
1.5 Oneway Analysis of Variance
Statistics: Rosie Cornish. 200. 1.5 Oneway Analysis of Variance 1 Introduction Oneway analysis of variance (ANOVA) is used to compare several means. This method is often used in scientific or medical experiments
1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ
STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material
Chapter 26: Tests of Significance
Chapter 26: Tests of Significance Procedure: 1. State the null and alternative in words and in terms of a box model. 2. Find the test statistic: z = observed EV. SE 3. Calculate the P-value: The area under
Exploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
Two Correlated Proportions (McNemar Test)
Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with
Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test
Experimental Design Power and Sample Size Determination Bret Hanlon and Bret Larget Department of Statistics University of Wisconsin Madison November 3 8, 2011 To this point in the semester, we have largely
UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST
UNDERSTANDING The independent-samples t test evaluates the difference between the means of two independent or unrelated groups. That is, we evaluate whether the means for two independent groups are significantly
Section 12 Part 2. Chi-square test
Section 12 Part 2 Chi-square test McNemar s Test Section 12 Part 2 Overview Section 12, Part 1 covered two inference methods for categorical data from 2 groups Confidence Intervals for the difference of
MIXED MODEL ANALYSIS USING R
Research Methods Group MIXED MODEL ANALYSIS USING R Using Case Study 4 from the BIOMETRICS & RESEARCH METHODS TEACHING RESOURCE BY Stephen Mbunzi & Sonal Nagda www.ilri.org/rmg www.worldagroforestrycentre.org/rmg
Principles of Hypothesis Testing for Public Health
Principles of Hypothesis Testing for Public Health Laura Lee Johnson, Ph.D. Statistician National Center for Complementary and Alternative Medicine [email protected] Fall 2011 Answers to Questions
Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools
Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Occam s razor.......................................................... 2 A look at data I.........................................................
