ANOVA. February 12, 2015


 Roger Stokes
 3 years ago
 Views:
Transcription
1 ANOVA February 12, ANOVA models Last time, we discussed the use of categorical variables in multivariate regression. Often, these are encoded as indicator columns in the design matrix. In [1]: %%R url = salary.table = read.table(url, header=t) salary.table$e = factor(salary.table$e) salary.table$m = factor(salary.table$m) salary.lm = lm(s ~ X + E + M, salary.table) head(model.matrix(salary.lm)) (Intercept) X E2 E3 M Often, especially in experimental settings, we record only categorical variables. Such models are often referred to ANOVA (Analysis of Variance) models. These are generalizations of our favorite example, the two sample ttest. 1.1 Example: recovery time Suppose we want to understand the relationship between recovery time after surgery based on an patient s prior fitness. We group patients into three fitness levels: below average, average, above average. If you are in better shape before surgery, does it take less time to recover? In [2]: %%R url = rehab.table = read.table(url, header=t, sep=, ) rehab.table$fitness < factor(rehab.table$fitness) head(rehab.table) 1
2 Fitness Time In [3]: %%R h 800 w 800 attach(rehab.table) boxplot(time ~ Fitness, col=c( red, green, blue )) 2
3 1.2 Oneway ANOVA First generalization of two sample ttest: more than two groups. Observations are broken up into r groups with n i, 1 i r observations per group. Model: Y ij = µ + α i + ε ij, ε ij N(0, σ 2 ). Constraint: r i=1 α i = 0. This constraint is needed for identifiability. This is equivalent to only adding r 1 columns to the design matrix for this qualitative variable. This is not the same parameterization we get when only adding r columns, but it gives the same model. The estimates of α can be obtained from the estimates of β using R s default parameters. For a more detailed exploration into R s creation of design matrices, try reading the following tutorial on design matrices. 1.3 Remember, it s still a model (i.e. a plane) 1.4 Fitting the model Model is easy to fit: Ŷ ij = 1 n i n i j=1 Y ij = Y i. If observation is in ith group: predicted mean is just the sample mean of observations in ith group. Simplest question: is there any group (main) effect? H 0 : α 1 = = α r = 0? Test is based on F test with full model vs. reduced model. Reduced model just has an intercept. Other questions: is the effect the same in groups 1 and 2? In [4]: %%R rehab.lm < lm(time ~ Fitness) summary(rehab.lm) H 0 : α 1 = α 2? Call: lm(formula = Time ~ Fitness) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e16 *** Fitness ** Fitness e06 *** 3
4 Signif. codes: 0 *** ** 0.01 * Residual standard error: on 21 degrees of freedom Multiple Rsquared: ,Adjusted Rsquared: Fstatistic: on 2 and 21 DF, pvalue: 4.129e05 In [5]: %%R print(predict(rehab.lm, list(fitness=factor(c(1,2,3))))) c(mean(time[fitness == 1]), mean(time[fitness == 2]), mean(time[fitness == 3])) [1] Recall that the rows of the Coefficients table above do not correspond to the α parameter. For one thing, we would see three α s and their sum would have to be equal to 0. Also, the design matrix is the indicator coding we saw last time. In [6]: %%R head(model.matrix(rehab.lm)) (Intercept) Fitness2 Fitness There are ways to get different design matrices by using the contrasts argument. This is a bit above our pay grade at the moment. Upon inspection of the design matrix above, we see that the (Intercept) coefficient corresponds to the mean in Fitness==1, while Fitness==2 coefficient corresponds to the difference between the groups Fitness==2 and Fitness== ANOVA table Much of the information in an ANOVA model is contained in the ANOVA table. In [8]: make_table(anova_oneway) apply_theme( basic ) Out[8]: <ipy table.ipytable at 0x107d8c250> In [9]: %%R anova(rehab.lm) 4
5 Analysis of Variance Table Response: Time Df Sum Sq Mean Sq F value Pr(>F) Fitness e05 *** Residuals Signif. codes: 0 *** ** 0.01 * Note that MST R measures variability of the cell means. If there is a group effect we expect this to be large relative to MSE. We see that under H 0 : α 1 = = α r = 0, the expected value of MST R and MSE is σ 2. This tells us how to test H 0 using ratio of mean squares, i.e. an F test. 1.6 Testing for any main effect Rows in the ANOVA table are, in general, independent. Therefore, under H 0 F = MST R MSE SST R = df T R SSE df E F dft R,df E the degrees of freedom come from the df column in previous table. Reject H 0 at level α if F > F 1 α,dft R,df E. In [10]: %%R F = / pval = 1  pf(f, 2, 21) print(data.frame(f,pval)) F pval e Inference for linear combinations Suppose we want to infer something about r a i µ i where µ i = µ + α i is the mean in the ith group. For example: H 0 : µ 1 µ 2 = 0 (same as H 0 : α 1 α 2 = 0)? i=1 For example: Is there a difference between below average and average groups in terms of rehab time? 5
6 We need to know ( r r Var a i Y i ) = σ 2 a 2 i. n i After this, the usual confidence intervals and ttests apply. In [11]: %%R head(model.matrix(rehab.lm)) (Intercept) Fitness2 Fitness i=1 i=1 This means that the coefficient Fitness2 is the estimated difference between the two groups. In [12]: %%R detach(rehab.table) 1.8 Twoway ANOVA Often, we will have more than one variable we are changing Example After kidney failure, we suppose that the time of stay in hospital depends on weight gain between treatments and duration of treatment. We will model the log number of days as a function of the other two factors. In [14]: make_table(desc) apply_theme( basic ) Out[14]: <ipy table.ipytable at 0x107d8cd90> In [15]: %%R url = kidney.table = read.table(url, header=t) kidney.table$d = factor(kidney.table$duration) kidney.table$w = factor(kidney.table$weight) kidney.table$logdays = log(kidney.table$days + 1) attach(kidney.table) head(kidney.table) Days Duration Weight ID D W logdays
7 1.8.2 Twoway ANOVA model Second generalization of ttest: more than one grouping variable. Twoway ANOVA model: r groups in first factor m groups in second factor n ij in each combination of factor variables. Model: Y ijk = µ + α i + β j + (αβ) ij + ε ijk, ε ijk N(0, σ 2 ). In kidney example, r = 3 (weight gain), m = 2 (duration of treatment), n ij = 10 for all (i, j) Questions of interest Twoway ANOVA: main questions of interest Are there main effects for the grouping variables? Are there interaction effects: Interactions between factors H 0 : α 1 = = α r = 0, H 0 : β 1 = = β m = 0. H 0 : (αβ) ij = 0, 1 i r, 1 j m. We ve already seen these interactions in the IT salary example. An additive model says that the effects of the two factors occur additively such a model has no interactions. An interaction is present whenever the additive model does not hold Interaction plot In [16]: %%R h 800 w 800 interaction.plot(w, D, logdays, type= b, col=c( red, blue ), lwd=2, pch=c(23,24)) 7
8 When these broken lines are not parallel, there is evidence of an interaction. The one thing missing from this plot are errorbars. The above broken lines are clearly not parallel but there is measurement error. If the error bars were large then we might consider there to be no interaction, otherwise we might Parameterization Many constraints are needed, again for identifiability. Let s not worry too much about the details Constraints: r i=1 α i = 0 m j=1 β j = 0 m j=1 (αβ) ij = 0, 1 i r r i=1 (αβ) ij = 0, 1 j m. We should convince ourselves that we know have exactly r m free parameters. 8
9 1.8.7 Fitting the model Easy to fit when n ij = n (balanced) Ŷ ijk = Y ij = 1 n n Y ijk. k=1 Inference for combinations r m Var a ij Y ij = σ2 n i=1 j=1 r m i=1 j=1 a 2 ij. Usual ttests, confidence intervals. In [17]: %%R kidney.lm = lm(logdays ~ D*W) summary(kidney.lm) Call: lm(formula = logdays ~ D * W) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e05 *** D W * W e05 *** D2:W D2:W Signif. codes: 0 *** ** 0.01 * Residual standard error: on 54 degrees of freedom Multiple Rsquared: ,Adjusted Rsquared: Fstatistic: on 5 and 54 DF, pvalue: 2.301e Example Suppose we are interested in comparing the mean in (D = 1, W = 3) and (D = 2, W = 2) groups. The difference is E(Ȳ13 Ȳ22 ) By independence, its variance is Var(Ȳ13 ) + Var(Ȳ22 ) = 2σ2 n. 9
10 In [18]: %%R estimates = predict(kidney.lm, list(d=factor(c(1,2)), W=factor(c(3,2)))) print(estimates) sigma.hat = # from table above n = 10 # ten observations per group fit = estimates[1]  estimates[2] upper = fit + qt(0.975, 54) * sqrt(2 * sigma.hat^2 / n) lower = fit  qt(0.975,54) * sqrt(2 * sigma.hat^2 / n) data.frame(fit,lower,upper) fit lower upper In [19]: %%R head(model.matrix(kidney.lm)) (Intercept) D2 W2 W3 D2:W2 D2:W Finding predicted values The most direct way to compute predicted values is using the predict function In [20]: %%R predict(kidney.lm, list(d=factor(1),w=factor(1)), interval= confidence ) fit lwr upr ANOVA table In the balanced case, everything can again be summarized from the ANOVA table In [22]: make_table(anova_twoway) apply_theme( basic ) Out[22]: <ipy table.ipytable at 0x107d8c890> Tests using the ANOVA table Rows of the ANOVA table can be used to test various of the hypotheses we started out with. For instance, we see that under H 0 : (αβ) ij = 0, i, j the expected value of SSAB and SSE is σ 2 use these for an F test testing for an interaction. 10
11 Under H 0 In [23]: %%R anova(kidney.lm) Analysis of Variance Table (m 1)(r 1) F = MSAB SSAB MSE = SSE (n 1)mr F (m 1)(r 1),(n 1)mr Response: logdays Df Sum Sq Mean Sq F value Pr(>F) D * W e06 *** D:W Residuals Signif. codes: 0 *** ** 0.01 * We can also test for interactions using our usual approach In [24]: %%R anova(lm(logdays ~ D + W, kidney.table), kidney.lm) Analysis of Variance Table Model 1: logdays ~ D + W Model 2: logdays ~ D * W Res.Df RSS Df Sum of Sq F Pr(>F) Some caveats about R formulae While we see that it is straightforward to form the interactions test using our usual anova function approach, we generally cannot test for main effects by this approach. In [25]: %%R lm_no_main_weight = lm(logdays ~ D + W:D) anova(lm_no_main_weight, kidney.lm) Analysis of Variance Table Model 1: logdays ~ D + W:D Model 2: logdays ~ D * W Res.Df RSS Df Sum of Sq F Pr(>F) e15 In fact, these models are identical in terms of their planes or their fitted values. What has happened is that R has formed a different design matrix using its rules for formula objects. 11
12 In [26]: %%R lm1 = lm(logdays ~ D + W:D) lm2 = lm(logdays ~ D + W:D + W) anova(lm1, lm2) Analysis of Variance Table Model 1: logdays ~ D + W:D Model 2: logdays ~ D + W:D + W Res.Df RSS Df Sum of Sq F Pr(>F) e ANOVA tables in general So far, we have used anova to compare two models. In this section, we produced tables for just 1 model. This also works for any regression model, though we have to be a little careful about interpretation. Let s revisit the job aptitude test data from last section. In [27]: %%R url = jobtest.table < read.table(url, header=t) jobtest.table$ethn < factor(jobtest.table$ethn) jobtest.lm = lm(jperf ~ TEST * ETHN, jobtest.table) summary(jobtest.lm) Call: lm(formula = JPERF ~ TEST * ETHN, data = jobtest.table) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) TEST ETHN TEST:ETHN Signif. codes: 0 *** ** 0.01 * Residual standard error: on 16 degrees of freedom Multiple Rsquared: ,Adjusted Rsquared: Fstatistic: on 3 and 16 DF, pvalue: Now, let s look at the anova output. We ll see the results don t match. In [28]: %%R anova(jobtest.lm) 12
13 Analysis of Variance Table Response: JPERF Df Sum Sq Mean Sq F value Pr(>F) TEST *** ETHN TEST:ETHN Residuals Signif. codes: 0 *** ** 0.01 * The difference is how the Sum Sq columns is created. In the anova output, terms in the response are added sequentially. We can see this by comparing these two models directly. The F statistic doesn t agree because the MSE above is computed in the fullest model, but the Sum of Sq is correct. In [29]: %%R anova(lm(jperf ~ TEST, jobtest.table), lm(jperf ~ TEST + ETHN, jobtest.table)) Analysis of Variance Table Model 1: JPERF ~ TEST Model 2: JPERF ~ TEST + ETHN Res.Df RSS Df Sum of Sq F Pr(>F) Similarly, the first Sum Sq in anova can be found by: In [30]: %%R anova(lm(jperf ~ 1, jobtest.table), lm(jperf ~ TEST, jobtest.table)) Analysis of Variance Table Model 1: JPERF ~ 1 Model 2: JPERF ~ TEST Res.Df RSS Df Sum of Sq F Pr(>F) *** Signif. codes: 0 *** ** 0.01 * There are ways to produce an ANOVA table whose pvalues agree with summary. This is done by an ANOVA table that uses TypeIII sum of squares. In [31]: %%R library(car) Anova(jobtest.lm, type=3) 13
14 Anova Table (Type III tests) Response: JPERF Sum Sq Df F value Pr(>F) (Intercept) TEST ETHN TEST:ETHN Residuals Signif. codes: 0 *** ** 0.01 * In [32]: %%R summary(jobtest.lm) Call: lm(formula = JPERF ~ TEST * ETHN, data = jobtest.table) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) TEST ETHN TEST:ETHN Signif. codes: 0 *** ** 0.01 * Residual standard error: on 16 degrees of freedom Multiple Rsquared: ,Adjusted Rsquared: Fstatistic: on 3 and 16 DF, pvalue: Fixed and random effects In kidney & rehab examples, the categorical variables are welldefined categories: below average fitness, long duration, etc. In some designs, the categorical variable is subject. Simplest example: repeated measures, where more than one (identical) measurement is taken on the same individual. In this case, the group effect α i is best thought of as random because we only sample a subset of the entire population. 14
15 2.0.1 When to use random effects? A group effect is random if we can think of the levels we observe in that group to be samples from a larger population. Example: if collecting data from different medical centers, center might be thought of as random. Example: if surveying students on different campuses, campus may be a random effect Example: sodium content in beer How much sodium is there in North American beer? How much does this vary by brand? Observations: for 6 brands of beer, we recorded the sodium content of 8 12 ounce bottles. Questions of interest: what is the grand mean sodium content? How much variability is there from brand to brand? Individuals in this case are brands, repeated measures are the 8 bottles. In [33]: %%R url = sodium.table = read.table(url, header=t) sodium.table$brand = factor(sodium.table$brand) sodium.lm = lm(sodium ~ brand, sodium.table) anova(sodium.lm) Analysis of Variance Table Response: sodium Df Sum Sq Mean Sq F value Pr(>F) brand < 2.2e16 *** Residuals Signif. codes: 0 *** ** 0.01 * Oneway random effects model Assuming that cellsizes are the same, i.e. equal observations for each subject (brand of beer). Observations Y ij µ + α i + ε ij, 1 i r, 1 j n ε ij N(0, σ 2 ɛ ), 1 i r, 1 j n α i N(0, σ 2 α), 1 i r. Parameters: µ is the population mean; σ 2 ɛ is the measurement variance (i.e. how variable are the readings from the machine that reads the sodium content?); σ 2 α is the population variance (i.e. how variable is the sodium content of beer across brands). 15
16 2.0.4 Modelling the variance In random effects model, the observations are no longer independent (even if ε s are independent Cov(Y ij, Y i j ) = ( σ 2 α + σ 2 ɛ δ j,j ) δi,i. In more complicated models, this makes maximum likelihood estimation more complicated: least squares is no longer the best solution. It s no longer a plane! This model has a very simple model for the mean, it just has a slightly more complex model for the variance. Shortly we ll see other more complex models of the variance: Weighted Least Squares Correlated Errors Fitting the model The MLE (Maximum Likelihood Estimator) is found by minimizing 2 log l(µ, σ 2 ɛ, σ 2 α Y ) = r [ (Y i µ) T (σɛ 2 I ni n i + σα11 2 T ) 1 (Y i µ) i=1 + log ( det(σ 2 ɛ I ni n i + σ 2 α11 T ) )]. THe function l(µ, σ 2 ɛ, σ 2 α) is called the likelihood function Fitting the model in balanced design Only one parameter in the mean function µ.  When cell sizes are the same (balanced), Unbalanced models: use numerical optimizer. µ = Y = 1 Y ij. nr This also changes estimates of σ 2 ɛ see ANOVA table. We might guess that df = nr 1 and This is not correct. σ 2 = 1 nr 1 i,j (Y ij Y ) 2. In [34]: %%R library(nlme) sodium.lme = lme(fixed=sodium~1,random=~1 brand, data=sodium.table) summary(sodium.lme) Linear mixedeffects model fit by REML Data: sodium.table AIC BIC loglik Random effects: i,j 16
17 Formula: ~1 brand (Intercept) Residual StdDev: Fixed effects: sodium ~ 1 Value Std.Error DF tvalue pvalue (Intercept) Standardized WithinGroup Residuals: Min Q1 Med Q3 Max Number of Observations: 48 Number of Groups: 6 For reasons I m not sure of, the degrees of freedom don t agree with our ANOVA, though we do find the correct SE for our estimate of µ: In [35]: %%R MSTR = anova(sodium.lm)$mean[1] sqrt(mstr/48) [1] The intervals formed by lme use the 42 degrees of freedom, but are otherwise the same: In [36]: %%R intervals(sodium.lme) Approximate 95% confidence intervals Fixed effects: lower est. upper (Intercept) attr(,"label") [1] "Fixed effects:" Random Effects: Level: brand lower est. upper sd((intercept)) Withingroup standard error: lower est. upper In [37]: %%R center = mean(sodium.table$sodium) lwr = center  sqrt(mstr / 48) * qt(0.975,42) upr = center + sqrt(mstr / 48) * qt(0.975,42) data.frame(lwr, center, upr) 17
18 lwr center upr Using our degrees of freedom as 7 yields slightly wider intervals In [38]: %%R center = mean(sodium.table$sodium) lwr = center  sqrt(mstr / 48) * qt(0.975,7) upr = center + sqrt(mstr / 48) * qt(0.975,7) data.frame(lwr, center, upr) lwr center upr ANOVA table Again, the information needed can be summarized in an ANOVA table. In [40]: make_table(anova_oneway) apply_theme( basic ) Out[40]: <ipy table.ipytable at 0x107d8c990> ANOVA table is still useful to setup tests: the same F statistics for fixed or random will work here. Test for random effect: H 0 : σ 2 α = 0 based on Inference for µ F = MST R MSE F r 1,(n 1)r under H 0. Easy to check that E(Y ) = µ Var(Y ) = σ2 ɛ + nσα 2. rn To come up with a t statistic that we can use for test, CIs, we need to find an estimate of Var(Y ). ANOVA table says E(MST R) = nσ 2 α + σ 2 ɛ which suggests Degrees of freedom Why r 1 degrees of freedom? Y µ MST R rn t r 1. Imagine we could record an infinite number of observations for each individu al, so that Y i µ + α i. To learn anything about µ we still only have r observations (µ 1,..., µ r ). Sampling more within an individual cannot narrow the CI for µ. 18
19 Estimating σ 2 α We have seen estimates of µ and σ 2 ɛ. Only one parameter remains. Based on the ANOVA table, we see that σα 2 = 1 (E(MST R) E(MSE)). n This suggests the estimate ˆσ 2 µ = 1 (MST R MSE). n However, this estimate can be negative! Many such computational difficulties arise in random (and mixed) effects models. 2.1 Mixed effects model The oneway random effects ANOVA is a special case of a socalled mixed effects model: Y n 1 = X n p β p 1 + Z n q γ q 1 γ N(0, Σ). Various models also consider restrictions on Σ (e.g. diagonal, unrestricted, block diagonal, etc.) Our multiple linear regression model is a (very simple) mixedeffects model with q = n, Z = I n n Σ = σ 2 I n n. 19
Fixed vs. Random Effects
Statistics 203: Introduction to Regression and Analysis of Variance Fixed vs. Random Effects Jonathan Taylor  p. 1/19 Today s class Implications for Random effects. Oneway random effects ANOVA. Twoway
More informationMultiple Linear Regression
Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is
More informationE(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F
Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,
More informationRegression in ANOVA. James H. Steiger. Department of Psychology and Human Development Vanderbilt University
Regression in ANOVA James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) 1 / 30 Regression in ANOVA 1 Introduction 2 Basic Linear
More informationDEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,
More informationWe extended the additive model in two variables to the interaction model by adding a third term to the equation.
Quadratic Models We extended the additive model in two variables to the interaction model by adding a third term to the equation. Similarly, we can extend the linear model in one variable to the quadratic
More informationStatistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 16233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate
More informationRecall this chart that showed how most of our course would be organized:
Chapter 4 OneWay ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical
More informationStatistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 16233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
More informationComparing Nested Models
Comparing Nested Models ST 430/514 Two models are nested if one model contains all the terms of the other, and at least one additional term. The larger model is the complete (or full) model, and the smaller
More informationStat 411/511 ANOVA & REGRESSION. Charlotte Wickham. stat511.cwick.co.nz. Nov 31st 2015
Stat 411/511 ANOVA & REGRESSION Nov 31st 2015 Charlotte Wickham stat511.cwick.co.nz This week Today: Lack of fit Ftest Weds: Review email me topics, otherwise I ll go over some of last year s final exam
More informationOneWay Analysis of Variance (ANOVA) Example Problem
OneWay Analysis of Variance (ANOVA) Example Problem Introduction Analysis of Variance (ANOVA) is a hypothesistesting technique used to test the equality of two or more population (or treatment) means
More informationNWay Analysis of Variance
NWay Analysis of Variance 1 Introduction A good example when to use a nway ANOVA is for a factorial design. A factorial design is an efficient way to conduct an experiment. Each observation has data
More informationClassical Hypothesis Testing in R. R can do all of the common analyses that are available in SPSS, including:
Classical Hypothesis Testing in R R can do all of the common analyses that are available in SPSS, including: Classical Hypothesis Testing in R R can do all of the common analyses that are available in
More informationStat 5303 (Oehlert): Tukey One Degree of Freedom 1
Stat 5303 (Oehlert): Tukey One Degree of Freedom 1 > catch
More information1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
More informationWeek 5: Multiple Linear Regression
BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School
More informationLecture 5 Hypothesis Testing in Multiple Linear Regression
Lecture 5 Hypothesis Testing in Multiple Linear Regression BIOST 515 January 20, 2004 Types of tests 1 Overall test Test for addition of a single variable Test for addition of a group of variables Overall
More informationGeneralized Linear Models
Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the
More informationAn example ANOVA situation. 1Way ANOVA. Some notation for ANOVA. Are these differences significant? Example (Treating Blisters)
An example ANOVA situation Example (Treating Blisters) 1Way ANOVA MATH 143 Department of Mathematics and Statistics Calvin College Subjects: 25 patients with blisters Treatments: Treatment A, Treatment
More information121 Multiple Linear Regression Models
121.1 Introduction Many applications of regression analysis involve situations in which there are more than one regressor variable. A regression model that contains more than one regressor variable is
More informationRegression, least squares
Regression, least squares Joe Felsenstein Department of Genome Sciences and Department of Biology Regression, least squares p.1/24 Fitting a straight line X Two distinct cases: The X values are chosen
More informationChapter 7. Oneway ANOVA
Chapter 7 Oneway ANOVA Oneway ANOVA examines equality of population means for a quantitative outcome and a single categorical explanatory variable with any number of levels. The ttest of Chapter 6 looks
More informationIntroducing the Multilevel Model for Change
Department of Psychology and Human Development Vanderbilt University GCM, 2010 1 Multilevel Modeling  A Brief Introduction 2 3 4 5 Introduction In this lecture, we introduce the multilevel model for change.
More informationPsychology 205: Research Methods in Psychology
Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready
More informationStatistics II Final Exam  January Use the University stationery to give your answers to the following questions.
Statistics II Final Exam  January 2012 Use the University stationery to give your answers to the following questions. Do not forget to write down your name and class group in each page. Indicate clearly
More informationCorrelation and Simple Linear Regression
Correlation and Simple Linear Regression We are often interested in studying the relationship among variables to determine whether they are associated with one another. When we think that changes in a
More informationEDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION
EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION EDUCATION AND VOCABULARY 510 hours of input weekly is enough to pick up a new language (Schiff & Myers, 1988). Dutch children spend 5.5 hours/day
More informationUsing R for Linear Regression
Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional
More informationOutline. Topic 4  Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares
Topic 4  Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test  Fall 2013 R 2 and the coefficient of correlation
More information5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
More informationChapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) 
More informationAn analysis method for a quantitative outcome and two categorical explanatory variables.
Chapter 11 TwoWay ANOVA An analysis method for a quantitative outcome and two categorical explanatory variables. If an experiment has a quantitative outcome and two categorical explanatory variables that
More informationTimeSeries Regression and Generalized Least Squares in R
TimeSeries Regression and Generalized Least Squares in R An Appendix to An R Companion to Applied Regression, Second Edition John Fox & Sanford Weisberg last revision: 11 November 2010 Abstract Generalized
More informationAnd sample sizes > tapply(count, spray, length) A B C D E F And a boxplot: > boxplot(count ~ spray) How does the data look?
ANOVA in R 1Way ANOVA We re going to use a data set called InsectSprays. 6 different insect sprays (1 Independent Variable with 6 levels) were tested to see if there was a difference in the number of
More informationRandom effects and nested models with SAS
Random effects and nested models with SAS /************* classical2.sas ********************* Three levels of factor A, four levels of B Both fixed Both random A fixed, B random B nested within A ***************************************************/
More informationMULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)
MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL by Michael L. Orlov Chemistry Department, Oregon State University (1996) INTRODUCTION In modern science, regression analysis is a necessary part
More informationSCHOOL OF MATHEMATICS AND STATISTICS
RESTRICTED OPEN BOOK EXAMINATION (Not to be removed from the examination hall) Data provided: Statistics Tables by H.R. Neave MAS5052 SCHOOL OF MATHEMATICS AND STATISTICS Basic Statistics Spring Semester
More informationLets suppose we rolled a sixsided die 150 times and recorded the number of times each outcome (16) occured. The data is
In this lab we will look at how R can eliminate most of the annoying calculations involved in (a) using ChiSquared tests to check for homogeneity in twoway tables of catagorical data and (b) computing
More informationStatistics Review PSY379
Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses
More informationPaired Differences and Regression
Paired Differences and Regression Students sometimes have difficulty distinguishing between paired data and independent samples when comparing two means. One can return to this topic after covering simple
More informationChapter 11: Two Variable Regression Analysis
Department of Mathematics Izmir University of Economics Week 1415 20142015 In this chapter, we will focus on linear models and extend our analysis to relationships between variables, the definitions
More information0.1 Multiple Regression Models
0.1 Multiple Regression Models We will introduce the multiple Regression model as a mean of relating one numerical response variable y to two or more independent (or predictor variables. We will see different
More informationLecture 11: Confidence intervals and model comparison for linear regression; analysis of variance
Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance 14 November 2007 1 Confidence intervals and hypothesis testing for linear regression Just as there was
More informationBasic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
More informationMIXED MODEL ANALYSIS USING R
Research Methods Group MIXED MODEL ANALYSIS USING R Using Case Study 4 from the BIOMETRICS & RESEARCH METHODS TEACHING RESOURCE BY Stephen Mbunzi & Sonal Nagda www.ilri.org/rmg www.worldagroforestrycentre.org/rmg
More informationIntroduction to Stata
Introduction to Stata September 23, 2014 Stata is one of a few statistical analysis programs that social scientists use. Stata is in the midrange of how easy it is to use. Other options include SPSS,
More informationOneWay Analysis of Variance: A Guide to Testing Differences Between Multiple Groups
OneWay Analysis of Variance: A Guide to Testing Differences Between Multiple Groups In analysis of variance, the main research question is whether the sample means are from different populations. The
More informationwhere b is the slope of the line and a is the intercept i.e. where the line cuts the y axis.
Least Squares Introduction We have mentioned that one should not always conclude that because two variables are correlated that one variable is causing the other to behave a certain way. However, sometimes
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationMultiple Linear Regression. Multiple linear regression is the extension of simple linear regression to the case of two or more independent variables.
1 Multiple Linear Regression Basic Concepts Multiple linear regression is the extension of simple linear regression to the case of two or more independent variables. In simple linear regression, we had
More informationExercise Page 1 of 32
Exercise 10.1 (a) Plot wages versus LOS. Describe the relationship. There is one woman with relatively high wages for her length of service. Circle this point and do not use it in the rest of this exercise.
More informationAn Sweave Demo. Charles J. Geyer. July 27, latex
An Sweave Demo Charles J. Geyer July 27, 2010 This is a demo for using the Sweave command in R. To get started make a regular L A TEX file (like this one) but give it the suffix.rnw instead of.tex and
More informationTesting for Lack of Fit
Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data? If the model is correct then ˆσ 2 should be an unbiased estimate of σ 2. If we have a model which is not complex enough to fit
More information" Y. Notation and Equations for Regression Lecture 11/4. Notation:
Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through
More informationSimple Linear Regression
Inference for Regression Simple Linear Regression IPS Chapter 10.1 2009 W.H. Freeman and Company Objectives (IPS Chapter 10.1) Simple linear regression Statistical model for linear regression Estimating
More informationStatistics in Geophysics: Linear Regression II
Statistics in Geophysics: Linear Regression II Steffen Unkel Department of Statistics LudwigMaximiliansUniversity Munich, Germany Winter Term 2013/14 1/28 Model definition Suppose we have the following
More informationChapter 11: Linear Regression  Inference in Regression Analysis  Part 2
Chapter 11: Linear Regression  Inference in Regression Analysis  Part 2 Note: Whether we calculate confidence intervals or perform hypothesis tests we need the distribution of the statistic we will use.
More informationData Analysis Tools. Tools for Summarizing Data
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
More informationANOVA Designs  Part II. Nested Designs. Nested Designs. Nested Designs (NEST) Design Linear Model Computation
ANOVA Designs  Part II Nested Designs (NEST) Design Linear Model Computation Example NCSS s (FACT) Design Linear Model Computation Example NCSS RCB Factorial (Combinatorial Designs) Nested Designs A nested
More information, then the form of the model is given by: which comprises a deterministic component involving the three regression coefficients (
Multiple regression Introduction Multiple regression is a logical extension of the principles of simple linear regression to situations in which there are several predictor variables. For instance if we
More informationStatistics 112 Regression Cheatsheet Section 1B  Ryan Rosario
Statistics 112 Regression Cheatsheet Section 1B  Ryan Rosario I have found that the best way to practice regression is by brute force That is, given nothing but a dataset and your mind, compute everything
More informationPart II. Multiple Linear Regression
Part II Multiple Linear Regression 86 Chapter 7 Multiple Regression A multiple linear regression model is a linear model that describes how a yvariable relates to two or more xvariables (or transformations
More informationEXPECTED MEAN SQUARES AND MIXED MODEL ANALYSES. This will become more important later in the course when we discuss interactions.
EXPECTED MEN SQURES ND MIXED MODEL NLYSES Fixed vs. Random Effects The choice of labeling a factor as a fixed or random effect will affect how you will make the Ftest. This will become more important
More informationSimple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
More informationMultiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
More informationMath 141. Lecture 24: Model Comparisons and The Ftest. Albyn Jones 1. 1 Library jones/courses/141
Math 141 Lecture 24: Model Comparisons and The Ftest Albyn Jones 1 1 Library 304 jones@reed.edu www.people.reed.edu/ jones/courses/141 Nested Models Two linear models are Nested if one (the restricted
More informationQuantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression
Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression Correlation Linear correlation and linear regression are often confused, mostly
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 SigmaRestricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationRegression stepbystep using Microsoft Excel
Step 1: Regression stepbystep using Microsoft Excel Notes prepared by Pamela Peterson Drake, James Madison University Type the data into the spreadsheet The example used throughout this How to is a regression
More informationTwoVariable Regression: Interval Estimation and Hypothesis Testing
TwoVariable Regression: Interval Estimation and Hypothesis Testing Jamie Monogan University of Georgia Intermediate Political Methodology Jamie Monogan (UGA) Confidence Intervals & Hypothesis Testing
More information(d) True or false? When the number of treatments a=9, the number of blocks b=10, and the other parameters r =10 and k=9, it is a BIBD design.
PhD Qualifying exam Methodology Jan 2014 Solutions 1. True or false question  only circle "true " or "false" (a) True or false? Fstatistic can be used for checking the equality of two population variances
More informationLecture Outline (week 13)
Lecture Outline (week 3) Analysis of Covariance in Randomized studies Mixed models: Randomized block models Repeated Measures models Pretestposttest models Analysis of Covariance in Randomized studies
More informationFinal Exam Practice Problem Answers
Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal
More informationInferential Statistics
Inferential Statistics Sampling and the normal distribution Zscores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are
More informationLecture 7: Binomial Test, Chisquare
Lecture 7: Binomial Test, Chisquare Test, and ANOVA May, 01 GENOME 560, Spring 01 Goals ANOVA Binomial test Chi square test Fisher s exact test Su In Lee, CSE & GS suinlee@uw.edu 1 Whirlwind Tour of One/Two
More informationdata visualization and regression
data visualization and regression Sepal.Length 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 I. setosa I. versicolor I. virginica I. setosa I. versicolor I. virginica Species Species
More informationn + n log(2π) + n log(rss/n)
There is a discrepancy in R output from the functions step, AIC, and BIC over how to compute the AIC. The discrepancy is not very important, because it involves a difference of a constant factor that cancels
More informationChapter 5 Analysis of variance SPSS Analysis of variance
Chapter 5 Analysis of variance SPSS Analysis of variance Data file used: gss.sav How to get there: Analyze Compare Means Oneway ANOVA To test the null hypothesis that several population means are equal,
More informationExperimental Designs (revisited)
Introduction to ANOVA Copyright 2000, 2011, J. Toby Mordkoff Probably, the best way to start thinking about ANOVA is in terms of factors with levels. (I say this because this is how they are described
More informationRegression in SPSS. Workshop offered by the Mississippi Center for Supercomputing Research and the UM Office of Information Technology
Regression in SPSS Workshop offered by the Mississippi Center for Supercomputing Research and the UM Office of Information Technology John P. Bentley Department of Pharmacy Administration University of
More informationCS 147: Computer Systems Performance Analysis
CS 147: Computer Systems Performance Analysis OneFactor Experiments CS 147: Computer Systems Performance Analysis OneFactor Experiments 1 / 42 Overview Introduction Overview Overview Introduction Finding
More informationAnalysis of Variance. MINITAB User s Guide 2 31
3 Analysis of Variance Analysis of Variance Overview, 32 OneWay Analysis of Variance, 35 TwoWay Analysis of Variance, 311 Analysis of Means, 313 Overview of Balanced ANOVA and GLM, 318 Balanced
More informationSimple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
More informationThe F distribution and the basic principle behind ANOVAs. Situating ANOVAs in the world of statistical tests
Tutorial The F distribution and the basic principle behind ANOVAs Bodo Winter 1 Updates: September 21, 2011; January 23, 2014; April 24, 2014; March 2, 2015 This tutorial focuses on understanding rather
More informationMultivariate Logistic Regression
1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation
More informationANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.
ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R. 1. Motivation. Likert items are used to measure respondents attitudes to a particular question or statement. One must recall
More informationThe scatterplot indicates a positive linear relationship between waist size and body fat percentage:
STAT E150 Statistical Methods Multiple Regression Three percent of a man's body is essential fat, which is necessary for a healthy body. However, too much body fat can be dangerous. For men between the
More informationStatistical Functions in Excel
Statistical Functions in Excel There are many statistical functions in Excel. Moreover, there are other functions that are not specified as statistical functions that are helpful in some statistical analyses.
More informationCHAPTER 13. Experimental Design and Analysis of Variance
CHAPTER 13 Experimental Design and Analysis of Variance CONTENTS STATISTICS IN PRACTICE: BURKE MARKETING SERVICES, INC. 13.1 AN INTRODUCTION TO EXPERIMENTAL DESIGN AND ANALYSIS OF VARIANCE Data Collection
More informationT O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these
More informationLesson 1: Comparison of Population Means Part c: Comparison of Two Means
Lesson : Comparison of Population Means Part c: Comparison of Two Means Welcome to lesson c. This third lesson of lesson will discuss hypothesis testing for two independent means. Steps in Hypothesis
More informationSection 13, Part 1 ANOVA. Analysis Of Variance
Section 13, Part 1 ANOVA Analysis Of Variance Course Overview So far in this course we ve covered: Descriptive statistics Summary statistics Tables and Graphs Probability Probability Rules Probability
More informationMODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING
Interpreting Interaction Effects; Interaction Effects and Centering Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015 Models with interaction effects
More informationLinear Regression with One Regressor
Linear Regression with One Regressor Michael Ash Lecture 10 Analogy to the Mean True parameter µ Y β 0 and β 1 Meaning Central tendency Intercept and slope E(Y ) E(Y X ) = β 0 + β 1 X Data Y i (X i, Y
More informationKSTAT MINIMANUAL. Decision Sciences 434 Kellogg Graduate School of Management
KSTAT MINIMANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To
More informationIntroduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
More informatione = random error, assumed to be normally distributed with mean 0 and standard deviation σ
1 Linear Regression 1.1 Simple Linear Regression Model The linear regression model is applied if we want to model a numeric response variable and its dependency on at least one numeric factor variable.
More informationDidacticiel  Études de cas
1 Topic Regression analysis with LazStats (OpenStat). LazStat 1 is a statistical software which is developed by Bill Miller, the father of OpenStat, a wellknow tool by statisticians since many years. These
More informationSAS Syntax and Output for Data Manipulation:
Psyc 944 Example 5 page 1 Practice with Fixed and Random Effects of Time in Modeling WithinPerson Change The models for this example come from Hoffman (in preparation) chapter 5. We will be examining
More information