Introduction to proc glm



Similar documents
Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Chapter 5 Analysis of variance SPSS Analysis of variance

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

SPSS Guide: Regression Analysis

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

1.5 Oneway Analysis of Variance

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Introduction to Fixed Effects Methods

Imputing Missing Data using SAS

Using Excel for Statistical Analysis

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

Univariate Regression

SAS: A Mini-Manual for ECO 351 by Andrew C. Brod

Notes on Applied Linear Regression

5. Linear Regression

Statistiek II. John Nerbonne. October 1, Dept of Information Science

Illustration (and the use of HLM)

Formula for linear models. Prediction, extrapolation, significance test against zero slope.

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Chapter 13 Introduction to Linear Regression and Correlation Analysis

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Forecast. Forecast is the linear function with estimated coefficients. Compute with predict command

Introduction to Data Analysis in Hierarchical Linear Models

Linear Models in STATA and ANOVA

Chapter 7: Simple linear regression Learning Objectives

CHAPTER 14 NONPARAMETRIC TESTS

Traditional Conjoint Analysis with Excel

Multiple Regression: What Is It?

Recall this chart that showed how most of our course would be organized:

Forecasting in STATA: Tools and Tricks

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

5. Multiple regression

Math 4310 Handout - Quotient Vector Spaces

11. Analysis of Case-control Studies Logistic Regression

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Comparing Nested Models

The Dummy s Guide to Data Analysis Using SPSS

Regression III: Advanced Methods

Econometrics Simple Linear Regression

Simple linear regression

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

SPSS Explore procedure

An Introduction to Statistical Tests for the SAS Programmer Sara Beck, Fred Hutchinson Cancer Research Center, Seattle, WA

CURVE FITTING LEAST SQUARES APPROXIMATION

Research Methods & Experimental Design

individualdifferences

Multiple regression - Matrices

Base Conversion written by Cathy Saxton

Section 13, Part 1 ANOVA. Analysis Of Variance

Final Exam Practice Problem Answers

Simple Tricks for Using SPSS for Windows

Ordinal Regression. Chapter

Getting Correct Results from PROC REG

Wooldridge, Introductory Econometrics, 4th ed. Chapter 7: Multiple regression analysis with qualitative information: Binary (or dummy) variables

Topic 1 - Introduction to Labour Economics. Professor H.J. Schuetze Economics 370. What is Labour Economics?

Linear functions Increasing Linear Functions. Decreasing Linear Functions

2. Simple Linear Regression

Introduction to Regression and Data Analysis

Introduction to Structural Equation Modeling (SEM) Day 4: November 29, 2012

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Analysis of Variance ANOVA

An Introduction to Path Analysis. nach 3

Lecture 14: GLM Estimation and Logistic Regression

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Session 7 Bivariate Data and Analysis

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

Main Effects and Interactions

Reliability Analysis

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

Part 3. Comparing Groups. Chapter 7 Comparing Paired Groups 189. Chapter 8 Comparing Two Independent Groups 217

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

Module 3: Correlation and Covariance

Using SAS Proc Mixed for the Analysis of Clustered Longitudinal Data

Introduction to General and Generalized Linear Models

Testing for Lack of Fit

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Laboratory 3 Type I, II Error, Sample Size, Statistical Power

Profile analysis is the multivariate equivalent of repeated measures or mixed ANOVA. Profile analysis is most commonly used in two cases:

Factor Analysis. Factor Analysis

Estimation of σ 2, the variance of ɛ

Credit Scoring Modelling for Retail Banking Sector.

Simple Regression Theory II 2010 Samuel L. Baker

Logs Transformation in a Regression Equation

Mixed 2 x 3 ANOVA. Notes

Lecture 2 Mathcad Basics

CHAPTER 11 CHI-SQUARE AND F DISTRIBUTIONS

To do a factor analysis, we need to select an extraction method and a rotation method. Hit the Extraction button to specify your extraction method.

Azure Machine Learning, SQL Data Mining and R

Updates to Graphing with Excel

Binary Logistic Regression

Week 3&4: Z tables and the Sampling Distribution of X

Transcription:

Lab 7: Proc GLM and one-way ANOVA STT 422: Summer, 2004 Vince Melfi SAS has several procedures for analysis of variance models, including proc anova, proc glm, proc varcomp, and proc mixed. We mainly will use proc glm and proc mixed, which the SAS manual terms the flagship procedures for analysis of variance. In this lab we ll learn about proc glm, and see learn how to use it to fit one-way analysis of variance models. Introduction to proc glm The glm in proc glm stands for general linear models. Included in this category are multiple linear regression models and many analysis of variance models. In fact, we ll start by using proc glm to fit an ordinary multiple regression model. Here is a description of the data we ll use, which is taken from the SAS manual: *-------------------Data on Physical Fitness-------------------* These measurements were made on men involved in a physical fitness course at N.C.State Univ. The variables are Age (years), Weight (kg), Oxygen intake rate (ml per kg body weight per minute), time to run 1.5 miles (minutes), heart rate while resting, heart rate while running (same time Oxygen rate measured), and maximum heart rate recorded while running. ***Certain values of MaxPulse have been changed. *--------------------------------------------------------------*; The following SAS program reads in the data, fits a regression model using proc reg with Oxygen as the response and RunTime and Weight as predictors, and then fits the same model using proc glm. 1 Look at the output of both. data oxygen; infile u:\msu\course\stt\422\summer04\oxy.dat ; input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse; proc reg data = oxygen; model Oxygen = RunTime Weight; proc glm data = oxygen; model Oxygen = RunTime Weight; 1 You may be wondering why we didn t use proc glm to fit regression models throughout. The proc reg procedure is more convenient and more powerful for the usual multiple regression model. 1

Recall that it is possible to cast the analysis of variance model in the context of multiple regression, by using indicator variables as the predictors. This is the approach taken by proc glm. We won t need to be overly concerned with the details of how proc glm fits analysis of variance models, but rather with understanding how to specify models and to interpret the output. Later in this lab, though, we ll learn a bit about the regression approach to analysis of variance. One-way analysis of variance using proc glm We ll investigate one-way analysis of variance using Example 12.6 from the text. The data give the scores of students on a reading comprehension test. Students were taught using one of three teaching methods, called basal, DRTA, and Strat. The goal is to investigate whether there are differences in scores among the three groups and, if so, what the differences are. First read in the data and look at it. data reading; infile u\msu\course\stt\422\summer04\reading.dat ; input group $ score; proc print data = reading; Next, look at some plots and basic descriptive statistics to investigate the data. The plots are particularly uninformative in this example, in part because the score variable has so few possible values. None of the output suggests a big difference between the scores for the three treatments. proc boxplot data = reading; plot score * group; proc gplot data = reading; plot score * group; proc means data = reading; var score; by group; Next fit a one-way analysis of variance model using proc glm. First we must tell SAS which variable is the classification variable, i.e., the variable that indicates the teaching method. Then the model is specified, similar to a regression model. The means statement asks SAS to print the sample sizes, means and standard deviations separately for each of the three groups. 2

proc glm data = reading; class group; model score = group; means group; The F test with p-value of 0.3288 confirms the suspicion, based on the plots and descriptive statistics, that there isn t any difference in the mean score for the three treatments. The regression approach to ANOVA; constraints There are many ways to write the one-way analysis of variance model as a multiple regression model. For many purposes it doesn t matter which of these is chosen, but it s worth thinking a bit about the various options, since it has an effect on how we interpret the parameters of the model. For the sake of concreteness, consider the reading data set from above, for which the teaching method variable has three levels. Since in our data there are three levels and 22 observations per level, we can write the one-way analysis of variance model as y ij = µ + τ i + ǫ ij, i = 1, 2, 3; j = 1,...,22. This model is equivalent to a multiple regression model with predictors x 1,x 2,x 3 which are dummy variables for the treatments. For example, x 1 would be 1 when the data come from treatment 1, and 0 otherwise. Since the model has four parameters, µ,τ 1,τ 2,τ 3 in the anova terminology, β 0,β 1,β 2,β 3 in the regression notation, but there are only three populations of interest, the model is said to be overspecified. One can just live with this fact (this is what SAS does when we fit an analysis of variance model using proc glm), or one can introduce constraints on the model to reduce the number of parameters. Setting one parameter to 0 Conceptually, the simplest constraint is to set one of the τ parameters to 0. For simplicity, we ll set τ 3 = 0. In the regression setting this leaves the predictors x 1 and x 2. We ll fit this model in SAS. First we create the dummy variables and view them. Make sure you look at the output of the print statement to see what the variables are. Note that we re calling the new dataset reading2 to distinguish it from the original data set. data reading2; set reading; x1 = (group = Basal ); x2 = (group = DRTA ); proc print data = reading2; Now we fit the regression model with x 1 and x 2 as predictors. We re also computing the means of the scores for the three teaching methods for reasons that will be apparent soon. 3

proc reg data = reading2; model score = x1 x2; proc means data = reading2; var score; by group; Look at the output of proc reg carefully. Note that the sums of squares and the F test are exactly the same as we got from proc glm above. What about the parameter estimates? Compare them with the separate group means. You ll find that the intercept parameter estimate is equal to the mean of the scores for the third teaching method; the parameter estimate for the slope of x 1 is the difference of the mean of the scores for the first teaching method and the scores for the third teaching method; and the parameter estimate for the slope of x 2 is the difference of the mean of the scores for the second teaching method and the scores for the third teaching method. Symbolically, b 0 = y 3. b 1 = y 1. y 3. b 2 = y 2. y 3.. The method we ve just investigated is equivalent to the method that SAS uses in proc glm, although it s hard to discover this from the SAS documentation. The solution option in the model statement in proc glm asks for parameter estimates, as in the following program. Compare the parameter estimates from proc glm to the parameter estimates from proc reg above. proc glm data = reading; class group; model score = group / solution; Setting the sum of the parameters to 0 A constraint that many people find more appealing conceptually is to set the sum of the τ parameters to 0: I i=1 τ i = 0. In terms of the analysis of variance model, this makes µ the mean of all the individual groups means, and makes τ i the deviation of the ith group s mean from the overall mean. Let s again consider this in the context of the reading data. If 3 i=1 τ i = 0, then τ 3 = τ 1 τ 2. So data from the third group can be represented as y 3j = µ τ 1 τ 2 + ǫ 3j. In terms of dummy variables, we ll want to set 1 if the data come from group 1; x 1 = 0 if the data come from group 2; 1 if the data come from group 3; 4

and 0 if the data come from group 1; x 2 = 1 if the data come from group 2; 1 if the data come from group 3; The SAS program below creates the dummy variables in the data set reading3, prints the data, and then fits the regression of y on x 1 and x 2. data reading3; set reading; x1 = (group = Basal ); x2 = (group = DRTA ); x3 = (group = Strat ); x1 = x1 - x3; x2 = x2 - x3; proc print data = reading3; proc reg data = reading3; model score = x1 x2; Again, note that the sums of squares and the F test are the same as those from proc glm. In this case the parameter estimates are b 0 = (1/3)(y 1. + y 2. + y 3. ) b 1 = y 1. y.. b 2 = y 2. y... 5