# Statistical Models in R

Size: px
Start display at page:

## Transcription

1 Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; Fall, 2007

2 Outline Statistical Models Linear Models in R

3 Regression Regression analysis is the appropriate statistical method when the response variable and all explanatory variables are continuous. Here, we only discuss linear regression, the simplest and most common form. Remember that a statistical model attempts to approximate the response variable Y as a mathematical function of the explanatory variables X 1,..., X n. This mathematical function may involve parameters. Regression analysis attempts to use sample data find the parameters that produce the best model

4 Linear Models The simplest such model is a linear model with a unique explanatory variable, which takes the following form. ŷ = a + bx. Here, y is the response variable vector, x the explanatory variable, ŷ is the vector of fitted values and a (intercept) and b (slope) are real numbers. Plotting y versus x, this model represents a line through the points. For a given index i, ŷ i = a + bx i approximates y i. Regression amounts to finding a and b that gives the best fit.

5 Linear Model with 1 Explanatory Variable y y y hat x= x

6 Plotting Commands for the record The plot was generated with test data xr, yr with: > plot(xr, yr, xlab = "x", ylab = "y") > abline(v = 2, lty = 2) > abline(a = -2, b = 2, col = "blue") > points(c(2), yr[9], pch = 16, col = "red") > points(c(2), c(2), pch = 16, col = "red") > text(2.5, -4, "x=2", cex = 1.5) > text(1.8, 3.9, "y", cex = 1.5) > text(2.5, 1.9, "y-hat", cex = 1.5)

7 Linear Regression = Minimize RSS Least Squares Fit In linear regression the best fit is found by minimizing RSS = n (y i ŷ i ) 2 = i=1 n (y i (a + bx i )) 2. i=1 This is a Calculus I problem. There is a unique minimum and unique a and b achieving the minimum.

8 No Difference with Multiple Explanatory Variables Suppose there are k continuous explanatory variables, x 1,..., x k. A linear model of y in these variables is of the form ŷ = a + b 1 x 1 + b 2 x b k x k. The multiple linear regression problem is to find a, b 1,..., b k that minimze RSS. With the mild assumption of independence in the x 1,..., x k, there is a again a unique solution. It s just a problem in matrix algebra rather than high school algebra.

9 Is the Model Predictive? No assumptions have yet been made about the distribution of y or any other statistical properties. In modeling we want to calculate a model (a and b 1,..., b k ) from the sample data and claim the same relationship holds for other data, within a certain error. Is it reasonable to assume the linear relationship generalizes? Given that the variables represent sample data there is some uncertainty in the coefficients. With other sample data we may get other coefficients. What is the error in estimating the coefficients? Both of these issues can be addressed with some additional assumptions.

10 Is the Model Predictive? No assumptions have yet been made about the distribution of y or any other statistical properties. In modeling we want to calculate a model (a and b 1,..., b k ) from the sample data and claim the same relationship holds for other data, within a certain error. Is it reasonable to assume the linear relationship generalizes? Given that the variables represent sample data there is some uncertainty in the coefficients. With other sample data we may get other coefficients. What is the error in estimating the coefficients? Both of these issues can be addressed with some additional assumptions.

11 Is the Model Predictive? No assumptions have yet been made about the distribution of y or any other statistical properties. In modeling we want to calculate a model (a and b 1,..., b k ) from the sample data and claim the same relationship holds for other data, within a certain error. Is it reasonable to assume the linear relationship generalizes? Given that the variables represent sample data there is some uncertainty in the coefficients. With other sample data we may get other coefficients. What is the error in estimating the coefficients? Both of these issues can be addressed with some additional assumptions.

12 Assumptions in Linear Models Given a linear model ŷ = a + b 1 x b k x k of the response variable y, the validity of the model depends on the following assumptions. Recall: the residual vector is y ŷ. Homoscedasticity (Constant Variance) The variance of the residuals is constant across the indices. The points should be evenly distributed around the mean. Plotting residuals versus fitted values is a good test. Normality of Errors The residuals are normally distributed.

13 Assumptions in Linear Models Given a linear model ŷ = a + b 1 x b k x k of the response variable y, the validity of the model depends on the following assumptions. Recall: the residual vector is y ŷ. Homoscedasticity (Constant Variance) The variance of the residuals is constant across the indices. The points should be evenly distributed around the mean. Plotting residuals versus fitted values is a good test. Normality of Errors The residuals are normally distributed.

14 Assumptions in Linear Models Given a linear model ŷ = a + b 1 x b k x k of the response variable y, the validity of the model depends on the following assumptions. Recall: the residual vector is y ŷ. Homoscedasticity (Constant Variance) The variance of the residuals is constant across the indices. The points should be evenly distributed around the mean. Plotting residuals versus fitted values is a good test. Normality of Errors The residuals are normally distributed.

15 Assessment Methods These conditions are verified in R linear fit models with plots, illustrated later. If a plot of residuals versus fitted values shows a dependence pattern then a linear model is likely invalid. Try transforming the variables; e.g., fit log(y) instead of y, or include more complicated explanatory variables, like x 2 1 or x 1x 2. With normality of residuals, RSS satisfies a chi-squared distribution. This can be used as a measure of the model s quality and compare linear models with different sets of explanatory variables.

16 Assessment Methods These conditions are verified in R linear fit models with plots, illustrated later. If a plot of residuals versus fitted values shows a dependence pattern then a linear model is likely invalid. Try transforming the variables; e.g., fit log(y) instead of y, or include more complicated explanatory variables, like x 2 1 or x 1x 2. With normality of residuals, RSS satisfies a chi-squared distribution. This can be used as a measure of the model s quality and compare linear models with different sets of explanatory variables.

17 Assessment Methods These conditions are verified in R linear fit models with plots, illustrated later. If a plot of residuals versus fitted values shows a dependence pattern then a linear model is likely invalid. Try transforming the variables; e.g., fit log(y) instead of y, or include more complicated explanatory variables, like x 2 1 or x 1x 2. With normality of residuals, RSS satisfies a chi-squared distribution. This can be used as a measure of the model s quality and compare linear models with different sets of explanatory variables.

18 Linear Models in R Given: A response variable Y and explanatory variables X1, X2,...,Xk from continuous random variables. A linear regression of Y on X1, X2,..., Xk is executed by the following command. > lmfit <- lm(y ~ X Xk) The values of the estimated coefficients and statistics measuring the goodness of fit are revealed through summary(lmfit)

19 Example Problem There is one response variable yy and five explanatory variables x1, x2, x3, x4, x5, all of length 20. The linear fit is executed by > lmfit1 <- lm(yy ~ x1 + x2 + x3 + x4 + + x5)

20 > summary(lmfit1) Results of the Linear Fit Call: lm(formula = yy ~ x1 + x2 + x3 + x4 + x5) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) x x x x x e-06

21 Results of the Linear Fit continued (Intercept) *** x1 * x2 *** x3 ** x4. x5 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 14 degrees of freedom Multiple R-Squared: 0.974, Adjusted R-squared: F-statistic: 106 on 5 and 14 DF, p-value: 1.30e-10

22 What Class is lmfit1? > class(lmfit1) [1] "lm" > names(lmfit1) [1] "coefficients" "residuals" [3] "effects" "rank" [5] "fitted.values" "assign" [7] "qr" "df.residual" [9] "xlevels" "call" [11] "terms" "model" These can be used to extract individual components, e.g., lmfit1\$fitted.values is the vector of fitted values, the hat vector.

23 Explanation of Coefficients The Estimate column gives the model s estimate of a (Intercept) and b1, b2, b3, b4, b5. The vector of fitted values is yy-hat = *x *x *x *x *x5 From the assumed normal distribution of the residuals it s possible to estimate the error in the coefficients (see the second column). The t test is a test of null hypothesis that the coefficient is 0. If the p-value in the fourth column is < 0.05 then the variable is significant enough to be included in the model.

24 Measures of Fit Quality Several parameters are given that measure the quality of the fit. The distribution of values of the residuals is given. The model degrees of freedom, df, is the length of yy minus the number of parameters calculated in the model. Here this is 20 6 = 14. By definition the residual standard error is RSS. df Clearly, it s good when this is small.

25 Measures of Fit Quality A quantity frequently reported in a model is R 2. Given the y values y 1,..., y n, the mean of y, ȳ, and the fitted values ŷ 1,..., ŷ n, n R 2 i=1 = (ŷ i ȳ) 2 n i=1 (y i ȳ) 2. This is a number between 0 and 1. The quality of fit increases with R 2. The adjusted R 2 does some adjustment for degrees of freedom. In our example R 2 is 0.974, which is very high.

26 Plots to Assess the Model Remember the assumptions on the residuals needed to consider the linear model valid. We need an even scatter of residuals when plotted versus the fitted values, and a normal distribution of residuals. R produces 4 plots we can use to judge the model. The following code generates the 4 plots in one figure, then resets the original graphic parameters. > oldpar <- par(mfrow = c(2, 2)) > plot(lmfit1) > par(oldpar)

27 Plots of lmfit1 Residuals vs Fitted Normal Q Q Residuals Standardized residuals Fitted values Theoretical Quantiles Standardized residuals Scale Location Standardized residuals Residuals vs Leverage Cook's distance Fitted values Leverage

28 Using R 2 to Compare Models? A problem with R 2, though, is that it doesn t follow a distribution. We can t compare the R 2 s in two models and know when one is meaningfully better. Just as an F statistic assessed the significance of an anova model, we use a statistic that follows an F distribution to compare two linear models, and to compare a single model to the null model.

29 Comparing Linear Models A typical concern in a linear modeling problem is whether leaving a variable out meaningfully diminishes the quality of the model. There is some disadvantage in that RSS may increase some in the smaller model, however using fewer variables is a simpler model, always a plus. We need to measure the trade-off.

30 The F Statistic Suppose the data contain N samples (N = length of y). Consider two linear models M 0, M 1. M 0 has p 0 variables and RSS value RSS 0. Model M 1 has p 1 > p 0 variables, the variables in M 0 are included in those used in M 1, and the RSS value is RSS 1. Let F be the number defined as F = (RSS 0 RSS 1 )/(p 1 p 0 ). RSS 1 /(N p 1 1) Under the assumption that the residuals are normally distributed, F satisfies an F distribution with p 1 p 0 and N p 1 1 degrees of freedom.

31 Tested with Anova There is a simple way to execute this test in R. If fit1 and fit2 are the objects returned by lm for the two nested models, the test is executed by > compmod <- anova(fit1, fit2) This is not aov, which models a continuous variable against a factor. The similarity is that both use the F distribution to measure the statistic; all such tests are an analysis of variance in some form.

32 Test One Model Against the Null In the summary(lmfit1) output the last line reports an F statistic. This is a comparison between the model and the null model, that sets all coefficients to 0 except the intercept. This statistic can be >.05 when y has no dependence on the explanatory variables.

33 Remove Variable from lmfit1 and Test the Result The variable judged least significant in lmfit1 is x4. For it, the p-value is.08, which is above the threshold. Generate another model without it. > lmfit2 <- lm(yy ~ x1 + x2 + x3 + x5)

34 Inspect lmfit2 > summary(lmfit2) Call: lm(formula = yy ~ x1 + x2 + x3 + x5) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) x x x e-12 x e-06

35 Inspect lmfit2 continued (Intercept) ** x1 *** x2 *** x3 *** x5 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 15 degrees of freedom Multiple R-Squared: 0.968, Adjusted R-squared: F-statistic: 113 on 4 and 15 DF, p-value: 5.34e-11

36 Compare the Two Models with Anova > compfit1fit2 <- anova(lmfit2, lmfit1) > compfit1fit2 Analysis of Variance Table Model 1: yy ~ x1 + x2 + x3 + x5 Model 2: yy ~ x1 + x2 + x3 + x4 + x5 Res.Df RSS Df Sum of Sq F Pr(>F) Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

37 Use lmfit2 In Place of lmfit1 Since the p-value is 0.082, which is > 0.05, we accept the null hypothesis that the model using 5 variables (lmfit1) is not significantly better than the model using 4 variables (lmfit2). In this situation we use lmfit2 as a model preferred over lmfit1.

38 Variable Selection a simple but typical case The following steps yield a model with the fewest number of variables that is statistically as meaningful as a larger model. Generate an initial model using all reasonable explanatory variables. Identify the variable with the smallest p-value. Compute a linear model using the smaller set of variables. Compute an anova for the two models. If the p-value is < 0.05 then the larger model is significantly better than the smaller. We accept the larger model as optimal. Otherwise, repeat steps 2 4.

39 Variable Selection a simple but typical case The following steps yield a model with the fewest number of variables that is statistically as meaningful as a larger model. Generate an initial model using all reasonable explanatory variables. Identify the variable with the smallest p-value. Compute a linear model using the smaller set of variables. Compute an anova for the two models. If the p-value is < 0.05 then the larger model is significantly better than the smaller. We accept the larger model as optimal. Otherwise, repeat steps 2 4.

40 Variable Selection a simple but typical case The following steps yield a model with the fewest number of variables that is statistically as meaningful as a larger model. Generate an initial model using all reasonable explanatory variables. Identify the variable with the smallest p-value. Compute a linear model using the smaller set of variables. Compute an anova for the two models. If the p-value is < 0.05 then the larger model is significantly better than the smaller. We accept the larger model as optimal. Otherwise, repeat steps 2 4.

41 Variable Selection a simple but typical case The following steps yield a model with the fewest number of variables that is statistically as meaningful as a larger model. Generate an initial model using all reasonable explanatory variables. Identify the variable with the smallest p-value. Compute a linear model using the smaller set of variables. Compute an anova for the two models. If the p-value is < 0.05 then the larger model is significantly better than the smaller. We accept the larger model as optimal. Otherwise, repeat steps 2 4.

42 Linear Models are Broadly Applicable More complicated models can be generated by transforming a variable or including interactions between variables. Instead of fitting y to a + b 1 x1 + b2x 2 it may be more meaningful to fit log(y) to a + c 1 x 1 + c 2 x c 3 x 2 + c 4 x 1 x 2. This is still considered a linear model since it is linear in the parameters. R s handling of generalized lienar models is applicable here.

### Multiple Linear Regression

Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is

### Statistical Models in R

Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

### Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1 > catch

### Comparing Nested Models

Comparing Nested Models ST 430/514 Two models are nested if one model contains all the terms of the other, and at least one additional term. The larger model is the complete (or full) model, and the smaller

### E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,

### We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Quadratic Models We extended the additive model in two variables to the interaction model by adding a third term to the equation. Similarly, we can extend the linear model in one variable to the quadratic

### Correlation and Simple Linear Regression

Correlation and Simple Linear Regression We are often interested in studying the relationship among variables to determine whether they are associated with one another. When we think that changes in a

### Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

### Week 5: Multiple Linear Regression

BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School

### Psychology 205: Research Methods in Psychology

Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready

### Using R for Linear Regression

Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional

### EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION EDUCATION AND VOCABULARY 5-10 hours of input weekly is enough to pick up a new language (Schiff & Myers, 1988). Dutch children spend 5.5 hours/day

### DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

### 5. Linear Regression

5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

### Testing for Lack of Fit

Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data? If the model is correct then ˆσ 2 should be an unbiased estimate of σ 2. If we have a model which is not complex enough to fit

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

### Lucky vs. Unlucky Teams in Sports

Lucky vs. Unlucky Teams in Sports Introduction Assuming gambling odds give true probabilities, one can classify a team as having been lucky or unlucky so far. Do results of matches between lucky and unlucky

### ANOVA. February 12, 2015

ANOVA February 12, 2015 1 ANOVA models Last time, we discussed the use of categorical variables in multivariate regression. Often, these are encoded as indicator columns in the design matrix. In [1]: %%R

### 1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

### Simple Linear Regression Inference

Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

### Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance 14 November 2007 1 Confidence intervals and hypothesis testing for linear regression Just as there was

### Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

### BIOL 933 Lab 6 Fall 2015. Data Transformation

BIOL 933 Lab 6 Fall 2015 Data Transformation Transformations in R General overview Log transformation Power transformation The pitfalls of interpreting interactions in transformed data Transformations

### Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

### 2. Regression and Correlation. Simple Linear Regression Software: R

2. Regression and Correlation Simple Linear Regression Software: R Create txt file from SAS data set data _null_; file 'C:\Documents and Settings\sphlab\Desktop\slr1.txt'; set temp; put input day:date7.

### Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

In this lab we will look at how R can eliminate most of the annoying calculations involved in (a) using Chi-Squared tests to check for homogeneity in two-way tables of catagorical data and (b) computing

### Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

### N-Way Analysis of Variance

N-Way Analysis of Variance 1 Introduction A good example when to use a n-way ANOVA is for a factorial design. A factorial design is an efficient way to conduct an experiment. Each observation has data

### 2. Simple Linear Regression

Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

### Univariate Regression

Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

### Simple Linear Regression

Chapter Nine Simple Linear Regression Consider the following three scenarios: 1. The CEO of the local Tourism Authority would like to know whether a family s annual expenditure on recreation is related

### Regression Analysis: A Complete Example

Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty

### MIXED MODEL ANALYSIS USING R

Research Methods Group MIXED MODEL ANALYSIS USING R Using Case Study 4 from the BIOMETRICS & RESEARCH METHODS TEACHING RESOURCE BY Stephen Mbunzi & Sonal Nagda www.ilri.org/rmg www.worldagroforestrycentre.org/rmg

### Two-way ANOVA and ANCOVA

Two-way ANOVA and ANCOVA In this tutorial we discuss fitting two-way analysis of variance (ANOVA), as well as, analysis of covariance (ANCOVA) models in R. As we fit these models using regression methods

### Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear. In the main dialog box, input the dependent variable and several predictors.

### SPSS Guide: Regression Analysis

SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar

### KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To

### Final Exam Practice Problem Answers

Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal

### A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps

### Getting Correct Results from PROC REG

Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking

### Logistic Regression (a type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge

### MSwM examples. Jose A. Sanchez-Espigares, Alberto Lopez-Moreno Dept. of Statistics and Operations Research UPC-BarcelonaTech.

MSwM examples Jose A. Sanchez-Espigares, Alberto Lopez-Moreno Dept. of Statistics and Operations Research UPC-BarcelonaTech February 24, 2014 Abstract Two examples are described to illustrate the use of

### Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this

### Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

### Regression step-by-step using Microsoft Excel

Step 1: Regression step-by-step using Microsoft Excel Notes prepared by Pamela Peterson Drake, James Madison University Type the data into the spreadsheet The example used throughout this How to is a regression

### Exchange Rate Regime Analysis for the Chinese Yuan

Exchange Rate Regime Analysis for the Chinese Yuan Achim Zeileis Ajay Shah Ila Patnaik Abstract We investigate the Chinese exchange rate regime after China gave up on a fixed exchange rate to the US dollar

### International Statistical Institute, 56th Session, 2007: Phil Everson

Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: peverso1@swarthmore.edu 1. Introduction

### Recall this chart that showed how most of our course would be organized:

Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

### 17. SIMPLE LINEAR REGRESSION II

17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.

### Outline. Dispersion Bush lupine survival Quasi-Binomial family

Outline 1 Three-way interactions 2 Overdispersion in logistic regression Dispersion Bush lupine survival Quasi-Binomial family 3 Simulation for inference Why simulations Testing model fit: simulating the

### Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

### Elementary Statistics Sample Exam #3

Elementary Statistics Sample Exam #3 Instructions. No books or telephones. Only the supplied calculators are allowed. The exam is worth 100 points. 1. A chi square goodness of fit test is considered to

### Factors affecting online sales

Factors affecting online sales Table of contents Summary... 1 Research questions... 1 The dataset... 2 Descriptive statistics: The exploratory stage... 3 Confidence intervals... 4 Hypothesis tests... 4

### Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

### n + n log(2π) + n log(rss/n)

There is a discrepancy in R output from the functions step, AIC, and BIC over how to compute the AIC. The discrepancy is not very important, because it involves a difference of a constant factor that cancels

### 2013 MBA Jump Start Program. Statistics Module Part 3

2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just

Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models

### Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Spring 204 Class 9: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.) Big Picture: More than Two Samples In Chapter 7: We looked at quantitative variables and compared the

### Chapter 13 Introduction to Linear Regression and Correlation Analysis

Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing

### Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb 2 2015

Stat 412/512 CASE INFLUENCE STATISTICS Feb 2 2015 Charlotte Wickham stat512.cwick.co.nz Regression in your field See website. You may complete this assignment in pairs. Find a journal article in your field

### " Y. Notation and Equations for Regression Lecture 11/4. Notation:

Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

### Premaster Statistics Tutorial 4 Full solutions

Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for

### Exercises on using R for Statistics and Hypothesis Testing Dr. Wenjia Wang

Exercises on using R for Statistics and Hypothesis Testing Dr. Wenjia Wang School of Computing Sciences, UEA University of East Anglia Brief Introduction to R R is a free open source statistics and mathematical

### Q = ak L + bk L. 2. The properties of a short-run cubic production function ( Q = AL + BL )

Learning Objectives After reading Chapter 10 and working the problems for Chapter 10 in the textbook and in this Student Workbook, you should be able to: Specify and estimate a short-run production function

### Generalized Linear Models

Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

### Multiple Regression: What Is It?

Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in

### Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

### Module 5: Multiple Regression Analysis

Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College

### Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize

### GLM I An Introduction to Generalized Linear Models

GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation

### Introduction to Hierarchical Linear Modeling with R

Introduction to Hierarchical Linear Modeling with R 5 10 15 20 25 5 10 15 20 25 13 14 15 16 40 30 20 10 0 40 30 20 10 9 10 11 12-10 SCIENCE 0-10 5 6 7 8 40 30 20 10 0-10 40 1 2 3 4 30 20 10 0-10 5 10 15

### Regression and Correlation

Regression and Correlation Topics Covered: Dependent and independent variables. Scatter diagram. Correlation coefficient. Linear Regression line. by Dr.I.Namestnikova 1 Introduction Regression analysis

### Time Series Analysis

Time Series 1 April 9, 2013 Time Series Analysis This chapter presents an introduction to the branch of statistics known as time series analysis. Often the data we collect in environmental studies is collected

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used

### Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.

Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged

### MULTIPLE REGRESSION EXAMPLE

MULTIPLE REGRESSION EXAMPLE For a sample of n = 166 college students, the following variables were measured: Y = height X 1 = mother s height ( momheight ) X 2 = father s height ( dadheight ) X 3 = 1 if

### 11. Analysis of Case-control Studies Logistic Regression

Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

### Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Applied Statistics J. Blanchet and J. Wadsworth Institute of Mathematics, Analysis, and Applications EPF Lausanne An MSc Course for Applied Mathematicians, Fall 2012 Outline 1 Model Comparison 2 Model

### Exercise 1.12 (Pg. 22-23)

Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

### STAT 350 Practice Final Exam Solution (Spring 2015)

PART 1: Multiple Choice Questions: 1) A study was conducted to compare five different training programs for improving endurance. Forty subjects were randomly divided into five groups of eight subjects

### 2. What is the general linear model to be used to model linear trend? (Write out the model) = + + + or

Simple and Multiple Regression Analysis Example: Explore the relationships among Month, Adv.\$ and Sales \$: 1. Prepare a scatter plot of these data. The scatter plots for Adv.\$ versus Sales, and Month versus

### The importance of graphing the data: Anscombe s regression examples

The importance of graphing the data: Anscombe s regression examples Bruce Weaver Northern Health Research Conference Nipissing University, North Bay May 30-31, 2008 B. Weaver, NHRC 2008 1 The Objective

### 1 Simple Linear Regression I Least Squares Estimation

Simple Linear Regression I Least Squares Estimation Textbook Sections: 8. 8.3 Previously, we have worked with a random variable x that comes from a population that is normally distributed with mean µ and

### DATA INTERPRETATION AND STATISTICS

PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

### Lab 13: Logistic Regression

Lab 13: Logistic Regression Spam Emails Today we will be working with a corpus of emails received by a single gmail account over the first three months of 2012. Just like any other email address this account

### Multivariate Logistic Regression

1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

### R: A Free Software Project in Statistical Computing

R: A Free Software Project in Statistical Computing Achim Zeileis Institut für Statistik & Wahrscheinlichkeitstheorie http://www.ci.tuwien.ac.at/~zeileis/ Acknowledgments Thanks: Alex Smola & Machine Learning

### Regression Analysis (Spring, 2000)

Regression Analysis (Spring, 2000) By Wonjae Purposes: a. Explaining the relationship between Y and X variables with a model (Explain a variable Y in terms of Xs) b. Estimating and testing the intensity

### This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

One-Degree-of-Freedom Tests Test for group occasion interactions has (number of groups 1) number of occasions 1) degrees of freedom. This can dilute the significance of a departure from the null hypothesis.

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### Multi Factors Model. Daniel Herlemont. March 31, 2009. 2 Estimating using Ordinary Least Square regression 3

Multi Factors Model Daniel Herlemont March 31, 2009 Contents 1 Introduction 1 2 Estimating using Ordinary Least Square regression 3 3 Multicollinearity 6 4 Estimating Fundamental Factor Models by Orthogonal

### MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL by Michael L. Orlov Chemistry Department, Oregon State University (1996) INTRODUCTION In modern science, regression analysis is a necessary part

### Chapter 7 Section 1 Homework Set A

Chapter 7 Section 1 Homework Set A 7.15 Finding the critical value t *. What critical value t * from Table D (use software, go to the web and type t distribution applet) should be used to calculate the

### An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

Chapter 9 Simple Linear Regression An analysis appropriate for a quantitative outcome and a single quantitative explanatory variable. 9.1 The model behind linear regression When we are examining the relationship