Simple Regression and Correlation

Similar documents
Regression Analysis: A Complete Example

2. Simple Linear Regression

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

17. SIMPLE LINEAR REGRESSION II

2013 MBA Jump Start Program. Statistics Module Part 3

Correlation key concepts:

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

Simple Linear Regression Inference

Univariate Regression

Hypothesis testing - Steps

1.5 Oneway Analysis of Variance

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

4. Multiple Regression in Practice

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

1.1. Simple Regression in Excel (Excel 2010).

Analysing Questionnaires using Minitab (for SPSS queries contact -)

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Simple linear regression

Chapter 13 Introduction to Linear Regression and Correlation Analysis

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Chapter 7: Simple linear regression Learning Objectives

Final Exam Practice Problem Answers

Premaster Statistics Tutorial 4 Full solutions

Factors affecting online sales

Coefficient of Determination

ABSORBENCY OF PAPER TOWELS

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

CALCULATIONS & STATISTICS

Module 3: Correlation and Covariance

Regression step-by-step using Microsoft Excel

The correlation coefficient

How To Run Statistical Tests in Excel

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Multiple Linear Regression

Regression III: Advanced Methods

5. Linear Regression

Notes on Applied Linear Regression

INTRODUCTION TO MULTIPLE CORRELATION

General Regression Formulae ) (N-2) (1 - r 2 YX

Module 5: Multiple Regression Analysis

Solution Let us regress percentage of games versus total payroll.

The importance of graphing the data: Anscombe s regression examples

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Title: Modeling for Prediction Linear Regression with Excel, Minitab, Fathom and the TI-83

Part 2: Analysis of Relationship Between Two Variables

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

Elementary Statistics Sample Exam #3

1 Simple Linear Regression I Least Squares Estimation

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

TRINITY COLLEGE. Faculty of Engineering, Mathematics and Science. School of Computer Science & Statistics

Section 3 Part 1. Relationships between two numerical variables

Week TSX Index

SPSS Guide: Regression Analysis

Using Excel for Statistical Analysis

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

II. DISTRIBUTIONS distribution normal distribution. standard scores

c 2015, Jeffrey S. Simonoff 1

Statistical Models in R

Scatter Plot, Correlation, and Regression on the TI-83/84

WEB APPENDIX. Calculating Beta Coefficients. b Beta Rise Run Y X

Example: Boats and Manatees

August 2012 EXAMINATIONS Solution Part I

12: Analysis of Variance. Introduction

AP Physics 1 and 2 Lab Investigations

Introduction to Quantitative Methods

Session 7 Bivariate Data and Analysis

Using Excel for inferential statistics

The Volatility Index Stefan Iacono University System of Maryland Foundation

Chapter 7. One-way ANOVA

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Using R for Linear Regression

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Simple Regression Theory II 2010 Samuel L. Baker

Chapter 23. Inferences for Regression

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Data Analysis Tools. Tools for Summarizing Data

2. Linear regression with multiple regressors

2 Sample t-test (unequal sample sizes and unequal variances)

Lin s Concordance Correlation Coefficient

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

MULTIPLE REGRESSION EXAMPLE

Notes on logarithms Ron Michener Revised January 2003

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Homework 11. Part 1. Name: Score: / null

Recall this chart that showed how most of our course would be organized:

(More Practice With Trend Forecasts)

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Module 5: Statistical Analysis

Correlation and Simple Linear Regression

GLM I An Introduction to Generalized Linear Models

Example G Cost of construction of nuclear power plants

Transcription:

14 Simple Regression and Simple Regression and We are going to study the relation between two variables. Let us label the first as X and the second as Y. We will observe values for these two variables in the form: X Y We are interested in finding relation between these two variables. To determine the relation between them we have to know the meaning of each of them. The following are some examples of X and Y, and the theoretical relation between them: X Y Theoretical Relation Weight of a person Height of a person Positive relation Higher Temperature Electricity consumption Positive relation KD Exchange rate Temperature in against the Dollar Japan No relation Price of a commodity Quantity of items purchased Negative relation QMIS 0 1

Simple Regression and For these variables we could be interested in answers to the following questions: a) Is there any relation between the two variables? b) Does the relation (if any) take a linear form? c) What is the direction of the relation (positive, negative)? d) What is the magnitude of this relation (weak, medium or trong)? The Answer to these questions could be done in two ways 1) Partially by plotting the scatter plot of X and Y. As in the following general forms of scatter plots one would obtain : 3 Simple Regression and Scatter Plot scatter plot of X and Y scatter plot of X and Y Y 50 45 40 35 30 5 0 15 10 5 0 0 5 10 15 0 5 X Y 50 45 40 35 30 5 0 15 10 5 0 0 5 10 15 0 5 X Complete positive linear relation r = 1 Complete negative linear relation r = 1 4 QMIS 0

Simple Regression and Scatter Plot scatter plot of X and Y scatter plot of X and Y Y 45 40 35 30 5 0 15 10 5 0 0 5 10 15 0 5 X Y 50 45 40 35 30 5 0 15 10 5 0 0 5 10 15 0 5 X Positive incomplete linear relation 0 < r < 1 Negative incomplete linear relation 1 < r < 0 5 Simple Regression and Scatter Plot scatter plot of X and Y Y 40 35 30 5 0 15 10 5 0 0 4 6 8 10 X No relation linear relation between X and Y r = 0 6 QMIS 0 3

Simple Regression and Scatter Plot in MINITAB Assuming that the variable X is in column C1 and Y is in column C of the MINITAB worksheet. We can get a high resolution form of the scatter plot using the commands: Mtb> plot C*C1 The general form of the plot command is: Col. of Y Mtb> plot * Col. of X 7 Linear Coefficient () The second and more precise method to answer the previous four questions is by using a numerical measure. Carl Pearson has developed a measure of the strength of a linear relation between two variables. This measure is still wildly used and is known as Pearson's linear correlation coefficient or simply the linear correlation coefficient, and is denoted by "r". The Coefficient "r" is a scale that runs from 1 to +1. The value of "r" indicates the strength of the linear relation between the two variables X and Y. And its sign indicates the direction (negative or positive) of the linear relation. Negative relation Positive relation 1.0 0.75 0.5 0 +0.5 +0.75 +1.0 strong medium Weak weak medium strong Perfect negative relation No linear relation Perfect positive relation 8 QMIS 0 4

Linear Coefficient As an example the following values of correlation coefficients "r" is interpreted as follows: If r = 0.81 Strong negative linear relation If r = 0.34 Weak positive linear relation If r = 0.69 Medium positive linear relation If r = 1.00 Perfect positive linear relation If r = 1.00 Perfect negative linear relation If r = 0.71 Medium negative linear relation If r = 0 No linear relation 9 Linear Coefficient The formula for computing the Pearson's Coefficients is as follows: r (X X)*(Y Y) ( Definition form) (X X) (Y Y) (X) (Y) XYn ( X) ( Y) X Y n n SSxy SS xx * SSyy ( Computation form) So, to compute "r" we need the value of the following sums: n, X, Y, X, Y and XY 10 QMIS 0 5

Linear Coefficient Example: For the two variables X and Y we have the following 5 observations. Determine the strength and direction of the linear relation (if any) between them. X Y 70 10 50 3 45 9 30 50 10 55 Sum 05 176 11 Linear Coefficient Example To compute the Pearson's Coefficient "r" we need to find out the values of the required sums. So, we have: X Y X Y XY 70 10 4900 100 700 50 3 500 104 1600 45 9 05 841 1305 30 50 900 500 1500 10 55 100 305 550 Sum 05 176 1045 7490 5655 n 5 X05 Y 176 X 1045 Y 7490 XY5655 SS xx 00 SS yy 194.8 SS xy 1561 1 QMIS 0 6

n5 Linear Coefficient Example X05 Y 176 X 1045 Y 7490 XY 5655 SS xx 00 SS yy 194.8 SS xy 1561 r (X) (Y) SS XYxy n SS xx * SS yy ( X) ( Y) X Y n n (05)(176) 5655 5 1561 (05) (176) 00 *194.8 1045 7490 5 5 1561 0.965 1671.495 Which means that there is a strong negative linear relation between the two variable X and Y. 13 Linear Coefficient Example Computing the Coefficient in MINITAB: Assuming that the data of X is in column C1 and the data of Y is in column C we can use the following MINITAB command to calculate "r": Mtb> corr C1 C 14 QMIS 0 7

Linear Coefficient MINITAB Example MTB > print c1 c Row X Y 1 70 10 50 3 3 45 9 4 30 50 5 10 55 MTB > let c3=c1*c1 MTB > let c4=c*c MTB > let c5=c1*c MTB > print c1c5 Row X Y x^ Y^ XY 1 70 10 4900 100 700 50 3 500 104 1600 3 45 9 05 841 1305 4 30 50 900 500 1500 5 10 55 100 305 550 MTB > count c1 Total number of observations in X = 5 MTB > sum c1 Sum of X = 05.00 MTB > sum c Sum of Y = 176.00 MTB > sum c3 Sum of x^ = 1045 MTB > sum c4 Sum of Y^ = 7490.0 MTB > sum c5 Sum of XY = 5655.0 15 Linear Coefficient MINITAB Example MTB > gstd MTB > plot c c1 Y * * 45+ * 30+ * 15+ * ++++++X 1 4 36 48 60 7 MTB > corr c1 c s: X, Y Pearson correlation of X and Y = 0.965 PValue = 0.008 16 QMIS 0 8

Linear Coefficient Hypothesis Testing 17 Linear Coefficient Hypothesis Testing t 0.965 1 0.9316 = 6.39 t(3) 5 So for =.05 and t(with 3 df) The critical values is 3.18 Decision: reject H0 with 95% confidence 18 QMIS 0 9

Regression We have learned from the previous section how to examine the strength and direction of a linear relation that could link two variables: X and Y. In linear regression, we are interested in forming or estimating the best linear function that ties the two variables: X and Y. The general form of linear equation is: Y = a + b X or Y = b 0 + b 1 X An example of such a linear equation is: Y = 4 + 3 X Here a = 4 and b = 3. What is the interpretation of "a" and "b" in the general form of linear equation? 19 Regression To understand the interpretation of linear equation let us take the previous equation as an example: Y = 4 + 3 X And find out the values of Y for different values of X: Changes in X: Changes in Y: Y X X Y 1 7 X=1=1 10 Y=107 = 3 X=3=1 3 13 Y=1310= 3 X=43=1 4 16 Y=1613= 3 0 4 Y = 4 Value of "b" Value of "a" From the above example we see that whenever X changes by 1 unit Y changes by 3 (the value of b). And when X equals 0 the value of Y equals 4 (the value of a in the general form of the linear equation). 0 QMIS 0 10

Regression Y = a + b X a = is the starting or initial value of Y (i.e. the value of Y when X = 0) b = (1) the rate of change in Y when X changes by 1 unit. Or () the rate of change in Y divided by the rate of change in X (b= Y/X). (3) b is also interpreted as the slope of the linear equation Y = a + b X or the tangent (tan) of the angle between that line and the horizontal line. For the above general form of linear equation X is known as the independent variable, where as Y is the dependents variable as its value is determined by the values of X. 1 Regression QMIS 0 11

Regression Y = a + b X How to estimate the best linear equation for Y on X? We usually start with values for the two variables X and Y. Here we should specify which of them we are going to considered as the independent Variable and which is the dependent (or the one we need to explain by the other) variable. We start in a similar data layout as we had with the correlation coefficient case, that is: X Y Estimating the linear equation of Y as dependent variable and X as the independent variable, from a set of data, is known as the Estimating the linear regression line of Y on X. 3 Regression One way to help use estimate the linear regression of Y on X is to scatter plot the values of X and Y. And compute the correlation coefficient as discussed before. If the scatter plot of Y and X takes exactly a linear form (with positive or negative slop). Then estimating the best line that fits the data is easy and straight forward. But if the scatter plot does not perfectly follow a linear pattern. Then there are a number of ways to define and estimate a linear equation that would represent the data. The Least Squared Method is one of these methods which is widely use. The idea of the least squared method is to fit a line for the linear relation Y= a + b X that passes in the middle of the data set. To achieve this idea, the method would search for the line that has the least squared error. That is the estimated line that has the least (the lowest possible value) of (Y Ŷ) where: Y is the observed values (value obtained) for the dependent variable. Ŷ is the estimated values of Y using the estimated regression line. The quantity (Y Ŷ) is known as the error term or residual or deviation. 4 QMIS 0 1

Regression scatter plot of X and Y scatter plot of X and Y Y 45 40 35 30 5 0 15 10 5 0 0 5 10 15 0 5 X Y 45 40 35 30 5 0 15 10 5 0 (Y Ŷ) Ŷ Y 0 5 10 15 0 5 X 5 Regression Since the data does not exactly follow a linear form. We can say the linear form would fit the data with some error. That is: Y = a + b X + e Where "a" and "b" are unknown constants and "e" is an unobservable error term (deviation from the line). We then can write the error term as: (Y Ŷ) or (Y ( a + b X)), the square of that is the squared deviation. (Y ( a + b X)), Y 45 40 35 30 5 0 15 10 5 0 scatter plot of X and Y (Y Ŷ) Ŷ Y 0 5 10 15 0 5 X 6 QMIS 0 13

Y (14) Simple Regression and Regression And the sum of all squared deviations for all values of Y is the sum of squared deviations: Y a b X to obtain the values of "a" and "b" that will minimize the sum of the squared errors we have to partially differentiate the sum of squared deviations, once with respect to "a" and another with respect to "b". Then equate both resulted functions with the zero, to obtain what is known as the normal equations: scatter plot of X and Y 45 a n b X Y a X b X XY 40 35 30 5 0 15 10 5 0 (Y Ŷ) Ŷ Y 0 5 10 15 0 5 X 7 Regression Solving these two equations for "a" and "b" we have: ( X)( Y) XY ˆb n ( X) X n SSxy SSxx and â Y bˆ X so the estimated equation is then: Yˆ aˆ bˆ X Y 45 40 35 30 5 0 15 10 5 0 scatter plot of X and Y (Y Ŷ) Ŷ Y 0 5 10 15 0 5 X 8 QMIS 0 14

Regression Example From the previous example n5 X Y X Y XY 70 10 4900 100 700 50 3 500 104 1600 45 9 05 841 1305 30 50 900 500 1500 10 55 100 305 550 Sum 05 176 1045 7490 5655 X05 Y 176 X 1045 Y 7490 XY 5655 SS xx 00 SS yy 194.8 SS xy 1561 05*176 5655 ˆ 5 1561 b = = = 0.773 (05) 00 1045 5 and â Y bx ˆ 176 05 ( 0.773) 5 5 35. 0.773*41 66.893 9 Regression Example 05*176 5655 ˆ 5 1561 b = = = 0.773 (05) 00 1045 5 â YbX ˆ 176 05 ( 0.773) 5 5 35. 0.773*41 66.893 The final least squared estimated linear regression of Y on X is: Ŷ 66.8930.773X Note: "b" and "r" have the same sign (+ or ) which is determined by the value of the numerator. 30 QMIS 0 15

Regression Example Ŷ 66.8930.773X Usage of the estimated Linear equation: 1) To further explain the relation between X and Y ) To predict values of Y (dependent variable) for certain values of X (independent variable). Explaining the relation in the previous example: For the previous example the estimated slope (b) is: ˆ b = 0.773 ΔY =0.773 ΔX Which indicates that when X increases by 1 unit the values of Y will decrease by 0.773. The initial value of Y (the value of Y when X=0) is: ˆ a= 66.893 31 Regression Example Ŷ 66.8930.773X Prediction: What is the estimated (predicted) value of Y when X=30? i.e. What is Ŷ when X 30 66.893 43.703 Y x30 50 0.773 * 30 X Y 70 10 50 3 45 9 30 50 10 55 error deviation Y Ŷ 50 43.70 6.3 Similarly, the predicted value of Y when X=80 is: * Yx 80 66.893 0.773*80 5.053 3 QMIS 0 16

Regression MINITAB Example Assuming that the data of X is in column C1 and the data of Y is in column C we can use the following MINITAB command to estimate the value of "a" and "b" : mtb> regr C 1 C1 The general form of the Regression command is: Y X standard error Ŷ Mtb> regr 1 Optional columns Example: Mtb> regr C 1 C1 Mtb> regr C 1 C1 C5 C6 33 Regression MINITAB Example MTB > print c1c Row X Y 1 70 10 50 3 3 45 9 4 30 50 5 10 55 MTB > gstd MTB > plot c c1 Y * * 45+ * 30+ * 15+ * ++++++X 1 4 36 48 60 7 MTB > corr c1 c s: X, Y Pearson correlation of X and Y = 0.965 34 QMIS 0 17

Regression MINITAB Example MTB > regr c 1 c1 Regression Analysis: Y versus X The regression equation is Y = 66.9 0.773 X Predictor Coef SE Coef T P Constant 66.884 5.518 1.1 0.001 X 0.778 0.108 6.39 0.008 S = 5.431 RSq = 93.% RSq(adj) = 90.9% Analysis of Variance Source DF SS MS F P Regression 1 106.3 106.3 40.89 0.008 Residual Error 3 88.5 9.5 Total 4 194.8 35 Regression MINITAB Example MTB > regr c 1 c1 c5 c6 MTB > print c1 c c5 c6 Data Display Row X Y st.error Yhat 1 70 10 0.8918 1.7896 50 3 0.79306 8.450 3 45 9 0.64314 3.1089 4 30 50 1.34817 43.7005 5 10 55 1.3437 59.1559 36 QMIS 0 18

Regression Assumptions of the regression Model Consider the population regression model: Y = a + bx + We have four assumptions for estimating this model using a sample: Assumption 1 : The random error term has a mean equal to zero for each x. Assumption : The error associated with different observations are independent. Assumption 3 : For any given x, the distribution of error is Normal. Assumption 4 : The distribution of population errors for each x has the same constant standard deviations, which is denoted by. This assumption indicates that the spread of points around the regression line is similar for all x values. 37 Regression If we estimate a regression of Y on X using the least squared method. Then how do we know if this estimate is good? And how do we know whether it is reliable estimate? We need to test the validity of the estimated model for: Where: Y = a + b X + X is the independent variable. is a random variable that is distributed normally with mean 0 and standard deviation. i.e. [ ~ N( 0, ) ] Y is the dependent random variable. Y ~ N( a+bx, ) 38 QMIS 0 19

Regression Steps for estimated model validation: (1) Coefficient of Determination (r ): r is a numerical measure that takes a value between 0 and 1 (i.e. 0 r 1 ) which represents the percentage of the total variations in the dependent (Y) that is explained by the estimated linear model: Ŷ â bˆ X. r is computed by the formula: Explaned variation r = = = Total variation in Y Regression sum of squares = RSS Total sum of squares TSS The more the value of r, the stronger is the estimated model. So, r is one indication or measure for the strength of the estimated model. 39 Regression Example: If we compute r and we find out that it equals to 0.83. Then this is interpreted as: The estimated model (using the least squared method) has succeeded in explaining 83% of the total variation in the dependent variable Y. The remaining 17% is not explained by the estimated model. It could be a pure error or some lack or deficiency in the estimated model. 40 QMIS 0 0

Regression Notes: 1 SS SS (Total Sum of Squares) (Regression Sum of Squares) SS SS 3 SS SS (Error Sum of Squares) (4) i.e. 41 Regression 4 QMIS 0 1

Regression 43 Regression () Test for the overall validity of the estimated model: Given the assumptions of the normally of the error term and the dependent variable Y. we can test the adequacy of the estimated model in presenting the linear relation between Y and X. The procedure for this testing follows the same steps we have been doing through all previous procedures of hypothesis testing. We start by formulating the two hypotheses. Namely, H O and H 1. In our case : H O : The estimated model does not fit the data. ( OR the model is not adequate) H 1 : The estimated model fits the data. (OR the model is adequate). 44 QMIS 0

Regression To test the above hypotheses we compute the value of the Test Statistics through the analysis of variance table (ANOVA). ANOVA: SOURCE DF SS MS F (test stat.) Regression k ( Ŷ Y) (Ŷ Y) / k Reg. MS Error nk1 ( Y Ŷ) (Y Ŷ) /(n k 1) Error MS Total n1 ( Y Y) The test statistics: Regression MS ~ F( k, nk1) Error MS Where: n is the number of observations used in estimating the model k is the number of independent variables (X) used in the model Ŷ is the predicted values of Y for each value of X. Y is the mean of the dependent variable Y. 45 Regression (3) Test for the model parameters: The estimated model: Ŷ â bˆx Has two parameters: â and bˆ. We will test the hypothesis that (b) (the true value of the estimated bˆ ) equals 0. The general form of the hypothesis in this case is: HO: b = b0 versus H1: b b0 Similarly, we will test the hypothesis that (a) (the true value of the estimated â ) equals 0. The general form of the hypothesis in this case is: H O: a = a 0 versus H 1: a a 0. If the value of b0 equals zero then the true value of (b) could be zero, and X will not contribute to the equation, and it is not needed in the model. In that case, we can drop X from the model. Similarly, if a0 = 0, and the result of the test shows that this could be accepted, then (a) is dropped out from the estimated model. 46 QMIS 0 3

Regression The Test Statistics are: same DF as the error in ANOVA same DF as the error in ANOVA The two statistics have a tdistribution with (n) df. The standard error for â and bˆ is not easy to compute and we are going to relay on the output of the MINITAB to estimated it. We can estimate the Confidence Intervals for a and b from the formulas: 47 Regression SE S SS where; S Error MS n SS b SS n 48 QMIS 0 4

Regression 49 Regression Example: If we have the following values of the variables X and Y: n x = 498 y=6 xy = 1178 x = 690 y =5370 SS xx = 1489.6 SS yy = 6.4 SS xy = 57. X Y 3 16 36 17 55 6 47 4 38 60 1 66 3 44 18 70 30 50 0 If we input the values of these variables in columns 1 and in the MINITAB worksheet and use the commands print and Descriptive we have: 50 QMIS 0 5

n x = 498 y=6 xy = 1178 x = 690 y =5370 SS xx = 1489.6 SS yy = 6.4 SS xy = 57. Regression MTB > print c1 c Row X Y 1 3 16 36 17 3 55 6 4 47 4 5 38 6 60 1 7 66 3 8 44 18 9 70 30 10 50 0 MTB > desc c1 c Descriptive Statistics: X, Y Variable N Mean Median TrMean StDev SE Mean X 10 49.80 48.50 49.50 1.87 4.07 Y 10.60 1.50.5 5.40 1.71 Variable Minimum Maximum Q1 Q3 X 3.00 70.00 37.50 61.50 Y 16.00 3.00 17.75 7.00 51 Regression To see the relation between the two variables: X and Y we plot the scatter plot between them using the MINITAB Plot command: MTB > GStd. MTB > Plot 'Y' 'X'; SUBC> Symbol '*'. * 30.0+ * Y * 5.0+ * * * 0.0+ * * * * ++++++X 35.0 4.0 49.0 56.0 63.0 70.0 The scatter plot indicates that there is nonperfect but positive correlation between the two variables. This could be more confirmed by computing the Pearson s linear correlation coefficient (r) using the MINITAB corr command: 5 QMIS 0 6

n x = 498 y=6 xy = 1178 x = 690 y =5370 SS xx = 1489.6 SS yy = 6.4 SS xy = 57. Regression MTB > corr c1 c s: X, Y Pearson correlation of X and Y = 0.843 PValue = 0.00 The value of the correlation coefficient confirms that there is a strong positive linear correlation between X and Y. To estimate the linear regression line of Y on X (in which Y is the dependent variable and X is the independent one) we will use the MINITAB command Regress c 1 c1. 53 n x = 498 y=6 xy = 1178 x = 690 y =5370 SS xx = 1489.6 SS yy = 6.4 SS xy = 57. Regression To further store the values of the standardized errors in column C3 and the predicted values of Y in column C4. We use the command: Regress c 1 c1 c3 c4. The following is the output of that command: MTB > regr c 1 c1 c3 c4 Regression Analysis: Y versus X The regression equation is Y = 4.97 + 0.354 X Predictor Coef SE Coef T P Constant 4.975 4.090 1. 0.58 X 0.3539 0.07976 4.44 0.00 S = 3.078 RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression 1 186.59 186.59 19.69 0.00 Residual Error 8 75.81 9.48 Total 9 6.40 54 QMIS 0 7

Regression MTB > print c1 c c3 c4 X Y SE yhat 3 16 0.1176 16.300 36 17 0.647 17.7159 55 6 0.5395 4.4404 47 4 0.811 1.6090 38 1.936 18.437 60 1 1.8575 6.100 66 3 1.3999 8.3335 44 18 0.8834 0.5473 70 30 0.1030 9.749 50 0 0.9145.6708 From the above results we have: The best linear estimated regression line using the least squared method is: Ŷ 4.975 0.3539 X 55 Regression To test the validation of the model we will use the previous output from MINITAB. (1) Coefficient of Determination (r ): The value of r is presented in the above result. r = 0.711 or 71.1%. To show how this value is computed we will use the values of Y, Yhat and Y. Through the MINITAB we will compute r using the formula: ŶY RSS 186.59 r 0.7111 YY TSS 6.4 b ˆ SS xy 0.3539 * 57. r = = = 0.7111 SSyy 6.4 ( SS ) ( 57.) xx xy yy S = 3.078 RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression 1 186.59 186.59 19.69 0.00 Residual Error 8 75.81 9.48 Total 9 6.40 SSxx = 1489.6 SSyy = 6.4 SSxy = 57. OR OR ( ) ( ) r = = = 0.711 = r = 0.8435 SS * SS 1489.6* 6.4 56 QMIS 0 8

Regression MTB > let c5=c4mean(c) MTB > name c5 'YhYB' MTB > let c6=c5** MTB > name c6 '(YhYB)' MTB > let c7 = (cmean(c))** MTB > name c7 '(YYB)' MTB > print c1 c c4 c5c7 Row X Y Yhat YhYB (YhYB) (YYB) 1 3 16 16.300 6.9979 39.6873 43.56 36 17 17.7159 4.88410 3.8545 31.36 3 55 6 4.4404 1.84039 3.3870 11.56 4 47 4 1.6090 0.99098 0.980 1.96 5 38 18.437 4.1766 17.441 0.36 6 60 1 6.100 3.60999 13.030.56 7 66 3 8.3335 5.73351 3.873 88.36 8 44 18 0.5473.0574 4.137 1.16 9 70 30 9.749 7.14919 51.1110 54.76 10 50 0.6708 0.07078 0.0050 6.76 MTB > let k1=sum(c6) MTB > let k=sum(c7) MTB > let k3=k1/k MTB > print k1k3 K1 186.587 K 6.400 K3 0.711078 r 57 S = 3.078 RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression 1 186.59 186.59 19.69 0.00 Residual Error 8 75.81 9.48 Total 9 6.40 Regression r = 0.711 or 71.1% means that the estimated model of the regression for Y on X (i.e Ŷ 4.975 0.3539 X ) managed to explain 71.1% of the total variation in the dependent variable Y. The remaining 9.9% is not explained by the model either because the used model is not adequate, or because that part is a pure error term that can not be explained. We can also compute the same value directly from the results of the ANOVA table presented above, by: dividing the regression sum of squares by the total sum of squares: Analysis of Variance Source DF SS MS F P Regression 1 186.59 186.59 19.69 0.00 Residual Error 8 75.81 9.48 Total 9 6.40 r 186.59 = 0. 711 6.40 58 QMIS 0 9

S = 3.078 RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression 1 186.59 186.59 19.69 0.00 Residual Error 8 75.81 9.48 Total 9 6.40 Regression The same value is presented in the above output: S = 3.078 RSq = 71.1% RSq(adj) = 67.5% It is worth mentioning here that for the simple regression model case the coefficient of Determination (r ) equals the square of the Pearson s Coefficient (r) (r) = (0.843) = 0.711 Conclusion: 71.1% is a good and acceptable ratio. So the model is considered adequate enough. 59 Regression n x = 498 y=6 xy = 1178 x = 690 y =5370 SSxx = 1489.6 SSyy = 6.4 SSxy = 57. () Testing the validity of the overall estimated model: We will test the following two hypotheses using the results from the ANOVA presented above: H0: The estimated model does not fit the data (the estimated model is not good or accepted) H1: The estimated model fits the data. (the estimated model is good and accepted) The number of independent variables in the model is k = 1 The ANOVA results presented in the above MINITAB results are: Analysis of Variance Source DF SS MS F P Regression 1 186.59 186.59 19.69 0.00 Residual Error 8 75.81 9.48 Total 9 6.40 60 QMIS 0 30

S = 3.078 RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression 1 186.59 186.59 19.69 0.00 Residual Error 8 75.81 9.48 Total 9 6.40 Regression So, the test statistics for testing the above hypotheses equals to: Fcalculated = 19.69 and follows the F Distribution with 1 and 8 DF. The Pvalue = 0.00 The tabulated F value for = 0.05 and 1 and 8 DF equals to 5.318 Conclusion: Reject Ho: The estimated model does not fit the data (the estimated model is not good or accepted) with 95% confidence. This mean that we have statistical evidences that the estimated model is accepted and good to fit and present the linear relation between the two variables. 61 Regression (3) Testing each component of the estimated model: If the validity of the model (as a whole) is accepted in the second step above. We can go ahead and test the importance, and the necessity of keeping each of the elements that form the estimated model. If the true model we are estimating is: Y= a+ b X + e And the estimated model is: Yˆ = ˆa+ ˆb X Then we will test whether a = 0 (i.e. we can eliminate a from the estimated model). And the estimated model will be: Yˆ = ˆb X 6 QMIS 0 31

Regression Similarly, we will test whether b = 0 (i.e. we can eliminate b and X contribution. And estimate the model with the contribution of a alone. a here will be the mean of the dependent variable Y). And the estimated model is: Yˆ = ˆa Or simply Ŷ Y For that, we will test the hypotheses: H o : b = 0 and H o : a = 0 H 1 : b 0 H 1 : a 0 63 S = 3.078 RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression 1 186.59 186.59 19.69 0.00 Residual Error 8 75.81 9.48 Total 9 6.40 Regression SSxx = 1489.6 SSyy = 6.4 SSxy = 57. 64 QMIS 0 3

The regression equation is Y = 4.97 + 0.354 X Predictor Coef SE Coef T P Constant 4.975 4.090 1. 0.58 X 0.3539 0.07976 4.44 0.00 Regression 65 Regression Based on the above results, we can conclude the following: We will reject Ho that b = 0. This means that the contribution of X is not negligible and it should be included in the model. The effect of a is not important and can be eliminated from the model. Even though if we leave it in the model it will not harm the model as its roll and contribution is not of a significant importance. 66 QMIS 0 33

Regression Confidence Intervals For model parameters: As we have shown before, one can estimate the Confidence Interval for B using the equation: P b ˆ t( ˆ ˆ ˆ /, nk1 )*S.E.(b) b b t( /, nk1)*s.e.(b) 1 And the Confidence Interval for A: P a ˆ t( /, nk1 ) *S.E.(a) ˆ a a ˆ t( /, nk1) *S.E.(a) ˆ 1 To estimate these two confidence intervals, we will use the MINITAB output. Mainly the estimated values for the two parameters and their Standard Errors shown in the output below: The regression equation is Y = 4.97 + 0.354 X Predictor Coef SE Coef T P Constant 4.975 4.090 1. 0.58 X 0.3539 0.07976 4.44 0.00 S = 3.078 RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression 1 186.59 186.59 19.69 0.00 Residual Error 8 75.81 9.48 Total 9 6.40 67 Regression The regression equation is Y = 4.97 + 0.354 X Predictor Coef SE Coef T P Constant 4.975 4.090 1. 0.58 X 0.3539 0.07976 4.44 0.00 S = 3.078 RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression 1 186.59 186.59 19.69 0.00 Residual Error 8 75.81 9.48 Total 9 6.40 For = 0.05 the tabulated value of t ( 0.05/, 8) =.306, and The estimated Confidence Interval for b is: P0.3539.306*0.07976 b 0.3539.306*0.07976 0.95 P 0.1699 b 0.53780.95 And the estimated Confidence Interval for a is: P 4.975.306* 4.090 a 4.975.306*4.090 0.95 P 4.4565 a 14.4065 0.95 68 QMIS 0 34