Introduction to Regression and Data Analysis



Similar documents
Chapter 13 Introduction to Linear Regression and Correlation Analysis

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

SPSS Tests for Versions 9 to 13

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Simple Predictive Analytics Curtis Seare

SPSS Guide: Regression Analysis

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Fairfield Public Schools

2013 MBA Jump Start Program. Statistics Module Part 3

Univariate Regression

SPSS Explore procedure

Chapter 7: Simple linear regression Learning Objectives

Simple Linear Regression Inference

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

11. Analysis of Case-control Studies Logistic Regression

The Dummy s Guide to Data Analysis Using SPSS

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

January 26, 2009 The Faculty Center for Teaching and Learning

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Directions for using SPSS

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Additional sources Compilation of sources:

2. Simple Linear Regression

Factors affecting online sales

Descriptive Statistics

Econometrics Simple Linear Regression

Elements of statistics (MATH0487-1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Simple linear regression

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

4. Simple regression. QBUS6840 Predictive Analytics.

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

MTH 140 Statistics Videos

An SPSS companion book. Basic Practice of Statistics

Introduction to Quantitative Methods

Algebra 1 Course Information


1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Introduction to Statistics and Quantitative Research Methods

Practical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

II. DISTRIBUTIONS distribution normal distribution. standard scores

Common Core Unit Summary Grades 6 to 8

Getting Correct Results from PROC REG

Module 5: Statistical Analysis

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Moderation. Moderation

A Primer on Forecasting Business Performance

Note 2 to Computer class: Standard mis-specification tests

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Organizing Your Approach to a Data Analysis

DATA INTERPRETATION AND STATISTICS

Regression Analysis: A Complete Example

Tutorial 5: Hypothesis Testing

Correlation key concepts:

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Teaching Business Statistics through Problem Solving

APPLICATION OF LINEAR REGRESSION MODEL FOR POISSON DISTRIBUTION IN FORECASTING

Module 5: Multiple Regression Analysis

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Data Analysis Tools. Tools for Summarizing Data

Description. Textbook. Grading. Objective

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures

Chapter 23. Inferences for Regression

Chapter 5 Analysis of variance SPSS Analysis of variance

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

How To Run Statistical Tests in Excel

Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th

Lin s Concordance Correlation Coefficient

The Statistics Tutor s Quick Guide to

Instructions for SPSS 21

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Correlational Research

430 Statistics and Financial Mathematics for Business

Module 3: Correlation and Covariance

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

August 2012 EXAMINATIONS Solution Part I

Analyzing Research Data Using Excel

Mathematics within the Psychology Curriculum

Recall this chart that showed how most of our course would be organized:

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

Calibration and Linear Regression Analysis: A Self-Guided Tutorial

International Statistical Institute, 56th Session, 2007: Phil Everson

An introduction to IBM SPSS Statistics

Pearson s Correlation

Correlation and Regression

Transcription:

Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008

I. The basics A. Types of variables Your variables may take several forms, and it will be important later that you are aware of, and understand, the nature of your variables. The following variables are those which you are most likely to encounter in your research. Categorical variables Such variables include anything that is qualitative or otherwise not amenable to actual quantification. There are a few subclasses of such variables. Dummy variables take only two possible values, 0 and 1. They signify conceptual opposites: war vs. peace, fixed exchange rate vs. floating exchange rate, etc. Nominal variables can range over any number of non-negative integers. They signify conceptual categories that have no inherent relationship to one another: red vs. green vs. black, Christian vs. Jewish vs. Muslim, etc. Ordinal variables are like nominal variables, only there is an ordered relationship among them: no vs. maybe vs. yes, etc. Numerical variables Such variables describe data that can be readily quantified. Like categorical variables, there are a few relevant subclasses of numerical variables. Continuous variables can appear as fractions; in reality, they can have an infinite number of values. Examples include temperature, GDP, etc. Discrete variables can only take the form of whole numbers. Most often, these appear as count variables, signifying the number of times that something occurred: the number of firms invested in a country, the number of hate crimes committed in a county, etc. A useful starting point is to get a handle on your variables. How many are there? Are they qualitative or quantitative? If they are quantitative, are they discrete or continuous? Another useful practice is to explore how your data are distributed. Do your variables all cluster around the same value, or do you have a large amount of variation in your variables? Are they normally distributed? Plots are extremely useful at this introductory stage of data analysis histograms for single variables, scatter plots for pairs of continuous variables, or box-and-whisker plots for a continuous variable vs. a categorical variable. This preliminary data analysis will help you decide upon the appropriate tool for your data. http://www.yale.edu/statlab 2

If you are interested in whether one variable differs among possible groups, for instance, then regression isn t necessarily the best way to answer that question. Often you can find your answer by doing a t-test or an ANOVA. The flow chart shows you the types of questions you should ask yourselves to determine what type of analysis you should perform. Regression will be the focus of this workshop, because it is very commonly used and is quite versatile, but if you need information or assistance with any other type of analysis, the consultants at the Statlab are here to help. II. Regression: An Introduction: A. What is regression? Regression is a statistical technique to determine the linear relationship between two or more variables. Regression is primarily used for prediction and causal inference. In its simplest (bivariate) form, regression shows the relationship between one independent variable (X) and a dependent variable (Y), as in the formula below: The magnitude and direction of that relation are given by the slope parameter ( 1 ), and the status of the dependent variable when the independent variable is absent is given by the intercept parameter ( 0 ). An error term (u) captures the amount of variation not predicted by the slope and intercept terms. The regression coefficient (R 2 ) shows how well the values fit the data. Regression thus shows us how variation in one variable co-occurs with variation in another. What regression cannot show is causation; causation is only demonstrated analytically, through substantive theory. For example, a regression with shoe size as an independent variable and foot size as a dependent variable would show a very high regression coefficient and highly significant parameter estimates, but we should not conclude that higher shoe size causes higher foot size. All that the mathematics can tell us is whether or not they are correlated, and if so, by how much. It is important to recognize that regression analysis is fundamentally different from ascertaining the correlations among different variables. Correlation determines the strength of the relationship between variables, while regression attempts to describe that relationship between these variables in more detail. B. The linear regression model (LRM) The simple (or bivariate) LRM model is designed to study the relationship between a pair of variables that appear in a data set. The multiple LRM is designed to study the relationship between one variable and several of other variables. In both cases, the sample is considered a random sample from some population. The two variables, X and Y, are two measured outcomes for each observation in the data set. For http://www.yale.edu/statlab 3

example, let s say that we had data on the prices of homes on sale and the actual number of sales of homes: Price(thousands of $) Sales of new homes x y 160 126 180 103 200 82 220 75 240 82 260 40 280 20 This data is found in the file house sales.jmp. We want to know the relationship between X and Y. Well, what does our data look like? We will use the program JMP (pronounced jump ) for our analyses today. Start JMP, look in the JMP Starter window and click on the Open Data Table button. Navigate to the file and open it. In the JMP Starter, click on Basic in the category list on the left. Now click on Bivariate in the lower section of the window. Click on y in the left window and then click the Y, Response button. Put x in the X, Regressor box, as illustrated above. Now click OK to display the following scatterplot. http://www.yale.edu/statlab 4

We need to specify the population regression function, the model we specify to study the relationship between X and Y. This is written in any number of ways, but we will specify it as: where Y is an observed random variable (also called the response variable or the lefthand side variable). X is an observed non-random or conditioning variable (also called the predictor or right-hand side variable). 1 is an unknown population parameter, known as the constant or intercept term. 2 is an unknown population parameter, known as the coefficient or slope parameter. u is an unobserved random variable, known as the error or disturbance term. Once we have specified our model, we can accomplish 2 things: Estimation: How do we get good estimates of 1 and 2? What assumptions about the LRM make a given estimator a good one? Inference: What can we infer about 1 and 2 from sample information? That is, how do we form confidence intervals for 1 and 2 and/or test hypotheses about them? The answer to these questions depends upon the assumptions that the linear regression model makes about the variables. The Ordinary Least Squres (OLS) regression procedure will compute the values of the parameters 1 and 2 (the intercept and slope) that best fit the observations. http://www.yale.edu/statlab 5

Obviously, no straight line can exactly run through all of the points. The vertical distance between each observation and the line that fits best the regression line is called the residual or error. The OLS procedure calculates our parameter values by minimizing the sum of the squared errors for all observations. Why OLS? It has some very nice mathematical properties, and it is compatible with Normally distributed errors, a very common situation in practice. However, it requires certain assumptions to be valid. C. Assumptions of the linear regression model 1. The proposed linear model is the correct model. Violations: Omitted variables, nonlinear effects of X on Y (e.g., area of circle = *radius 2 ) 2. The mean of the error term (i.e. the unobservable variable) does not depend on the observed X variables. 3. The error terms are uncorrelated with each other and exhibit constant variance that does not depend on the observed X variables. Violations: Variance increases as X or Y increases. Errors are positive or negative in bunches called heteroskedasticity. 4. No independent variable exactly predicts another. Violations: Including monthly precipitation for 12 months, and annual precipitation in the same model. 5. Independent variables are either random or fixed in repeated sampling If the five assumptions listed above are met, then the Gauss-Markov Theorem states that the Ordinary Least Squares regression estimator of the coefficients of the model is the Best Linear Unbiased Estimator of the effect of X on Y. Essentially this means that it is the most accurate estimate of the effect of X on Y. III. Deriving OLS estimators The point of the regression equation is to find the best fitting line relating the variables to one another. In this enterprise, we wish to minimize the sum of the squared deviations (residuals) from this line. OLS will do this better than any other process as long as these conditions are met. http://www.yale.edu/statlab 6

The best fit line associated with the n points (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) has the form where y = mx + b slope = m = intercept = b = So we can take our data from above and substitute in to find our parameters: Price(thousands of $) Sales of new homes X y xy x 2 160 126 20,160 25,600 180 103 18,540 32,400 200 82 16,400 40,000 220 75 16,500 48,400 240 82 19,680 57,600 260 40 10,400 67,600 280 20 5,600 78,400 Sum 1540 528 107280 350000 Slope = m = Intercept = b = http://www.yale.edu/statlab 7

To have JMP calculate this for you, click on the small red triangle just to the left of Bivariate Fit of y By x. From the drop down menu, choose Fit Line to produce the following output. Thus our least squares line is y = 249.85714-0.7928571x http://www.yale.edu/statlab 8

IV. Interpreting data Coefficient: (Parameter Estimate) for every one unit increase in the price of a house, -.793 fewer houses are sold. Std Error: if the regression were performed repeatedly on different datasets (that contained the same variables), this would represent the standard deviation of the estimated coefficients t-ratio: the coefficient divided by the standard error, which tells us how large the coefficient is relative to how much it varies in repeated sampling. If the coefficient varies a lot in repeated sampling, then its t-statistic will be smaller, and if it varies little in repeated sampling, then its t-statistic will be larger. Prob> t : the p-value is the result of the test of the following null hypothesis: in repeated sampling, the mean of the estimated coefficient is zero. E.g., if p = 0.001, the probability of observing an estimate of that is at least as extreme as the observed estimate is 0.001, if the true value of is zero. In general, a p-value less than some threshold, like 0.05 or 0.01, will mean that the coefficient is statistically significant. Confidence interval: the 95% confidence interval is the set of values that lie within 1.96 standard deviations of the estimated Rsquare: this statistic represents the proportion of variation in the dependent variable that is explained by the model (the remainder represents the proportion that is present in the error). It is also the square of the correlation coefficient (hence the name). V. Multivariate regression models. Now let s do it again for a multivariate regression, where we add in number of red cars in neighborhood as an additional predictor of housing sales: Price(thousands of $) Sales of new homes Number of red cars x y cars 160 126 0 180 103 9 200 82 19 220 75 5 240 82 25 260 40 1 280 20 20 http://www.yale.edu/statlab 9

Our general regression equation this time is this: To calculate a multiple regression, go the JMP Starter window and select the Model category. Click on the top button, Fit Model for the following dialog: Click on y in the left window, then on the Y button. Now click on x and then Add in the Construct Model Effects section. Repeat with cars then click on Run Model to produce the following window. (I ve hidden a few results to save space. Click on the blue triangles to toggle between displayed and hidden.) http://www.yale.edu/statlab 10

The regression equation we get from this model is y = 252.85965 -.824935 x 1 +.3592748 x 2 In this case, our regression model says that the number of houses sold is a linear function of both the house price and the number of red cars in the neighborhood. The coefficient of each predictor variable is the effect of that variable, for a given value of the other. Let s look at our t-ratio on red cars. We can see that the t is quite low, and its associated p-value is much larger than 0.05. Thus, we cannot reject the null hypothesis that this coefficient is equal to 0. Since a coefficient of 0 essentially erases the contribution of that variable in the regression equation, we might be better off leaving it out of our model entirely. Be careful, though there is a difference between statistical significance and actual significance. Statistical significance tells us how sure we are about the coefficient of the model, based on the data. Actual significance tells us the importance of the variable a coefficient for X could be very close to 0 with very high statistical significance, but that could mean it contributes very little to Y. Conversely, a variable could be very important to the model, but its statistical significance could be quite low because you don t have enough data. What happened to our Rsquare in this case? It increased from the previous model, which is good, right? We explained more of the variation in Y through adding this variable. However, you can ALWAYS increase the R 2 by adding more and more variables, but this increase might be so small that increasing the complexity of the model isn t worth it. Testing the significance of each coefficient through p-values is the way to decide if a predictor variable should be included or not. V. When linear regression does not work Linear regression will fail to give good results when any of the assumptions are not valid. Some violations are more serious problems for linear regression than others, though. Consider an instance where you have more independent variables than observations. In such a case, in order to run linear regression, you must simply gather more observations. Similarly, in a case where you have two variables that are very highly correlated (say, GDP per capita and GDP), you may omit one of these variables from your regression equation. If the expected value of your disturbance term is not zero, then there is another independent variable that is systematically determining your dependent variable. Finally, in a case where your theory indicates that you need a number of independent variables, you may not have access to all of them. In this case, to run the linear regression, you must either find alternate measures of your independent variable, or find another way to investigate your research question. http://www.yale.edu/statlab 11

Other violations of the regression assumptions are addressed below. In all of these cases, be aware that Ordinary Least Squares regression, as we have discussed it today, gives biased estimates of parameters and/or standard errors. Non-linear relationship between X and Y: use Non-linear Least Squares or Maximum Likelihood Estimation Endogeneity: use Two-Stage Least Squares or a lagged dependent variable Autoregression: use a Time-Series Estimator (ARIMA) Serial Correlation: use Generalized Least Squares Heteroskedasticity: use Generalized Least Squares How will you know if the assumptions are violated? You usually will NOT find the answer from looking at your regression results alone. Consider the four scatter plots below: http://www.yale.edu/statlab 12

Clearly they show widely different relationships between x and y. However, in each of these four cases, the regression results are exactly the same: the same best fit line, the same t-statistics, the same R 2. http://www.yale.edu/statlab 13

In the first case, the assumptions are satisfied, and linear regression does what we would expect it to. In the second case, we clearly have a non-linear (in fact, a quadratic) relationship. In the third and fourth cases, we have heteroskedastic errors: the errors are much greater for some values of x than for others. The point is, the assumptions are crucial, because they determine how well the regression model fits the data. If you don t look closely at your data and make sure a linear regression is appropriate for it, then the regression results will be meaningless. VI. Additional resources If you believe that the nature of your data will force you to use a more sophisticated estimation technique than Ordinary Least Squares, you should first consult the resources listed on the Statlab Web Page at http://statlab.stat.yale.edu/links/index.jsp. You may also find that Google searches of these regression techniques often find simple tutorials for the methods that you must use, complete with help on estimation, programming, and interpretation. Finally, you should also be aware that the Statlab consultants are available throughout regular Statlab hours to answer your questions (see the areas of expertise for Statlab consultants at http://statlab.stat.yale.edu/people/showconsultants.jsp). http://www.yale.edu/statlab 14

Purpose of statistical analysis Summarizing univariate data Exploring relationships between variables Testing significance of differences Descriptive statistics (mean, standard deviation, variance, etc) Form of data Number of groups Frequencies Measurements One: mean compared to a specified value Two Multiple Number of variables Number of variables One-sample t-test Independent samples Related samples Independent samples Related samples One: compared to theoretical distribution Two: tested for association Two: degree of relationship Multiple: effect of 2+ predictors on a dependant variable Form of data Form of data One independent variable Multiple independent variables Repeatedmeasures ANOVA Chi-square goodness-of-fit test Chi-square test for association Level of measurement Multiple regression Ordinal Interval Ordinal Interval Ordinal Interval Mann-Whitney U test Independentsamples t-test One-way ANOVA Multifactorial ANOVA Spearman's rho Pearson's correlation cooeficient Wilcoxon matchedpairs test Pairedsamples t-test Source: Corston and Colman (2000) Figure 9. Choosing an appropriate statistical procedure