1 Correlation and Regression Analysis



Similar documents
Lesson 17 Pearson s Correlation Coefficient

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Hypothesis testing. Null and alternative hypotheses

CHAPTER 11 Financial mathematics

I. Chi-squared Distributions

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

3. Greatest Common Divisor - Least Common Multiple

Determining the sample size

MEI Structured Mathematics. Module Summary Sheets. Statistics 2 (Version B: reference to new book)

1. C. The formula for the confidence interval for a population mean is: x t, which was

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

Math C067 Sampling Distributions

PSYCHOLOGICAL STATISTICS

Solving Logarithms and Exponential Equations

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

Maximum Likelihood Estimators.

CS103X: Discrete Structures Homework 4 Solutions

Chapter 7: Confidence Interval and Sample Size

Definition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean

Basic Elements of Arithmetic Sequences and Series

Confidence Intervals for One Mean


Now here is the important step

NATIONAL SENIOR CERTIFICATE GRADE 11

1 Computing the Standard Deviation of Sample Means

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

BINOMIAL EXPANSIONS In this section. Some Examples. Obtaining the Coefficients

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

Chapter 14 Nonparametric Statistics

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

OMG! Excessive Texting Tied to Risky Teen Behaviors

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Forecasting. Forecasting Application. Practical Forecasting. Chapter 7 OVERVIEW KEY CONCEPTS. Chapter 7. Chapter 7

Lesson 15 ANOVA (analysis of variance)

Learning objectives. Duc K. Nguyen - Corporate Finance 21/10/2014

Section 11.3: The Integral Test

Chapter 7 Methods of Finding Estimators

Confidence Intervals for Linear Regression Slope

AP Calculus AB 2006 Scoring Guidelines Form B

CHAPTER 3 THE TIME VALUE OF MONEY

HCL Dynamic Spiking Protocol

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

Listing terms of a finite sequence List all of the terms of each finite sequence. a) a n n 2 for 1 n 5 1 b) a n for 1 n 4 n 2

Systems Design Project: Indoor Location of Wireless Devices

Chapter 5: Inner Product Spaces

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

THE TWO-VARIABLE LINEAR REGRESSION MODEL

Chapter 5 Unit 1. IET 350 Engineering Economics. Learning Objectives Chapter 5. Learning Objectives Unit 1. Annual Amount and Gradient Functions

5.4 Amortization. Question 1: How do you find the present value of an annuity? Question 2: How is a loan amortized?

THE ROLE OF EXPORTS IN ECONOMIC GROWTH WITH REFERENCE TO ETHIOPIAN COUNTRY

Properties of MLE: consistency, asymptotic normality. Fisher information.

Statistical inference: example 1. Inferential Statistics

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

Output Analysis (2, Chapters 10 &11 Law)

NATIONAL SENIOR CERTIFICATE GRADE 12

Forecasting techniques

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

SEQUENCES AND SERIES CHAPTER

Sequences and Series

THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

SEQUENCES AND SERIES

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE The absolute value of the complex number z a bi is

Present Value Factor To bring one dollar in the future back to present, one uses the Present Value Factor (PVF): Concept 9: Present Value

Inference on Proportion. Chapter 8 Tests of Statistical Hypotheses. Sampling Distribution of Sample Proportion. Confidence Interval

FOUNDATIONS OF MATHEMATICS AND PRE-CALCULUS GRADE 10

Convexity, Inequalities, and Norms

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

Infinite Sequences and Series

Data Analysis and Statistical Behaviors of Stock Market Fluctuations

Exam 3. Instructor: Cynthia Rudin TA: Dimitrios Bisias. November 22, 2011

5: Introduction to Estimation

Elementary Theory of Russian Roulette

5.3. Generalized Permutations and Combinations

Comparative Study On Estimate House Price Using Statistical And Neural Network Model

A probabilistic proof of a binomial identity

hp calculators HP 12C Statistics - average and standard deviation Average and standard deviation concepts HP12C average and standard deviation

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Biology 171L Environment and Ecology Lab Lab 2: Descriptive Statistics, Presenting Data and Graphing Relationships

BENEFIT-COST ANALYSIS Financial and Economic Appraisal using Spreadsheets

This document contains a collection of formulas and constants useful for SPC chart construction. It assumes you are already familiar with SPC.

FIBONACCI NUMBERS: AN APPLICATION OF LINEAR ALGEBRA. 1. Powers of a matrix

I. Why is there a time value to money (TVM)?

Finding the circle that best fits a set of points

Measures of Spread and Boxplots Discrete Math, Section 9.4

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Descriptive Statistics

2-3 The Remainder and Factor Theorems

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

Transcription:

1 Correlatio ad Regressio Aalysis I this sectio we will be ivestigatig the relatioship betwee two cotiuous variable, such as height ad weight, the cocetratio of a ijected drug ad heart rate, or the cosumptio level of some utriet ad weight gai. The tools used to explore this relatioship, is the regressio ad correlatio aalysis. These tools ca be used to fid out if the outcome from oe variable depeds o the value of the other variable, which would mea a depedecy from oe variable o the other. Regressio ad correlatio aalysis ca be used to describe the ature ad stregth of the relatioship betwee two cotiuous variables. 1.1 Scatterplot The first step i the ivestigatio of the relatioship betwee two cotiuous variables is a scatterplot! Create a scatterplot for the two variables ad evaluate the quality of the relatioship. Example: Does the umber of years ivested i schoolig pay off i the job market? Apparetly so the better educated you are, the more moey you will ear. The data i the followig table give the media aual icome of full-time workers age 25 or older by the umber of years of schoolig completed. x=years of Schoolig y=salary (dollars) 8 18,000 10 20,500 12 25,000 14 28,100 1 34,500 19 39,700 Start of with creatig a scatterplot for X ad Y. 1

The scatterplot shows a strog, positive, liear associatio betwee years ad salary. Questios to be aswered with the help of the scatterplot: 1. Does a relatioship exist that ca be described by a straight lie (which meas is there a liear relatioship)? 2. Is there a relatioship, that is ot liear? 3. If the scatterplot of the variables look like a cloud there is o relatioship betwee both variables ad oe would stop at this poit. 1.2 Correlatio If the scatterplot shows a reasoable liear relatioship (straight lie) calculate Pearso s correlatio coefficiet to evaluate the stregth of the liear relatioship. Notatio: Let (x 1, y 1 ), (x 2, y 2 ),..., (x, y ) deote a sample of (x, y) pairs. Defiitio: Give the followig sum of squares S xy = xy ( x)( y) S xx = x 2 ( x) 2 S yy = y 2 ( y) 2 Pearso s Correlatio Coefficiet ca be calculated as: r = S xy Sxx S yy Pearso s correlatio coefficiet (amed after Karl Pearso, 1857-193) is a umber betwee -1 ad 1, that measures the stregth of a liear relatioship betwee two cotiuous variables. The absolute value of the coefficiet measures how closely the variables are related. The closer it is to 1 the closer the relatioship. A correlatio coefficiet over 0.8 idicates a strog correlatio betwee the variables. Data patters ad Pearso s Correlatio Coefficiet 2

The sig of the correlatio coefficiet tells you of the tred i the relatioship. A positive (egative) coefficiet meas that oe variable icreases (decreases), whe the other icreases. Cotiue Example: Calculate Pearso s correlatio coefficiet for years ad salary. First fid x = 13.17, s x = 4.02 ad ȳ = 2733, s y = 8290. x i =Years of Schoolig y i =Salary (dollars) x i y i x 2 i yi 2 8 18,000 144000 4 324,000,000 10 20,500 205000 100 420,250,000 12 25,000 300000 144 25,000,000 14 28,100 393400 19 789,10,000 1 34,500 552000 25 1,190,250,000 19 39,700 754300 31 1,57,090,000 So that x i = 79, y i = 15800, x i y i = 2348700, x 2 i = 1121, y 2 i = 4, 925, 200, 000. This leads to So that S xy = x i y i ( x i ) ( y i ) S xx = x 2 i ( x i ) 2 S yy = y 2 i ( y i ) 2 r = S xy Sxx S yy = = 2348700 = 1121 (79)2 (79) (15800) = 4, 925, 200, 000 (15800)2 = 80.8333 = 15.5 = 343593333.333 15.5 80.8333 343593333.333 = 0.994. 3

The Pearso correlatio coefficiet of Years of schoolig ad salary r = 0.994. A correlatio of 0.9942 is very high ad shows a strog, positive, liear associatio betwee years of schoolig ad the salary. 1.3 Liear Regressio I the example we might wat to predict the expected salary for differet times of schoolig, or calculate the icrease i salary for every year of schoolig. For this purpose we ca do a regressio aalysis. Terms ad Defiitio: If we wat to use a variable x to draw coclusios cocerig a variable y: y is called depedet or respose variable. x is called idepedet, predictor, os explaatory variable. If the relatioship betwee two variables is liear is ca be summarized by a straight lie. A straight lie ca be described by a equatio: y = a + b x a is called the itercept ad b the slope of the equatio. The slope is the amout by which y icreases whe x icreases by 1 uit. Fittig a straight lie Give data poits (x i, y i ) a ad b shall ow be chose i that way that the correspodig liear lie will have the best fit for the give data. The criteria for best fit used i regressio aalysis is the sum of the squared differeces betwee the data poits ad the lie itself, that is the y deviatios. For data poits (x i, y i ), 1 i this ca be writte as mi a,b (y i (a + bx i )) 2 i I words: miimize the sum by choosig the appropriate parameters a ad b. The resultig lie is called the least square lie or sample regressio lie. After the problem is stated it ca be solved mathematically ad the results are formulas, how to calculate the best parameters. b = S xy Sxx ad a = ȳ b x. Write the equatio of the least squares lie as ŷ = a + bx ŷ gives a estimate for y for a give value of x. 4

Cotiue Example: Sice the salary ad the years of schoolig show such a strog liear relatioship ad the salary ca be viewed as depedig o the years of schoolig, do a liear regressio aalysis with the salary as the respose variable ad the years of schoolig as the predictor variable. Calculate b = S xy xx = 15.5 80.8333 Our result is the least squares lie = 2050.28 ad a = ȳ b x = 2733 2050.28 13.17 = 30.81 ŷ = a + bx = 30.81 + 2050.28 x The slope equals $2050.28, that is for every year of schoolig the average salary icreases by this amout. To estimate the average salary after 18 years of schoolig we calculate ŷ with x = 18 ŷ = 30.81 + 2050.28 18 = 37535.85$ Do t use the regressio lie for values outside the rage of the observed values. This is a model that oly has bee proved valid for the give rage. Properties of the regressio or least squares lie 1. The least squares lie passes always through the balace poit ( x, ȳ) of the data set. 2. The regressio lie of y o x should ot be used to predict x, sice it is ot the lie that miimizes the sum of squared x deviatios. Assessig the fit of a lie Oce the least squares lie has bee obtaied, it is atural to examie how effectively the lie summarizes the relatioship betwee x ad y. The first questio that has to be aswered is, if the lie is a appropriate way to summarize the relatioship. I order to aswer this questio, we will calculate the coefficiet of determiatio r 2. Defiitio: The coefficiet of determiatio for he regressio of y o x is r 2 = S2 xy S xx S yy the square of Pearso s Correlatio Coefficiet. It gives the proportio of variatio i y that ca be attributed to a liear relatioship betwee x ad y. Is r 2 greater tha 0.8, the model has a good fit ad ca be used to calculate reliable predictios of the depedet variable by usig the idepedet variable. I the example, the variable Years of Schoolig explais r 2 = 98.8% of the variatio i the variable Salary. Which is very high. The plot showed that the data poits are almost o a straight lie. Use the least squares lie for predictig the aual salary of a perso with 13 years of schoolig. 5

ŷ(13) = a + b 13 = 30.81 + 2050.28 13 = 27284.45$ This is just a estimate, from the other parts of the class, we kow that a cofidece iterval ca be foud that gives more iformatio.