1. Measuring association using correlation and regression



Similar documents
Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

CHAPTER 14 MORE ABOUT REGRESSION

An Alternative Way to Measure Private Equity Performance

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

SIMPLE LINEAR CORRELATION

How To Calculate The Accountng Perod Of Nequalty

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Problem Set 3. a) We are asked how people will react, if the interest rate i on bonds is negative.

Mean Molecular Weight

Forecasting the Direction and Strength of Stock Market Movement

1 Example 1: Axis-aligned rectangles

Calibration and Linear Regression Analysis: A Self-Guided Tutorial

What is Candidate Sampling

The impact of hard discount control mechanism on the discount volatility of UK closed-end funds

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

The Greedy Method. Introduction. 0/1 Knapsack Problem

14.74 Lecture 5: Health (2)

The OC Curve of Attribute Acceptance Plans

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Calculation of Sampling Weights

Recurrence. 1 Definitions and main statements

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Texas Instruments 30X IIS Calculator

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Macro Factors and Volatility of Treasury Bond Returns

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

An Analysis of the relationship between WTI term structure and oil market fundamentals in

Extending Probabilistic Dynamic Epistemic Logic

Section 5.4 Annuities, Present Value, and Amortization

DEFINING %COMPLETE IN MICROSOFT PROJECT

The Mathematical Derivation of Least Squares

Question 2: What is the variance and standard deviation of a dataset?

n + d + q = 24 and.05n +.1d +.25q = 2 { n + d + q = 24 (3) n + 2d + 5q = 40 (2)

Portfolio Loss Distribution

Transition Matrix Models of Consumer Credit Ratings

Financial Mathemetics

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

The Use of Analytics for Claim Fraud Detection Roosevelt C. Mosley, Jr., FCAS, MAAA Nick Kucera Pinnacle Actuarial Resources Inc.

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Statistical Methods to Develop Rating Models

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Lecture 3: Annuity. Study annuities whose payments form a geometric progression or a arithmetic progression.

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

7.5. Present Value of an Annuity. Investigate

Fixed income risk attribution

IDENTIFICATION AND CORRECTION OF A COMMON ERROR IN GENERAL ANNUITY CALCULATIONS

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

A Simplified Framework for Return Accountability

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Economic Interpretation of Regression. Theory and Applications

Student Performance in Online Quizzes as a Function of Time in Undergraduate Financial Management Courses

Faraday's Law of Induction

1 De nitions and Censoring

Evaluating the Effects of FUNDEF on Wages and Test Scores in Brazil *

PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIGIOUS AFFILIATION AND PARTICIPATION

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

Logistic Regression. Steve Kroon

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Rotation Kinematics, Moment of Inertia, and Torque

Survival analysis methods in Insurance Applications in car insurance contracts

HÜCKEL MOLECULAR ORBITAL THEORY

Two Faces of Intra-Industry Information Transfers: Evidence from Management Earnings and Revenue Forecasts

Regression Models for a Binary Response Using EXCEL and JMP

Control Charts with Supplementary Runs Rules for Monitoring Bivariate Processes

Simple Interest Loans (Section 5.1) :

HOUSEHOLDS DEBT BURDEN: AN ANALYSIS BASED ON MICROECONOMIC DATA*

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

Lecture 2: Single Layer Perceptrons Kevin Swingler

EDUCATION AND RELIGION

STATISTICAL DATA ANALYSIS IN EXCEL

Part 1: quick summary 5. Part 2: understanding the basics of ANOVA 8

Implementation of Deutsch's Algorithm Using Mathcad

Hedging Interest-Rate Risk with Duration

Figure 1. Inventory Level vs. Time - EOQ Problem

Quantization Effects in Digital Filters

High Correlation between Net Promoter Score and the Development of Consumers' Willingness to Pay (Empirical Evidence from European Mobile Markets)

Instructions for Analyzing Data from CAHPS Surveys:

Tuition Fee Loan application notes

Credit Limit Optimization (CLO) for Credit Cards

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

This circuit than can be reduced to a planar circuit

OLA HÖSSJER, BENGT ERIKSSON, KAJSA JÄRNMALM AND ESBJÖRN OHLSSON ABSTRACT

Forecasting and Stress Testing Credit Card Default using Dynamic Models

Return decomposing of absolute-performance multi-asset class portfolios. Working Paper - Nummer: 16

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

HARVARD John M. Olin Center for Law, Economics, and Business

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

Estimation of Dispersion Parameters in GLMs with and without Random Effects

Financial Instability and Life Insurance Demand + Mahito Okura *

Transcription:

How to measure assocaton I: Correlaton. 1. Measurng assocaton usng correlaton and regresson We often would lke to know how one varable, such as a mother's weght, s related to another varable, such as a baby's brthweght. We mght be nterested n the relatonshp between a patent's blood pressure and the amount of drug the patent takes per day. Suppose that we have data on blood pressure and drug dose such as n Table <BP-drug dose>. Table <BP-drug dose>. Drug dose (mg per day) Blood pressure 5 151 5 145 5 136 10 137 10 124 10 124 15 111 15 105 20 110 20 98 Drug dose versus blood pressure. R=- 0.92 Blood pressure 200 150 100 50 0 0 5 10 15 20 25 Drug dose (mg per day) Two questons we may ask are: 1. If I know how much drug the patent was gven, how well can I predct ther blood pressure? Put another way, we can ask how much of the varablty n blood pressure can be explaned by dfferences n the amount of drug the patent takes. To answer ths queston, we use correlaton, whch we dscuss n ths chapter.

2. For a unt change n the amount of drug gven, how much change n blood pressure do we expect? To answer ths queston, we use regresson, whch we dscuss n the next chapter. As another example of where we use correlaton or regresson, suppose that we are nterested n babes who are born wth low brthweghts, and want to examne factors that affect brthweght. We mght have data on mother's weght and baby's brthweght as n Table <Brthweghts>. Table <Brthweghts>. Mother's weght Chld's brthweght 100 5.10 115 4.50 120 6.00 125 6.80 140 7.50 150 8.10 155 7.50 160 8.10 180 9.00 200 11.00 Chld's weght Chld's weght vs mother's weght. R= 0.96 12.00 10.00 8.00 6.00 4.00 2.00 0.00 0 50 100 150 200 250 Mother's weght Agan, two questons we may ask are: 1. If I know the mother's weght, how well can I predct the baby's weght? Put another way, we can ask how much of the varablty n baby's weght can be explaned by dfferences n the mother's weght. To answer ths queston, we use correlaton. 2. For a unt change n the mother's weght (one pound ncrease), how much change n baby's weght do we expect? To answer ths queston, we use regresson.

You may notce that all the varables we are consderng (blood pressure, weght, dose) are measured on a contnuous scale, and these are sutable for correlaton and regresson. If we want to measure assocaton between categorcal varables (such as male/female, Republcan/Democrat, pass/fal, yes/no, and so on) we use statstcs such as the chsquare test whch we'll look at n a later chapter. We are gong to focus manly on the most wdely used correlaton measure, whch s R, the Pearson lnear correlaton coeffcent. Later on, we'll look at another correlaton measure, the Spearman rank correlaton coeffcent, whch s sometmes better to use than the Pearson measure.

2. Correlaton can be postve, zero, or negatve (rangng from 1.0 to -1.0) Correlaton can be postve as n the brthweght example or negatve as n the drug/blood pressure example. By defnton, usng the formula we'll see n the next secton, the maxmum (postve) correlaton s 1.0. In the brthweght example, correlaton was nearly perfect at R = 0.96. The mnmum possble (negatve) correlaton s -1.0. In the drug versus blood pressure example, correlaton was strongly negatve wth R = -0.92. Correlaton can also be near zero, as shown n Table <Scrambled brthweghts>, where we have scrambled the chldren's brthweghts, and see R = 0.03. Table <Scrambled brthweghts> Mother's weght Chld's weght 100 8.10 115 7.50 120 6.00 125 6.80 140 7.50 150 8.10 155 11.00 160 4.50 180 5.10 200 9.00 Chld's weght (scrambled) vs. mother's weght. R=0.03 Chld's weght 12.00 10.00 8.00 6.00 4.00 2.00 0.00 0 50 100 150 200 250 Mother's weght

3. How to calculate the Pearson lnear correlaton coeffcent We'll frst defne the Pearson lnear correlaton coeffcent, and then look at how to nterpret t. Recall the formula for varance from the chapter on descrptve statstcs. Varance descrbes varablty around the mean value. Varance = 2 ( x x) N Covarance has a formula smlar to that for the varance. Covar ancex (, y) = ( x x N )( y y ) Correlaton uses the covarance of two varables. The correlaton of two varables, x and y, s equal to the covarance of x and y dvded by a number that makes correlaton be between -1.0 and 1.0. Correlato n( x, y) = R = Covarancex (, y) Var( x)* Var( y) The term n the denomnator, the square root of Var(x) * Var(y), just forces the correlaton coeffcent to be between -1.0 and 1.0; t doesn't affect how we nterpret the correlaton coeffcent, so we won't look at t any further.

4. How to nterpret the correlaton coeffcent Let's look at what the correlaton coeffcent tells us. We'll start wth just four ponts, one from each quadrant, as shown n Table <Ponts n 4 quadrants>. Quadrant 1 s labeled here as (1,1), quadrant 2 s labeled (-1,1), quadrant 2 s labeled (1, -1), and quadrant 4 s labeled (-1, -1). For any data set, we can force the mean to be at (0,0) by subtractng the mean of all the x values from the x value for each pont and the mean of all the y values from the y value for each pont. For these "Mean corrected" values, the mean s now at (0,0), and every pont must fall nto one of the four quadrants relatve to the mean. Table <Ponts n 4 quadrants>. x value y value 1 1-1 1 1-1 -1-1 Fgure <Ponts n 4 quadrants>. 1.5-1, 1 1 0.5 1, 1 0-1.5-1 -0.5 0 0.5 1 1.5-0.5-1, -1-1 -1.5 1, -1 Now, let's look agan at the formula for covarance. Covar ancex (, y) = ( x x N )( y y ) We've specfed that we subtract the means, so the new mean value of x s zero and the new mean value of y s 0, and the formula for covarance then smplfes as follows.

Covar ancex (, y) = ( )( x y N ) Consder a pont n quadrant 1 n Fgure <Ponts n 4 quadrants>, such as the pont (1,1). In the formula for covarance, we put the pont (1,1), nto the term (x )(y ), and we get 1*1 = 1, whch s a postve number. For the term (x )(y ), every pont n quadrant 1 wll gve a postve value, because we are multplyng two postve numbers. Next, consder a pont n quadrant 3, such as (-1,-1). In the formula for covarance, we put the pont (-1,-1) nto the term (x )(y ), whch gves us -1*-1 = 1, whch s agan a postve number. For the term (x )(y ), every pont n quadrant 3, where we are multplyng two negatve numbers, whch wll gve a postve value. Ponts n quadrants 2 and 4 wll gve us negatve values for the term (x )(y ). In quadrant 2, we see that -1* 1 = -1, and n quadrant 4, we see that -1* 1 = -1. If all the ponts n our data set fall nto quadrant 1 or quadrant 3 wth respect to the mean, then every pont wll contrbute a postve value to the covarance, whch wll n turn gve us a large postve correlaton. In contrast, f all the ponts n our data set fall nto quadrant 2 or quadrant 4 wth respect to the mean, then every pont wll contrbute a negatve value to the covarance, whch wll n turn gve us a large negatve correlaton. If ponts are scattered across all four quadrants, we wll get a mxture of postve and negatve terms that tend to cancel each other out, gvng a correlaton near zero.

5. Potental problems wth Pearson lnear correlaton The Pearson lnear correlaton coeffcent can be greatly affected by a sngle observaton. In partcular, a sngle pont (an outler) that falls a long way from other ponts n the x-y plane can greatly ncrease or decrease the Pearson R. For example, let's look agan at the data on drug dose versus blood pressure, but suppose that the last patent, nstead of havng a blood pressure measurement of 98, has a value of 150 as n Table <Outler n BP-drug dose>. For these data, the Pearson correlaton coeffcent s R = -0.47, whch s a large change from the R = -0.92 we had before changng ths sngle pont. When we see an ndvdual pont that s so nfluental n determnng the value of our statstc, we should consder the possblty that there was an error n the measurement, and make sure that we are not beng mslead. Table <Outler n BP-drug dose>. Drug dose (mg per day) Blood pressure 5 151 5 145 5 136 10 137 10 124 10 124 15 111 15 105 20 110 20 150 An outler n blood pressure measurement at (20,150) Blood pressure 200 150 100 50 0 0 5 10 15 20 25 Drug dose

A sngle outler can also make a weak correlaton appear much stronger. For the data n <Table no-outler>, the correlaton coeffcent s qute small, R = 0.05. <Table no-outler> x value y value 1 4 1 1 2 3 2 3 3 1 3 4 4 3 4 2 No outler. R = -0.05 12 10 8 Y 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 X

However, f we add a sngle observaton at (10, 10), as shown n Table <Table Sngleoutler> and Fgure <Fgure Sngle-outler>, we change the correlaton coeffcent from R = 0.05 to R = 0.81. <Table Sngle-outler> x value y value 1 4 1 2 2 3 2 3 3 1 3 4 4 3 4 2 10 10 <Fgure Sngle-outler> Sngle outler at (10,10). R = 0.8 12 10 8 Y 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 X So we see that the Pearson lnear correlaton coeffcent may be very senstve to a sngle pont. In such stuatons, we may choose to use an alternatve assocaton measure, the Spearman rank correlaton coeffcent, whch we'll look at shortly.

The Pearson lnear correlaton coeffcent s not good at detectng genune but non-lnear assocatons between varable. Suppose that we have values such as those n Table <Table non-lnear relaton> and Fgure < Fgure non-lnear relaton>. Although there s clearly a relatonshp between x and y, the correlaton coeffcent s R = 0.0. Ths example shows that t s always a good dea to graph your data, and not to rely completely on a statstc. <Table non-lnear relaton>. x value y value -5 25-4 16-3 9-2 4-1 1 0 0 1 1 2 4 3 9 4 16 5 25 Non-lnear assocaton wth R = 0 Y 30 25 20 15 10 5 0-6 -4-2 0 2 4 6 X

6. Spearman rank correlaton: an alternatve to Pearson correlaton We saw that the Pearson correlaton coeffcent may be greatly affected by sngle nfluental ponts (outlers). Sometmes we would lke to have a measure of assocaton that s not so senstve to sngle ponts, and at those tmes we can use Spearman rank correlaton. Recall that, when we calculate the mean of a set of numbers, a sngle extreme value can greatly ncrease the mean. But when we calculate the medan, whch s based on ranks, extreme values have very lttle nfluence. The same dea apples to Pearson and Spearman correlaton. Pearson uses the actual values of the observatons, whle Spearman uses only the ranks of the observatons, and thus, lke the medan, s not much affected by outlers. Most statstcs packages wll calculate ether Pearson or Spearman, but Excel wll only do Pearson. The easest way to get Spearman s to replace each observaton by the rank value of each observaton, and then calculate the Pearson coeffcent usng the ranks. For the outler examples, recall that the Pearson correlaton s R = -0.05 excludng the outler and R = 0.81 ncludng the outler. For these data, the Spearman rank correlaton s R s = -0.10 excludng the outler and R s = 0.24 ncludng the outler. Let's do the calculatons. Here's the data excludng the sngle outler. I've assgned the rank to each value, wth tes gven the average rank. x value x rank y value y rank 1 1.5 4 7.5 1 1.5 1 1.5 2 3.5 3 5 2 3.5 3 5 3 5.5 1 1.5 3 5.5 4 7.5 4 7.5 3 5 4 7.5 2 3

We can extract the ranks, and calculate the Pearson coeffcent for the ranks, gettng R s = -0.10 excludng the outler. x rank y rank 1.5 7.5 1.5 1.5 3.5 5 3.5 5 5.5 1.5 5.5 7.5 7.5 5 7.5 3 Here's the data wth the sngle outler ncluded. Agan, I've assgned the rank to each value, wth tes gven the average rank. x value x rank y value y rank 1 1.5 4 7.5 1 1.5 1 1.5 2 3.5 3 5 2 3.5 3 5 3 5.5 1 1.5 3 5.5 4 7.5 4 7.5 3 5 4 7.5 2 3 10 9 10 9 We can extract the ranks, and calculate the Pearson coeffcent for the ranks, gettng R s = 0.24 wth the outler ncluded. x rank y rank 1.5 7.5 1.5 1.5 3.5 5 3.5 5 5.5 1.5 5.5 7.5 7.5 5 7.5 3 9 9 The Spearman coeffcent s much less affected by the sngle nfluental pont than s the Pearson correlaton coeffcent.