PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION



Similar documents
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Module 3: Correlation and Covariance

Simple linear regression

How To Test For Significance On A Data Set

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Permutation Tests for Comparing Two Populations

Fairfield Public Schools

Data exploration with Microsoft Excel: analysing more than one variable

Geostatistics Exploratory Analysis

Univariate Regression

Data Preparation and Statistical Displays

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

How To Write A Data Analysis

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Diagrams and Graphs of Statistical Data

2. Simple Linear Regression

Statistics. Measurement. Scales of Measurement 7/18/2012

Homework 11. Part 1. Name: Score: / null

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Lecture 2. Summarizing the Sample

START Selected Topics in Assurance

UNDERSTANDING THE TWO-WAY ANOVA

Additional sources Compilation of sources:

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Week 1. Exploratory Data Analysis

Lecture Notes Module 1

The Dummy s Guide to Data Analysis Using SPSS

Enhancing the Teaching of Statistics: Portfolio Theory, an Application of Statistics in Finance

UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST

Section 3 Part 1. Relationships between two numerical variables

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

A logistic approximation to the cumulative normal distribution

Session 7 Bivariate Data and Analysis

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Descriptive Statistics

Descriptive Statistics

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

SPSS Tests for Versions 9 to 13

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

Point Biserial Correlation Tests

Correlation key concepts:

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

MTH 140 Statistics Videos

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

II. DISTRIBUTIONS distribution normal distribution. standard scores

UNIVERSITY OF NAIROBI

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Introduction to Quantitative Methods

Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th

Finance Theory

Chapter 1 Introduction. 1.1 Introduction

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Normality Testing in Excel

Chapter G08 Nonparametric Statistics

Normal Probability Plots and Tests for Normality


LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

2. Descriptive statistics in EViews

What Does the Normal Distribution Sound Like?

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

STAT355 - Probability & Statistics

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

The Dangers of Using Correlation to Measure Dependence

Multivariate Normal Distribution

Probability Calculator

Describing, Exploring, and Comparing Data

Risk and return (1) Class 9 Financial Management,

DOES IT PAY TO HAVE FAT TAILS? EXAMINING KURTOSIS AND THE CROSS-SECTION OF STOCK RETURNS

EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA

Algebra 1 Course Information

An introduction to using Microsoft Excel for quantitative data analysis

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A full analysis example Multiple correlations Partial correlations

Means, standard deviations and. and standard errors

Chapter 3 RANDOM VARIATE GENERATION

We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries?

ESTIMATING THE DISTRIBUTION OF DEMAND USING BOUNDED SALES DATA

STAT 350 Practice Final Exam Solution (Spring 2015)

Java Modules for Time Series Analysis

How To Check For Differences In The One Way Anova

Econometrics Simple Linear Regression

Interpretation of Somers D under four simple models

Confidence Intervals for Spearman s Rank Correlation

Exploratory data analysis (Chapter 2) Fall 2011

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Data analysis process

CALCULATIONS & STATISTICS

Standard Deviation Estimator

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Quantitative Methods for Finance

Investment Statistics: Definitions & Formulas

Module 5: Multiple Regression Analysis

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Foundation of Quantitative Data Analysis

Transcription:

PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION Chin-Diew Lai, Department of Statistics, Massey University, New Zealand John C W Rayner, School of Mathematics and Applied Statistics, University of Wollongong,, Australia T P Hutchinson, School of Behavioural Sciences, Macquarie University, Australia Most statistics students know that the sample correlation coefficient R is used to estimate the population correlation coefficient ρ. If the pair (X, Y) has a bivariate normal distribution, this would not cause any trouble. However, if the marginals are nonnormal, particularly if they have high skewness and kurtosis, the estimated value from a sample may be quite different from the population correlation coefficient ρ. Our simulation analysis indicates that for the bivariate lognormal, the bias in estimating ρ can be very large and it can be substantially reduced only after a large number (3-4 million) of observations. This example could serve as an exercise for the statistics students to realise some of the pitfalls in using the sample correlation coefficient to estimate ρ. INTRODUCTION The Pearson product-moment correlation coefficient ρ is a measure of linear dependence between a pair of random variables (X,Y). The sample (product-moment) correlation coefficient R, derived from n observations of the pair (X, Y), is normally used to estimate ρ. Historically, R has been studied and applied extensively. The distribution of R has been thoroughly reviewed in Chapter 32 of Johnson et al. (1995). While the properties of R for the bivariate normal are clearly understood, the same cannot be said about the nonnormal bivariate populations. Cook (1951), Gayen (1951) and Nakagawa and Niki (1992) obtained expressions for the first four moments of R in terms of the cumulants and cross-cumulants of the parent population. However, the size of the bias and the variance of R are still rather hazy for general bivariate nonnormal populations when ρ 0 since the cross-cumulants are difficult to quantify in general. Although various specific nonnormal populations have been investigated, the messages on the robustness of R are conflicting. The bivariate lognormal distribution is very well known. It arises from transforming the marginals of the bivariate normal distribution by the exponential function. It has several applications in the literature. For example, Mielke et al.(1977) use this bivariate distribution for analyses of treatment (clouding) effects on measurements (precipitation amounts) when appropriate covariates (related controlled 312

area measurements) are available. The correlation coefficient r for the bivariate lognormal population can be obtained easily and it has been given in several books, e.g., pp. 20 of Johnson and Kotz (1972). Results of our simulations indicate that the sample correlation R for the bivariate lognormal with skewed marginals and ρ 0 has a large bias and a large variance for smaller sample sizes. One requires several million observations in order to reduce the bias and variance significantly. When we did these simulations, we got a surprise and that we still do not fully understand what is happening. Various histograms of the sample correlation coefficient R based on our simulation results are plotted below. Tables of summary statistics are also provided. The paper concludes with a cautionary note for students and teachers of statistics regarding the sample correlation as an estimate for ρ. ELEMENTARY PITFALLS IN INTERPRETING THE SAMPLE CORRELATION Let R denote the sample correlation which is an estimate of r and let r be an observed value of R. The sample correlation R is only a summary statistic; there are several pitfalls in interpreting it and therefore it is worthwhile emphasising these points. One should not confuse correlation with causation. r = 0.0 does not mean that there is no relationship between two marginals. A scatterplot might reveal a clear (though nonlinear ) relationship. Even if the correlation is close to 1, the relationship may be obviously nonlinear. Many different-looking sets of points can all produce the same value of r (see Chambers et al., 1983, section 4.2, for eight scatterplots, all having r =.7). The well know Anscombe (1973) data has four scattered plots. All have r =.816. The value of r calculated from a small sample may be totally misleading if not viewed in the context of its likely sampling error. There are several other measures of statistical dependence. These include rank correlation coefficients. Students likely have some experience in simulating sample statistics, e.g., sample mean from normal, sample mean from Cauchy and R from bivariate normal. They would taste a disaster in the second example. Here, we wish to alert them another potential difficulty when sampling R from nonnormal bivariate populations. SAMPLE CORRELATIONS OF THE BIVARIATE LOGNORMAL 313

Let (X, Y) denote a pair of bivariate lognormal random variables with correlation coefficient r; derived from the bivariate normal with marginal means ζ 1, ζ 2, standard deviations σ 1, σ 2, and correlation coefficient ρ N It is well known that if we start with a bivariate normal distribution, and apply any nonlinear transformations to the marginals, Pearson s product moment correlation coefficient is smaller (in absolute magnitude) in the resulting distribution than the original bivariate normal one (of course, rank correlation coefficients are unaltered provided the transformations are monotonic). The expression for the correlation coefficient of the bivariate lognormal expression can be found in page 20 of Johnson and Kotz (1972): exp(ρ ρ= N σ 1 σ 2 ) 1 (1) { exp(σ 2 1 ) 1}exp(σ { 2 2 ) 1} Eq (1) indicates that ρ is independent of ζ 1 andζ 2, we therefore set them both to zero for convenience sake. We note that the σ ' s measure the skewness of the lognormal marginals :α 3 = β 1 = ( ω 1) 1 / 2 (ω + 2),ω = exp(σ 2 ); see pp. 212 of Johnson et al. (1994). Clearly, ρ increases as ρ N increases. For example, for σ 1 = 1and σ 2 = 2, we have 0, if ρ N = 0 ρ= 0.179, if ρ N =0.5 0.666, if ρ N =1 (2) We note the skewness coefficients for σ 1 = 1and σ 2 = 2 are 6.18 and 429, respectively. The correlation ρ for the bivariate lognormal may not be very meaningful if one or both of the marginals are skewed. Consider the case for which σ 1 = 1 and σ 2 = 4. By setting ρ N = 1 and ρ N =1 in (2), respectively, we work out the lower and upper limits for correlation between X and Y to be -0.000251 and 0.0312. As Romano and Siegel (1986, section 4.22) say, Such a result raises a serious question in practice about how to interpret the correlation between lognormal random variables. Clearly, small correlations may be very misleading because a correlation of 0.01372 indicates, in fact, X and Y are perfectly functionally (but nonlinearly) related. The distribution of R, when (X, Y) has a bivariate normal distribution is well known and it has been well documented in Chapter 32 of Johnson and et al. (1995). The bias ( E(R) ρ) and the variance of R are both of O(n 1 ) and therefore ρ can be successfully estimated from samples or simulations. For nonnormal populations, the 314

moments of R may be obtained from the bivariate Edgeworth expansion which involves cross-cumulant ratios of the parent population. SIMULATION RESULTS In order to study the sampling distribution of R and assess its performance as an estimator of ρ, we carried out a large-scale simulation exercise. In our simulation procedure, we use the following steps: Step 1: Generate n observations from each of the pair of independent unit normals (U, V). Step 2: Obtain the bivariate normal (X *,Y * ) through the relationship: X * = σ 1 U +ζ 1, Y * = σ 2 ρ N U + σ 2 (1 ρ N 2 ) 1 / 2 V +ζ 2 (3) Step 3: Set X = exp(x * ) andy = exp(y * ). Then (X,Y) has a bivariate lognormal distribution with correlation coefficient given by (2). As we are only interested in the correlation coefficient, we set both ζ 1 and ζ 2 to zero. All the simulations and plottings are carried out using MINITAB commands: (i) Simulate U C1, V C2, (ii)c1 σ 1 C3, σ 2 ρ N C1 + σ 2 1 ρ N 2 C2 C4 (iii) exp (C3) C5,exp (C4) C6 and (iv) Corr C5- C6 M1 (Here, C stands for column whereas M stands for matrix ). Three cases are considered: (i) ρ N = 1, (ii) ρ N = 0.5 and (iii) ρ N = 0, each with σ 1 = 1and σ 2 = 2; and their corresponding correlation coefficients of the bivariate lognormal population are (i) ρ = 0.666, (ii) ρ = 0.179 and (iii) ρ = 0, respectively. The following histograms are plotted for the three cases considered: Fig 1: 50 Samples of 4 Million (with rho = 1) Fig 2: 100 Samples of 3 million (rho =0.5) Fig 3: 100 Samples of 3 Million (rho = 0) 20 70 40 60 Frequency 10 50 40 Frquency 30 30 Frequency 20 20 10 10 0 0 0.580.605.630.655.680.705.730.09.11.13.15.17.19.21 -.0010 -.0005.0000.0005.0010.0015 Correlation Coefficient Correlation Coefficient Correlation Coefficient Table 1. Summary of Simulations ρ Sample Size No of Samples Mean Standard Deviation Fig 1 0.666 4 Million 50 0.68578 0.029979 315

Fig 2 0.179 3 Million 100 0.18349 0.015889 Fig 3 0 3 Million 100 0.00006 0.000526 The plots displayed above indicate that the distributions of R are skewed to the left, and, except for the case ρ = 0, they have quite large variances even for such large sample sizes. We have also calculated the asymptotic expansions for both the bias and the variance of R, and found, except when ρ = 0, the leading coefficients in each case to be very large. So there is sound theory behind the simulation demonstrations. For the bivariate normal, the bias in R as an estimate of ρ is approximately ρ(1 ρ 2 )n 1 /2 and that var(r) ( 1-ρ 2 ) 2 n 1 (pp. 556, Johnson et al., 1995). So for these values of r (i.e., r = 0.666, 0.179 and 0), we would expect the standard errors to be 0.0003, 0.0006, and 0.0005, respectively. In order to reassure the readers that 50 or 100 samples is sufficient, we now let the number of simulations, k say, varies, but fix the sample size to n=100,000 of case (ii). The following histograms indicate that the shape changes very little as k varies; all are skewed to the left. Figure 4. Histograms of R with ρ N = 0.5, n =100,000 (ρ = 0.179, σ 1 =1,σ 2 =2) 0.12 0.16 0.20 0.24 0.2 0.3 0.1 0.0 0.1 0.2 0.3 (a) k = 50 (b) k = 100 (b) k = 500 (c) k = 1000 Table 2. Summary Statistics from 4 values of k (n = 100,000) k Mean Median StDev Min Max Q1 Q3 50 0.20644 0.20875 0.01978 0.12860 0.25300 0.19606 0.21871 100 0.19779 0.20372 0.03118 0.11145 0.25910 0.18012 0.22023 500 0.19582 0.20131 0.03460 0.04924 0.28928 0.18159 0.21855 1000 0.19931 0.20418 0.02965 0.04472 0.30495 0.18370 0.21913 On the other hand, if we fix k = 100 but allow n to varies from n = 50 to n = 1000000, we then have the following box-plots together with a table of summary statistics. Figure 5. Box-plots of various n, k=100 Table 3. Summary (ρ = 0.179, σ 1 =1,σ 2 =2) 316

1.0 0.5 0.0 Sample Size Mean Median StDev 50 0.3379 0.3289 0.2237 100 0.3258 0.2928 0.1506 1000 0.2613 0.2421 0.0960 10000 0.2195 0.2140 0.0501 100000 0.1979 0.2015 0.0266 1000000 0.1849 0.1905 0.0214 50 100 1000 10000 100000 1000000 The last column of the preceding table indicates that the stand error is not proportional to n 1/ 2 as one would probably expect should this be a well-behaved bivariate distribution.. If the skewness of the marginals is reduced, R seems to become more robust. For example, consider the case when σ 1 = σ 2 =.5 and ρ N = 0.5; so that ρ =.4688 (recall, σ s measure the skewness of the lognormals). 100 samples of (a) 100,000 and (b) 1million observations were simulated, and their results are now summarized as follows: Figure 6. Histograms based on 100 Samples (with σ 1 = σ 2 =.5 and ρ =.4688) Mean 0.469 Std Dev 0.003 Minimum 0.461 0.460 0.465 0.470 0.475 1st Quartile Median 3rd Quartile Maximum 0.467 0.469 0.471 0.475 0.4665 0.4685 0.4695 0.4715 (a) Sample size : 100,000 (b) Sample size:1 million Table 4: Summary Statistics of Two Different Sample Sizes (k =100) Mean StDev Minimum 1st Quartile Median 3rd Quartile Maximum Fig 4a 0.4689 0.0027 0.4611 0.4672 0.4689 0.4710 0.4747 Fig 4b 0.4689 0.0009 0.4667 0.4682 0.4688 0.4695 0.4714 We see that the normals fit the above data well. Indeed, formal goodness of fit tests show almost perfect fit. It seems that the skewness of the marginals affect the skewness of R. CONCLUSION 317

Many non-normal bivariate distributions are of concern in engineering, geology, and meteorology, as well as in psychology. Often, it is necessary to estimate the correlation coefficient from the sample correlation R. In most cases the sample sizes concerned are in the order of hundreds instead of thousands or millions for obvious reasons. So the bias may be quite significant in some cases, especially if ρ is not close to zero. By using an easily understood example we have illustrated the problem in estimating the population correlation and thereby we lend support to the claim of the nonrobustness of R. To our knowledge, most elementary texts do not discuss or highlight this important issue. It is our view that undergraduates in statistics should be adequately cautioned about this problem and be encouraged to check for the underlying assumptions on the populations before reporting their findings on the correlation. REFERENCE Anscombe, F. J. (1973). Graphs in Statistical Analysis. The American Statistician, 27, 17-20 Chambers, J. M., Cleveland, W S, Kleenex, B and Tukey, P A (1983). Graphical Methods for Data Analysis.. Belmont, California: Wadsworth, and Boston: Duxbury. Cook, M. B. (1951) Bi-variate k-statistics and Cumulants of Their Joint Sampling Distribution. Biometrika, 38, 179-195 Gayen, A. K. (1951) The Frequency Distribution of the Product-Moment Correlation Coefficient in Random Samples of Any Size from Non-Normal Universe. Biometrika, 38, 219-247. Johnson, N. L. and Kotz, S. and Balakrishnan, N. (1994). Distributions in Statistics: Continuous Univariate Distributions, Vol 1, Second Edition. New York, Wiley. Johnson, N. L. and Kotz, S. and Balakrishnan, N. (1995). Distributions in Statistics: Continuous Univariate Distributions, Vol 2, Second Edition. New York, Wiley. Johnson, N. L. and Kotz, S. (1972). Distributions in Statistics: Continuous Multivariate Distributions.. New York, Wiley. Mielke, P. W., Williams, J. S. and Wu, S. C. (1977). Covariance Analysis Techniques Based on Bivariate Log-Normal Distribution with Weather Modification Applications. Journal of Applied Meteorology, 16, 183-187. Nakagawa, S. and Niki, N. (1992). Distribution of Sample Correlation Coefficient for Nonnormal Populations. Journal of Japanese Society of Computational Statistics, 5, 1-19. Romano, J. P. and Siegel A. F. (1986). Counter Examples in Probability and Statistics.. Monterey, California: Wadsworth and Brooks/Cole. 318