Data Transforms: Natural Logarithms and Square Roots



Similar documents
How To Test For Significance On A Data Set

2 Sample t-test (unequal sample sizes and unequal variances)

Paired 2 Sample t-test

Descriptive Statistics and Measurement Scales

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Error Type, Power, Assumptions. Parametric Tests. Parametric vs. Nonparametric Tests

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Using SPSS, Chapter 2: Descriptive Statistics

CHAPTER 14 NONPARAMETRIC TESTS

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

13: Additional ANOVA Topics. Post hoc Comparisons

Chapter 4 Online Appendix: The Mathematics of Utility Functions

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Descriptive Statistics

Means, standard deviations and. and standard errors

CALCULATIONS & STATISTICS

Recall this chart that showed how most of our course would be organized:

Frequency Distributions

1.5 Oneway Analysis of Variance

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Measures of Central Tendency and Variability: Summarizing your Data for Others

The Assumption(s) of Normality

6.4 Normal Distribution

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

Introduction; Descriptive & Univariate Statistics

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Standard Deviation Estimator

Simple linear regression

Independent samples t-test. Dr. Tom Pierce Radford University

Jim Lambers MAT 169 Fall Semester Lecture 25 Notes

c 2008 Je rey A. Miron We have described the constraints that a consumer faces, i.e., discussed the budget constraint.

Module 5: Multiple Regression Analysis

Module 3: Correlation and Covariance

CURVE FITTING LEAST SQUARES APPROXIMATION

Statistical Process Control (SPC) Training Guide

Difference tests (2): nonparametric

Simple Linear Regression

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Chapter 2: Descriptive Statistics

II. DISTRIBUTIONS distribution normal distribution. standard scores

SPSS Explore procedure

MEASURES OF LOCATION AND SPREAD

Simple Predictive Analytics Curtis Seare

Chapter 7 Section 7.1: Inference for the Mean of a Population

Lesson 7 Z-Scores and Probability

CHAPTER 5 Round-off errors

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Normality Testing in Excel

Skewed Data and Non-parametric Methods

Introduction to Quantitative Methods

Scatter Plots with Error Bars

Rank-Based Non-Parametric Tests


Multiple Regression: What Is It?

Lecture Notes Module 1

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

$ ( $1) = 40

Nonlinear Regression Functions. SW Ch 8 1/54/

Exercise 1.12 (Pg )

individualdifferences

3.2. Solving quadratic equations. Introduction. Prerequisites. Learning Outcomes. Learning Style

NCSS Statistical Software

Updates to Graphing with Excel

FREE FALL. Introduction. Reference Young and Freedman, University Physics, 12 th Edition: Chapter 2, section 2.5

6.4 Logarithmic Equations and Inequalities

NCSS Statistical Software

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

The Dummy s Guide to Data Analysis Using SPSS

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries?

Odds ratio, Odds ratio test for independence, chi-squared statistic.

EQUATING TEST SCORES

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Solving DEs by Separation of Variables.

Comparing Means in Two Populations

Descriptive Statistics

How to Make the Most of Excel Spreadsheets

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

12: Analysis of Variance. Introduction

1 Shapes of Cubic Functions

Statgraphics Getting started

Stat 5102 Notes: Nonparametric Tests and. confidence interval

Lecture 2. Summarizing the Sample

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Week 4: Standard Error and Confidence Intervals

How To Run Statistical Tests in Excel

Nonparametric Two-Sample Tests. Nonparametric Tests. Sign Test

MEASURES OF VARIATION

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Come scegliere un test statistico

AMS 5 CHANCE VARIABILITY

Mean, Median, Standard Deviation Prof. McGahagan Stat 1040

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

4. Continuous Random Variables, the Pareto and Normal Distributions

Descriptive Statistics

Graphing Parabolas With Microsoft Excel

Transcription:

Data Transforms: atural Log and Square Roots 1 Data Transforms: atural Logarithms and Square Roots Parametric statistics in general are more powerful than non-parametric statistics as the former are based on ratio level data (real values) whereas the latter are based on ranked or ordinal level data. Of course, non-parametrics are extremely useful as sometimes our data is highly non-normal, meaning that comparing the means is often highly misleading, and can lead to erroneous results. on-parametrics statistics allow us to make observations on statistical patterning even though data may be highly skewed one way or another. However, by doing so, we loose a certain degree of power by converting the data values into relative ranks, rather than focus on the actual differences between the values in the raw data. The take home point here is that we always use parametric statistics where possible, and we resort to non-parametrics if we are sure parametrics will be misleading. Parametric statistics work on ratio level data, that is data that has a true zero value (where zero means absence of value) and the intervals between data are consistent, independent of the data point value. The obvious case in point are the Roman numeral real values we are used to counting everyday {, -4, -3, -2, -1, 0, 1, 2, 3, 4, }. However, these are not the only values that constitute ratio level data. Alternatives are logged data, or square rooted data, where the intervals between the data points are consistent, and a true zero value exists. The possibility of transforming data to an alternative ratio scale is particularly useful with skewed data, as in some cases the transformation will normalize the data distribution. If the transform normalizes the data, we can go ahead and continue to use parametric statistics in exactly the same way, and the results we get (p values etc.) are equally as valid as before. The way this works is that both the natural logarithm and the square root are mathematical functions meaning that they produce curves that affect the data we want to transform in a particular way. The shapes of these curves normalize data (if they work) by passing the data through these functions, altering the shape of their distributions. For example look at the figures below. Mathematically, taking the natural logarithm of a number is written in a couple of ways: And taking the square root is written: X = ln x, or X = log x e X = x

Data Transforms: atural Log and Square Roots 2 ln(x)/sqrt(x) 10 9 8 7 6 5 4 3 2 1 0 atural log Square root 0 10 20 30 40 50 60 70 80 90 100 X Outlier ln(x)/sqrt(x) 2.5 2 1.5 1 0.5 0-0.5-1 -1.5-2 -2.5 1 2 3 4 5 X Looking at the top figure we can see that the presence of any outliers on the X axis will be reduced on the Y axis due to the shape of the curves. This effect will be most effective with the log function as opposed to the square root function ( ). We can extrapolate out by seeing that given the curve of the log function the more extreme the outlier, the greater the affect of log transforming. Looking at the inset figure we can see that logging values that are less than 1 on the X axis will result in negative log values; even though this may seem to be a problem intuitively, it is not. This is because ln(1)=0, therefore ln(<1)<0. In fact ln(0) is undefined meaning that the log function approaches the Y axis asymptotically but never gets there. A usual method of dealing with raw data where many of the values are less than 1 is to add an arbitrary constant to the entire data set and then log transform; in this way we avoid dealing with negative numbers. What does all this mean? Well, transforming data sets works most effectively for data distributions that are skewed to the right by the presence of outliers. However, transforming the data does not always work as it depends ultimately on the specific values involved. In general, it is best to attempt log transforming first, if that doesn t work try square root transforming, and if that doesn t work, go with a non-parametric test.

Data Transforms: atural Log and Square Roots 3 MIITAB EXAMPLE It is very easy to transform data either in EXCEL or MIITAB (I usually use EXCEL). In EXCEL the code is simply =ln(x), where X is your data, and you can click and drag the formula down a whole column of data. In MIITAB you can use the CALCULATOR function under CALC on the toolbar and store the transformed variables in a new column. An example comes from Binford (2001) using data on hunter-gatherer group sizes (=227); I won t bother to list all 227 data points Reading the data into MIITAB, to look at the normality of the data we need to run the descriptive stats, do a normality test and look at the distribution. For the descriptive stats, in MIITAB procedure is: >STAT SUMMARY, >BASIC STATISTICS >DESCRIPTIVE STATISTICS >Double click on the column your data is entered >GRAPHS: choose BOXPLOT and GRAPHICAL >OK >OK The output reads: Variable Tr SE GROUP1 227 17.436 16.000 16.358 9.508 0.631 GROUP1 5.600 70.000 11.000 19.700 With the two graphics:

Data Transforms: atural Log and Square Roots 4 Variable: GROUP1 11.085 0.000 5 15 25 35 45 55 65 17.4357 9.5080 90.4022 2.37984 8.12061 227 15 16 17 18 19 16.1922 8.7064 5.6000 11.0000 16.0000 19.7000 70.0000 18.6792 10.4734 15.0000 17.0000 Boxplot of GROUP1 0 10 20 30 40 50 60 70 GROUP1 From the descriptive stats output we can see the mean and median are different, especially considering the standard error. We also see from the graphical output, the boxplot shows a bunch of outliers, and a heavily skewed distribution. The Anderson-Darling result on the graphical summary gives p=0.000, meaning that the data is very non-normal. Given the skewness of the data and the presence of outliers, log transforming is at least worth trying.

Data Transforms: atural Log and Square Roots 5 So, logging the data in EXCEL and transferring it into MIITAB we run the same set of procedures, leading to the following outputs: Variable Tr SE L Group 227 2.7470 2.7726 2.7339 0.4567 0.0303 L Group 1.7228 4.2485 2.3979 2.9806 Variable: L Group 1.387 0.001 1.7 2.1 2.5 2.9 3.3 3.7 4.1 2.74704 0.45666 0.208536 0.418019 0.539105 227 2.68 2.73 2.78 2.83 2.68731 1.72277 2.39790 2.77259 2.98062 4.24850 2.80676 0.41816 0.50302 2.70805 2.83321 Boxplot of L Group 2 3 L Group 4

Data Transforms: atural Log and Square Roots 6 Well, while it was a good idea to try a log transform, and we see from the descriptive statistics that the mean and median a very close, the Anderson-Darling result still tells us that the data is non-normal. We see from the boxplot that we still have a few stubborn outliers. We have made the data kind of symmetrical, but unfortunately it is still non-normal: we have to go ahead and use non-parametric statistics from here if we want to use this data statistically. Let s try a second example. We ll take some more data from Binford (2001), this time referring to the mean annual aggregation size of terrestrial hunter-gatherers (=181). Following the same procedures as above we find the following: For the raw data Variable Tr SE GROUP2 181 40.13 36.00 38.86 15.66 1.16 GROUP2 19.50 105.00 29.50 50.00 And, Variable: GROUP2 4.348 0.000 25 40 55 70 85 100 40.1348 15.6625 245.313 1.23473 1.73967 181 34 35 36 37 38 39 40 41 42 43 37.838 19.500 29.500 36.000 50.000 105.000 42.432 14.198 17.466 34.000 40.000 We see that the median and means are not equal, and the Anderson-Darling stat is nonsignificant, so logging the data and putting it into MIITAB we get:

Data Transforms: atural Log and Square Roots 7 Variable Tr SE lngroup2 181 3.6248 3.5835 3.6147 0.3616 0.0269 lngroup2 2.9704 4.6540 3.3842 3.9120 And, Variable: lngroup2 0.931 0.018 3.1 3.4 3.7 4.0 4.3 4.6 3.62483 0.36164 0.130780 0.347091-4.6E-01 181 3.55 3.60 3.65 3.70 3.57179 0.32782 2.97041 3.38425 3.58352 3.91202 4.65396 3.67787 0.40328 3.52636 3.68888 In this case we see that the mean and median are now very similar, and the boxplot shows the presence of no outliers. The Anderson-Darling test shows a significance level of roughly 0.02 (98%), and while this is less than the usual α level of 0.05 (95%), this result is pretty strong. And here we come up against the subjectivity of statistics; it is up to the observer to decide whether this data is normal enough for parametric statistics. Most would argue that it is, given that, in reality, the Anderson-Darling test is very conservative in that it will detect the slightest deviation from normality, and that parametric statistics are remarkably robust, only being dramatically effected by highly non-normal data. I would accept the log-transformed data as close enough to normal to use parametric statistics.