Regression III: Advanced Methods



Similar documents
Regression III: Advanced Methods

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Name: Date: Use the following to answer questions 2-3:

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

MTH 140 Statistics Videos

Simple Regression Theory II 2010 Samuel L. Baker

Simple linear regression

Algebra I Vocabulary Cards

Exercise 1.12 (Pg )

Java Modules for Time Series Analysis

Linear Regression. use waist

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

SAS Software to Fit the Generalized Linear Model

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Logs Transformation in a Regression Equation

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Algebra 1 Course Title

Straightening Data in a Scatterplot Selecting a Good Re-Expression Model

11. Analysis of Case-control Studies Logistic Regression

Logistic Regression (1/24/13)

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Nominal and ordinal logistic regression

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

2. Simple Linear Regression

LOGIT AND PROBIT ANALYSIS

Multivariate Normal Distribution

Logistic regression modeling the probability of success

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Florida Department of Education/Office of Assessment January Grade 6 FCAT 2.0 Mathematics Achievement Level Descriptions

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Transforming Bivariate Data

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

STATISTICA Formula Guide: Logistic Regression. Table of Contents

ANNUITY LAPSE RATE MODELING: TOBIT OR NOT TOBIT? 1. INTRODUCTION

Algebra 1 Course Information

How To Write A Data Analysis

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

Big Ideas in Mathematics

Statistics. Measurement. Scales of Measurement 7/18/2012

1 Review of Least Squares Solutions to Overdetermined Systems

GRADES 7, 8, AND 9 BIG IDEAS

Precalculus REVERSE CORRELATION. Content Expectations for. Precalculus. Michigan CONTENT EXPECTATIONS FOR PRECALCULUS CHAPTER/LESSON TITLES

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Vocabulary Words and Definitions for Algebra

Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Classification Problems

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Linear Models and Conjoint Analysis with Nonlinear Spline Transformations

Nonlinear Regression Functions. SW Ch 8 1/54/

Examining a Fitted Logistic Model

Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the school year.

STAT 350 Practice Final Exam Solution (Spring 2015)

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Probit Analysis By: Kim Vincent

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Mathematics Online Instructional Materials Correlation to the 2009 Algebra I Standards of Learning and Curriculum Framework

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

MATHEMATICS (MATH) 3. Provides experiences that enable graduates to find employment in sciencerelated

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Descriptive Statistics

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

SUMAN DUVVURU STAT 567 PROJECT REPORT

Example: Boats and Manatees

Module 3: Correlation and Covariance

Confidence Intervals in Public Health

Mathematics Review for MS Finance Students

Lecture 6: Logistic Regression

Jim Lambers MAT 169 Fall Semester Lecture 25 Notes

To Be or Not To Be a Linear Equation: That Is the Question

Chapter 5: Bivariate Cointegration Analysis

AP Statistics. Chapter 4 Review

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

BookTOC.txt. 1. Functions, Graphs, and Models. Algebra Toolbox. Sets. The Real Numbers. Inequalities and Intervals on the Real Number Line

Section 3 Part 1. Relationships between two numerical variables

STA 4273H: Statistical Machine Learning

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Mathematics (MAT) MAT 061 Basic Euclidean Geometry 3 Hours. MAT 051 Pre-Algebra 4 Hours

DATA INTERPRETATION AND STATISTICS

Week 1. Exploratory Data Analysis

Diagrams and Graphs of Statistical Data

Correlation and Regression

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

Penalized Logistic Regression and Classification of Microarray Data

Chapter 23. Inferences for Regression

AP STATISTICS REVIEW (YMS Chapters 1-8)

T-test & factor analysis

Interpretation of Somers D under four simple models

Corporate Defaults and Large Macroeconomic Shocks

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Hedging Illiquid FX Options: An Empirical Analysis of Alternative Hedging Strategies

Transcription:

Lecture 4: Transformations Regression III: Advanced Methods William G. Jacoby Michigan State University

Goals of the lecture The Ladder of Roots and Powers Changing the shape of distributions Transforming for Linearity 2

Why transform data? 1. In some instances it can help us better examine a distribution 2. Many statistical models are based on the mean and thus require that the mean is an appropriate measure of central tendency (i.e., the distribution is approximately normal) 3. Linear least squares regression assumes that the relationship between two variables is linear. Often we can straighten a nonlinear relationship by transforming one or both of the variables Often transformations will fix problem distributions so that we can use least-squares regression When transformations fail to remedy these problems, another option is to use nonparametric regression, which makes fewer assumptions about the data 3

Power transformations for quantitative variables Although there are an infinite number of functions f(x) that can be used to transform a distribution, in practice only a relatively small number are regularly used For quantitative variables one can usually rely on the family of powers and roots: When p is negative, the transformation is an inverse power: When p is a fraction, the transformation represents a root: 4

Log transformations A power transformation of X 0 should not be used because it changes all values to 1 (in other words, it makes the variable a constant) Instead we can think of X 0 as a shorthand for the log transformation log e X, where e 2.718 is the base of the natural logarithms: In practice most people prefer to use log 10 X because it is easier to interpret increasing log 10 X by 1 is the same as multiplying X by 10 In terms of result, it matters little which base is used because changing base is equivalent to multiplying X by a constant 5

Cautions: Power Transformations (1) Descending the ladder of powers and roots compresses the large values of X and spreads out the small values As p moves away from 1 in either direction, the transformation becomes more powerful Power transformations are sensible ONLY when all the X values are POSITIVE If not, this can be solved by adding a start value Some transformations (e.g., log, square root, are undefined for 0 and negative numbers) Other power transformations will not be monotone, thus changing the order of the data 6

Cautions: Power Transformations (2) Power transformations are only effective if the ratio of the largest data value to the smallest data value is large If the ratio is very close to 1, the transformation will have little effect General rule: If the ratio is less than 5, a negative start value should be considered 7

Transforming Skewed Distributions The example below shows how a log 10 transformation can fix a positive skew The density estimate for average income for occupations from the Canadian Prestige data is shown on top; the bottom shows the density estimate of the transformed income Probability density function Probability density function 0.00000 0.00006 0.00012 0.0 0.5 1.0 1.5 2.0 0 10000 20000 30000 income 2.5 3.0 3.5 4.0 4.5 log.inc 8

Transforming Nonlinearity When is it possible? An important use of transformations is to straighten the relationship between two variables This is possible only when the nonlinear relationship is simple and monotone Simple implies that the curvature does not change there is one curve Monotone implies that the curve is always positive or always negative (a) can be transformed, (b) and (c) can not 9

The Bulging Rule for transformations Tukey and Mosteller s rule provides a starting point for possible transformations to correct nonlinearity Normally we should try to transform explanatory variables rather than the response variable Y since a transformation of Y will affect the relationship of Y with all Xs not just the one with the nonlinear relationship If, however, the response variable is highly skewed, it makes sense to transform it instead 10

Transforming relationships Income and Infant mortality (1) Leinhardt s data from the car library Robust local regression in the plot shows serious nonlinearity The bulging rule suggests that both Y and X can be transformed down the ladder of powers I tried taking the log of income only, but significant nonlinearity still remained infant 0 100 200 300 400 500 600 In the end, I took the log 10 of both income and infant mortality 0 1000 2000 3000 4000 5000 income 11

Income and Infant mortality (2) log.infant 1.0 1.5 2.0 2.5 2.0 2.5 3.0 3.5 A linear model fits well here Since both variables are transformed by the log 10 the coefficients are easy to interpret: An increase in income by 1% is associated, on average, with a.51% decrease in infant mortality log.income 12

Transforming Proportions Power transformations will not work for proportions (including percentages and rates) if the data values approach the boundaries of 0 and 1 Instead, we can use the logit or probit transformations for skewed proportion distributions. If their scales are equated, these two are practically indistinguishable: The logit transformation: (a) removes the boundaries of the scale, (b) spreads out the tails of the distribution and (c) makes the distribution symmetric about 0. It takes the following form: 13

Logit Transformation of a Proportion Notice that the transformation is nearly linear for proportions between.20 and.80 Values close to 0 and 1 are spread out at an increasing rate, however Finally, the transformed variable is now centered at 0 rather than.5 14

Next Topics: The Basics of Least Squares Regression Least-squares fit Properties of the least-squares estimator Statistical inference Regression in matrix form The Vector Representation of the Regression Model 15