Linear Regression Models with Logarithmic Transformations



Similar documents
MULTIPLE REGRESSION EXAMPLE

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Nonlinear Regression Functions. SW Ch 8 1/54/

Correlation and Regression

Nonlinear relationships Richard Williams, University of Notre Dame, Last revised February 20, 2015

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Discussion Section 4 ECON 139/ Summer Term II

Handling missing data in Stata a whirlwind tour

Rockefeller College University at Albany

Interaction effects between continuous variables (Optional)

August 2012 EXAMINATIONS Solution Part I

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

Lecture 15. Endogeneity & Instrumental Variable Estimation

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

We are going to delve into some economics today. Specifically we are going to talk about production and returns to scale.

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

6.4 Logarithmic Equations and Inequalities

Addressing Alternative. Multiple Regression Spring 2012

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Quick Stata Guide by Liz Foster

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Multinomial and Ordinal Logistic Regression

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

Module 5: Multiple Regression Analysis

CHAPTER FIVE. Solutions for Section 5.1. Skill Refresher. Exercises

Standard errors of marginal effects in the heteroskedastic probit model

Section 1. Logarithms

Logs Transformation in a Regression Equation

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

MODELING AUTO INSURANCE PREMIUMS

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

2. Simple Linear Regression

25 Working with categorical data and factor variables

The average hotel manager recognizes the criticality of forecasting. However, most

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Chapter 3 Quantitative Demand Analysis

especially with continuous

Title. Syntax. stata.com. fp Fractional polynomial regression. Estimation

Econometrics I: Econometric Methods

Regression III: Advanced Methods

xtmixed & denominator degrees of freedom: myth or magic

Using Stata 9 & Higher for OLS Regression Richard Williams, University of Notre Dame, Last revised January 8, 2015

Simple linear regression

1. What do you think is the definition of an acid? Of a base?

Data Analysis Methodology 1

A Cohort Study of Traffic-related Air Pollution and Mortality in Toronto, Canada: Online Appendix

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Northern Colorado Retail Study: A shift-share analysis 2000 to 2010

Sample Size Calculation for Longitudinal Studies

Hypothesis testing - Steps

Milk Data Analysis. 1. Objective Introduction to SAS PROC MIXED Analyzing protein milk data using STATA Refit protein milk data using PROC MIXED

Chapter 4 and 5 solutions

Forecasting in STATA: Tools and Tricks

MEASURING THE INVENTORY TURNOVER IN DISTRIBUTIVE TRADE

Week 5: Multiple Linear Regression

DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS

Notes on logarithms Ron Michener Revised January 2003

Part 1 Expressions, Equations, and Inequalities: Simplifying and Solving

Stata Walkthrough 4: Regression, Prediction, and Forecasting

Transforming Bivariate Data

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Straightening Data in a Scatterplot Selecting a Good Re-Expression Model

Separable First Order Differential Equations

Equations, Inequalities & Partial Fractions

Outliers Richard Williams, University of Notre Dame, Last revised April 7, 2016

Solving Exponential Equations

LOGISTIC REGRESSION ANALYSIS

Part II. Multiple Linear Regression

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Review of Fundamental Mathematics

A Simple Feasible Alternative Procedure to Estimate Models with High-Dimensional Fixed Effects

Multiple Regression: What Is It?

Institut für Soziologie Eberhard Karls Universität Tübingen

2. Linear regression with multiple regressors

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Negative Integer Exponents

ALGEBRA REVIEW LEARNING SKILLS CENTER. Exponents & Radicals

Simple Regression Theory II 2010 Samuel L. Baker

III. INTRODUCTION TO LOGISTIC REGRESSION. a) Example: APACHE II Score and Mortality in Sepsis

Analysing Questionnaires using Minitab (for SPSS queries contact -)

The leverage statistic, h, also called the hat-value, is available to identify cases which influence the regression model more than others.

Using R for Linear Regression

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

LINEAR EQUATIONS IN TWO VARIABLES

Method To Solve Linear, Polynomial, or Absolute Value Inequalities:

Algebra 1 Course Information

Exponents and Radicals

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

Click on the links below to jump directly to the relevant section

SPSS Guide: Regression Analysis

3.1. Solving linear equations. Introduction. Prerequisites. Learning Outcomes. Learning Style

The Numbers Behind the MLB Anonymous Students: AD, CD, BM; (TF: Kevin Rader)

SUMAN DUVVURU STAT 567 PROJECT REPORT

2.6 Exponents and Order of Operations

Introduction. Survival Analysis. Censoring. Plan of Talk

The following postestimation commands for time series are available for regress:

Transcription:

Linear Regression Models with Logarithmic Transformations Kenneth Benoit Methodology Institute London School of Economics kbenoit@lse.ac.uk March 17, 2011 1 Logarithmic transformations of variables Considering the simple bivariate linear model Y i = α + βx i + ε i, 1 there are four possible combinations of transformations involving logarithms: the linear case with no transformations, the linear-log model, the log-linear model 2, and the log-log model. X Y X logx Y linear linear-log Ŷ i = α + βx i Ŷ i = α + βlogx i logy log-linear log-log logŷ i = α + βx i logŷ i = α + βlogx i Table 1: Four varieties of logarithmic transformations Remember that we are using natural logarithms, where the base is e 2.71828. Logarithms may have other bases, for instance the decimal logarithm of base 10. (The base 10 logarithm is used in the definition of the Richter scale, for instance, measuring the intensity of earthquakes as Richter = log(intensity). This is why an earthquake of magnitude 9 is 100 times more powerful than an earthquake of magnitude 7: because 10 9 /10 7 = 10 2 and log 10 (10 2 ) = 2.) Some properties of logarithms and exponential functions that you may find useful include: 1. log(e) = 1 2. log(1) = 0 3. log(x r ) = r log(x) 4. loge A = A With valuable input and edits from Jouni Kuha. 1 The bivariate case is used here for simplicity only, as the results generalize directly to models involving more than one X variable, although we would need to add the caveat that all other variables are held constant. 2 Note that the term log-linear model is also used in other contexts, to refer to some types of models for other kinds of response variables Y. These are different from the log-linear models discussed here. 1

5. e loga = A 6. log(ab) = loga + logb 7. log(a/b) = loga logb 8. e AB = e A B 9. e A+B = e A e B 10. e A B = e A /e B 2 Why use logarithmic transformations of variables Logarithmically transforming variables in a regression model is a very common way to handle situations where a non-linear relationship exists between the independent and dependent variables. 3 Using the logarithm of one or more variables instead of the un-logged form makes the effective relationship non-linear, while still preserving the linear model. Logarithmic transformations are also a convenient means of transforming a highly skewed variable into one that is more approximately normal. (In fact, there is a distribution called the log-normal distribution defined as a distribution whose logarithm is normally distributed but whose untransformed scale is skewed.) For instance, if we plot the histogram of expenses (from the MI452 course pack example), we see a significant right skew in this data, meaning the mass of cases are bunched at lower values: 0 200 400 600 0 500 1000 1500 2000 2500 3000 Expenses If we plot the histogram of the logarithm of expenses, however, we see a distribution that looks much more like a normal distribution: 3 The other transformation we have learned is the quadratic form involving adding the term X 2 to the model. This produces curvature that unlike the logarithmic transformation that can reverse the direction of the relationship, something that the logarithmic transformation cannot do. The logarithmic transformation is what as known as a monotone transformation: it preserves the ordering between x and f (x). 2

0 20 40 60 80 100 2 4 6 8 Log(Expenses) 3 Interpreting coefficients in logarithmically models with logarithmic transformations 3.1 Linear model: Y i = α + βx i + ε i Recall that in the linear regression model, logy i = α + βx i + ε i, the coefficient β gives us directly the change in Y for a one-unit change in X. No additional interpretation is required beyond the estimate ˆβ of the coefficient itself. This literal interpretation will still hold when variables have been logarithmically transformed, but it usually makes sense to interpret the changes not in log-units but rather in percentage changes. Each logarithmically transformed model is discussed in turn below. 3.2 Linear-log model: Y i = α + βlogx i + ε i In the linear-log model, the literal interpretation of the estimated coefficient ˆβ is that a one-unit increase in logx will produce an expected increase in Y of ˆβ units. To see what this means in terms of changes in X, we can use the result that log X + 1 = log X + log e = log(ex ) which is obtained using properties 1 and 6 of logarithms and exponential functions listed on page 1. In other words, adding 1 to log X means multiplying X itself by e 2.72. A proportional change like this can be converted to a percentage change by subtracting 1 and multiplying by 100. So another way of stating multiplying X by 2.72 is to say that X increases by 172% (since 100 (2.72 1) = 172). So in terms of a change in X (unlogged): 3

ˆβ is the expected change in Y when X is multiplied by e. ˆβ is the expected change in Y when X increases by 172% For other percentage changes in X we can use the following result: The expected change in Y associated with a p% increase in X can be calculated as ˆβ log([100 + p]/100). So to work out the expected change associated with a 10% increase in X, therefore, multiply ˆβ by log(110/100) = log(1.1) =.095. In other words, 0.095 ˆβ is the expected change in Y when X is multiplied by 1.1, i.e. increases by 10%. For small p, approximately log([100 + p]/100) p/100. For p = 1, this means that ˆβ/100 can be interpreted approximately as the expected increase in Y from a 1% increase in X 3.3 Log-linear model: logy i = α + βx i + ε i In the log-linear model, the literal interpretation of the estimated coefficient ˆβ is that a one-unit increase in X will produce an expected increase in log Y of ˆβ units. In terms of Y itself, this means that the expected value of Y is multiplied by e ˆβ. So in terms of effects of changes in X on Y (unlogged): Each 1-unit increase in X multiplies the expected value of Y by e ˆβ. To compute the effects on Y of another change in X than an increase of one unit, call this change c, we need to include c in the exponent. The effect of a c-unit increase in X is to multiply the expected value of Y by e c ˆβ. So the effect for a 5-unit increase in X would be e 5 ˆβ. For small values of ˆβ, approximately e ˆβ 1+ ˆβ. We can use this for the following approximation for a quick interpretation of the coefficients: 100 ˆβ is the expected percentage change in Y for a unit increase in X. For instance for ˆβ =.06, e.06 1.06, so a 1-unit change in X corresponds to (approximately) an expected increase in Y of 6%. 3.4 Log-log model: logy i = α + βlogx i + ε i In instances where both the dependent variable and independent variable(s) are log-transformed variables, the interpretation is a combination of the linear-log and log-linear cases above. In other words, the interpretation is given as an expected percentage change in Y when X increases by some percentage. Such relationships, where both Y and X are log-transformed, are commonly referred to as elastic in econometrics, and the coefficient of log X is referred to as an elasticity. So in terms of effects of changes in X on Y (both unlogged): multiplying X by e will multiply expected value of Y by e ˆβ To get the proportional change in Y associated with a p percent increase in X, calculate a = log([100 + p]/100) and take e a ˆβ 4

me examples 4 Examples Let's consider the relationship between the percentage Linear-log. Consider the regression of % urban population (1995) on per capita GNP: urban and per capita GNP: 100 % urban 95 (World Bank) urban 95 (World Bank) 8 77 77 42416 United Nations per capita GDP The distribution of per capita GDP is badly skewed, creating a non-linear relationship between X This andoesn't Y. To controllook the skewtoo and counter good. problemslet's in heteroskedasticity, try transforming we GNP/capita the per by taking its logarithm. This produces the following plot: capita GNP by logging it: 100 % urban 95 (World Bank) urban 95 (World Bank) 88 4.34381 10.6553 lpcgdp95 and the regression with the following results: 5

! That looked pretty good. Now let's quantify the association between percentage urban and the logged per capita income:. regress urb95 lpcgdp95 Source SS df MS Number of obs = 132 ---------+------------------------------ F( 1, 130) = 158.73 Model 38856.2103 1 38856.2103 Prob > F = 0.0000 Residual 31822.7215 130 244.790165 R-squared = 0.5498 ---------+------------------------------ Adj R-squared = 0.5463 Total 70678.9318 131 539.533831 Root MSE = 15.646 urb95 Coef. Std. Err. t P> t [95% Conf. Interval] ---------+-------------------------------------------------------------------- lpcgdp95 10.43004.8278521 12.599 0.000 8.792235 12.06785 _cons -24.42095 6.295892-3.879 0.000-36.87662-11.96528! The implication of this coefficient is that multiplying capita income by e, roughly 2.71828, 'increases' the percentage urban by 10.43 percentage points.! Increasing per capita income by 10% 'increases' the percentage urban by 10.43*0.09531 = 0.994 percentage points. To interpret the coefficient of 10.43004 on the log of the GNP/capita variable, we can make the following statements: Directly from the coefficient: An increase of 1 in the log of GNP/capita will increase Y by 10.43004. (This is not extremely interesting, however, since few people are sure how to interpret the natural logarithms of GDP/capita.) Multiplicative changes in e: Multiplying GNP/cap by e will increase Y by 10.43004. hat about A 1% increase the situation in X : A 1% increase where in GNP/capthe will increase dependent Y by 10.43004/100 variable =.1043 is gged? A 10% increase in X : A 10% increase in GNP/cap will increase Y by 10.43004 log(1.10) = 10.43004.09531 0.994. Log-linear. What if we reverse X and Y from the above example, so that we regress the log of We could just as easily have considered the 'effect' on GNP/capita on the %urban? In this case, the logarithmically transformed variable is the Y variable. This leads to the following plot (which is just the transpose of the previous one this is only logged per capita income of increasing urbanization: example!): 10.6553 lpcgdp95 4.34381 8 100 % urban 95 (World Bank) with the following regression results: regress lpcgdp95 urb95 Source SS df MS Number of obs = 13 -------+------------------------------ F( 1, 130) = 158.7 Model 196.362646 1 196.362646 Prob > F = 0.000 sidual 160.818406 130 1.23706466 R-squared = 0.549 6 -------+------------------------------ Adj R-squared = 0.546 Total 357.181052 131 2.72657291 Root MSE = 1.112 ---------------------------------------------------------------------------

4.34381 8 100 % urban 95 (World Bank). regress lpcgdp95 urb95 Source SS df MS Number of obs = 132 ---------+------------------------------ F( 1, 130) = 158.73 Model 196.362646 1 196.362646 Prob > F = 0.0000 Residual 160.818406 130 1.23706466 R-squared = 0.5498 ---------+------------------------------ Adj R-squared = 0.5463 Total 357.181052 131 2.72657291 Root MSE = 1.1122 lpcgdp95 Coef. Std. Err. t P> t [95% Conf. Interval] ---------+-------------------------------------------------------------------- urb95.052709.0041836 12.599 0.000.0444322.0609857 _cons 4.630287.2420303 19.131 0.000 4.151459 5.109115! Every one point increase in the percentage urban multiplies 0.052709 per capita income by e = 1.054. In other words, it increases per capita income by 5.4%. To interpret the coefficient of.052709 on the urb95 variable, we can make the following statements: Directly from the coefficient, transformed Y : Each one unit increase urb95 in increases lpcgdp95 by.052709. (Once again, this is not particularly useful as we still have trouble thinking in terms of the natural logarithm of GDP/capita.) Directly from the coefficient, untransformed Y : Each one unit increase of urb95 increases the untransformed GDP/capita by a multiple of e 0.52709 = 1.054 or a 5.4% increase. (This is very close to the 5.3% increase that we get using our quick approximate rule described above for interpreting the.053 as yielding a 5.3% increase for a one-unit change in X.) gged independent and dependent variables Let's look at infant mortality and per capita income: Log-log. Here we consider a regression of the logarithm of the infant mortality rate on the log of GDP/capita. The plot and the regression results look like this: 5.1299 Logged independent and dependent variables! Let's look at infant mortality and per capita income: limr 5.1299 limr 1.38629 1.38629 3.58352 10.6553 3.58352 10.6553 lpcgdp95. regress limr lpcgdp95 Source SS df MS Number of obs = 194 regress limr lpcgdp95 ---------+------------------------------ F( 1, 192) = 404.52 Model 131.035233 1 131.035233 Prob > F = 0.0000 Source SS Residual 62.1945021 df 192.323929698 MS R-squared = Number 0.6781 of obs = 1 ---------+------------------------------ Adj R-squared = 0.6765 -------+------------------------------ Total 193.229735 193 1.00119034 Root MSE = F(.56915 1, 192) = 404. Model 131.035233 limr Coef. 1 131.035233 Std. Err. t P> t [95% Conf. Interval] Prob > F = 0.00 ---------+-------------------------------------------------------------------- sidual 62.1945021 lpcgdp95 -.4984531 192.323929698.0247831-20.113 0.000 -.5473352 R-squared -.449571 = 0.67 _cons 7.088676.1908519 37.142 0.000 6.71224 7.465111 -------+------------------------------ Adj R-squared = 0.67 Total 193.229735 193 1.00119034 Root MSE =.569 -------------------------------------------------------------------------- To interpret the! coefficient Thus multiplying of -.4984531 per on capita the lpcgdp95 income by variable, 2.718 multiplies we can make the the following limr Coef. Std. Err. -0.4984531 t P> t [95% Conf. Interva statements: infant mortality rate by e = 0.607 -------+------------------------------------------------------------------! A 10% increase in per capita income multiplies the infant cgdp95 -.4984531 mortality.0247831-0.4984531*ln(1.1) rate e -20.113 = 0.954. 0.000 -.5473352 -.4495 7 _cons 7.088676! In other.1908519 words, a 10% increase 37.142 in per capita 0.000 income 6.71224 7.4651 -------------------------------------------------------------------------- reduces the infant mortality rate by 4.6%.

Directly from the coefficient, transformed Y : Each one unit increase lpcgdp95 in increases limr by -.4984531. (Since we cannot think directly in natural log units, then once again, this is not particularly useful.) Multiplicative changes in both X and Y : Multiplying X (GNP/cap) by e 2.72 multiplies Y by e.4984531 = 0.607, i.e. reduces the expected IMR by about 39.3%. A 1% increase in X : A 1% increase in GNP/cap multiplies IMR by e.4984531 log(1.01) =.0.9950525. So a 1% increase in GNP/cap reduces IMR by 0.5%. A 10% increase in X : A 10% increase in GNP/cap multiplies IMR by e.4984531 log(1.1).954. So a 10% increase in GNP/cap reduces IMR by 4.6%. 8