Proportions as dependent variable

Similar documents
Standard errors of marginal effects in the heteroskedastic probit model

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

2. Linear regression with multiple regressors

Multinomial and Ordinal Logistic Regression

Handling missing data in Stata a whirlwind tour

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

LOGIT AND PROBIT ANALYSIS

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Sample Size Calculation for Longitudinal Studies

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Regression with a Binary Dependent Variable

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Institut für Soziologie Eberhard Karls Universität Tübingen

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Generalized Linear Models

Prediction for Multilevel Models

13. Poisson Regression Analysis

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Logistic Regression (a type of Generalized Linear Model)

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

is paramount in advancing any economy. For developed countries such as

From the help desk: hurdle models

Logistic Regression (1/24/13)

Corporate Defaults and Large Macroeconomic Shocks

Java Modules for Time Series Analysis

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Nonlinear Regression:

Correlated Random Effects Panel Data Models

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

From the help desk: Bootstrapped standard errors

SAS Software to Fit the Generalized Linear Model

Introduction to General and Generalized Linear Models

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

Simple linear regression

Introduction to Quantitative Methods

GLM I An Introduction to Generalized Linear Models

11. Analysis of Case-control Studies Logistic Regression

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Comparing Features of Convenient Estimators for Binary Choice Models With Endogenous Regressors

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Simple Linear Regression Inference

Multivariate Logistic Regression

Regression III: Advanced Methods

Module 5: Multiple Regression Analysis

Univariate Regression

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Risk Decomposition of Investment Portfolios. Dan dibartolomeo Northfield Webinar January 2014

From the help desk: Swamy s random-coefficients model

DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS

Poisson Models for Count Data

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

Local classification and local likelihoods

A Basic Introduction to Missing Data

Analyses on Hurricane Archival Data June 17, 2014

Ordinal Regression. Chapter

UNIVERSITY OF WAIKATO. Hamilton New Zealand

Stata Walkthrough 4: Regression, Prediction, and Forecasting

Nominal and ordinal logistic regression

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Regression Modeling Strategies

Correlation and Regression

Discussion Section 4 ECON 139/ Summer Term II

An introduction to GMM estimation using Stata

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

VI. Introduction to Logistic Regression

2013 MBA Jump Start Program. Statistics Module Part 3

Lecture 15. Endogeneity & Instrumental Variable Estimation

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

A Statistical Analysis of the Prices of. Personalised Number Plates in Britain

Regression III: Advanced Methods

Econometrics Problem Set #2

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

Lecture 14: GLM Estimation and Logistic Regression

Outline. Dispersion Bush lupine survival Quasi-Binomial family

STA 4273H: Statistical Machine Learning

ANNUITY LAPSE RATE MODELING: TOBIT OR NOT TOBIT? 1. INTRODUCTION

Financial Vulnerability Index (IMPACT)

Additional sources Compilation of sources:

10. Comparing Means Using Repeated Measures ANOVA

Some Essential Statistics The Lure of Statistics

Modeling Health Care Costs and Counts

Simple Regression Theory II 2010 Samuel L. Baker

Nonlinear Regression Functions. SW Ch 8 1/54/

Multilevel Modeling of Complex Survey Data

Credit Scoring Modelling for Retail Banking Sector.

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

The Stata Journal. Editor Nicholas J. Cox Department of Geography Durham University South Road Durham City DH1 3LE UK

From the help desk: Demand system estimation

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Example: Boats and Manatees

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Transcription:

Proportions as dependent variable Maarten L. Buis Vrije Universiteit Amsterdam Department of Social Research Methodology http://home.fsw.vu.nl/m.buis Proportions as dependent variable p. 1/42

Outline Problems with using regress for proportions as dependent variable Methods for dealing with a single proportion Methods for dealing with multiple proportions Caveat: Ecological Fallacy Proportions as dependent variable p. 2/42

Example Explaining proportion of Dutch city budgets spent on administration and government with: Size of budget (natural logarithm of budget in 10s of millions euros) Average house price (in 100,000s of euros) Population density (in 1000s of persons per square km) Political orientation of city government (either no left parties in city government, left parties are a minority in city government, or left parties are a majority in city government) Proportions as dependent variable p. 3/42

OLS results b se lntot -0.030 (0.002) houseval 0.013 (0.004) popdens 0.008 (0.002) noleft -0.001 (0.005) minorityleft -0.007 (0.004) constant 0.109 (0.008) R 2 0.499 Proportions as dependent variable p. 4/42

Non linear effects due to floor.4 non linearity due to floor proportion budget to administration and government.3.2.1 0.5 1 5 10 50 100 500 budget in 10s of milions of euros Proportions as dependent variable p. 5/42

Residuals versus fitted values.2 Residuals versus fitted values.1 residuals 0.1.2.05 0.05.1.15.2 Fitted values Proportions as dependent variable p. 6/42

Floor observed = f itted + residual observed 0 (and 1) f itted + residual 0 residual f itted Proportions as dependent variable p. 7/42

Residuals versus fitted values.2 heteroscedasticity due to floor.1 residuals 0.1.2.05 0.05.1.15.2 Fitted values Proportions as dependent variable p. 8/42

Problems with regress Impossible predictions. Non-normal errors. Heteroscedasticity. Non-linear effects. Proportions as dependent variable p. 9/42

Outline Problems with using regress for proportions as dependent variable Methods for dealing with a single proportion Methods for dealing with multiple proportions Caveat: Ecological Fallacy Proportions as dependent variable p. 10/42

A solution: betafit Assumes that the proportion follows a beta distribution. The beta distribution is bounded between 0 and 1 (but does not include either 0 or 1). The beta distribution models heteroscedasticity in such a way that the variance is largest when the average proportion is near 0.5. Proportions as dependent variable p. 11/42

Two parameterizations the conventional parametrization with two shape parameters (α and β) Corresponds to the formulas of the beta distribution in textbooks. Does not correspond to conventions of Generalized Linear Models where one models how the mean of the distribution of the dependent variable changes as the explanatory variables change. the alternative parametrization with one location and one scale parameter (µ and φ) Does not correspond to textbook formulas of the beta distribution but does correspond to the GLM convention. Proportions as dependent variable p. 12/42

Two parameterizations conventional parametrization f(y α,β) E(y) = V ar(y) = y α 1 (y 1) β 1 α α + β αβ (α + β) 2 (α + β + 1) alternative parametrization f(y µ,φ) E(y) = µ V ar(y) = µ(1 µ) y µφ 1 (y 1) (1 µ)φ 1 1 1 + φ Proportions as dependent variable p. 13/42

different µ fixed φ alpha = 5 and beta = 5 mu =.5 and phi = 10, var =.091 alpha = 4 and beta = 6 mu =.4 and phi = 10, var =.061 density density 0.2.4.6.8 1 y 0.2.4.6.8 1 y alpha = 3 and beta = 7 mu =.3 and phi = 10, var =.039 alpha = 2 and beta = 8 mu =.2 and phi = 10, var =.023 density density 0.2.4.6.8 1 y 0.2.4.6.8 1 y Proportions as dependent variable p. 14/42

different φ fixed µ alpha = 2 and beta = 8 mu =.2 and phi = 10, var =.023 alpha = 4 and beta = 16 mu =.2 and phi = 20, var =.012 density density 0.2.4.6.8 1 y 0.2.4.6.8 1 y alpha = 8 and beta = 32 mu =.2 and phi = 40, var =.006 alpha = 16 and beta = 64 mu =.2 and phi = 80, var =.003 density density 0.2.4.6.8 1 y 0.2.4.6.8 1 y Proportions as dependent variable p. 15/42

Modeling the mean We allow different cities to have different µs depending on their values of the explanatory variables. µ i = f(b 0 + b 1 x 1i + b 2 x 2i ) The logistic transformation is used to ensure µ i remains between 0 and 1. µ i = eb 0 +b 1 x 1i +b 2 x 2i 1+e b 0 +b 1 x 1i +b 2 x 2i which is the same as: ln( µ 1 µ ) = b 0 + b 1 x 1i + b 2 x 2i Proportions as dependent variable p. 16/42

output of betafit. betafit gov, mu(lntot houseval popdens noleft minorityleft ) nolog ML fit of beta (mu, phi) Number of obs = 394 Wald chi2(5) = 473.19 Log likelihood = 887.97456 Prob > chi2 = 0.0000 ------------------------------------------------------------ Coef. se z P> z [ 95% CI ] -------------+---------------------------------------------- lntot -.3999.0227-17.58 0.000 -.4445 -.3553 houseval.1138.0385 2.96 0.003.0384.1892 popdens.0830.0216 3.85 0.000.0408.1253 noleft.0185.0445 0.42 0.677 -.0686.1057 minorityleft -.0080.0450-0.18 0.859 -.0962.0802 _cons -2.0545.0707-29.06 0.000-2.1931-1.9160 -------------+---------------------------------------------- /ln_phi 4.7968.0715 67.13 0.000 4.6568 4.9368 -------------+---------------------------------------------- phi 121.1 8.6545 105.3 139.3 ------------------------------------------------------------ Proportions as dependent variable p. 17/42

interpretation using dbetafit. dbetafit, at(noleft 0 minorityleft 0) ---------------------------------------------------------------- discrete Min --> Max +-SD/2 +-1/2 change coef. se coef. se coef. se --------------+------------------------------------------------- lntot -.2116.0122 -.0344.002 -.033.0019 houseval.0291.0105.0037.0013.0093.0032 popdens.0447.0133.0063.0016.0068.0018 noleft.0015.0037 minorityleft -6.6e-04.0037 ---------------------------------------------------------------- Proportions as dependent variable p. 18/42

discrete changes in lntot.25.2 predicted proportion.15.1.05 0 2 0 2 4 6 ln(total budget) Proportions as dependent variable p. 19/42

marginal effects ---------------------------------------------------- Marginal MFX at x Max MFX Effects coef. se coef. se --------------+------------------------------------- lntot -.0328.0019 -.1.0057 houseval.0093.0032.0284.0096 popdens.0068.0018.0208.0054 ---------------------------------------------------- Proportions as dependent variable p. 20/42

marginal effects of lntot 1.8 predicted proportion.6.4.2 0 10 5 0 5 ln(total budget) Proportions as dependent variable p. 21/42

Fractional logit Although the implied variance in betafit makes sense, it is still an assumption and some think it is too restrictive. The fractional logit has been proposed as an alternative by Papke and Wooldridge (1996). Fractional logit can handle proportions of exactly 0 or 1, unlike betafit. This model can be estimated by typing: glm varlist, family(binomial) link(logit) robust. Marginal effects like those from dbetafit can be obtained with mfx, predict(mu). Proportions as dependent variable p. 22/42

Does it matter? OLS betafit glm dy/dx se dy/sx se dy/dx se lntot -.0296.0027 -.0328.0019 -.0330.0026 houseval.0135.0051.0093.0032.0105.0036 popdens.0078.0019.0068.0018.0071.0018 noleft -.0010.0056.0015.0037.0008.0046 minorityleft -.0065.0047 -.0007.0037 -.0019.0042 dy/dx is for discrete change of dummy variable from 0 to 1 Proportions as dependent variable p. 23/42

Outline Problems with using regress for proportions as dependent variable Methods for dealing with a single proportion Methods for dealing with multiple proportions Caveat: Ecological Fallacy Proportions as dependent variable p. 24/42

Multiple proportions Cities also spent money on other categories: Safety (which includes public health, fire department, and the police department) Education (mostly primary and secondary schools) recreation (which includes sport facilities and culture) social (which includes social work and some social security benefits) urbanplanning (which includes roads and houses) Proportions as dependent variable p. 25/42

Multiple proportions The proportions spent on each category should remain between 0 and 1, and the proportions should add up to 1. The proportions could be modeled with separate betafit models. This would ensure the first condition is met, but it would ignore the second condition. Proportions as dependent variable p. 26/42

A solution: dirifit Assumes that the proportions follow a Dirichlet distribution. The Dirichlet distribution is the multivariate generalization of the beta distribution. It ensures that the proportions remain between 0 and 1, and that they add up to 1. Proportions as dependent variable p. 27/42

Two parameterizations the conventional parametrization with one shape parameters for each proportion (α 1, α 2,..., α k ) Corresponds to the formulas of the Dirichlet distribution in textbooks. Does not correspond to conventions of Generalized Linear Models where one models how the mean of the distribution of the dependent variable changes as the explanatory variables change. the alternative parametrization with on location location parameter for each proportion and one scale parameter (µ 1, µ 2,..., µ k, and φ) Does not correspond to textbook formulas of the Dirichlet distribution but does correspond to the GLM convention. One location parameter is redundant: µ 1 = 1 (µ 2 + µ 3 +... + µ k ). Proportions as dependent variable p. 28/42

Modeling the mean We allow different cities to have different µ j s depending on their values of the explanatory variables. The multinomial logistic transformation is used to ensure the µ j s remain between 0 and 1 and add up to 1. Proportions as dependent variable p. 29/42

output of dirifit. dirifit gov-urban, mu(lntot houseval popdens noleft minorityleft ) nolog ---------------------------------------------------------------- Coef. se z P> z [ 95% CI ] -------------+-------------------------------------------------- mu2 lntot.1445.0406 3.56 0.000.0649.2240 houseval -.0518.0718-0.72 0.471 -.1924.0889 popdens -.0700.0390-1.79 0.073 -.1465.0065 noleft.0817.0827 0.99 0.323 -.0805.2439 minorityleft.1043.0826 1.26 0.207 -.0577.2662 _cons.5274.1318 4.00 0.000.2690.7858 -------------+-------------------------------------------------- mu3 lntot.4123.0423 9.74 0.000.3293.4952 <snip> -------------+-------------------------------------------------- phi 45.01 1.407 42.33 47.85 ---------------------------------------------------------------- mu2 = safety mu4 = recreation mu6 = urbanplanning Proportions as dependent variable p. 30/42 mu3 = education mu5 = social base outcome = governing

Marginal effects obtained with ddirifit governing safety education recreation social urban planning lntot -.0320 -.0314.0115 -.0067.0265.0321 houseval.0132.0143 -.0321.0065 -.0496.0477 popdens.0074.0009 -.0067.0002.0072 -.0090 noleft.0006.0161 -.0266.0048 -.0168.0219 minorityleft -.0019.0154 -.0164.0085 -.0105.0049 discrete change of dummy variable from 0 to 1 significant at 5% level Proportions as dependent variable p. 31/42

Variance and covariance of y in dirifit The variance of y i is µ i (1 µ i ) 1 1+φ The covariance of y i and y j implicit in dirifit is µ i µ j 1 1+φ It depends on the means in a similar fashion as the multinomial distribution, and on a precision parameter φ. Covariance is forced to be negative. This makes sense in that there is less room for other categories if the fraction in one category increases. Proportions as dependent variable p. 32/42

Variance Covariance structure too restrictive? Though the implied variances and covariances make sense, they do not have to be true. Alternatives have been proposed for cases where this structure is violated. For dirifit a multivariate normal model for logit transformed dependent variables has been proposed by Aitcheson (2003). Proportions as dependent variable p. 33/42

Variance Covariance structure too restrictive? This model can be estimated by typing: gen logity1 = logit(y1) gen logity2 = logit(y2).. gen logityk = logit(yk) mvreg logity1 - logityk = indepvars, corr Proportions as dependent variable p. 34/42

Outline Problems with using regress for proportions as dependent variable Methods for dealing with a single proportion Methods for dealing with multiple proportions Caveat: Ecological Fallacy Proportions as dependent variable p. 35/42

Ecological Fallacy Sometimes one wants to study behavior of individuals but one only has information on a aggregate level. This aggregate information is often in the form of proportions. One might be tempted to use the methods discussed previously to analyze this data. Example from Robinson (1950): Relationship between immigrant status and literacy in the 1930 US census. Proportions as dependent variable p. 36/42

Individual level analysis illiterate immigrant literate illiterate Total native born 96.72 3.28 100.00 foreign born 90.75 9.25 100.00 Total 95.87 4.13 100.00 Proportions as dependent variable p. 37/42

State level analysis 15 10 % illiterate 5 0 0 10 20 30 % immigrants Proportions as dependent variable p. 38/42

Ecological Fallacy Aggregate level relationships can be completely different from individual level relationships. If it is remotely possible to use individual level data, do so! If that is not possible start reading up on Ecological Inference. A good place to start is Gary King (1997) Ecol package from Department of Political Science, Aarhus University, Denmark: http://www.ps.au.dk/stata/ Proportions as dependent variable p. 39/42

Summary (1) The constraint that a proportion must remain between 0 and 1 causes problems with regress. betafit is one possible solution. Multiple proportions have the additional constraint that they must add up to 1. dirifit is one possible solution. Proportions as dependent variable p. 40/42

Summary (2) Both betafit and dirifit make assumptions about the variance (covariance) structure of the dependent variable that does make sense but that some find too restrictive. Fractional logit and multivariate regression have been proposed as alternatives. None of these techniques are appropriate for studying individual behavior from aggregate data. Proportions as dependent variable p. 41/42

References Aitcheson, John. 2003. The Statistical Analysis of Compositional Data. Blackburn Press. King, Gary. 1997. A solution to the Ecological Inference Problem, Reconstructing Individual Behavior from Aggregate Data. Princeton University Press. Papke, Leslie E. and Jeffrey M. Wooldridge. 1996. Econometric Methods for Fractional Response Variables with an Application to 401(k) Plan Participation Rates. Journal of Applied Econometrics 11(6):619 632. Robinson, W.S. 1950. Ecological Correlations and the Behavior of Individuals. Amercian Sociological Review 15(3):351 357. Proportions as dependent variable p. 42/42