Iatrogenic Specification Error: A Cautionary Tale of Cleaning Data



Similar documents
Weighting Methods in Survey Sampling

Henley Business School at Univ of Reading. Pre-Experience Postgraduate Programmes Chartered Institute of Personnel and Development (CIPD)

Chapter 5 Single Phase Systems

Capacity at Unsignalized Two-Stage Priority Intersections

) ( )( ) ( ) ( )( ) ( ) ( ) (1)

Computer Networks Framing

Agile ALM White Paper: Redefining ALM with Five Key Practices

Henley Business School at Univ of Reading. Chartered Institute of Personnel and Development (CIPD)

Sebastián Bravo López

A Holistic Method for Selecting Web Services in Design of Composite Applications

Channel Assignment Strategies for Cellular Phone Systems

Chapter 1 Microeconomics of Consumer Theory

A Comparison of Service Quality between Private and Public Hospitals in Thailand


AUDITING COST OVERRUN CLAIMS *

Findings and Recommendations

User s Guide VISFIT: a computer tool for the measurement of intrinsic viscosities

DSP-I DSP-I DSP-I DSP-I

Intelligent Measurement Processes in 3D Optical Metrology: Producing More Accurate Point Clouds

RATING SCALES FOR NEUROLOGISTS

Optimal Health Insurance for Multiple Goods and Time Periods

In many services, the quality or value provided by the service increases with the time the service provider

Customer Efficiency, Channel Usage and Firm Performance in Retail Banking

Classical Electromagnetic Doppler Effect Redefined. Copyright 2014 Joseph A. Rybczyk

From the Invisible Handshake to the Invisible Hand? How Import Competition Changes the Employment Relationship

Static Fairness Criteria in Telecommunications

Effectiveness of a law to reduce alcohol-impaired driving in Japan

Supply chain coordination; A Game Theory approach

INCOME TAX WITHHOLDING GUIDE FOR EMPLOYERS

1.3 Complex Numbers; Quadratic Equations in the Complex Number System*

Suggested Answers, Problem Set 5 Health Economics

INCOME TAX WITHHOLDING GUIDE FOR EMPLOYERS

TRENDS IN EXECUTIVE EDUCATION: TOWARDS A SYSTEMS APPROACH TO EXECUTIVE DEVELOPMENT PLANNING

WORKFLOW CONTROL-FLOW PATTERNS A Revised View

Open and Extensible Business Process Simulator

A Survey of Usability Evaluation in Virtual Environments: Classi cation and Comparison of Methods

In order to be able to design beams, we need both moments and shears. 1. Moment a) From direct design method or equivalent frame method

An exploration of student failure on an undergraduate accounting programme of study

Table of Contents. Appendix II Application Checklist. Export Finance Program Working Capital Financing...7

VOLUME 13, ARTICLE 5, PAGES PUBLISHED 05 OCTOBER DOI: /DemRes

SOFTWARE ENGINEERING I

Fixed-income Securities Lecture 2: Basic Terminology and Concepts. Present value (fixed interest rate) Present value (fixed interest rate): the arb

Price-based versus quantity-based approaches for stimulating the development of renewable electricity: new insights in an old debate

SupermarketPricingStrategies

Optimal Sales Force Compensation

The Basics of International Trade: A Classroom Experiment

protection p1ann1ng report

Deadline-based Escalation in Process-Aware Information Systems

5.2 The Master Theorem

THE UNIVERSITY OF TEXAS AT ARLINGTON COLLEGE OF NURSING. NURS Introduction to Genetics and Genomics SYLLABUS

Context-Sensitive Adjustments of Cognitive Control: Conflict-Adaptation Effects Are Modulated by Processing Demands of the Ongoing Task

Recovering Articulated Motion with a Hierarchical Factorization Method

State of Maryland Participation Agreement for Pre-Tax and Roth Retirement Savings Accounts

The Advantages of Using Aountable Care Organizations ( ACOs)

REDUCTION FACTOR OF FEEDING LINES THAT HAVE A CABLE AND AN OVERHEAD SECTION

HEAT EXCHANGERS-2. Associate Professor. IIT Delhi P.Talukdar/ Mech-IITD

How To Fator

Using Live Chat in your Call Centre

A novel active mass damper for vibration control of bridges

A Keyword Filters Method for Spam via Maximum Independent Sets

Improved SOM-Based High-Dimensional Data Visualization Algorithm

arxiv:astro-ph/ v2 10 Jun 2003 Theory Group, MS 50A-5101 Lawrence Berkeley National Laboratory One Cyclotron Road Berkeley, CA USA

Lemon Signaling in Cross-Listings Michal Barzuza*

3 Game Theory: Basic Concepts

Board Building Recruiting and Developing Effective Board Members for Not-for-Profit Organizations

Trade Information, Not Spectrum: A Novel TV White Space Information Market Model

Programming Basics - FORTRAN 77

The D.C. Long Term Disability Insurance Plan Exclusively for NBAC members Issued by The Prudential Insurance Company of America (Prudential)

An Efficient Network Traffic Classification Based on Unknown and Anomaly Flow Detection Mechanism

' R ATIONAL. :::~i:. :'.:::::: RETENTION ':: Compliance with the way you work PRODUCT BRIEF

Retirement Option Election Form with Partial Lump Sum Payment

Big Data Analysis and Reporting with Decision Tree Induction

Masters Thesis- Criticality Alarm System Design Guide with Accompanying Alarm System Development for the Radioisotope Production L

i_~f e 1 then e 2 else e 3

Learning Curves and Stochastic Models for Pricing and Provisioning Cloud Computing Services

An Enhanced Critical Path Method for Multiple Resource Constraints

HEAT CONDUCTION. q A q T

FOOD FOR THOUGHT Topical Insights from our Subject Matter Experts

A Context-Aware Preference Database System

RESEARCH SEMINAR IN INTERNATIONAL ECONOMICS. Discussion Paper No The Evolution and Utilization of the GATT/WTO Dispute Settlement Mechanism

university of illinois library AT URBANA-CHAMPAIGN BOOKSTACKS

Granular Problem Solving and Software Engineering

i e AT 8 of 1938 THE PERSONAL INJURIES (EMERGENCY PROVISIONS) ACT 1939

Transcription:

DISCUSSION PAPER SERIES IZA DP No. 1093 Iatrogeni Speifiation Error: A Cautionary Tale of Cleaning Data Christopher R. Bollinger Amitabh Chandra Marh 2004 Forshungsinstitut zur Zukunft der Arbeit Institute for the Study of Labor

Iatrogeni Speifiation Error: A Cautionary Tale of Cleaning Data Christopher R. Bollinger University of Kentuky Amitabh Chandra Dartmouth College, NBER and IZA Bonn Disussion Paper No. 1093 Marh 2004 IZA P.O. Box 7240 53072 Bonn Germany Phone: +49-228-3894-0 Fax: +49-228-3894-180 Email: iza@iza.org Any opinions expressed here are those of the author(s) and not those of the institute. Researh disseminated by IZA may inlude views on poliy, but the institute itself takes no institutional poliy positions. The Institute for the Study of Labor (IZA) in Bonn is a loal and virtual international researh enter and a plae of ommuniation between siene, politis and business. IZA is an independent nonprofit ompany supported by Deutshe Post World Net. The enter is assoiated with the University of Bonn and offers a stimulating researh environment through its researh networks, researh support, and visitors and dotoral programs. IZA engages in (i) original and internationally ompetitive researh in all fields of labor eonomis, (ii) development of poliy onepts, and (iii) dissemination of researh results and onepts to the interested publi. IZA Disussion Papers often represent preliminary work and are irulated to enourage disussion. Citation of suh a paper should aount for its provisional harater. A revised version may be available on the IZA website (www.iza.org) or diretly from the author.

IZA Disussion Paper No. 1093 Marh 2004 ABSTRACT Iatrogeni Speifiation Error: A Cautionary Tale of Cleaning Data In empirial researh it is ommon pratie to use sensible rules of thumb for leaning data. Measurement error is often the justifiation for removing (trimming) or reoding (winsorizing) observations whose values lie outside a speified range. We onsider a general measurement error proess that nests many plausible models. Analyti results demonstrate that winsorizing and trimming are only solutions for a narrow lass of measurement error proesses. Indeed, for the measurement error proesses found in most soial-siene data, suh proedures an indue or exaerbate bias, and even inflate the variane estimates. We term this soure of bias Iatrogeni (or eonometriian indued) error. Monte Carlo simulations and empirial results from the Census PUMS data and 2001 CPS data demonstrate the fragility of trimming and winsorizing as solutions to measurement error in the dependent variable. Even on asymptoti variane and RMSE riteria, we are unable to find generalizable justifiations for ommonly used leaning proedures. JEL Classifiation: C1, J1 Keywords: measurement error models, trimming, winsorizing Corresponding author: Amitabh Chandra Department of Eonomis Dartmouth College 6106 Rokefeller Hall Hanover, NH 03755 USA Email: amitabh.handra@dartmouth.edu

1 Introdution Empirial researhers frequently use simple rules of thumb to lean data on the basis of the dependent variable. As an example, researhers analyzing survey reports of wages and salaries often remove observations whose value for the hourly wage is below the minimum wage or above some prespei ed uto : sample exlusions based on wages an be found in Katz and Murphy (1992), Card and Krueger (1992), Bound and Freeman (1992), Juhn, Murphy, and Piere (1993), and Buhinsky (1994). We ite these authors to illustrate the endorsement of this pratie by leading sholars in the eld. As we demonstrate in this paper, the intuitively appealing strategy of disarding ertain observations is not ostless and an introdue spei ation error in ases where no error previously existed. Given the fat that the inonsisteny is exaerbated by the analyst s ations, we borrow a term from the medial literature and term this form of bias iatrogeni spei ation error. In the medial literature an iatrogeni event is an adverse reation to a well-intentioned treatment initiated by a physiian, and we believe that parameter inonsisteny that is aused by the analysts well intentioned ations shares the same features of physiian indued ompliations. Given the widespread aeptane of this pratie, the topi of robust estimation has reeived the attention of both eonomists and statistiians. In one the earliest formal examinations, Stigler (1977) poses an interesting question: how muh have methods suh as trimming, winsorizing, the Edgeworth average, or Tukey s Biweight, redued the bias in the laboratory estimation of physial onstants suh as the speed of light or the density of the earth? Stigler onludes that the 10 perent trimmed mean, the smallest trimming amount onsidered in his study, is the most reliable estimator. In this he ehoes the famous mathematiian Legendre who reommended deleting those observations with errors too large to be admissible. Stigler looks at the role of measurement error in the physial sienes; the error proess may be vastly di erent in the soial-sienes where eonomi agents may have strategi or ultural inentives to in ate or de ate their reports. In the eonometris literature, Angrist and Krueger (2000) apply trimming and winsorizing tehniques to the mathed employer-employee data from Mellow and Sider (1983). When they trim both the employer and employee wage data, they nd that the orrelation between the two measures improves. Interestingly, this result does not hold for reports of hours worked. On the basis of this nding they onlude that a small amount of trimming ould be bene ial. Their presription, whih summarizes the intuition and urrent pratie of most analysts, may be summarized as: Loosely speaking, winsorizing the data is desirable if the extreme values are exaggerated versions of the true values, but the true values still lie in the tails. Trunating the sample is more desirable if the extremes are mistakes that bear no resemblane to the true values. (p.1349) We examine this pratie in detail here, using wages and earnings as a motivating example, though the results are likely to apply to other errors of measurement in survey data. We posit a general model of response error in the dependent variable of a linear regression model and haraterize the e et of di erent leaning tehniques on the estimated oe ients. We demonstrate, both analytially as well as through the use of simulations, that in general there is no reason to believe that removing obvious errors in the 1

dependent variable redues bias. This is similar in spirit, to the nding of Hyslop and Imbens (2001), who examine instrumental variables approahes to solving the measurement error problem and nd that they only apply to very spei measurement error proesses. Our work is most losely related to that of MaDonald and Robinson (1985), who onsider Bayesian estimation of an error omponents model in panel data when one of the error omponents is measurement error. They expliitly show that trimming an be thought of as an extremely dogmati prior belief. Our paper di ers in that we expliitly onsider a general error proess and we do not onsider a panel setting. Moreover we disuss an optimal trimming approah and other lassial approahes to estimation. We demonstrate that the results in Stigler (1977) do not neessarily arry over in a regression framework. Indeed, trimming or winsorizing an bias oe ient estimates by as muh as 10-30 perent, and in many ases either indues bias that did not previously exist, or exaerbates the bias due to measurement error. The intuition for our result is simple: assuming that the researher trims or winsorizes the data based on a lower bound of or an upper bound of C, it will be shown that leaning reates seletion-bias, and this is generally worse than the e ets of measurement-error in the dependent variable. Our paper is organized as follows: Setion 2 desribes identi ation with general measurement error in the dependent variable. We use the linear projetion of the mismeasured dependent variable onto the ovariates to derive analytial results. In Setion 3, we generalize the use of this projetion to onsider three spei models of measurement error (additive white noise, linear transformation and the ontaminated data proess) that are found in soial-siene data. Setion 4 examines the theoretial impliations of trimming the data on bias as well as the asymptoti varianes of the oe ients. We prove that only in highly speialized ases, unlikely to be found in soial-siene data, does leaning redue bias. In these ases, we demonstrate that the information neessary to redue bias leads to a simpler orretion that requires fewer assumptions. This setion also demonstrates that trimming will not neessarily redue standard-errors. Setion 5 presents simulation results for the ases onsidered analytially. We generate quasi-simulated data from the 1990 US Deennial Census to study the properties of winsorizing and trimming in a multivariate ontext. Finally, we present an empirial example from the Marh 2001 Current Population Survey (CPS). These simulations and examples support the results of the earlier two setions. Setion 6 provides onluding omments. The Appendix to this paper provides detailed mathematial proofs and also onsiders the e ets of winsorizing on bias and e ieny. 2 General Measurement Error in the Dependent Variable To evaluate the widespread pratie of leaning data as desribed above we onsider a general model for measurement error proesses in the dependent variable. To keep the analysis simple, we fous on a linear regression model as the underlying strutural model of interest to the researher. Assume that the 2

relationship between the true dependent variable and the ovariate is desribed by: y i = x 0 i + u i ; (1) Our maintained assumption is that the analyst is interested in estimates of : 1 We assume a general proess that relates the true value y i to the observed value y i : y i = h (y i ; " i ) ; (2) There are six assumptions that are made for identi ation of the vetor and its assoiated ovariane matrix: 2 A1 : E [u i jx i ] = 0 A2 : x i is a vetor random variable with mean 0 and full rank seond moment matrix V x A3 : Random Sampling A4 : h(:; :) has nitely many disontinuities A5 : " i is independent of (y i ; x i ; u i ) A6 : Cov (y i ; y i ) > 0 Regardless of the proess in equation 2, one summary of the joint distribution of y i and y i linear projetion of y i on y i : Here, = E [y i ] ; = Cov(yi;y i ) V (y i ) is the population y i = + y i + e i : (3) ;and E [e i ] = E [e i yi ] = 0: The linear projetion is not a statement about the data generating proess, but rather a summary measure of the joint distribution of (y i ; yi ) : The atual measurement proess, as de ned by h (yi ; ") may be substantially more ompliated. Assumption A6 insures that > 0: The researher is only able to observe (y i ; x i ). Substituting equation 1 into equation 3 yields: y i = + x 0 i + u i + e i : (4) Assumption A5 insures that Cov (x i ; u i + e i ) = 0 and E [u i + e i ] = 0: This de nes the population linear projetion of y i on x i : where b = ; a = ;and i = u i + e i : y i = a + x 0 ib + i (5) 1 If the analyst is not interested in per se, but other features of the joint distribution between y and x suh as ov(y; x) or var(yjx), then it is possible that our results do not apply. Further analysis is required to understand the appliability of trimming, winsorizing or even resaling for this lass of problems. We thank an anonymous referee for suggesting this aveat. 2 The mean independene assumption is stronger than neessary for identi ation of the vetor, but allows for a simpler analysis below. The zero mean for x i is the usual normalization. The fourth assumption is neessary for moments to be well de ned, and A6 simply ensures that the measurement error proess is not so perverse that y i is uninformative about yi (ovariane of zero), or that y i and yi are negatively related. Indeed, the neessary ondition would simply be that the sign of the ovariane were known and that the ovariane is not zero. The fth assumption is the strongest one. It implies that the measurement error proess is independent of x i and u i exept through yi ; and insures that f(y ijyi ) = f(y ijyi ; x i; u i ): 3

Therefore, the OLS regression of y i on x i yields a onsistent estimate of b whih is proportional to : The parameters of interest are identi ed up to an unknown saling onstant. This would imply that estimates of ratios of the parameters are onsistent. In some settings, identi ation up to sale is onsidered su ient. For example, in wage regressions the oe ients on years of eduation and years of labor market experiene an be ombined to onsistently identify the relative return of experiene to eduation; a fat that might be su ient for estimation of shooling hoies. In general however, we assume the researher is interested in reovering the parameters. This suggests two important identi ation approahes: obtain information about the saling onstant, or obtain information about one of the elements in : While it may be possible to obtain some onsistent estimate of one element in from auxiliary regressions or eonomi theory, the use of validation data may permit estimation of : Bound and Krueger (1992) and Bollinger (1998) have examined the struture of response error when y is the natural log of annual labor market earnings using Soial Seurity Inome data mathed to the Current Population Survey. They nd a point estimate for is 0.90. This estimate ould be used in log wage models to resale slope oe ients to aount for measurement error. 3 3 Spei Measurement Error Models The above analysis holds for general examples of measurement error. In this setion, we present three speial ases of the above model. These ases are hosen beause they are ommonly examined or supported in the literature or lead to results in the ontext of this paper whih are of interest. maintained, additional assumptions are also imposed. 3.1 Additive White Noise: Assumptions 1-6 are The lassial measurement error proess is often assumed: y i = y i + " i; and E [" i ] = 0: Indeed, the error term in regression models is often motivated as measurement error. The parameters of the linear-projetion of y on y are = 1, = a = 0, and the least squares estimates are onsistent for the parameters of interest : In this model, if y i were hourly wages, it would be possible to have observations less than the minimum wage (or for that matter even negative observations) and observations above whatever threshold is deemed as a maximum. While it may be true that observations outside the aeptable region are measured with error, observations within the aeptable region are also measured with error. However, as is well known, lassial measurement error does not lead to any bias the estimated standard-errors are in ated but all statistial tests remain valid. Researhers will often point out that standard errors are too large beause of the additional measurement error. Standard errors are meant to apture the variation in estimates due to 3 The results in Bound and Krueger (1991) and Bollinger (1998) rely on estimates from the the 1977 and 1978 CPS-SSA mathed les. It is possible that the struture of measurement-error has hanged over time, thereby reduing the appliability of the resaling option sine it hinges ritially on knowledge of the orret : Estimation of is further ompliated by the fat that low earning repondents in the SSA data may be reporting their CPS earnings orretly. Examining these hypotheses is an important avenue for future researh. 4

di erenes aross samples. As long as the data generating proess does not hange, the sampling variation of the estimating oe ients will depend on the variation in both the strutural model, as well as the variation in the error model. Hene, estimates of the standard error are not biased, but rather re et the variation aross samples for this data generating proess. 3.2 Linear Measurement Error A seond ase is where the data generating proess is linear: y i = d + gy i + " i: Here, the parameters in the linear-projetion of y on y are = g and = d; and the model an either have > 1 or < 1. data generating proess an lead to observations outside the aeptable range. The Beause of the values of and the distribution of " i ; even if < 1, it is quite possible to have both observations that are too high and observations that are too low. Empirial work by Bollinger (1998) and Bound and Krueger (1991) supports the possibility that < 1: For example, using non-parametri regression on the 1978 CPS-SSA mathed data, Bollinger (1998) estimates that is equal to 0.91 for men and 0.97 for women. He estimates the interepts to be $1,364 and $211 respetively. Cognitive psyhologists have noted that this model, with < 1, will arise when respondents exhibit regression to the mean. If survey respondents give answers that try to make them appear average, then those below the mean report higher values, on average, while those above the mean report lower values, on average. Similarly, the hot dek proedure used by Census to impute earnings an also lead to a regression to the mean (Hirsh and Shumaher, 2001). study has found any variable with > 1. 3.3 Contaminated Data To our knowledge, no A third example is a simple ontaminated sample: y i = (y i ) 1 [" 1i > ] + (d + " 2i ) 1 [" 1i < ]. The term 1[:] is the indiator funtion and (" 1i ; " 2i ) are mean zero and mutually independent. This model produes a mixture: with some probability p = Pr [" 1i > ], we observe the true variable yi, while with probability (1 p) we observe only noise: (d + " 3i ) : This leads to a model where we have some orretly measured observations and some observations where the observed y has no relationship to the atual y. In this model = p and = d (1 p) : Again, some observations may fall outside a given range, depending on the distribution of " 2i and the value of d. An important impliation of this model is that estimates of the slope parameter an be obtained if an estimate of p is available. Horowitz and Manski (1995) note that the expetation of y given x annot be bounded unless information about d is available. Our analysis does not ontradit this, but rather points out that in a linear model, the slopes an be identi ed up to the ontamination rate. In many ases researhers have a priori bounds for the ontamination rate. The bounds on the ontamination rate will yield trivial bounds for the slope oe ients. h i bj elements of, j, are bounded by p ; bj p. If p< p < p, then the 5

4 E et of Cleaning We assume that the researher trunates above and below the mean. The leaning approahes we onsider are de ned by fy i ; x i j y i Cg (6) for known onstants (; C) suh that < E [y] < C. We ompare the slopes obtained from a least squares projetion of the leaned y i on x i to those obtained from the unleaned data regressed on the ovariate (that is, relative to b the biased estimate of from the unensored data). Sine the hoie for the researher is to lean or not lean, this is the relevant omparison. As noted in the setion above, the slope b may be larger or smaller in magnitude than the true slope : 4.1 Analyti Results: Trimmed Data We rst derive analyti results under the additional assumption of joint-normality. A7 : (y i ; x i ) are jointly normal. As Goldberger (1981) demonstrates, the slope vetor from the least squares projetion of y i on x i in the trunated sample (where observations above C and below are disarded) is given by: Proposition 1 Under assumptions 1-7, the trunated slope b is attenuated relative to the slope b from the least squares projetion in the full sample. b = 1 (1 ) 2 b (7) with and = V (y ij y i C) 2 = b2 2 x : (9) (8) Proof: Goldberger (1981). As Goldberger notes, 0 1 (1 ) 2 1. 4 Clearly, if 1, then the attenuation bias of the measurement error is exaerbated by the attenuation bias of the sample trunation. Hene, only if the researher is ertain that > 1 an trunation alleviate bias from measurement error. For this ase, the optimal level is determined by nding (; C) suh that = 1. With two unknown terms and only one restrition, there are 1 (1 ) many solutions. Too little trunation will fail to fully orret for the bias, while too muh will overorret. Therefore, seleting the trimming bounds on the basis of a priori values for the supports of y (as proxied 4 Sine y i is normally distributed, 2 the variane of the doubly trunated distribution an be expressed (see Madalla, 1983) as: 2 3 2 3 523 E[yi ] E[yi ] C E[yi ] C E[yi ] 6 V (y V (y i j y i C) = 41 + 4 i ) 5 4 E[yi ] C E[yi ] 7 C E[yi ] E[yi ] C E[yi ] E[yi ] 5 6

by the 1 and 99 perentiles of the wage distribution, or trimming at the minimum wage) will not neessarily orrespond to the optimal trimming rule. Other solutions exist as well. For example, if the analyst hooses = E [y] rule. and C = E [y] + ; only the term needs to be found in order to devise an optimal trimming It is impliitly desribed in the next proposition: Proposition 2 Under assumptions 1-7 and > 1; an optimal trimming rule of the form fy i ; x i j y i Cg with = E [y] Proof: see Appendix. and C = E [y] + may be derived impliitly as: 0 1 2 @ A = 1 2 : (10) Beause the solution involves the df of the standard normal distribution, there is no losed form expression. The optimal leaning depends on the variane of the observed y; the orrelation between y and x; and : The right hand side of (10) is inreasing in and the left hand side of (10) is dereasing in. Therefore, as inreases, the trunation points must move loser to the mean and the data must be trunated more heavily. In order to use this approah a number of highly restritive assumptions must be met. First, the data must be jointly normally distributed. Any disrete variables in x i will violate this assumption. Seond, the measurement error proess must result in a projetion equation for y i on y i where > 1: Finally, spei information on must be obtained in order to arrive at a trunation rule. Ad ho leaning approahes that ignore the strong nature of these assumptions may be of little value in reduing the bias in b. The variane of the measurement error is often used as a measure of the severity of the error. Sine b = 1 (1 ) b under trimming; the derivative of 2 1 (1 ) with respet to 2 2 " reveals how the trunation bias is e eted by the measurement error. This result motivates the next proposition: Proposition 3 Under assumptions 1-7, the absolute value of the di erene between elements of b and b beomes larger sine, 1 (1 ) < 0. 2 @ @ 2 " The above proposition is may appear to be prima-faia ounterintuitive, but the intuition behind it is simple. As the measurement error beomes more severe (as measured by it variane), trimming is more likely to result in deleting observations based on the regression error instead of the measurement error, thereby ausing sample-seletion bias. 4.1.1 Does Trimming Redue Standard Errors? A seond reason sometimes ited for trimming is the redution of standard errors. To examine this proedure more rigorously we begin by noting that the trunation will introdue heteroskedastiity by reduing the variane of errors in the tails of the distribution. Therefore, the asymptoti variane of the estimated slope from 7

h i the trunated data an be derived from the expression: AV bb = Q 1 E (y i x 0 i b ) 2 x i x 0 i j y i C Q 1 ; where Q = E x i x T i j y i C : In the appendix we demonstrate that this expression for asymptoti variane may be written as the sum of two terms: AV bb = 1 +Q 1 E 1 (1 ) 2 h (x 0 ib m (x i )) 2 x i x 0 ij y i C 2 Q 1 (11) i Q 1 : The size of the leading term is indeterminate relative to its full-sample OLS ounterpart. 5 in ontrast to similar omparisons for the mean (and onsequently the results in Stigler (1977)). This fat is For the trimmed mean, the term < and therefore trimming neessarily redues the variane in the leading term. Here, the omparison is not so straight forward. The seond term is due to heteroskedastiity from the trimming and is neessarily positive de nite. Hene, even if the leading term is smaller than the variane of the OLS estimate on the full sample (in a positive-de nite sense), the seond term may reverse, or at least mitigate that di erene. Finally, an often overlooked reason for why trimming may not redue standard-errors is the e et of trunation on sample size. The nite sample variane of b b is given by AV bb V bb = N C The term C E[yi] V (y i) E[yi] V (y i) E[yi] V (y i) E[yi] V (y i) : (12) < 1; measures the proportion of the sample disarded from the trunation rule, and inreases in this proportion will raise estimates of the nite sample variane. In onlusion, we nd that omparisons between the variane of the trunated estimates and the variane of the full sample estimates is ompliated and depends on the underlying parameters of the joint distribution. It is not possible to sign this di erene, even under normality. no e et of trimming on standard-errors. 4.2 Analyti Results: Winsorized Data In fat, the simulations below show little or Rather than trunation, winsorized data are ensored at the points and C: removed, but values of y i outside of the region (; C) are transformed as: Here, no observations are y w i = C if y i C y i if < y i < C y i : (13) 5 The leading term is omparable to the asymptoti variane expression for the OLS estimate in the full sample: 1 2 E x i x T 1 i : Furthermore, V (yi ) 1 1 (1 ) 2 2 1 2 : However, the di erene E x i x T 1 i E x i x T i j y i C 1 is neessarily positive semi-de nite sine the variane of xi in the trunated sample annot be larger than the variane of the full sample (a su ient ondition here is joint normality (see Goldberger(1981)). Hene, a omparison of the leading terms is indeterminate. 8

In the appendix we show that the analytial results from winsorizing are similar to those obtained for trimming. The empirial results suggest winsorizing has less of an impat on the slope oe ients (relative to OLS) than trunation. Again, if > 1;an optimal hoie of winsorizing points is available, but of ourse is unknown. Asertaining the e et of winsorizing on the size of the standard errors is oneptually similar to the e et of trimming on standard-errors, but algebraially more di ult. We provide a disussion of this point in the appendix. One advantage winsorizing has over trimming is that the penalty of lost data does not e et the expression for the nite sample variane. But overall, winsorizing and trimming have similar e ets on both slope estimates and asymptoti variane. 4.3 Cleaning Data in the General Case The results derived in the previous setion rely heavily upon normality. However, as Goldberger (1981) demonstrates, without normality some oe ients may be attenuated by trunation, while others may be in ated. Clearly, as a theoretial matter, trunation or winsorizing annot be relied upon to adjust slope oe ients for the bias in general. onditions. 5 Results In ontrast, the results from Setion 2 were derived under muh weaker To suggest more general results, we present a set of Monte Carlo simulations. We draw data from the 1990 PUMS and estimate the returns to shooling, treating estimates from the full PUMS le as population parameters. The results of di erent leaning proedures are gauged against these parameters. This allows for a ompliated measurement model, with relationships similar to those found in typial eonomi data. 6 5.1 Evidene from U.S. Deennial Census Data We begin with evidene from quasi-simulated data drawn from the PUMS samples of the 1990 US Deennial Census. The advantage of these data is that they provide a omplex multivariate distribution for analysis. We study the problem of estimating the returns to shooling, using a standard Minerian spei ation (that is, lnwage = 0 + 1 Shooling + 2 Experiene+ 3 Experiene 2 + 4 Blak +u) to desribe the relationship between hourly wages, years of shooling, rae and potential experiene. We rst selet prime-aged men who 6 Our working paper [Bollinger and Chandra (2003)] provides more detailed Monte Carlo simulations for the univariate and multivariate ase, and data from the normal, uniform, and log-normal distributions. Of note are the results for the realisti ase of = 0:9: We note that leaning proedures are always dominated by not leaning the data. Even though the bias from not leaning the data is a little over 10 perent, the bias from trimming is uniformly greater. 1% winsorizing or the use of median regression are neutral rules with respet to point-estimates, but both proedures are dominated by not leaning the data on the basis of a RMSE riteria. When > 1; a 1 perent trimming rule learly dominates not leaning the data. Before this onlusion is embraed too quikly by pratitioners, we raise two important aveats: First, even though 1% trimming works, 5% trimming is muh worse that not leaning the data; the optimal trimming rule is therefore not a known onstant and small perturbations from the optimal trunation will generate large biases relative to not leaning the data. In fat, the best rule for the ase of > 1 would be to use 5% winsorizing. Seond, we reiterate the di ulty in being able to justify a behavioral model for why would exeed one. 9

are working full time in a non-agriultural industry. We remove individuals who earn less than the minimum wage. The resulting 346,900 observations onsitute a sample that losely simulates the ideal population distribution assumed by many researhers. Column 1 of Table 1 presents mean and standard deviation parameters for this pseudo-population. Blak men omprise 8.3 perent of the population, the mean years of potential experiene is 17.67, and the mean years of shooling is 13.37. The regression parameters generated by an OLS regression on all 346,900 observations are reported in Column 2 of Table 1. For our simulations, we randomly draw samples of size 1000 from the psuedo-population of 346,900 observations. Ordinary least squares performed on these data (without any leaning) are reported in the third olumn of the table. In Columns 4-8, we report the e et of alternative leaning proedures when no measurement error has been added to the PUMS data. 7 The idea of leaning data with no error might strike the reader as a peuliar exerise. Our motivation for doing so is to demonstrate that leaning proedures are not benign and an introdue signi ant bias when they are not required; alternatively, if the degree of ontamination is low, the iatrogeni error from leaning data may be substantial. In this situation, the leaning proedures do not generally perform better than not leaning the data. In general, the RMSE from the leaning proedures (inluding median regression) is greater than that from doing nothing. Whereas a 1% trimming rule improves the estimation of the oe ients on experiene and experiene squared, it is inferior to not leaning the data as regards the estimation of the oe ients on shooling and rae. Together these results on rm those from the univariate ase: No leaning proedure is neutral when applied to already lean data. Measurement error is added to the data in Table 2. We selet two values for the variane of this error using the results of Bound and Krueger (1991), who note that V ar(lny ) = 0.458 and 0.529 with orresponding error varianes are 0.083 and 0.116. This implies that the variane of the error is 18% and 22% of the total variation in lny. Rogers, Brown and Dunan (1993) nd even higher implied estimates of the variane of the measurement error. Therefore, to study empirially relevant ases we simulate measurement error whose variane is 0:1V ar(wage) and 0:3V ar(wage). In order to keep the reported number of results manageable we only report results from the latter simulation, but note that the results from the former are quantitatively similar [see Bollinger and Chandra (2003) for details]. In the ase of additive white noise we nd that trimming is one again dominated by not leaning the data. A ase an be made for a 1% winsorizing rule over not leaning the data, but it is important to note that signi ant bias is introdued when the ensoring rate is inreased to 5%. For this ase, least-squares is found to be superior to median regression. When = 0:9 there is no leaning proedure that stritly dominates OLS. A 1% winsorizing rule provides superior estimates on a RMSE riteria for many oe ients but simultaneously raises the bias on others. For example, the oe ients on Exp, Exp-Sq and Blak all have lower RMSE when a 1% winsorizing rule is applied, but the oe ient on shooling has a larger RMSE at the same time. When = 1:1 winsorizing at 1% and 5% are preferred to doing nothing. In this ase, trimming proedures dominate not leaning the data on a 7 In the last olumn of the table, we report results from performing Median Regression. While not expliitly studied in our paper, we inlude these estimates beause several readers of our paper argued that it may be viewed as an alternative to trimming and winsorizing. 10

RMSE riteria, but an be worse in terms of the bias omponent. 5.1.1 Comparing resaling approahes to trimming approahes. As noted in previous setions, another identi ation approah is to resale the estimates. The optimal trimming rule derived in Setion 4 requires both information about and normality: Clearly, if < 1, trimming or winsorizing will be dominated by the resaling approah. Even when > 1, the amount of trimming or winsorizing neessary depends upon the variane of the errors (see Bollinger and Chandra (2003)): Examining the rst Column of Table 2, shows that knowledge of alone will be su ient to arrive at onsistent estimates. Even if is not known, the resaling results an be used to perform sensitivity analysis. For example, researhers might ask: how sensitive to di erent values of are the onlusions we draw from our OLS estimates? The robustness of these onlusions may be examined either by plaing bounds on ; as suggested in Manski (1995), or alternatively by asking what values of support the onlusions typially drawn (an approah suggested in a similar ontext by Bollinger (2003), and Bollinger (2001)). Further, researhers may not have detailed information about but may have information about the likely range of : It is di ult to use that information for trimming and winsorizing, but it an be trivially used in a resaling approah. 5.2 Empirial example from the CPS We also examine data leaning approahes using the Marh 2001 Current Population Survey. There are two measures of hourly wage that a researher ould exploit in these data. One is the hourly wage onstruted from the reported annual earnings, weeks worked and usual hours worked. The CPS also asks the atual hourly wage for workers who are paid hourly. Most researhers do not use this variable, as the resulting sample is smaller and only represents hourly wage workers. In this ontext, the two measures provide an interesting omparison to examine the impliations of trimming as is typially pratied. Our sample onsists of males, working full time, year round in non-agriultural positions who are not self-employed. In Marh 2001, we nd 2,626 men who are full time, year round non-agriultural hourly wage workers. The rst olumn of Table 3 presents the log wage regression on the reported hourly wage, while the seond olumn uses the onstruted hourly wage. We restrited the onstruted hourly wage sample to ontain only those workers who also reported an hourly wage, hene any di erene between the two olumns re ets only di erenes in the measurement of the hourly wage, rather than sample di erenes. One perspetive with these results is that the rst olumn represents the "true oe ients," while the results in olumn two are biased due to response error. Interestingly, most of the oe ients in olumn 2 are larger in magnitude than their orresponding oe ients in olumn 1. The exeption to this is the oe ient on Bahelors degree. This is a ase where trimming might be useful. However, the fat that the oe ient on Bahelors is the exeption demonstrates that it is di ult to nd perfet ases. If olumn 1 represents the "true oe ients," then olumns 3 and 4 represent attempts to orret olumn 11

2. Trimming at about 1/2 of the minimum wage is a ommon pratie (see Angrist and Krueger, 2000). Column 3 represents this approah. Another logial orretion is to trim at the minimum wage; olumn 4 represents this approah. Comparing olumn 3 with olumns 1 and 2, we nd that the oe ients on experiene and experiene squared and blak are largely una eted by the trimming, and are still "too large." The oe ient on less than high shool has atually inreased in magnitude, and made the bias worse. The oe ients on assoiates degree and graduate degree have delined in magnitude, reduing the bias, but not eliminating it. The oe ient on bahelors degree has dereased in magnitude inreasing the bias in this oe ient. Trimming at the minimum wage, represented by olumn 4, improves some oe ients but not others. The oe ients on experiene and experiene squared are now both smaller in magnitude and loser to the "ideal" olumn 1. The oe ient on less than high shool has now delined in magnitude, but it still somewhat larger than the oe ient in olumn 1. The oe ient on assoiates degree has delined and is biased relative to the target in olumn 1; it now underestimates the magnitude. The oe ient on graduate degree has not hanged any further and still overstates the target in olumn 1. The oe ient on blak has delined in magnitude and is now loser to the oe ient in olumn 1, but is still larger in magnitude. The oe ient on bahelors degree has delined further in magnitude inreasing the bias still further. The onlusion we take from this is that there is no lear advantage to trimming. While it ertainly may redue the bias for some estimates, it is simultaneously making other estimates worse. Sine it is rare to have a target (as we do in this ase), it is only through serendipity that one will pik the right trimming rule even if the researher is only interested in one spei oe ient. A seond perspetive on the estimates in olumns 1 and 2 is that they both ontain measurement error in the dependent variable. One would expet that trimmed versions of the two regressions would onverge to some set of orret estimates. Columns 5 and 6 are trimmed versions of olumn 1. As one might expet, the reported wage has fewer observations below the trimming points than the onstruted wage. In olumn 5 there is very little hange in the oe ients. In olumn 6 there is little hange the oe ients on experiene, experiene squared, or less than High Shool. The oe ient on Bahelors degree inreases in magnitude. This is in sharp ontrast to trimming the onstruted wage, where the oe ient dereased in magnitude. The oe ients on graduate degree and Blak both inrease in magnitude slightly. We onlude that it appears only serendipitous if any oe ients onverge with trimming. If one is interested in the return to a ollege degree, trimming is likely to be undesirable. While if one is interested in the Blak-white wage gap, trimming at an even higher threshold may be desirable. The e ets of trimming are unlear sine we annot even predit whih diretion the slope oe ients will hange when we trim. Without apriori information, it is di ult or impossible to know if trimming has redued bias, inreased bias, or some of both. 12

6 Conlusions The ommon pratie of leaning data by removing observations where the dependent variable is larger or smaller than some threshold is used in the hope of reduing the impat of measurement error. While this sounds sensible, it may make matters worse. Analytial results using normality demonstrate that leaning strategies using trunation or winsorizing work only the ase where measurement error in the dependent variable results in an upward bias on the magnitude of oe ients. This ase is not empirially supported by investigations of response error in earnings data. Under ertain irumstanes it may be possible to ahieve an optimal leaning strategy, but if the information neessary for that result were available, a simpler approah based on resaling the estimated oe ients would work too. Our empirial results demonstrate that a 1 perent winsorizing rule does not alter the results in a meaningful manner, but we note that this poliy is generally dominated by doing nothing to the data. Still, winsorizing is learly better than trunation. We aution, however, that small inrements to this rule (for example to 5 perent) an dramatially inrease bias and render the leaning undesirable on MSE grounds. Two important extentions we hope to address in future work is leaning based on ovariates, and leaning based up panel data where the researher has multiple observations on the dependent variable. In both ases, it may be possible to develop leaning proeedures that exploit other information. 13

7 Appendix Derivation of equation10 To solve for the expression in 10, begin by noting that E [y] = + + + x and = 2 2 2 x + 2 2 u + 2 ": Additionally we simplify the analysis by only onsidering a symmetri trunation sheme where = E [y] and C = E [y] + : so that only need be found. Consider rst the expression for in this ase: 2 = 1 + 4 2 E[y] 4 E[y] C E[y] C x 2 2 2 x +2 2 u +2 " E[y] C E[y] E[y] C E[y] 3 2 5 C E[y] E[y] substituting the symmetri expressions for ; C yields v(y) = 1 + 2 4 3 2 5 0 1 = 1 2 @ A : Next, noting that is observable and is assumed to be known solve and : = 1 1 : 1 (1 ) Substituting for and solving, yields the impliit relationship expressed in equation (10): 0 1 2 @ A = 1 : Proof of Proposition 3 3 5 = 1 for in terms of To show that the bias gets worse with more variane in ", di erentiate the bias term with respet to the variane of the measurement error: @ @ 2 " 1 (1 ) 2 @ = @ 2 1 (1 ) 2 2 @ " @ 2 (1 ) @2 " @ 2 = 1 (1 ) 2 2 " This term is negative if and only if the numerator is negative. Considering only the numerator and grouping similar derivatives yields @ @ 2 " 1 2 + (1 ) @2 @ 2 : " 14

As noted in Goldberger both 2 and are bounded in the unit interval. Inspetion of the de nition of 2 learly demonstrates that @2 @ < 0: Now, onsider the de nition of : inspetion reveals that this has 2 " the trunation points standardized by the mean and variane of y: Hene inreasing 2 " is equivalent to inreasing and dereasing C for a trunated standard normal random variable. Sine is also the ratio of the variane of the trunated standard normal to the variane of the untrunated standard normal (see Goldberger), inreasing and dereasing C will result in a lower variane for the trunated distribution and thus a lower. Hene, by inspetion, @ @ 2 " established. < 0. Combined with the bounds on 2 and, the result is Does Trimming Redue Standard Errors Under assumptions A1-A7, the onditional distribution of y i jx i is N x 0 i b; V (y i) 1 2 : We an use the results in Goldberger (1981) to obtain an expression for the trunated seond moment matrix as: h The term E E Q = E [x i x 0 ij y i C] = E [x i x 0 i] (1 ) E [x i x 0 i] bb 0 E [x i x 0 i] : i (y i x 0 i b ) 2 x i x 0 j y i C i is onsidered by using the law of iterated expetations. The an be deomposed into the variation of y i around the onditional mean in h (y i x 0 i b ) 2 jx i ; y i C the trunated distribution, and the squared di erene between the onditional mean and the linear projetion term x T i b : Doing so yields: 1 h E i (y i x 0 ib ) 2 jx i ; y i C = 1 (1 ) 2 2 + (x 0 ib m (x i )) 2 ; where m (x i ) is the onditional mean of y i given x i in the trunated sample. Combining these terms produes the equation in the paper: AV bb = 1 +Q 1 E 1 (1 ) 2 2 Q 1 h (x 0 ib m (x i )) 2 x i x 0 ij y i C i Q 1 : Analyti Results: Winsorized Data When data are winsorized, no observations are removed, but values of y i outside of the region (; C) are transformed as follows: y w i = C if y i C y i if < y i < C y i : As in the setion on trimming, let b represent the vetor of full-sample (unleaned) oe ients. Under bivariate normality, the winsorized oe ient vetor, b is given by : 2 0 C E b [yi ] E [yi ] 6 = : 41 @ C E[yi] V (y i) E[yi] V (y i) C E[yi] V (y i) E[yi] V (y i) 1 A23 7 5 b: 15

Does Winsorizing Redue Standard Errors? The e ets of Winsorizing on the variane are derived similarly to the results for trimming. We start by 2 2 noting that the AV b = E [x i x 0 i ] 1 E yi W x 0 i b xi x 0 i E [x i x 0 i ] 1 : Here, the term E yi W x 0 i b jxi an be broken into three terms: h E yi W x i b i 2 E [yi ] jxi = x 0 ib 2 C E [yi ] + 1 C x 0 C E [yi ] E [yi ] + Combined with results from the previous setion, we obtain ib 2 h i E (y i x 0 ib ) 2 jx i ; y i C AV b = C E [yi ] E [yi ] 1 1 (1 ) 2 E [x i x 0 1 i] QE [xi x 0 1 i] h +E [x i x 0 1 i] E x 0 ib m W (x i ) i 2 xi x 0 ij < y i < C E [x i x 0 1 i] E [yi ] h i + E [x i x 0 1 i] E ( x i b ) 2 x i x 0 i jy i E [x i x 0 1 i] C E [yi ] h i + 1 E [x i x 0 1 i] E (C x i b ) 2 x i x 0 i jy i C E [x i x 0 i] 1 : Again, the omparison is di ult. Here, the rst term will neessarily be smaller than the OLS expression. However, the seond, third and fourth terms are all positive de nite. As in the trimming ase, the impat on standard errors depends upon the parameters of the model. 16

Referenes [1] Angrist, Joshua D. and Alan B. Krueger, 2000. Empirial Strategies in Labor Eonomis, in Orley Ashenfelter and David Card (Eds.) Handbook of Labor Eonomis, Vol 3A (Elsevier Siene). [2] Blak, Dan A., Mark C. Berger and Frank A. Sott, 2000. Bounding Parameter Estimates with Nonlassial Measurement Error, Journal of the Amerian Statistial Assoiation 95: 739-48. [3] Bollinger, Christopher R., 1996. Bounding Mean Regressions When A Binary Regressor is Mismeasured, Journal of Eonometris 73: 387-399. [4] Bollinger, Christopher R., 2003 Measurement Error in Human Capital and the Blak -White Wage Di erential, Review of Eonomis and Statistis85: 578-587. [5] Bollinger, Christopher and Martin H. David. 1997. Modeling Food Stamp Partiipation in the Presene of Reporting Errors, Journal of the Amerian Statistial Assoiation 92: 827-35. [6] Bollinger, Christopher, 1998. Measurement Error in the Current Population Survey: A Nonparametri Look, Journal of Labor Eonomis 16(3): 57-71. [7] Bollinger, Christopher and Amitabh Chandra. 2003. "Iatrogeni Spei ation Error" NBER Tehnial Working Paper 289, Cambridge, MA. [8] Bound, John and Alan B. Krueger, 1991. The Extent of Measurement Error in Longitudinal Earnings Data: Do Two Wrongs Make a Right? Journal of Labor Eonomis 9: 1-24. [9] Bound, John and Rihard Freeman, 1992. What Went Wrong? The Erosion of Relative Earnings and Employment Among Blak Men in the 1980s, Quarterly Journal of Eonomis 107(1), February: 201-32. [10] Bound, John, Charles Brown, Greg J. Dunan and Willard L. Rodgers, 1994. Evidene on the Validity of Cross-Setional and Longitudinal Labor Market Data, Journal of Labor Eonomis 12: 345-68. [11] Buhinsky, Mohe, 1994. Changes in the U.S. Wage Struture 1963-1987: Appliation of Quantile Regression, Eonometria 62: 405-58. [12] Card, David and Alan B. Krueger, 1992a. Shool Quality and Blak-White Relative Earnings: A Diret Assessment, Quarterly Journal of Eonomis 107, February: 151-200. [13] Fuller, Wayne A. 1987. Measurement Error Models. John Wiley and Sons. (New York, NY). [14] Goldberger, Arthur S., 1981, Linear Regression after Seletion, Journal of Eonometris 15(3): 357-66. 17

[15] Hirsh, Barry T. and Edward J. Shumaher, 2001. Math Bias in Wage Gap Estimates Due to Earnings Imputations, unpublished manusript. [16] Horowitz, Joel L. and Charles F. Manski, 1995. Identi ation and Robustness with Contaminated and Corrupted Data, Eonometria 63(2): 281-302. [17] Hyslop, Dean R. and Guido W. Imbens, 2001. Bias From Classial and Other Forms of Measurement Error, Journal of Business and Eonomi Statistis 19(4): 475-481. [18] Juhn, Chinhui, Kevin M. Murphy and Brooks Piere, 1993. Wage Inequality and the Rise in the Returns to Skill, Journal of Politial Eonomy 101: 410-42. [19] Katz, Lawrene and Kevin M. Murphy, 1992. Changes in Relative Wages 1963-1987, Quarterly Journal of Eonomis 107(1): 35-78. [20] MaDonald, Glenn M. and Robinson Chris, 1985. Cautionary Tales About Arbitrary Deletion of Observations; or, Throwing the Variane out with the Bathwater, Journal of Labor Eonomis 3(2): 124-52. [21] Maddala, G. S., 1983. Limited Dependent and Qualitative Variables in Eonometris (Cambridge University Press). [22] Manski, Charles F., 1995. Identi ation Problems in the Soial Sienes (Harvard University Press). [23] Mellow, Wesley and Hal Sider, 1983. Auray of Response in Labor Market Surveys: Evidene and Impliations, Journal of Labor Eonomis 1: 331-44. [24] Rodgers, Willard L., Charles C. Brown and Greg J. Dunan, 1993. Errors in Survey Reports of Earnings, Hours Worked and Hourly Wages, Journal of the Amerian Statistial Assoiation 88: 1208-18. [25] Stigler, Stephen M., 1977. Do Robust Estimators work with Real Data? Annals of Statistis 5(6): 1055-98. 18

Table 1: Effet of Cleaning Proedures on Unorrupted Data, Evidene from the Returns to Shooling in 1990 PUMS Data Mean Population b No Cleaning Trim 1% Trim 5% Wins 1% Wins 5% Median Shooling 13.3741 0.092 0.0918 0.0878 0.0704 0.0910 0.0856 0.1001 SE (Yrs of Shooling) 2.2189-0.0078 0.0072 0.0068 0.0075 0.0070 0.0086 RMSE - - 0.0078 0.0083 0.0227 0.0076 0.0095 0.0118 Potential Experiene 17.6700 0.0374 0.0375 0.0361 0.0294 0.0372 0.0353 0.0402 SE (Shooling) 8.5976-0.0084 0.0079 0.0069 0.0082 0.0076 0.0095 RMSE - - 0.0084 0.0080 0.0106 0.0082 0.0079 0.0099 Pot. Exp. Sq /100 3.8639-0.0535-0.0538-0.0522-0.0419-0.0535-0.0508-0.0565 SE (Pot. Exp) 3.3643-0.0218 0.0202 0.0175 0.0211 0.0196 0.0246 RMSE - - 0.0218 0.0203 0.0210 0.0211 0.0198 0.0248 Blak (1= yes) 0.0831-0.1419-0.1416-0.1374-0.1072-0.1417-0.1339-0.1640 SE (Blak) 0.2762-0.0597 0.0544 0.0492 0.0579 0.0531 0.0697 RMSE - - 0.0597 0.0545 0.0602 0.0579 0.0538 0.0731 Constant - 0.8608 0.8639 0.9293 1.2376 0.8746 0.9649 0.7280 SE - - 0.1275 0.1195 0.1112 0.1229 0.1150 0.1430 Dependent variable is ln hourly wage. PUMS data are restrited to white (non-hispani) and blak men in the 1990 PUMS files of the Deennial Census who are aged 25-55 during the ensus referene week. Nonworkers and repondents with hourly wages less than $3.35 in 1989 (the nominal value of the minimum wage) are deleted from the analysis. Column (1) reports means and standard deviations for this sample of 346,900 individuals, and olumn 2 reports the parameters from estimating the model: ln wage=b 0 +b 1 Shooling +b 2 Exp+b 3 Exp 2 +b 4 Blak + u on this sample. Reported estimates in other olumns are empirial sample moments from 1,000 repliations eah with a sample size of 1,000.

Table 2: Effet of Cleaning Proedures on Corrupted Data, Evidene from the 1990 PUMS No Cleaning Trim 1% Trim 5% Wins 1% Wins 5% Median Error Model: ln wage = ln wage* + e; var (e) = 0.3 x var (wage) Shooling 0.0925 0.0883 0.0704 0.0916 0.0859 0.1001 SE (Shooling) 0.0081 0.0074 0.0068 0.0078 0.0072 0.0091 RMSE 0.0082 0.0084 0.0227 0.0078 0.0095 0.0121 Pot. Exp 0.0377 0.0363 0.0294 0.0374 0.0354 0.0402 SE (Pot Exp) 0.0084 0.0078 0.0070 0.0081 0.0075 0.0096 RMSE 0.0084 0.0079 0.0107 0.0081 0.0078 0.0100 Pot. Exp. Sq /100-0.0539-0.0524-0.0419-0.0537-0.0510-0.0566 SE (Pot. Exp Sq) 0.0215 0.0201 0.0178 0.0209 0.0193 0.0248 RMSE 0.0215 0.0202 0.0213 0.0209 0.0195 0.0250 Blak (1= yes) -0.1426-0.1394-0.1097-0.1426-0.1346-0.1616 SE (Blak) 0.0639 0.0578 0.0513 0.0616 0.0564 0.0730 RMSE 0.0639 0.0578 0.0606 0.0616 0.0569 0.0756 Constant 0.8535 0.9222 1.2396 0.8667 0.9621 0.7281 SE 0.1345 0.1237 0.1147 0.1289 0.1203 0.1502 Error Model: ln wage = 0.9 ln wage* + e; var (e) = 0.3 x var (wage) Shooling 0.0833 0.0795 0.0634 0.0825 0.0773 0.0901 SE (Shooling) 0.0074 0.0067 0.0062 0.0071 0.0066 0.0082 RMSE 0.0115 0.0142 0.0294 0.0119 0.0161 0.0084 Pot. Exp 0.0339 0.0327 0.0264 0.0337 0.0318 0.0360 SE (Pot Exp) 0.0075 0.0071 0.0061 0.0073 0.0068 0.0087 RMSE 0.0083 0.0086 0.0127 0.0083 0.0088 0.0088 Pot. Exp. Sq /100-0.0485-0.0471-0.0375-0.0483-0.0459-0.0502 SE (Pot. Exp Sq) 0.0195 0.0182 0.0158 0.0190 0.0175 0.0225 RMSE 0.0201 0.0193 0.0225 0.0197 0.0191 0.0227 Blak (1= yes) -0.1276-0.1250-0.0976-0.1276-0.1205-0.1439 SE (Blak) 0.0576 0.0518 0.0458 0.0554 0.0507 0.0641 RMSE 0.0594 0.0545 0.0637 0.0572 0.0551 0.0641 Constant 0.7672 0.8295 1.1165 0.7789 0.8651 0.6558 SE 0.1208 0.1108 0.1020 0.1158 0.1075 0.1351 Error Model: ln wage = 1.1 ln wage* + e; var (e) = 0.3 x var (wage) Shooling 0.1018 0.0972 0.0775 0.1008 0.0945 0.1104 SE (Shooling) 0.0090 0.0082 0.0075 0.0087 0.0080 0.0100 RMSE 0.0132 0.0097 0.0164 0.0123 0.0084 0.0208 Pot. Exp 0.0415 0.0400 0.0325 0.0412 0.0390 0.0441 SE (Pot Exp) 0.0092 0.0087 0.0075 0.0090 0.0083 0.0106 RMSE 0.0101 0.0090 0.0089 0.0097 0.0085 0.0125 Pot. Exp. Sq /100-0.0594-0.0577-0.0466-0.0591-0.0562-0.0616 SE (Pot. Exp Sq) 0.0239 0.0224 0.0192 0.0232 0.0214 0.0275 RMSE 0.0246 0.0227 0.0205 0.0238 0.0216 0.0286 Blak (1= yes) -0.1560-0.1527-0.1204-0.1562-0.1478-0.1789 SE (Blak) 0.0702 0.0635 0.0554 0.0675 0.0618 0.0803 RMSE 0.0716 0.0644 0.0594 0.0690 0.0621 0.0884 Constant 0.9380 1.0126 1.3612 0.9522 1.0567 0.7982 SE 0.1480 0.1352 0.1238 0.1418 0.1321 0.1656 Dependent variable is ln hourly wage. PUMS data are restrited to white (non-hispani) and blak men in the 1990 PUMS files of the Deennial Census who are aged 25-55 during the ensus referene week (n=346,900). Nonworkers and repondents with hourly wages less than $3.35 in 1989 (the nominal value of the minimum wage) are deleted from the analysis, and measurement error is added to observed lnwage using the speified error models. Reported estimates are empirial sample moments from 1,000 repliations eah with a sample size of 1,000. The variane of ln(wage)=0.3144.

Table 3: Effet of Cleaning Proedures: Evidene from the Marh 2001 CPS (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Baseline Regressions Construted Hourly Wage Reported Hourly Wage Construted Hourly Wage Reported Hourly Wage Reported Hourly Wage Construted Hourly Wage Trim at 1/2 min wage Trim at min wage Trim at 1/2 min wage Trim at min wage Wins at 1/2 min wage Wins at the min wage Wins at 1/2 min wage Wins at min wage Potential Exp 0.028 0.033 0.033 0.030 0.028 0.028 0.033 0.032 0.028 0.028 (0.002) (0.003) (0.003) (0.003) (0.002) (0.002) (0.003) (0.003) (0.002) (0.002) Potential Exp 2-0.045-0.051-0.051-0.046-0.046-0.045-0.052-0.049-0.045-0.045 (0.005) (0.006) (0.006) (0.005) (0.005) (0.004) (0.006) (0.005) (0.005) (0.005) Less than HS (1=Yes) -0.261-0.287-0.292-0.280-0.252-0.253-0.290-0.285-0.258-0.256 (0.024) (0.031) (0.029) (0.028) (0.024) (0.023) (0.030) (0.028) (0.024) (0.023) Assoiates Deg (1=Yes) 0.170 0.191 0.187 0.166 0.168 0.168 0.191 0.184 0.170 0.169 (0.028) (0.037) (0.034) (0.032) (0.028) (0.027) (0.035) (0.034) (0.028) (0.028) Bahelors Deg (1=Yes) 0.254 0.242 0.235 0.211 0.253 0.270 0.238 0.232 0.254 0.258 (0.028) (0.037) (0.034) (0.032) (0.028) (0.027) (0.035) (0.034) (0.028) (0.028) Graduate Deg (1=Yes) 0.377 0.590 0.575 0.575 0.375 0.450 0.584 0.577 0.377 0.398 (0.058) (0.076) (0.070) (0.066) (0.057) (0.056) (0.072) (0.069) (0.058) (0.056) Blak (1=Yes) -0.087-0.135-0.136-0.113-0.090-0.093-0.138-0.130-0.088-0.090 (0.025) (0.033) (0.031) (0.030) (0.025) (0.024) (0.032) (0.030) (0.025) (0.025) Constant 2.240 2.197 2.220 2.282 2.238 2.241 2.203 2.230 2.240 2.240 (0.025) (0.032) (0.030) (0.029) (0.024) (0.024) (0.031) (0.029) (0.025) (0.024) Observations 2626 2626 2612 2534 2622 2605 2626 2626 2626 2626 Sample is drawn from the Marh 2001 CPS. The sample onsists of males who are not self-employed, working full time, year round in non-agriultural positions, who were paid hourly. Reported hourly wage refers to respondent s report of hourly wage (true wage); onstruted hourly wage is onstruted using annual earnings, hours and weeks worked (and therefore onstitutes a noisy measure of wage). The orrelation between the two wage measures is 0.39 (and 0.54 in logs). The standard-deviations for atual and onstruted hourly pay are 7.4 and 12.5 respetively (0.45 and 0.58 in logs). Reported hourly pay ranged from $1 to $99, whereas onstruted hourly wages ranged from $0.02-$184.13.