Ordinary Least Squares: the univariate case

: the univariate case Majeure Economie September 2011

1 Introduction 2 The OLS method Objective and principles of OLS Deriving the OLS estimates Do OLS keep their promises? 3 The linear causal model Assumptions Identification and estimation Limits 4 A simulation & applications OLS do not always yield good estimates... But things can be improved... Empirical applications 5 Conclusion and exercises

Objectives Objective 1 : to make the best possible guess on a variable Y based on X. Find a function of X which yields good predictions for Y. Given cigarette prices, what will be cigarettes sales in September 2010 in France? Objective 2 : to determine the causal mechanism by which X influences Y. Cetebus paribus type of analysis. Everything else being equal, how a change in X affects Y? By how much one more year of education increases an individual s wage? By how much the hiring of 1 000 more policemen would decrease the crime rate in Paris? The tool we use = a data set, in which we have the wages and number of years of education of N individuals.

Objective and principles of OLS What we have and what we want For each individual in our data set we observe his wage and his number of years of education. Assume we have a graph such as the one below. Relationship between the two variable seems to be linear. We want to find the line which describes best the relationship between these variables. 4000 3500 3000 Wage 2500 2000 1500 1000 500 0 8 10 12 14 16 18 20 Years of Schooling

Objective and principles of OLS The principle of OLS A line is characterized by a slope and by an intercept that we denote α and β. Idea = choose for α and β the values which minimize (Yi α β X i ) 2. Estimates. Let us denote Ŷi = α + β X i. It represents the wage of individual i as predicted by our model. We also denote ε i = Y i Ŷi. The ε i are called the estimated residuals and represent the mistake made by our model when predicting individual i s wage based on his number of years of schooling. => the principle of OLS is merely to minimize the sum of the mistakes we make when we use an affine function of X i to predict Y i. Why do we take the square of ε i? Could we have used another function?

Objective and principles of OLS A graphical example 4000 3500 3000 Wage 2500 2000 1500 1000 500 0 8 10 12 14 16 18 20 Years of Schooling

Deriving the OLS estimates Finding α and β (Theorem 1.1) We denote Y = 1 N Yi the empirical mean of (Y i ), X the empirical mean of (X i ), V e (X ) = 1 N X 2 i ( 1 ) 2 N Xi the empirical variance of (X i ) and finally cov e (X, Y ) = 1 N Xi Y i X Y the empirical covariance of (X i ) and (Y i ). We want to minimize f ( α, β) = (Y i α β X i ) 2. Solution: β = cov e(x,y ) cove(x,y ) V e(x ) and α = Y V e(x ) X. Can we compute β from the sample? Any problem with the computations? Any idea to interpret this result?

Deriving the OLS estimates An example Compute β in this simple example: Individual Years of Schooling Wage 1 5 1000 2 5 1500 3 10 1000 4 15 2000 5 15 2500

Do OLS keep their promises? Do OLS attain objectives 1 and 2? Objective 1: find the best prediction for Y based on X / find a function P(X i ) of X i which yields good predictions for Y i. Objective 2: determine the causal mechanism by which X influences Y.

Do OLS keep their promises? OLS partially reach objective 1. Once agreed that a good prediction is a prediction which minimizes the square of errors, OLS yield by construction the best prediction function for Y, among all affine functions of X. But: the criterion can be challenged: minimize ε i instead of 2 εi. This is not so big an issue. Quantile regression models minimize ε i and results usually close from OLS. even if the criterion is accepted, OLS yield the best prediction function among all affine functions of X, not among all functions of X. There might for instance exist a polynomial function of X : α + β X + γ X 2 which yields errors ε i such that ( ε i ) 2 < εi 2. Not so big an issue neither, see next chapter. How to measure the extent to which Objective 1 is reached?

Do OLS keep their promises? The R 2 : a measure of the quality of our predictions SST = (Y i Ȳ ) 2 : the dispersion of wages. SSE = (Ŷi Ȳ ) 2 : the dispersion of predicted wages. SSR = (Y i Ŷi) 2 : the sum of the square of the errors. SST = (Y i Ȳ ) 2 = (Y i Ŷi + Ŷi Ȳ ) 2 = (Yi Ŷi) 2 + (Ŷi Ȳ ) 2 + 2 ε i (Ŷi Ȳ ) = SSE + SSR + 2 α ε i + 2 β ε i X i 2Y ε i. According to FOC1, ε i = 0, according to FOC2, εi x i = 0. Therefore, SST = SSE + SSR. R 2 = SSE SST. The R2 is always included between 0 and 1 (why?). It is a measure of the share of the variance observed in the sample our model is able to account for, of the quality of our predictions for Y based on X. However, a model with a low R-square can still be helpful and models with high R-squared can be helpless.

Do OLS keep their promises? But OLS do not necessarily reach objective 2. 4000 3500 3000 Wage 2500 2000 1500 1000 500 0 8 10 12 14 16 18 20 Years of Schooling Individuals with more schooling have higher wages. Does it imply that schooling has a causal impact on wages?

Do OLS keep their promises? But OLS do not necessarily reach objective 2. The line can be inverted causality goes in the other direction. Reverse causality. Here, not an issue: higher wages cannot cause longer education because schooling takes place before labor market participation. Individuals with many years of schooling make more money than those with few years of schooling. But do those two groups only differ on their number of years of schooling? Probably not. For instance, those with more years of schooling might have richer parents, or might also be more clever. this correlation between wages and education, is it only due to the effect of education on wages, or to the fact that those with more education are also more clever and have richer parents? Omitted variable bias.

Do OLS keep their promises? A causal framework Parents wage Well paid parents can afford sending their children to school, then to college and finally to university Well paid parents have good networking skills, know how to get good positions => can help their children Children s education Education increases children s productivity + ability to find a well paid job (signalling theory) Children s wage True causal impact of education on wages = green cell. If this framework is true, does β, i.e. the correlation between children s education and wage measures the green cell only? Does it overestimate or underestimate the green cell?

Assumptions Positing a linear causal model We assume that for every individual, his income is generated according to the following model: Income = α + β Number of Years of Education + ε More formally: Y i = α + β X i + ε i. Y i is the dependent variable, X i the explanatory variable, and ε i the error term: all other determinants of income (cleverness, gender...). Assumption 1. β measures by how much wage changes when education of an individual increases by one year and all the other determinants of income (ε) remain unchanged (cetebus paribus impact of education), i.e. the causal impact of education on income. Assuming that education has an influence on income does not seem to be too big an assumption. However, we assume that this influence is linear, when the number of years of education is increased by 1, wage increases by β. Realistic? Moreover, we assume that this influence is the same for everyone: β does not depend on i. Realistic?

Assumptions Why is linearity not so stupid an assumption... If the relationship between the data does not look linear at all, you can try to estimate a different equation: Y i = α + βxi 2 + ε i for instance if the relationship is quadratic. If the data looks as in the graph below, which relationship do you want to estimate? 3500 3000 2500 2000 1500 1000 500 0 0 2 4 6 8 10 12 14 16 18 20

Assumptions Other assumptions Assumption 2 : random sampling. (X i, ε i ) is independent from (X j, ε j ). This amounts to say that the number of years of education completed by Mr Dupont, or his marital status, is not related to Mr Duchamp s who lives fifty kilometers from him and whom he does not know. This seems fairly credible. Assumption 3 : sample variation. In our sample, not all the X i are equal. Trivial assumption: if it is not verified, that is to say if all the individuals in our sample have the same number of years of education, it is impossible to determine the impact of education on wage from our data. This implies that V e (X i ) > 0. Assumption 4: ε i X i Question: in our example of wage and education, do you believe that ε i X i?

Identification and estimation What is identification? Identification amounts to finding a formula relating an unknown parameter (here this unknown parameter will be β, the causal impact of education on wages) to quantities that we can estimate from the data.

Identification and estimation Identification of the linear model Theorem: under assumption 1 to 4, β is identified. Proof: cov(y i, X i ) = cov(α + β X i + ε i, X i ) according to assumption 1 = cov(α, X i ) + βcov(x i, X i ) + cov(ε i, X i ) according to the properties of covariance = βv (X i ) since cov(α, X i ) = 0 and cov(ε i, X i ) = 0 according to assumption 5. Therefore, β = cov(y i,x i ) V (X i ).

Identification and estimation How to estimate β? As shown above, β = cov(y i,x i ) V (X i ). Any idea on a good estimator β?

Identification and estimation Consistency of β β = cove(y i,x i ) V e(x i ). Law of large numbers: cov e (Y i, X i ) cov(y i, X i ) and V e (X i ) V (X i ). Therefore: β β = cov(y i,x i ) V (X i ) when the number of observations in the sample goes to infinity.

Identification and estimation Asymptotic normality of β The OLS estimators are asymptotically normal, in the sense that σ N( β β) N(0, 2 V (X )) (central limit theorem) The meaning of this is that when the size of the sample is large, we can state that N( β β) is approximately normally distributed. Proof at page 177 of your text book. This result is important to build up confidence intervals for β.

Identification and estimation Variance of β Let us denote σ 2 = V (ε i ). The variance of β is equal to σ 2 (Xi X ) 2 (you can find a proof at page 55 of the textbook): It is increasing with σ 2. The more the error term is spread, the harder it is to estimate precisely β. For instance, assume that unobserved determinants of wage (ambition, ability, age...) play an important role in wage setting. For some individuals, ε i will take very high positive values, and for others it will take very low negative values. We will therefore be likely to be faced to individuals with low levels of education and high wages and conversely, which will make the estimation of β difficult. The more X i is volatile in our sample, the more precisely we estimate β. Finally, (X i X ) 2 is increasing with N, the number of people in our sample.

Identification and estimation Estimating σ 2 In next session we will need to use an estimator of the variance of the error term. Usually, to estimate for instance a theoretical mean, we use the empirical one. Here, we use the same idea: to estimate the variance of the error term, a natural idea would be to use the empirical 1 variance of the estimated residuals: 2 N εi. This estimator indeed converges to σ 2 (LLN). However it is biased: one can show that E( 1 2 N εi ) = N 2 N σ2. Thus, we prefer to use the following unbiased estimator σ 2 = 1 2 N 2 εi. It is easy to show that this estimator also converges to σ 2.

Limits Link with OLS In the linear model, β represents the causal impact of X on Y. Under various (very strong) assumptions, one can show that β = cov(y i,x i ) V (X i ), which can be estimated from the sample by the quantity β = cove(x,y ) V e(x ). As you may have noticed, this estimator β is the same as the quantity we derived in section 2 with the OLS method. => if the linear model assumptions are verified, then predictions based on OLS are not only the best predictions for Y based on X, but β also describes the causal impact of X on Y. But are the linear model assumptions credible?

Limits Review of the assumptions of the linear model Assumption 1: fairly credible up to the linear approximation (impact of education on wage might not be linear) and to the constant effect assumption Assumption 2 and 3: credible. Assumption 4: extremely strong assumption. Amounts to stating that X is not correlated to all other determinants of Y. Credible in the wage / education example?

Limits What happens if assumption 4 is not verified? Theorem: If assumption 5 is not verified, then the OLS estimator β is not a consistent estimator of β, the causal impact of X on Y. Proof: cov(y i, X i ) = cov(α + β X i + ε i, X i ) = βcov(x i, X i ) + cov(ε i, X i ) Therefore, β = cov(y i,x i ) V (X i ) cov(ε i,x i ) V (X i ). Since β cov(y i,x i ) V (X i ), β is not consistent. The asymptotic bias, that is to say the difference between the limit of β and β is equal to cov(ε i,x i ) V (X i ) : the stronger the correlation between ε and X, the larger the bias. If X and ε are positively (resp. negatively) related, β overestimates (resp. underestimates) β. In the wage / education example, do you think β over or underestimates β?

OLS do not always yield good estimates... Generating 18 random pairs for wage and education (1/2) Open an Excel file, write in cell A1 to A18 = 2000 (alea() 0, 5) if you have the French version of Excel. The 18 random numbers you have generated thus stand for the ε in our model. They are supposed to be independent. Do they verify the other assumptions we made on the ε? What kind of distribution do they follow? What is their expectation and their variance? Then, write from cell B1 to B18 = ent(10 + alea() 10). These 18 random numbers stand for the number of schooling years. Do they verify the assumptions we imposed on the X i? Finally, write in cell C1 = 1500 + 100 B1 + A1, and extend this formula until C18. What do these 18 numbers stand for? Do the X i truly have a causal impact on the Y i here? In this experiment, what are the true values of α and β?

OLS do not always yield good estimates... Generating 18 random pairs for wage and education (2/2) Select cell B1 to C18, go to the assistant graphique and make a graph, choosing the option nuage de points. Once this is done, select your graph and go to the graphic menu, select the Ajouter une courbe de tendance option. Choose the linear type of curve and go to options. Select Afficher l équation sur le graphique and Afficher le coefficient de détermination sur le graphique. Once this is done, write down on a sheet of paper the values for β that appears on the graphic. Is it close to the trueβ? Any idea of why it is the case?

OLS do not always yield good estimates... What I get... 4000 3500 3000 y = 32,612x + 2636,6 R 2 = 0,0369 Wage 2500 2000 1500 1000 500 0 8 10 12 14 16 18 20 Years of Schooling

But things can be improved... Illustrating some points of the course In the first column, write = 200 (alea() 0, 5) instead of = 2000 (alea() 0, 5). Is your new estimate β closer from the true β? What is your intuition to explain this result? Now write = 4000 (alea() 0, 5) in cell A1 and extend the formulas in cells A1, B1 and C1 up to A200, B200 and C200. Draw a new graph similar to the previous one but selecting cells from B1 to C200. Is your new estimate β closer from the true β? What is your intuition to explain this result?

But things can be improved... What I get... 4000 3500 3000 y = 101,08x + 1490,8 R 2 = 0,9661 Wage 2500 2000 1500 1000 500 0 8 10 12 14 16 18 20 Years of Schooling

But things can be improved... What I get... 6000 5000 Wage 4000 3000 y = 97,233x + 1560,2 R 2 = 0,0507 2000 1000 0 8 10 12 14 16 18 20 Years of Schooling

Empirical applications Consequences of smoking when pregnant In a sample of 1 388 American mothers who gave birth to a child in 1988, we estimate the following relationship: weight of the child in grams = α + β daily cigarettes smoked by mother during pregnancy + ε. Results: α = 3395, β = 14, 57. How to interpret β? Are the various assumptions needed for OLS to be unbiased etc. verified here according to you?

Empirical applications Consequences of attending a class on exam grade Assume we want to estimate the following model among students attending an econometric course: final grade = α + β number of classes attended + ε. Do you think that the estimated value β would estimate properly the true causal impact of attendance on final grade?

Conclusion Today, we have seen the OLS technique to make a prediction for Y based on X. We have seen that up to two small limits, this prediction is the best we can make => our first goal was reached. However, we have seen that OLS estimators also describe the causal impact of X on Y iif a very restrictive assumption is made, which is that X is uncorrelated to all other determinants of Y. But in many situations, unlikely to hold => in most cases we will not be able to achieve our second goal with OLS. Finally, we have seen with some simulations that even in situations where all OLS assumptions are verified (which we can be sure of because we used data generated by the computer), OLS estimators can be far from the true values when the sample size is small. => do not do statistics with small samples! References Clément for this de Chaisemartin chapter: chapter Ordinary2Least and Squares 5 of your textbook.