2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1
Making an Investment Decision A researcher in your firm just invented a new flavor of ice cream Given the short Seattle spring, you only had the opportunity to ask ten people about the taste Six liked it and four hated it After a quick meeting with your co founder, you have decided to abandon the last year of R&D that culminated in this amazingly different ice cream Is this a reasonable decision? 3 The Power of Statistics After sitting down with your consultants, you established that your target market comprises of 25 million DINKS To be profitable, you need to sell your ice cream to 30% of that market over the course of the summer How many people do you need to sample in order to be 95% confident that at least 30% of that market will like the ice cream? This is something you will be able to answer by the end of the winter quarter! 4 2
Estimating parameters Goal Parameter: a characteristic of the population (e.g. μ) Feature of the data generating process Statistics: an observed characteristic of a sample (ӯ) To estimate is to use a statistic to approximate a parameter 5 Sampling Variation Sampling variation is the variability in the value of a statistic from sample to sample It is the price we pay for working with a sample rather than the population Example: Average exam class grade 6 3
From Data to Probability Over the long run (with enough data), the accumulated relative frequency converges to a constant (probability) The Law of Large Numbers: The relative frequency of an outcome converges to a number, the probability of the outcome, as the number of observed outcomes increases 7 GDP Growth What has been the average annual Gross Domestic Product (GDP) growth in the U.S. since 1947? In Excel, you have annualized quarterly real GDP growth Is this the true average GDP growth? Is this next quarter s expected GDP growth? 8 4
Normal Models Sample means are normally distributed (bell shape curve) if the individual values are normally distributed We never have exact normal distributions The Central Limit Theorem shows that the sampling distribution of averages is approximately normal even if the underlying population is not normally distributed Sample size needs to be large enough for averaging to smooth away deviations from normality 9 Standard Error of the Mean The standard error of the mean of a simple random sample of n measurements from a process or population with standard deviation σ is: SE x n The larger the sample size, the smaller the sampling variation from sample to sample What is the standard error in our average GDP growth estimate? 10 5
Concept of Statistical Test We estimated the average annualized GDP growth at 3.3%. Is it different from 4%? Use a statistical test to answer this question Consider the plausibility of a specific claim Claims are called hypotheses 11 Concept of Statistical Test Statistical hypothesis: claim about a parameter of a population Null hypothesis (H 0 ): specifies a default course of action, preserves the status quo Alternative hypothesis (H a ): contradicts the assertion of the null hypothesis 12 6
Hypotheses Is average GDP growth (3.3%) different from 4%? H 0 : H a : 13 Types of Errors Type I error: Reject H 0 incorrectly Believe that GDP growth is 4% even though it is not False positive Type II error: Accept H 0 incorrectly Believe that GDP growth is not 4% even though it is False negative 14 7
Confidence Interval In order to estimate the long term tax revenue from closing a tax loophole, the White House needs to know what future GDP growth will be It can use past GDP growth as a basis for long term planning Use confidence intervals to answer such questions Confidence intervals convey information about the precision of the estimates 15 Ranges for Parameters A confidence interval is a range of plausible values for a parameter based on a sample Constructing confidence intervals relies on the sampling distribution of the statistic We will assume a normal model based on the Central Limit Theorem 16 8
Confidence Interval for the Mean We will use the estimated standard error of the mean SE x n Based on the normal distribution, random samples have the following property: The sample statistic in 95% of samples lies within about two standard errors of the population parameter 17 Confidence Interval for the Mean As a result, the confidence interval (at 95% confidence) for the mean is x 2 SE x to x 2 SE x What is your 95% confidence interval on annual GDP growth in the U.S. over the last 65 years? 18 9
Interpreting the Confidence Interval What does this mean? We are 95% confident that μ (true GDP growth) lies between 2.75% and 3.78% Might μ be 2%? It could be, but it is unlikely given the sample results 19 Wrong interpretations! Common Confusion 95% of years witness a GDP growth between 2.75% and 3.78% The average GDP growth is between 2.75% and 3.78% 20 10
Practice Quiz #5 and Break Please take a few minutes and complete the practice quiz on the next page Hypothesis testing for the mean return of PCAR Then take a 10 min break to stretch your legs! 21 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 22 11
Regression We are interested in understanding how changes in one variable can be explained by movements in one or more other variables A response variable in a dataset measures the outcome of a study An explanatory variable explains or influences changes in a response variable A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes: y = f(x) 23 Regression We use regression lines to predict y as a function of x: y = a + b*x How do we estimate a and b? How do we find the best line to fit between y and x? A ordinary least squares (OLS) regression line of y on x is the line that minimizes the sum of the squares of the vertical distance between the data points and the line 24 12
Graphical Explanation 25 OLS Regression The slope coefficient is given by: bˆ x Covar y, x V ar This is actually an estimated slope b hat and we also have an estimated intercept a hat : aˆ y bˆ x Using these estimates, we can calculate some predicted values of y given the values of x: yˆ aˆ bˆ x 26 13
Regression in Excel Let s regress PCAR returns on market returns Go to Data Analysis Select Regression Highlight the y and x variables and press OK Note the many options: Labels No intercepts: y = b*x Confidence intervals Residuals and residual plots 27 Regression Output SUMMARY OUTPUT Regression Statistics Multiple R 0.664232549 R Square 0.441204879 Adjusted R Square 0.435933227 Standard Error 0.066815657 Observations 108 ANOVA df SS MS F Significance F Regression 1 0.373637157 0.373637157 83.69385395 4.62782E 15 Residual 106 0.4732192 0.004464332 Total 107 0.846856357 Coefficients Standard Error t Stat P value Lower 95% Upper 95% Lower 99.0% Upper 99.0% Intercept 0.019298413 0.006429671 3.001461975 0.003350688 0.006550965 0.032045861 0.002433332 0.036163495 X Variable 1 1.274361693 0.139298335 9.148434508 4.62782E 15 0.998189203 1.550534182 0.90898099 1.639742395 28 14
Alternative Visualization Build scatter plot of the returns (need to reverse the columns in order to have PCAR vertically and the market horizontally) Add linear trendline (highlight data on chart and rightclick) Note: Can also use intercept( ) and slope( ) functions 29 Interpreting the Fitted Line Interpreting the slope The slope estimates the marginal PCAR return per unit of market return While tempting, it is not correct to describe the slope as the change in y caused by changing x Question: Is the slope statistically different from 0? 30 15
Explaining Variation R squared (R 2 ) It is the squared correlation between x and y It is the fraction of the variation in y accounted by the variation in x In our example, 44% of the variation in PCAR returns can be explained by variation in market returns 31 Regression Example Relationship between age and blood pressure 32 16
Regression Example (2) Explaining home selling prices using multiple explanatory variables 33 Caution! Association (or correlation) does not imply causation! Must use common sense! It could be co linearity It could be a missing variable: need to control for it It could be a variable that is not independent Example: Someone says, There is a strong positive correlation between the number of firefighters at a fire and the amount of damage the fire does. So sending lots of firefighters just causes more damage. Explain why this reasoning is wrong 34 17
Summary of Part 3 Hypothesis testing is the cornerstone of inference in statistics Attempt to reject (or fail to reject) a null Standard errors of parameter estimates are key to answer Regressions are powerful tools to understand relations between variables Simple regression is very similar to a correlation Multiple right hand side variables allow decomposition of effects (or controls) 35 Part 1 Summary of Statistics Module Review of basic data analysis, such as means and standard deviation Histograms and distributions Part 2 Review of co variation analysis (covariance and correlation) Working with random variables Part 3 Inference and regressions Additional Problem Set at the end contains more problems on (almost) all topics Good statistics and its understanding is too overlooked and leads to poor decision making! 36 18