Statistics 104: Section 6!

Statistics 104: Section 6! TF: Deirdre (say: Dear-dra) Bloome Email: dbloome@fas.harvard.edu Section Times Thursday 2pm-3pm in SC 109, Thursday 5pm-6pm in SC 705 Office Hours: Thursday 6pm-7pm SC 705, Friday 12pm-1pm, SC 601 Section Outline 1. Reminders 2. Comments on HW #5 3. Week in review a. Sampling distributions of x and pˆ b. Confidence Intervals for x and pˆ 4. Practice problems and Midterm review! Reminders Midterm exam: Monday October 19 th, 8-9pm Science Center C and D. The exam is a full hour exam, so please arrive slightly before 8pm. You need a calculator. You may bring 1 page double-sided of notes. Exam covers lectures 1-17. Midterm review: Head TF Kevin Rader will hold an optional midterm review session this Saturday October 17 th, 1-2:30pm in Science Center Lecture Hall D. He will go through about an hour of general review, and 30 minutes of exam-like problems. The PowerPoint slides and problems discussed (with solutions) will be posted online shortly after the review. This review session will also be videotaped. Midterm practice: Try to complete the Fall 2008 Statistics 104 midterm (posted on our class website) as a practice exam. Write up your cheat sheet, and sit with a calculator for 1 hour completing that exam. Comments on HW #5 Good work. A few notes: When looking for the variance of a sum of random variables, remember: Adding n random variables is not the same as multiplying a single random variable by n! If X 1, X 2,,X n are iid (independent and identically distributed) and Var(X i ) =σ 2, then Var(nX 1 ) = n 2 Var(X 1 ) = n 2 σ 2, but Var( i X i ) = i Var(X i ) = nσ 2. This leads the variance of a sum to be lower than the variance of a single random variable multiplied many times, because it is far less likely for all of the X i 's to be very high or very low than it is for a single X i to be very high or low. Two random variables X and Y are not necessary independent if their correlation (or covariance) equals zero! Why not? Correlation measures only linear relationships. To prove independence, test a formulation like P(X=x Y=y) =P(X=x)P(Y=y). When dealing with normal approximations to binomial probabilities, it is always a good idea to use the continuity correction. One way of remembering it is to rewrite the probability you're trying to calculate in two ways. For example, if you want P(X< 100), where X is binomially distributed, you rewrite this as P(X < 100) = P(X < 99), so the number you use for the normal approximation is ½(100 + 99) = 99.5.

Week in Review 1. Sampling Distributions You all already know a lot about sampling distributions! Let s review them one more time: We have collected information on characteristic X from each unit in a sample of size n. We would like to know about the mean value of X in the population (if X is continuous meaning that it can take on an infinite number of values we denote this population mean µ, and if X is binomial meaning that it is the sum of iid 1 binary random variables we denote this population mean p, and we interpret it as a proportion). What can we do? We could just sample our entire population and then calculate the mean based on the data from every unit in the population. But this is too expensive! Instead, we could draw multiple samples of size n, calculate the sample mean (denoted x for continuous variables and pˆ for binomial variables) in each of these samples, and create a sampling distribution for the sample mean. We could then use the mean and standard deviation of this sampling distribution to draw inferences about the population mean. But this is also too expensive! Instead, we must rely on theoretical results regarding the sampling distribution of the sample mean. There are two important theoretical underpinnings to our inferential procedures: 1. Law of Large Numbers: If an independent sample is drawn from a population with mean µ, then as the number of observations in the sample increases (i.e., as n increases) the sample mean eventually becomes very close to (and stays close to) the population mean µ. This suggests that for large samples, the center of our sampling distribution will be the true population mean. 2. Central Limit Theorem: Loosely, for a sequence of n iid 1 random variables X 1, X 2,,X n with finite mean µ and variance σ 2 >0, as n increases, the distribution of the sample average of these random variables approaches the normal distribution with mean µ and variance σ 2 /n This suggests that the sampling distribution of the sample mean is approximately normal for large n, and that as our sample size increases the variability of our sampling distribution decreases. Note that this asymptotic (n ) normal distribution holds for the mean even if the underlying variable X is far from normally distributed! How large is large enough? For x the rule of thumb is n>30; for pˆ the rule of thumb is np>10 and n(1-p)>10. So, that s the theory. Here are the formulations we will use: For large enough n (check for yourself: what are the rules of thumb on sample size again?), For X with population mean µ and standard deviation σ, the mean x is distributed x ~ N(µ, σ/ n) For binomial X with population probability of success p in n trials, the proportion of successes pˆ is distributed pˆ ~N(p, p(1-p)/ n) [note that this is just a special case of the rule stated above] 1 Recall: iid stands for independent and identically distributed in the binomial case, we mean that each of the n trials is independent of all of the others and each trial as probability p of success; in genea;, we mean that each X i is independent of all the others and they all have the same mean and variance and family of distribution (e.g., Normal).

Week in Review continued 2. Confidence Intervals The Law of Large Numbers (introduced in the previous section) tells use that our sample mean and sample proportion are good estimates of our population mean and population proportion (at least when our sample size is large). However, when we want to make inferences from our sample data to our population parameter of interest, we often prefer to report a reasonable range of values that we think may contain the true population value, rather than reporting a single sample value that almost certainly does not exactly equal the true population value. How do we determine what this range of values should be? We use the sampling distribution of our statistic, of course! Because the sampling distribution of the sample mean is Normal, we know that 95% of values will lie within 1.96 standard deviations of the mean. The mean of our sampling distribution is our point estimate (this is x for the sample mean and pˆ for the sample proportion). The values within 1.96 standard deviations represent other values that we think would be reasonable estimates of our population parameter. Note: 1.96 is the value we use for a 95% confidence interval. For other levels of confidence, we would use a different critical value. For example, for a 99% confidence intervals, our critical value would be 2.575 instead of 1.96, because for the Normal distribution 99% of the area is contained within 2.575 standard deviations from the mean. We calculate a 95% confidence interval (for sample with large enough n) as: Point estimate (x 1.96*σ/ n, x + 1.96*σ/ n) or ( pˆ 1.96* pˆ (1- pˆ )/ n, pˆ + 1.96* pˆ (1- pˆ )/ n) Critical value from Normal distribution: here, the 97.5 quantile We allow an α=5% error rate in a 95% interval, which translates to a 100- α/2 quantile cut-off, denoted Z α/2. Question: what is the error rate for a 99% interval and what would be the critical value? Standard deviation of the point estimate Point estimate Margin of error = Critical value* Standard deviation of point estimate Note that we have plugged in the sample estimate pˆ into the standard error formula for the proportion, but we have kept the population parameter σ in the formula for the sample mean x. Why? We introduce extra uncertainty when we estimate σ (note that there are two parameters to be estimated for the continuous variable case, µ and σ, but only one for the proportion, p). We will learn how to take this extra uncertainty into account soon! Once we calculate our confidence interval, how do we interpret it? Say we find that our 95% confidence interval for pˆ from some sample is (10, 16). Can we say that there is a 95% probability that our true population parameter p is between 10% and 16%? No!! Since p is a fixed (although unknown) value, it either is between 10% and 16% or it is not between 10% and 16% there is no probability associated with a constant (other than 0 or 1); it either is within a certain range or it is not. So, how can we interpret our interval? Note that although p is fixed, our interval is random because it changes based on the random sample we draw! Thus, we can associate probabilities with the interval. If we find that our 95% confidence interval for pˆ from some sample is (10, 16), we can say that there is 95% probability that our interval covers the true population parameter p. In other words, if we repeatedly draw samples of size n and calculate 95% CIs for each sample, then 95% of such CIs will cover the unknown population mean μ.

Week in Review continued 3. Midterm Review Today we will complete several practice problems based on previous midterms. Due to time limitations, we will not complete a general review of topics you may encounter on the exam. However, I want to provide a few notes for you: See Lecture 17, Slides 13-23 for a review of the general topics that may be covered on the midterm. There are a number of previous midterms posted on the class website. These can be useful for practice. I particularly recommend that you take the Fall 2008 midterm as a practice exam. Write your cheat sheet (1 page doubled sided of notes) and sit down with it and a calculator for 1 hour and take the exam. When looking at midterms prior to Fall 2008, be aware: o More attention has been paid recently to conditional probabilities, expectations, variances, and Bayes Rule. o Don t forget about the material we saw at the beginning of the term: skewness and relationships between the mean and median; transformations; normal quantile plots; using the Normal table; regression (including the line, predictions, residuals, and R 2 ); correlation (and covariance); study design. Areas of particular student difficulty on past exams include: o Probability and conditional probability (independence of events and of random variables) o Sums and differences of random variables (use the rules we discussed last week). Remember that the linear combination of independent normal random variables is normally distributed itself. o Binomial random variables. Don t forget to check the sample size conditions before using the normal approximation, and don t forget the continuity correction. When the sample size conditions are not met, be ready to use the actual binomial probabilities.

Practice Problems! 1. Confidence Intervals: Financial Gains and Losses Note: We will use the normal distribution for this problem, but there is evidence that financial fluctuations are not normally distributed why might that be? Does the fact that these fluctuations are not normally distributed impact our confidence interval calculations? We are interested in measuring how the S&P 500 stock index fluctuates from day-to-day. We have the entire population over the course of the last 20+ years in a dataset, and our variable of interest is percent daily change = 100*(today s price yesterday s price)/(yesterday s price). A. If we were to sample individual daily stock fluctuations, where would 95% of our observations fall? B. If we were to sample 50 observations at a time, where would 95% of our sample means ( x ) fall? C. Create a sample of about n = 50 observations (We will do this in Stata). What is the 95% confidence interval for the population mean, µ, using this sample? Do we cover the true population mean? D. Now, create 100 samples of size n = 50, and calculate the confidence interval for each of the samples. How many of these confidence intervals do we expect to contain the true mean? How many actually do? E. What is the interpretation of your confidence interval from part (c) above? (Keep in mind what is truly random, because only random things can have probabilities other than zero or one!) F. The S&P 500 has increased on 51.6% of the days over the course of the last year (258 days). Assuming this is a good random sample, what is the 95% confidence interval for the overall proportion of days in which the S&P 500 will increase?

Practice Problems continued 2. Past exam question I: LASIK Gone Wrong? According to a recent study, 1% of all patients who undergo laser eye surgery (i.e. LASIK) to correct their vision have serious post-laser vision problems (All About Vision, 2006). A. (9 points) A doctor has recently started treating patients with LASIK surgery. After treating three patients, it was observed that two had serious post-laser vision problems. If we assume that the true rate of these problems is 1% and that these three patients can be treated as a random sample, what is the probability two or three (i.e., > 2) of these patients should have been observed to have serious post-laser problems? B. (1 point) Considering your calculation in part (a), what conclusion would you make regarding this doctor? C. (10 points) In a random sample of 1600 LASIK patients, what is the probability that more than 25 experienced serious post-laser vision problems if we assume that the true rate of these problems is 1%? 3. Past exam question II: Stranger at the door Maria s dog Rio often barks when people are at the front door. If the person at the front door is a stranger, Rio barks 90% of the time. If the person at the front door is Maria s friend, Rio barks only 20% of the time. About 75% of people who come to the front door are Maria s friends. (Note: for this problem, everyone is either Maria s friend or a stranger). A. (7 points) What is the probability that Rio barks at the next person at the front door? B. (7 points) If Rio is barking at someone at the front door, what is the probability that person is Maria s friend? 4. Past exam question III: Hold the door? There is a sign in the university library elevator indicating a 16-person limit as well as a weight limit of 2500 pounds. Suppose that the weights of students, faculty, and staff are approximately normally distributed with a mean weight of 150 pounds and a standard deviation of 25 pounds. What is the probability that a random sample of 16 people in the elevator will exceed the weight limit?

Practice Problems continued 5. Past exam question IV: Political opinions The table below shows the political affiliation of American voters and the proportion favoring or opposing the death penalty within the 6 categories defined by three values of party affiliation and 2 opinions. Death Penalty Opinion Party Favor Oppose Republican 0.26 0.04 Democrat 0.12 0.24 Other 0.24 0.10 A. (6 points) What is the probability that a randomly chosen voter favors the death penalty? What is the probability that a different randomly chosen voter is a Republican? B. (6 points) Suppose you know that a randomly chosen voter is a Republican. What is the probability that he or she favors the death penalty? C. (7 points) Are the events choosing a Republican and choosing someone who favors the death penalty independent? Justify your answer. 6. Past exam question V: Gas taxes An investigator is interested in determining the relationship between gasoline prices and gasoline consumption in the US. She collects data on all 50 states and DC (n = 51) and measures the following variables: price: price per gallon of gasoline (in US dollars in June, 2006) usage: gasoline consumption (in barrels, per capita in 2006). Below is the relevant output from Stata on the state-by-state data she collected (summary statistics, correlation table, and scatterplot). Use this output to answer the following questions.. summarize price usage Variable Obs Mean Std.Dev. Min Max ----------+--------------------------------------------- price 51 2.91843.123067 2.72 3.38 usage 51 11.0581 1.619653 6.945 15.906. corr price usage (obs=51) price usage ---------+----------------- price 1.0000 usage -0.6908 1.0000 usage 6 8 10 12 14 16 2.6 2.8 3 3.2 3.4 price

Practice Problems continued 6. Past exam question V: Gas taxes continued A. (5 points) Calculate the equation for the least squares regression line for predicting gasoline usage from gasoline price. B. (5 points) Some politicians are hoping to add a $0.25 tax on every gallon of gasoline (which will in effect raise the average gasoline price exactly $0.25). Based on this regression model, how much would the average gasoline consumption change with this new tax? C. (6 points) Below is the histogram of the residuals from the regression model you found in part (a). Please give an estimate for the mean and median of this distribution of residuals, and also state (with brief reasoning) whether the distribution of residuals appears to be symmetric, left-skewed, or right-skewed: Mean: Median: Skewness (circle one): a) Left-skewed b) Symmetric c) Right-skewed Reasoning for skewness answer: Percent 0 10 20 30-4 -2 0 2 4 Residuals