1. Multiple Choice: require no justification. Note: these parts are not related.

Transcription

1 Statistics E100 Final Exam Extra Practice Problems Solutions 1. Multiple Choice: require no justification. Note: these parts are not related. a) A magazine states the following hypotheses about the average age of their subscribers: Ho: µ= 8 years vs. Ha: µ > 8 years. Making a Type I error with this test means that: a) The sample result gives little evidence to conclude that the average age of the subscribers is greater than 8 years when in fact the average age is 8 years. b) The sample result gives little evidence to conclude that the average age of the subscribers is greater than 8 years when in fact the average age is much greater than 8 years. c) The sample result gives strong evidence that the average age is greater than 8 years when in fact the average age IS 8 years. d) The sample result gives strong evidence that the average age is greater than 8 years when in fact the average age is much greater than 8 years. b) The business college computing center wants to determine the proportion of business students who have personal computers at home. If the proportion differs from 5%, then the lab will modify a proposed enlargement of its facilities. Suppose data is collected from 100 randomly chosen students of the business college and the sample proportion is found to be 34%. What is the test statistic for testing H 0: π = 0.5 versus H A: π > 0.5?? a) 1.65 b).08 c) 1.90 d) 1.78 c) What is the result of the hypothesis test in Problem (b) above? a) Reject the null hypothesis b) Fail to reject the null hypothesis d) As the degrees of freedom for the t distribution increase, the distribution approaches 1) The value of zero for the mean. ) The t distribution. 3) The normal distribution. 4) The binomial distribution. e) Which statement is NOT true about hypothesis tests? a) Hypothesis tests are only valid when the sample is representative of the population for the question of interest. b) Hypotheses are statements about the population represented by the samples. c) Hypotheses are statements about the sample (or samples) from the population. d) Conclusions are statements about the population represented by the samples.

2 f) In regression analysis, if the coefficient of determination (R ) is 1.0, then: a. SSE (error sum of squares) must be 1.0 b. SSR (regression sum of squares) must be 1.0 c. SSE must be 0.0 d. SSR must be 0.0 g) A sample size of 00 light bulbs was tested and found that 11 were defective. What is the 95% confidence interval around this sample proportion? a) ± b) ± c) ± d) ± h) You wish to estimate the proportion of shoppers that use credit cards. Determine the sample size needed if the margin of error should be at most 0.01 (that is, we want the confidence interval to be +/-.01) and the confidence level is 95%. a) 8,98 b) 3,050 c) 15,914 d) 9,604 i) Suppose individuals with a certain gene have a 0.4 probability of eventually contracting a particular disease. If 15 individuals with the gene participate in a lifetime study, what is the distribution of the random variable X describing the number of these individuals who will contract the disease? a) X is a binomial random variable with n=6 and p=1.897 b) X is a normally distributed random variable with mean 15 and variance 0.4 c) X is a binomial random variable with n=15 and p=0.4 d) X is a normally distributed random variable with mean 6 and variance e) None of the above j) After performing a simple linear regression, we calculated the residuals and obtained the residual plot shown below. Does the plot indicate any potential problems with the regression?

3 a) The plot indicates the residuals are not normally distributed. b) The plot shows curvature. Hence, a linear model is not appropriate. c) The plot indicates that the error variance is not constant. d) All of the above e) There are no apparent problems. k) What do residuals represent in the simple linear regression model? a) The difference between the actual Y values and the mean of Y. b) The difference between the actual Y values and the predicted Y values. c) The square root of the slope. d) The predicted value of Y for the average X value e) None of the above. l) The probability that a region prone to hurricanes will be hit by a hurricane in any single year is 0.1 and independent of other years. What is the probability of a hurricane hit at least once in the next 5 years? a) b) c) 0.5 d) e) None of the above m) What is the expected number of hurricanes to hit the area described above in the next 90 years? a) 9 b) 3 c) 8.1 d).85 e) None of the above q) For which of the following hypotheses tests would the p-value be the same whether the sample mean is 44 or 46 (see table to the right) a) I. b) I. and IV. c) II. and III. d) IV. e) Stop bothering me with these silly questions. r) (3 points) We are told that a 95% prediction interval for a response variable, y, is (3., 35.6) from a simple regression on a sample of n = 100 observations at x* = 10. Which of the following is a reasonable estimate for the confidence interval for µ y at x* = 10? a. (13., 45.6) b. (4., 7.8) c. (8.6, 30.) d. (33., 45.6)

4 . (5 points total) Kellogg s wants to increase sales of its Fruit Loops cereal, and decides to run an experiment at Stop & Shop stores in New England. They randomize which shelf (bottom vs. middle vs. top) Fruit Loops is placed at a total of 150 stores (50 on each shelf). The variables collected were then: sales: number of boxes sold at the store in one day middle: a 0/1 binary variable to indicate if Fruit Loops was on the middle shelf top: a 0/1 binary variable to indicate if Fruit Loops was on the top shelf A regression was run in SPSS, and the results are shown below: a) (7 points) Is there any evidence that sales varied across the 3 shelf locations? Perform a formal hypothesis test to determine this: be sure to include your hypotheses, the test statistic, the degrees of freedom, the p-value, and your conclusion in context of the problem. H 0 : β 1 = β = 0 H A : At least one β 0 F = 3.76 (directly from the table) p-value = (directly from the table) Since p-value < 0.05, we can reject the null hypothesis. There is evidence that sales truly vary across the 3 shelf locations in the population of all stores.

5 This coefficient is estimating the mean difference in sales of stores where Fruit Loops was sold on the middle shelf in comparison to the bottom shelf (the reference group). There were on average 3.38 more boxes sold on the middle shelf than the bottom shelf. c) (6 points) Calculate the 95% confidence interval for the middle variable in the above regression model. * b 1 t SE( b 1 ) (1.34) (0.93, 5.83) d) (4 points) Which shelf location is predicted to have the most sales? Which shelf location is predicted to have the fewest sales? Please justify. Highest: middle shelf Lowest: bottom shelf The middle shelf is predicted to have the highest average sales ( = 6.18) since it has the highest difference compared to the reference group (and it is positive), while the bottom shelf is predicted to have the fewest sales as it is the reference group and the other two groups are estimated to have higher means (since both slope estimates are positive). e) (4 points) What percent of total variability in sales can be predicted by this model? SSM R So 4.87% of the variability in sales can be predicted SST from the model on shelf location. 3. (15 points total) Suppose that past history shows that 35% of college students prefer Pepsi over Coca-Cola. a. (4 points) A sample of 5 students is selected. What is the probability that at least 1 prefers Pepsi? Let X = count of students who prefer Pepsi in a random sample of 5 X ~ Binomial(n = 5, p = 0.35), P(X 1) = 1 - P(X = 0) = 1 (0.65) 5 = 0.884% b. (6 points) A sample of size 50 is collected. What are the mean and standard deviation for the number of students who prefer Pepsi in this sample? Y= number of students who prefer Pepsi Y~Binomial(50, 0.35) Mean of Y= np = 50*0.35=17.5 Standard deviation of Y = sqrt(50*0.35*0.65) = c. (5 points) In this sample of 50 students, what is the probability that the majority (strictly more than half) of the students selected prefer Pepsi to coke? Standardized: Z = (Y-17.5)/3.373 is approximately standard normal distributed P(Y 6) = P(Z (6-17.5)/3.373 ) = P(Z.5) =

6 4. (9 points total) An investigator is interested in modeling the progression over time in the Men s 100 meter run in the Olympics. He measures variables: time: the winning time in the men s 100 meter sprint, in seconds year: the year of the Olympics (from 1900 to 008) Some relevant SPSS output is shown below: a) (7 points) Are time and year significantly associated at α = 0.05? Be sure to include the hypotheses, the test statistic, the degrees of freedom, the p-value, and your conclusion in the context of the problem. Solution: Test for the significance of the slope in the regression. a) Hypotheses. Ho: = 0 vs. Ha: 0 b) Test statistic (from SPSS): t = / = , 4 degrees of freedom c) p-value < < 0.05 d) We can reject the null hypothesis, and therefore there is statistical evidence of an association between time and year in the Olympics. In fact, winning times are decreasing. b) (4 points) What is the estimated correlation between time and year? r = R = = (must be negative since the slope is negative) c) (3 points) What is the estimated standard deviation of winning times within Olympic year (aka, standard deviation of the residuals)? Std. Error of the Estimate =

7 d) (4 points) Based on this model, in what year will the winning times be forecasted to drop below 9 seconds (please round up to the nearest year)? 8.99 = * x x = ( ) / = In the year 070. e) (4 points) In 1 or sentences, please comment on the validity of your forecast in part (e) above. Since the range of values for the x-variable go from 1900 to 008, using the model to predict what the time will be in 070 is extrapolation, and who knows if this pattern will continue in such a linear fashion. f) (7 points) Above are the histogram of residuals along with the scatterplot of the residuals vs. the x-variable, year. Please comment on the validity of the assumptions for this regression model. (I ve comment on one, please list it and the others and comment on the other 3). Assumptions: 1. Independence of observations: cannot be checked here. Normally distributed residuals: From the histogram we can conclude that the residuals seem to be fairly normally distributed. 3. Linear relationship between predictors and response: There is no visible pattern (i.e. curvature) in the residual vs fitted plot, and therefore we can assume that the linear trend was captured by the regression. 4. Residuals have constant variance: From the residuals vs fitted plot, this assumption seems to hold true since there is no pattern or fanning out.

8 5. (16 points total) Kevin is flying directly to Philadelphia on Saturday for a friend's wedding next weekend. He has his flight booked through US Airways. US Airways reports that whether his flight is on time or not depends on the weather in Boston. If it is raining in Boston, the flight will be late 50% of the time. If it is not raining in Boston, the flight will be late only 10% of the time. There is a forecasted 5% chance for rain on Saturday (assume that this forecast is correct). a. (8 points) What is the overall probability that Kevin's flight will be delayed? [If Kevin arrives late, he will miss the beginning of the wedding]. Recall: P(A) = P(A and B) + P(A and B C ) P(late) = P(late and rain) + P(late and no rain) = P(late rain)p(rain) + P(late no rain)p(no rain) = 0.50* *0.75 = 0.0 b. (8 points) Saturday rolls around and Kevin's friend notices Kevin has not arrived at the wedding on time because his flight was delayed. What is the conditional probability that it actually was raining up in Boston given the fact that Kevin's flight was delayed? P(rain late) = P(late and rain) / P(late) = P(late rain)p(rain) / P(late) = 0.50*0.5 / 0.0 = An elevator serving a hospital is designed to hold up to 15 passengers and has a maximum safe capacity of 440 pounds. The weight of passengers who use the elevator is normally-distributed with an average of 149 pounds and a standard deviation of 0 pounds. a) What is the probability that a single passenger on the elevator weighs between 140 and 150 pounds? Z = (X-μ)/σ = ( )/0 = 0.05 Z = (X-μ)/σ = ( )/0 = P(140 < X < 150) = P(X<150) P(X<140) = P(Z<0.05)-P(Z<-0.45) = = b) What is the probability that a single passenger on the elevator weighs more than 00 pounds? Z = (X-μ)/σ = (00-149)/0 =.55 P(X > 00) = 1 P(X<00) = 1 P(Z<.55) = = c) If five passengers enter the elevator together, what is the probability that all five of them weigh 00 pounds or less? P(all 5 less than 00) = = d) What is the probability that the elevator's safe capacity is exceeded by a full load of 15 passengers? Let T = total weight of 15 passengers. Then X-bar = T/15. We know X-bar ~ N(μ = 149, σ = 0/ 15 = 5.164) Z = ( ) / =.65 So P(T > 440) = P(X-bar > 440/15) = P(X-bar > 16.67) = P(Z>.65) =

9 7. (3 points total) GPA s at Harvard are known to be approximately Normally distributed with a mean of µ = 3.5 and a standard deviation σ = a. (6 points) Show that the 0.33% of Harvard students have a GPA below z x P ( X 3.00) P( Z 0.83) 0.03 b. (6 points) There are 6 students living in a suite in a Harvard house. If we assume their GPA s to be independent, what is the probability that at least one of them has a GPA below 3.00? Let X = # students below 3.0 out of 6. Then X ~ Bin(n = 6, π = 0.03). We want: P ( X 1) 1 P( X 0) c. (6 points) A random sample of 50 Harvard students was taken. Assuming their GPA s are independent, what is the probability that at least 0 of them have a GPA below 3.00? Let X = # students below 3.0 out of 50. Then X ~ Bin(n = 50, π = 0.03). We want P ( X 0). We can do this with the Normal approximation to the Binomial. So we know that approx.. X ~ N(µ = nπ = 50(0.03) = 10.1, σ = n ( 1) 50(0.0)(0.798) =.84). Then: z x P ( X 0) P( Z 3.49) 1 P( P 3.49) d. (6 points) What is the probability that the average GPA for these 50 randomly sampled students is below 3.00? x z / n / P ( X 3.00) P( Z 5.89) e. (8 points) A random sample of 50 Harvard athletes had a mean of x 3. 1 and standard deviation of s = 0.37 (there is no reason to suspect athletes have the same standard deviation as the general Harvard population). Perform a formal hypothesis test to determine whether Harvard athletes have a different mean GPA than all Harvard students. Be sure to include your hypotheses, the test statistic, the degrees of freedom (if applicable), an estimate of the p-value, and your conclusion in context of the problem. H 0 : µ = 3.5 H A : µ 3.5

10 t x s / n This test has df = n 1 = 49 (use df = 40 in the table) 0.37 / 50 p-value = * P ( t.48). In the table, we see that.48 falls between.704 and.43, so our p-value is somewhere between (0.005) and (0.01). So it s between 0.01 and 0.0. Since our p-value < 0.05, we can reject the null hypothesis. It looks like Harvard athletes do have a different average GPA than the rest of Harvard students, in fact it is lower. 8. (3 points total) Over the last twenty years, the daily change (in decimal form) of a mutual fund based on the S&P 500 Index fund is known to follow a normal distribution with a mean of μ = and a sd of σ = a. (8 points) What is the probability that this mutual fund loses money in any one day? Let X be the random variable representing the daily change for this mutual fund. From the opening paragraph, we know X ~ N( , ). Thus: X X P( X 0) P ( 0.4) P Z X b. (8 points) What is the probability that this mutual fund loses money in at least one day over the next week (5 days) assuming days are independent? X Let Y be the random variable for the number of days out of 5 that the fund loses money. Thus: Y ~ Bin ( n 5, p 0.405). So, X P ( Y 1) 1 P( Y 0) c. (8 points) What is the approximate probability that this mutual fund will lose money in at least 15 of the next 30 days assuming days are independent? Let V be the random variable for the number of days out of 30 that the fund loses money. Thus: V ~ Bin ( n 30, p 0.405). Based on the fact that np > 10 and n(1-p) > 10, we know that V is also approximately Normal: V ~ N( V np 30(0.405) 1.156, np(1 p) 30(0.405)(0.5948).689) V P( V V V 15) P ( 1.06).689 P Z V 1 P( Z 1.06)

11 d. (8 points) Let X be a random variable to represent the average daily change across 50 days (which is essentially a full year of business days). If you assume each day is independent, what is the probability that your investment will have an average change below zero (essentially meaning the fund lost money during the year)? From the central limit theorem, we know that: X ~ N( , n ). So: X X X X X ( 0) X P X P ( 3.83) P Z X If days are independent, then the mutual fund has almost no chance (about 1/5000 chance) of losing money (but in real life, days are not independent...which leads us to the next question...). e. (4 points) Now assume that instead of this mutual fund's daily change is not independent from day to day, but it actually has a positive correlation from one day to the next. Would the probability of losing money increase, decrease or stay the same from your answer in part (d)? Please justify your answer. This probability would definitely increase if there was a positive correlation from one day to the next. The varaince of X would increase with the positive correlation (remember: Var (X 1 + X ) = Var(X 1 ) + Var(X) + ρσ X1 σ X ), which means the z-score calculated in part (d) would be based ona larger denominator, leading to a z-score not as far out in the left tail, so the probability of falling below that would increase. 9. (6 points total) The table and graph below show numerical and graphical summaries of the monthly precipitation (in inches) over the last 60 months in Cambridge, MA. a. (8 points) Is this distribution left-skewed, right-skewed, or symmetric? Briefly justify your answer.

12 The distribution is right-skewed. This can be seen in the summary statistics since the mean is larger than the median, and also in the histogram since the right tail is longer than the left, pulling the mean up towards the right tail. b. (10 points) Identify any suspected high outliers in the data using the quantitative methods discussed in class. Show your work. The rule for outlier detection is the 1.5*IQR rule. So a value will be designated a high outlier if it lies above Q (IQR). We see that Q 1 =.11 and Q 3 = 4.981, so the boundary is: Q 3 1.5( IQR) ( ) (.869) 9.85 From the summary statistics table, we see there are 3 outliers at the values 9.57, 9.976, and c. (8 points) Calculate the mean and standard deviation of monthly precipitation in centimeters (1 inch =.54 cm). This is simply a linear transformation from inches to cm (it has the form y = a + bx). So if X is the variable for rainfall in inches and Y is the variable representing rainfall in cm, then: y a bx 0.54x.54(3.894) 9.89 s y b s.54 (.411) 6.14 x 10. (0 points total) Below are the summary statistics for two variables measured on the top 10 grossing box office movies so far in 011: how much revenue they generated in US markets and the amount of revenue generated in all international markets combined (both in millions of US dollars), along with the correlation table between the two, and the related scatterplot with international revenue on the y-axis, and US revenue on the x-axis:

13 a. (7 points) What is the formula for the least-squares regression line to predict international revenue based on US revenue? b b s y 1 r.31 sx ( 0 y b1 x) (0.58) yˆ b ( 0 b1 x) ( x) 57.7 b. (4 points) What is the predicted amount of international revenue for a movie that generated 16 million dollars in the US? y ˆ b ( 0 b1 x) (16) c. (4 points) Kung Fu Panda made 16 million dollars in the US and 614 million dollars internationally. What is Kung Fu Panda 's estimated residual? e y yˆ d. (5 points) What percentage of variability in international revenue can be explained by US revenue? R About 59.1% of the total variability in international revenue can be explained by US revenue. 10. (30 points total) Each part of this problem requires a short response with a brief explanation (simply yes or no will not suffice). Note: these parts are not related. a. (6 points) In a study of cold symptoms, every one of the 50 study subjects with a cold was found to be improved weeks after taking ginger pills. The authors concluded that ginger pills cure colds. What is the major flaw in this study? The major problem is that there is no control/comparison group. These subjects most likely would have improved within two weeks had they received no treatments whatsoever (the flu usually just takes a few days to run its course). b. (6 points) Let H be the event that the Democrats win the majority of the seats in the House of Representatives, and let S be the event that the Democrats win the majority of the seats in the Senate. Let P(H) = 0.5, P(S) = 0.6, and P(H or S) = 0.7. Are H and S independent? Solution: No, since P(H and S) = P(H) + P(S) - P(H or S) = = 0.4, which is different from P(H) * P(S) = 0.3.

14 c. (5 points) The sensitivity for a diagnostic test, P(positive test disease), is 0.85 and the specificity of the same test, P(negative test no disease), is also Are the two events, (A = having the disease) and (B = receiving a positive testing), independent? Show your work. No, these events are not independent, they are dependent, since: P(B A) = 0.85 P(B A C ) = 0.15 d. (6 points) It is known that 30% of young girls favorite color is blue while 70% of young boys favorite color is blue (you can also assume the population is split evenly into 50% boys and 50% girls). Are the two events (being a boy) and (favorite color is blue) independent? No, these events are not independent. The simplest way to show this it to show that: P(blue boy) P(blue girl) e. (6 points) Suppose that A and B are two disjoint events within the same sample space. In addition, let P(A) = 1/8 and P(B) = 1/4. Are events A and B independent? Exaplain or show your calculations. Events a and B must be dependent since they are disjoint. Since these events are disjoint, we know there is no overlap or intersection, so P(A and B) = 0. Thus: P( A and B) 0 1/ 3 (1/8) (1/ 4) P( A) P( B) f. (6 points) In 1990 a research organization sent questionnaires to all of the approximately 15,000 high school systems in the United States. These questionnaires asked about computer useage in the school system. As many as 3,600 schools systems returned answers. Of these 3,600, 60% indicated that some of their students used computers. In a speech shortly thereafter, an authority on the use of computers in high school education cited this study as evidence that "students in 60% of the high school systems in the United States use computers during their high school careers." Do you regard 60% as a trustworthy estimate of the proportion of school systems providing computer access in 1990? In two sentences or fewer, explain your answer. Since only 3600/15000 = 4% of the schools responded, this allows for the potential of nonresponse bias. It could be that the schools that chose to respond used computers more than those schools that chose not to respond. g. (6 points) A company in Hawaii builds bridges for married couples to walk over during their weddings. There are 3 islands in Hawaii that each have the same mean and variance of husbands weights and the wives weights. However, the relationship of weights within couples is different on the 3 islands: in Inde: the weights within couples are independent; in Posi: they are positively correlated; and in Nega the weights are negatively correlated. On which island should the company build the strongest bridge? Defend your answer in sentences or less.

15 Solution: since the variability will be higher for the sum of weights if the weights within couple are positively correlated, then there is greater potential (chance) for heavier couple. Thus, the bridge should be built highest on the Island of Posi. 11. (35 points total) A researcher is investigating variables that might be associated with death rates in the US states. He examined data from 008 for each of the 50 states plus Washington, D.C. The data included information on the following variables: deathrate The annual deathrate per one million inhabitants smokers The percent of inhabitants who smoke heavily, in percentage points college The percent of inhabitants that have a bachelors degree, in percentage points As part of his investigation, he ran the following multiple regression model: deathrate = (smokers) + (college) + This model was fit to the data using the method of least squares. The following results were obtained from statistical software: a. (4 points) What is the estimated standard deviation of the residuals? s e std. error of the estimate 7.313

16 b. (6 points) Suppose we wish to test the hypotheses H 0 : 1 = = 0 versus H a : at least one of the j is not 0. What is the value of the appropriate test statistic, the p-value, and conclusion to this test? F = 1.4 p-value = Since p-value < 0.05, we can reject the null hypothesis. There is evidence that either smokers or college (or both) is an important predictor of death rate among the 50 states and DC. c. (6 points) What is the interpretation of the value for b 1, the estimated coefficient for the variable smokers? For every additional percentage point of smokers in a state, we expect a per million people per year increase in the death rate in that state, holding the percentage of college graduates in that state constant. d. (7 points) Calculate the 95% confidence interval for 1, the coefficient for the variable smokers. b 1 ± t * se(b 1 ) = ±.01(4.500) = (8.38, 6.56) Note: t * =.01 is the critical value from a t-distribution with df = n p 1 = 51-3 = 48 that puts.5% in each tail. We rounded down and used df = 40 in the table. e. (6 points) Briefly comment on the residual diagnostic plot for this model shown below. Please be specific and limit your response to 3 sentences or bullet points. We can comment on two assumptions with this graph: 1) Constant Variance: since there is no fanning out around the y = 0 line (that, is more vertical spread on one side of the plot compared to the other), this appears to be a safe assumption. ) Linearity: since there is no curvature in the scatterplot of points, there is no evidence of non-linearity. This assumption also appears to be safe. Note: if you are REALLY good, you could try to make an argument that the residuals are normally distributed as well. This can be seen by the fact that most of the points in the vertical direction are close to zero, and they tail off both above and below this middle (there are fewer and fewer observations as you go further away from the y = 0 line in both directions, up and down). Another researcher, using the same data, ran the following simple linear regression model:

17 deathrate = + (college) + The following results were obtained from statistical software: f. (6 points) The second researcher concluded that because the coefficient for the variable college was negative in his results, spending additional money on education to have more college graduates would decrease the death rate in his state. This researcher therefore recommended more money be spent on education. The second researcher concluded that because the coefficient for the college variable was positive in his results, spending additional money on students would increase the death rate. This researcher therefore recommended less money be spent on education. Why are these two conclusions different even though the researchers used the same data? Explain using a few concise sentences. This is not surprising, actually. Most likely college is correlated with smokers. Since the first model included smokers, college no longer had predictive ability, and had a slight positive relationship with death rate when controlling for smokers. Without adjusting for smokers (since it s not in the second model), college has a strong negative relationship with death rate. 1. (0 points total) It is known that 0% of all Harvard students are varsity athletes. 50% of varsity athletes eat breakfast on any particular weekday, while only 5% of all other Harvard students eat breakfast on any particular weekday. Define the events: A: the event that a student is a varsity athlete B: the event that a student eats breakfast on a particular weekday a. (5 points) Are the events A and B independent? Please briefly justify. We know P(B A) = 0.50, and P(B A C ) = 0.5. Since these are not equal, we know that they are not independent; they are dependent. b. (5 points) Are the events A and B disjoint? Please briefly justify. P(A and B) = P(B A)*P(A) = 0.5*0. = 0.1. Since the probability of their intersection is not zero, they are NOT disjoint.

18 c. (5 points) Find the overall proportion of students that eat breakfast on any particular weekday. P(B) = P(A and B) + P( A C and B) = P(B A)*P(A) + P(B A C )*P(A C ) = 0.5* *0.8 = 0.30 d. (5 points) Given a student is eating breakfast on a particular weekday, what is the probability that that student is a varsity athlete? P(A B) = P(A and B) / P(B) = 0.10/0.30 = e. (4 points) In actuality the non-varsity-athlete students are comprised of two further subgroups: 30% of them are club athletes and 70% are nonathletes [so there are actually 3 distinct groups in the Harvard student population: varsity athletes, club athletes, and nonathletes]. Of the club athletes, 40% eat breakfast. Let NA denote non-athletes, CA denote club athletes, and VA denote varsity athletes. i) P(B NA) = P(B and NA)/P(NA) = 0.104/(0.8*0.7) = Since, P(B and NA) = P(B) P(B and VA) P(B and CA) = = Since, P(B and CA) = P(A C )*P(CA A C )*P(B CA and A C ) = 0.80*0.30*0.40 = ii) P(NA B) = P(NA and B)/P(B) = 0.104/0.30 = (55 points total) A study was conducted to determine the association between the maximum distance at which a highway sign can be read (in feet) and the age of the driver (in years). Fourty drivers of various ages were studied. The summary statistics for distance and age are shown below in a table from Stata: a. (8 points) The correlation coefficient between distance and age in this sample is r = Calculate a and b of the least-squares regression equation that would predict the distance at which a highway sign can be read given the age of the driver. b b sy r sx y b x) ( 3.863)*(46.1) 1 ( b. (10 points) The standard error of b was calculated to be from SPSS. Is age a significant predictor of distance in this linear model? Conduct this statistical test of H0: β = 0 using α = Be sure to include your hypotheses, test statistic, degrees of freedom if appropriate, either the p- value or critical value, and your conclusion in terms of the problem. H 0 : 1 0

19 H A : 1 0 t b 0 SE( b ) For this t-test for regression, we have df = n p 1 = = 38. We round down to df = 30 in the t-table. Our p-value is P(t < -4.) + P(t > 4.) = P(t<-4.) [since this is a two-sided test]. With df = 30, we see that the largest t* in the t-table is 3.385, and our t-statistic is farther out in the tail than that, so P(t < -4.) < Thus our p-value < (0.001) = 0.00 Since our p-value < α = 0.05, we can reject the null hypothesis and conclude that the distance someone is able to read a sign while driving is associated with age of the driver. c. (4 points) What is the predicted distance that a sign can be read for someone who is 40 years old? y ˆ b ( 0 b1 x) (40) d. (6 points) What is R for this regression model? What is the interpretation of R here? R r ( ) This means that 31.85% of the variability to being with in distance (the response) can be explained by using age as a predictor in this linear model. The variance of the residuals in this model is 31.85% less than the overall variance of y (distance) ignoring x (age). The investigators also decided to look at whether someone wore glasses had an effect on the distance a driver could read a sign. Below is the binary-predictor regression output, labeled as Model A, of the distance someone was able to read the sign predicted from whether or not that person wore glasses (which has value 1 for those wearing glasses or contact lenses, 0 otherwise): Model A: e. (3 points) What is the reference group in this model? The reference group is the group for which the binary predictor variable (glasses) takes the value zero. That means the reference group is the group of people who did not wear glasses or contact lenses during the test.

20 f. (4 points) What is the predicted distance that a sign can be read for someone who wears glasses based on this model? y ˆ b ( 0 b1 x) (1) Below is the Stata output of a multiple regression, labeled as Model B, of the distance someone was able to read the sign predicted from age and glasses (again, which has value 1 for individuals wearing classes or contact lenses, 0 otherwise): Model B: g. (5 points) What is the interpretation of the value in this regression model? This value represent the estimated difference in distance between people who wear glasses vs. do not wear glasses, adjusting for age. In essence, if you have two people that are the same age, one with glasses and the other without, the person who wears glasses will need to be about 5.68 feet closer to the sign in order to read it than the person without glasses. h. (8 points) Compare the results of the two regressions, Model A and Model B, above. Specifically mention any signs or significance that are different between the two models. Why do you suspect this is the case? When doing this, we see that glasses goes from being a significant predictor (p = 0.09) to a clearly insignificant one (p = 0.453). This can be explained by the fast that age and glasses are correlated themselves (older people wear glasses more often), and if you adjust for age in the model, glasses no longer have the apparent affect that they did in Model A. Age was confounding the significant result between glasses and distance we saw in Model A.

21 i. (7 points) Above are the residual vs. fitted scatterplot and histogram of the residuals for the multiple regression model (Model B) above. Use these plot to comment on whether the assumptions for this model seem to be valid. Be specific. We can check 3 assumptions with these plots: (i) Normally distributed residuals: this seems to be OK based on the histogram to the right. The points follow the general bell-shape, but there may be evidence of a bimodal distribution. (ii) Constant variance of the residuals: this assumption seems perfectly appropriate. In the residual scatterplot to the left, we see the spread of the points in the vertical direction seems pretty consisitent no matter where you are along the x-axis. (iii) Residuals are centered at zero (for any values of the X s): this assumption seems just fine. In the scatttplot on the left, the points don t show any curvature and appear to be centered at the zero line no matter what the X-axis is. *Note: we cannot check the independence assumption based on these graphs. 14. (5 points) As part of a study on student loan debt, a national agency that underwrites student loans is examining the differences in student loan debt for undergraduate students. One question the agency would like to address specifically is whether the mean undergraduate debt of Hispanic students graduating in 009 is less than the mean undergraduate debt of Asian- American students graduating in 009. To conduct the study, a random sample of 9 Hispanic students and a random sample 110 Asian- American students who completed an undergraduate degree in 009 were taken. The undergraduate debt incurred for financing college for each sampled student was collected. Let denote the population average student loan debt for Hispanic students, and the population average student loan debt for Asian-American students. Using the A summary statistics below, test the hypothesis H : H :. Clearly interpret your results. H o A H a A H Group n mean Std. Dev. Hispanic Asian-American Total

22 H H 0 A : H H A : A t x s 1 1 n 1 x s n df = min(n 1-1, n -1) = 91 We can get the p-value from the t-table (rounding down to df = 80), and we see that it is greater than (0.05) = Since our p-value is greater than 0.05, we are unable to reject the null hypothesis. There is not enough evidence to say that the average student loans for Hispanic vs. Asian-American students are different. The true averages in the population may actually be the same. 15. (33 points total) An investigator is trying to determine what factors are important in determining the graduation rate at US colleges. She collects a random sample of 53 four-year colleges, and records three variables: GradRate: the graduation rate for the class of 007 (as a percentage of all entering students that were full-time, in percentage points: 0 to 100) Tuition: the tuition in (in thousands of dollars) SATMath: the median SAT math score for entering freshmen in Below is SPSS s regression output of predicting GradRate based on Tuition and median SATMath scores.

23 a. (5 points) Based on the above model, what is the predicted graduation rate for a college with tuition of $35 thousand and median math SAT score of 750 (aka, Harvard)? ˆ y b0 b1 x1 b x ( tuition ) 0.184( satmath ) (35) 0.184(750) b. (5 points) In words, what is the interpretation of the coefficient for SATMath (which has value 0.184) in the above table? This is the estimated change in graduation rate for every extra point in the median sat math score when holding tuition constant (that s the key: since this is a multiple regression, this is the effect of sat math on graduation rate while adjusting for tuition). If we compared schools that had the same tuition and one of the schools had a median SAT math score 10 points higher than the other, we d expect that school to have a graduation rate of about 1.8% points higher. c. (4 points) What is the proportion of total variability in graduation rate that can be explained by this model? This is simply R = / = which is d. (6 points) Perform a single hypothesis test to determine whether any of the variables are associated with graduation rate. Be sure to state your hypotheses, test statistic, degrees of freedom (if applicable), p-value, and conclusion. H 0 : β tuition = β satmath = 0 H A : Either β tuition 0 or β satmath 0 This can be tested via an F-test. We see the F-statistic is (from SPSS) and has df =,5, which leads to p-value = Since this p-value < α = 0.05, we can reject the null hypothesis. It looks like at least one of our predictors is significantly associated with graduation rate. e. (7 points) Perform a hypothesis test to determine whether specifically tuition is associated with graduation rate in the above model. Be sure to state your hypotheses, test statistic, degrees of freedom (if applicable), p-value, and conclusion. H 0 : β tuition = 0 H A : β tuition 0 t 1.14 This t-statistic has df = n k 1 = 5. The p-value estimate is > 0.10 based on the table. Since this p-value > α = 0.05, we cannot reject the null hypothesis. Tuition may not truly be associated with graduation rate (when also adjusting for median SAT math score). f. (6 points) The dean at a college sees these results and suggests to his board of trustees that they raise their tuition in order to improve their graduation rate. What is the major mistake the Dean is making in concluding from these data that raising their tuition will lead to a higher graduation rate? The major mistake he is making is in thinking this is a causal relationship. Since this data is coming from a survey and not an experiment, there is no guarantee that raising tuition will lead to an

24 improvement in graduation rate (In fact, I would argue that it s the best schools that have a graduation rate that are able to charge a high tuition because of their reputation a reverse causation). 16. (1 points) A survey of male and female university students asked which popular musical artist they preferred. The survey focused on Lady Gaga and Justin Bieber but allowed for other artists as well. Some of the values from the two-way table are missing, but you can determine what they are and answer the given questions. a) What is the value of a? Artist Lady Gaga Justin Bieber Other Total Male a 100 Female Total a = Lady Gaga (lady Gaga and Female) = ( ) 50 = = 0 Here is the whole completed table: Justin Artist Lady Gaga Bieber Other Total Male Female Total a) What is the probability that a randomly chosen student will prefer Justin Bieber? P(Bieber) = 90/300 = 0.30 b) Given a student prefers Justin Bieber, what is the probability that they are female? P(Female Bieber) = (#Female and Bieber)/(#Bieber) = 60/90 = c) Is gender and artist preference dependent or independent events? (do not use a chi-square test) They are definitely dependent. In part (c) we showed that P(Female Bieber) = However, P(Female) = 150/300 = 0.50 overall. Since these probabilities are not equal, we can say that the two events are dependent. 17. (4 points total) Your younger sister and brother are strong believers in the tooth fairy. Whenever a baby tooth falls out, your sibling places it under his/her pillow before going to sleep, and in the morning the tooth fairy replaces it with cash. You observe this week that their baby teeth are close to falling out. Let A be the event that one of your sister s teeth will fall out today, and let B

25 be the event that one of your brother s teeth will fall out today. You estimate that P(A) = 0.3 and P(B) = 0.. Assume that for each sibling at most one tooth will fall out. a. (6 points) Assuming whether your brother s tooth falls out is independent of whether your sister s tooth falls out, the probability that neither falls out today is Demonstrate with appropriate calculations why this is true. P(A C and B C ) = P(A C )*P(B C ) (since independent) = (1 0.3)*(1 0.) = (0.7)*(0.8) = 0.56 b. (6 points) Under the assumption of independence as in part (a), what is the probability that exactly one (i.e., not both) of your siblings teeth falls out today? Let the random variable X = the count of teeth the fall out. Then we want P(X = 1) = 1 [P(X=0) + P(X=)] = 1 [ *0.3] = 1 [0.6] = 0.38 Or this can be solved by thinking of it as the union of two disjoint events: P[(brother loses a tooth and sister does not) or (sister loses a tooth and brother does not)] = P(brother loses and sister doesn t) + P(sister loses and brother doesn t) = P(brother loses)*p(sister doesn t) + P(sister loses)*p(brother doesn t) (these multiply since independent) = 0.*(1 0.3) + 0.3*(1 0.) = 0.* *0.8 = 0.38 c. (6 points) Describe a scenario involving your younger siblings where A and B are clearly not independent events. Be sure to state this scenario in context of this problem (do not just give the definition of dependence). There are lots of answers here. This would include the possibility of the two getting into a fight and start punching each other (somedays they do, somedays they don t). Or maybe both eating something sticky (like peanut brittle). Or if they both decide to play ice hockey without a helmet. The list goes on and on d. (6 points) The tooth fairy replaces a tooth with cash with probability 0.5 independently from child to child. On a given night, 10 children in a town have placed teeth that have fallen out under their pillows. What is the probability that at least 1 of these 10 children is visited by the tooth fairy? Let the random variable X = # children visited by the tooth fairy. Hence, X~Bin(n = 10, π = 0.5). The question being asked is P(X 1). Then, P(X 1) = 1 P(X = 0) = 1 ( ) = Note, P(X = 0) is the probability that no child is visited by the tooth fairy, which will lead you to the same calculation. 18. (16 points total) With the popularity of traditional lotteries waning across the US, many states are turning to instant games, called scratch-off tickets, to lure new players and raise revenue. However, many critics are concerned that instant gratification scratch-off tickets are more likely to

26 contribute to gambling addiction and take particular advantage of the poor members of society. A survey of 100 randomly selected gamblers with below median incomes was conducted in the El Paso area of Texas to study the association between gambling addiction and the primary type of gambling (traditional state lottery versus scratch-off tickets). The results are given below. Primary type of gambling Diagnosed with a gambling addiction No gambling addiction Total Scratch-off tickets Traditional lottery Total a. (8 points) Is this significant evidence that the primary type of gambling affects the risk of a gambling addiction? Test at level α = 0.05 and include the null and alternative hypotheses, the test statistic, the rejection region, an estimate of the P-value, a statement of whether or not you reject the null hypothesis, and a sentence summarizing your conclusion. There are two ways to do this problem using a χ test or using a z-test comparing π 1 π. The χ test method is given below. H 0 : A focus on scratch-off tickets and gambling addiction are independent H a : A focus on scratch-off tickets and gambling addiction are dependent ( Obs Exp ) Exp (11 6.5) 6.5 ( ) 43.5 ( 6.5) 6.5 ( ) Reject H 0 if χ > χ (df = 1, 0.05) = The p-value is between 0.01 and 0.0 (from the χ table). So we should reject H 0. We can conclude that gambling using scratch-off tickets increases the rate of gambling addiction. b. (3 points) Find the difference in proportions of a gambling addiction comparing scratch-off ticket users to traditional lottery users. 11 p ˆ1 0., p ˆ p ˆ ˆ 1 p c. (5 points) Find the 95% confidence interval for the difference in gambling addiction for scratchoff ticket users vs. traditional lottery users. 95% Confidence Interval for a difference in proportions: * pˆ(1 pˆ ) ˆ (1 ˆ 1 p p ) ( pˆ ˆ 1 p ) z ( ) 1.96 n n (0.053, 0.307) 1 0.(0.78) (0.96) 50

27 19. The mean length of stay in a hospital is useful for planning purposes. Suppose that the following is the distribution of the length of stay in a hospital after a minor operation. Number of Days 3 4 Probability a) What is the mean (expected value) length of stay? E( X ) xp( X x) (0.) 3(0.3) 4(0.5) 3.3 X b) What is the variance of length of stay? Var ( X ) X ( x ) P( X x) ( 3.3) (0.) (3 3.3) (0.3) (4 3.3) (0.5) 0.61 X