1. Multiple Choice: require no justification. Note: these parts are not related.
|
|
|
- Stuart Blake
- 9 years ago
- Views:
Transcription
1 Statistics E100 Final Exam Extra Practice Problems Solutions 1. Multiple Choice: require no justification. Note: these parts are not related. a) A magazine states the following hypotheses about the average age of their subscribers: Ho: µ= 8 years vs. Ha: µ > 8 years. Making a Type I error with this test means that: a) The sample result gives little evidence to conclude that the average age of the subscribers is greater than 8 years when in fact the average age is 8 years. b) The sample result gives little evidence to conclude that the average age of the subscribers is greater than 8 years when in fact the average age is much greater than 8 years. c) The sample result gives strong evidence that the average age is greater than 8 years when in fact the average age IS 8 years. d) The sample result gives strong evidence that the average age is greater than 8 years when in fact the average age is much greater than 8 years. b) The business college computing center wants to determine the proportion of business students who have personal computers at home. If the proportion differs from 5%, then the lab will modify a proposed enlargement of its facilities. Suppose data is collected from 100 randomly chosen students of the business college and the sample proportion is found to be 34%. What is the test statistic for testing H 0: π = 0.5 versus H A: π > 0.5?? a) 1.65 b).08 c) 1.90 d) 1.78 c) What is the result of the hypothesis test in Problem (b) above? a) Reject the null hypothesis b) Fail to reject the null hypothesis d) As the degrees of freedom for the t distribution increase, the distribution approaches 1) The value of zero for the mean. ) The t distribution. 3) The normal distribution. 4) The binomial distribution. e) Which statement is NOT true about hypothesis tests? a) Hypothesis tests are only valid when the sample is representative of the population for the question of interest. b) Hypotheses are statements about the population represented by the samples. c) Hypotheses are statements about the sample (or samples) from the population. d) Conclusions are statements about the population represented by the samples.
2 f) In regression analysis, if the coefficient of determination (R ) is 1.0, then: a. SSE (error sum of squares) must be 1.0 b. SSR (regression sum of squares) must be 1.0 c. SSE must be 0.0 d. SSR must be 0.0 g) A sample size of 00 light bulbs was tested and found that 11 were defective. What is the 95% confidence interval around this sample proportion? a) ± b) ± c) ± d) ± h) You wish to estimate the proportion of shoppers that use credit cards. Determine the sample size needed if the margin of error should be at most 0.01 (that is, we want the confidence interval to be +/-.01) and the confidence level is 95%. a) 8,98 b) 3,050 c) 15,914 d) 9,604 i) Suppose individuals with a certain gene have a 0.4 probability of eventually contracting a particular disease. If 15 individuals with the gene participate in a lifetime study, what is the distribution of the random variable X describing the number of these individuals who will contract the disease? a) X is a binomial random variable with n=6 and p=1.897 b) X is a normally distributed random variable with mean 15 and variance 0.4 c) X is a binomial random variable with n=15 and p=0.4 d) X is a normally distributed random variable with mean 6 and variance e) None of the above j) After performing a simple linear regression, we calculated the residuals and obtained the residual plot shown below. Does the plot indicate any potential problems with the regression?
3 a) The plot indicates the residuals are not normally distributed. b) The plot shows curvature. Hence, a linear model is not appropriate. c) The plot indicates that the error variance is not constant. d) All of the above e) There are no apparent problems. k) What do residuals represent in the simple linear regression model? a) The difference between the actual Y values and the mean of Y. b) The difference between the actual Y values and the predicted Y values. c) The square root of the slope. d) The predicted value of Y for the average X value e) None of the above. l) The probability that a region prone to hurricanes will be hit by a hurricane in any single year is 0.1 and independent of other years. What is the probability of a hurricane hit at least once in the next 5 years? a) b) c) 0.5 d) e) None of the above m) What is the expected number of hurricanes to hit the area described above in the next 90 years? a) 9 b) 3 c) 8.1 d).85 e) None of the above q) For which of the following hypotheses tests would the p-value be the same whether the sample mean is 44 or 46 (see table to the right) a) I. b) I. and IV. c) II. and III. d) IV. e) Stop bothering me with these silly questions. r) (3 points) We are told that a 95% prediction interval for a response variable, y, is (3., 35.6) from a simple regression on a sample of n = 100 observations at x* = 10. Which of the following is a reasonable estimate for the confidence interval for µ y at x* = 10? a. (13., 45.6) b. (4., 7.8) c. (8.6, 30.) d. (33., 45.6)
4 . (5 points total) Kellogg s wants to increase sales of its Fruit Loops cereal, and decides to run an experiment at Stop & Shop stores in New England. They randomize which shelf (bottom vs. middle vs. top) Fruit Loops is placed at a total of 150 stores (50 on each shelf). The variables collected were then: sales: number of boxes sold at the store in one day middle: a 0/1 binary variable to indicate if Fruit Loops was on the middle shelf top: a 0/1 binary variable to indicate if Fruit Loops was on the top shelf A regression was run in SPSS, and the results are shown below: a) (7 points) Is there any evidence that sales varied across the 3 shelf locations? Perform a formal hypothesis test to determine this: be sure to include your hypotheses, the test statistic, the degrees of freedom, the p-value, and your conclusion in context of the problem. H 0 : β 1 = β = 0 H A : At least one β 0 F = 3.76 (directly from the table) p-value = (directly from the table) Since p-value < 0.05, we can reject the null hypothesis. There is evidence that sales truly vary across the 3 shelf locations in the population of all stores.
5 This coefficient is estimating the mean difference in sales of stores where Fruit Loops was sold on the middle shelf in comparison to the bottom shelf (the reference group). There were on average 3.38 more boxes sold on the middle shelf than the bottom shelf. c) (6 points) Calculate the 95% confidence interval for the middle variable in the above regression model. * b 1 t SE( b 1 ) (1.34) (0.93, 5.83) d) (4 points) Which shelf location is predicted to have the most sales? Which shelf location is predicted to have the fewest sales? Please justify. Highest: middle shelf Lowest: bottom shelf The middle shelf is predicted to have the highest average sales ( = 6.18) since it has the highest difference compared to the reference group (and it is positive), while the bottom shelf is predicted to have the fewest sales as it is the reference group and the other two groups are estimated to have higher means (since both slope estimates are positive). e) (4 points) What percent of total variability in sales can be predicted by this model? SSM R So 4.87% of the variability in sales can be predicted SST from the model on shelf location. 3. (15 points total) Suppose that past history shows that 35% of college students prefer Pepsi over Coca-Cola. a. (4 points) A sample of 5 students is selected. What is the probability that at least 1 prefers Pepsi? Let X = count of students who prefer Pepsi in a random sample of 5 X ~ Binomial(n = 5, p = 0.35), P(X 1) = 1 - P(X = 0) = 1 (0.65) 5 = 0.884% b. (6 points) A sample of size 50 is collected. What are the mean and standard deviation for the number of students who prefer Pepsi in this sample? Y= number of students who prefer Pepsi Y~Binomial(50, 0.35) Mean of Y= np = 50*0.35=17.5 Standard deviation of Y = sqrt(50*0.35*0.65) = c. (5 points) In this sample of 50 students, what is the probability that the majority (strictly more than half) of the students selected prefer Pepsi to coke? Standardized: Z = (Y-17.5)/3.373 is approximately standard normal distributed P(Y 6) = P(Z (6-17.5)/3.373 ) = P(Z.5) =
6 4. (9 points total) An investigator is interested in modeling the progression over time in the Men s 100 meter run in the Olympics. He measures variables: time: the winning time in the men s 100 meter sprint, in seconds year: the year of the Olympics (from 1900 to 008) Some relevant SPSS output is shown below: a) (7 points) Are time and year significantly associated at α = 0.05? Be sure to include the hypotheses, the test statistic, the degrees of freedom, the p-value, and your conclusion in the context of the problem. Solution: Test for the significance of the slope in the regression. a) Hypotheses. Ho: = 0 vs. Ha: 0 b) Test statistic (from SPSS): t = / = , 4 degrees of freedom c) p-value < < 0.05 d) We can reject the null hypothesis, and therefore there is statistical evidence of an association between time and year in the Olympics. In fact, winning times are decreasing. b) (4 points) What is the estimated correlation between time and year? r = R = = (must be negative since the slope is negative) c) (3 points) What is the estimated standard deviation of winning times within Olympic year (aka, standard deviation of the residuals)? Std. Error of the Estimate =
7 d) (4 points) Based on this model, in what year will the winning times be forecasted to drop below 9 seconds (please round up to the nearest year)? 8.99 = * x x = ( ) / = In the year 070. e) (4 points) In 1 or sentences, please comment on the validity of your forecast in part (e) above. Since the range of values for the x-variable go from 1900 to 008, using the model to predict what the time will be in 070 is extrapolation, and who knows if this pattern will continue in such a linear fashion. f) (7 points) Above are the histogram of residuals along with the scatterplot of the residuals vs. the x-variable, year. Please comment on the validity of the assumptions for this regression model. (I ve comment on one, please list it and the others and comment on the other 3). Assumptions: 1. Independence of observations: cannot be checked here. Normally distributed residuals: From the histogram we can conclude that the residuals seem to be fairly normally distributed. 3. Linear relationship between predictors and response: There is no visible pattern (i.e. curvature) in the residual vs fitted plot, and therefore we can assume that the linear trend was captured by the regression. 4. Residuals have constant variance: From the residuals vs fitted plot, this assumption seems to hold true since there is no pattern or fanning out.
8 5. (16 points total) Kevin is flying directly to Philadelphia on Saturday for a friend's wedding next weekend. He has his flight booked through US Airways. US Airways reports that whether his flight is on time or not depends on the weather in Boston. If it is raining in Boston, the flight will be late 50% of the time. If it is not raining in Boston, the flight will be late only 10% of the time. There is a forecasted 5% chance for rain on Saturday (assume that this forecast is correct). a. (8 points) What is the overall probability that Kevin's flight will be delayed? [If Kevin arrives late, he will miss the beginning of the wedding]. Recall: P(A) = P(A and B) + P(A and B C ) P(late) = P(late and rain) + P(late and no rain) = P(late rain)p(rain) + P(late no rain)p(no rain) = 0.50* *0.75 = 0.0 b. (8 points) Saturday rolls around and Kevin's friend notices Kevin has not arrived at the wedding on time because his flight was delayed. What is the conditional probability that it actually was raining up in Boston given the fact that Kevin's flight was delayed? P(rain late) = P(late and rain) / P(late) = P(late rain)p(rain) / P(late) = 0.50*0.5 / 0.0 = An elevator serving a hospital is designed to hold up to 15 passengers and has a maximum safe capacity of 440 pounds. The weight of passengers who use the elevator is normally-distributed with an average of 149 pounds and a standard deviation of 0 pounds. a) What is the probability that a single passenger on the elevator weighs between 140 and 150 pounds? Z = (X-μ)/σ = ( )/0 = 0.05 Z = (X-μ)/σ = ( )/0 = P(140 < X < 150) = P(X<150) P(X<140) = P(Z<0.05)-P(Z<-0.45) = = b) What is the probability that a single passenger on the elevator weighs more than 00 pounds? Z = (X-μ)/σ = (00-149)/0 =.55 P(X > 00) = 1 P(X<00) = 1 P(Z<.55) = = c) If five passengers enter the elevator together, what is the probability that all five of them weigh 00 pounds or less? P(all 5 less than 00) = = d) What is the probability that the elevator's safe capacity is exceeded by a full load of 15 passengers? Let T = total weight of 15 passengers. Then X-bar = T/15. We know X-bar ~ N(μ = 149, σ = 0/ 15 = 5.164) Z = ( ) / =.65 So P(T > 440) = P(X-bar > 440/15) = P(X-bar > 16.67) = P(Z>.65) =
9 7. (3 points total) GPA s at Harvard are known to be approximately Normally distributed with a mean of µ = 3.5 and a standard deviation σ = a. (6 points) Show that the 0.33% of Harvard students have a GPA below z x P ( X 3.00) P( Z 0.83) 0.03 b. (6 points) There are 6 students living in a suite in a Harvard house. If we assume their GPA s to be independent, what is the probability that at least one of them has a GPA below 3.00? Let X = # students below 3.0 out of 6. Then X ~ Bin(n = 6, π = 0.03). We want: P ( X 1) 1 P( X 0) c. (6 points) A random sample of 50 Harvard students was taken. Assuming their GPA s are independent, what is the probability that at least 0 of them have a GPA below 3.00? Let X = # students below 3.0 out of 50. Then X ~ Bin(n = 50, π = 0.03). We want P ( X 0). We can do this with the Normal approximation to the Binomial. So we know that approx.. X ~ N(µ = nπ = 50(0.03) = 10.1, σ = n ( 1) 50(0.0)(0.798) =.84). Then: z x P ( X 0) P( Z 3.49) 1 P( P 3.49) d. (6 points) What is the probability that the average GPA for these 50 randomly sampled students is below 3.00? x z / n / P ( X 3.00) P( Z 5.89) e. (8 points) A random sample of 50 Harvard athletes had a mean of x 3. 1 and standard deviation of s = 0.37 (there is no reason to suspect athletes have the same standard deviation as the general Harvard population). Perform a formal hypothesis test to determine whether Harvard athletes have a different mean GPA than all Harvard students. Be sure to include your hypotheses, the test statistic, the degrees of freedom (if applicable), an estimate of the p-value, and your conclusion in context of the problem. H 0 : µ = 3.5 H A : µ 3.5
10 t x s / n This test has df = n 1 = 49 (use df = 40 in the table) 0.37 / 50 p-value = * P ( t.48). In the table, we see that.48 falls between.704 and.43, so our p-value is somewhere between (0.005) and (0.01). So it s between 0.01 and 0.0. Since our p-value < 0.05, we can reject the null hypothesis. It looks like Harvard athletes do have a different average GPA than the rest of Harvard students, in fact it is lower. 8. (3 points total) Over the last twenty years, the daily change (in decimal form) of a mutual fund based on the S&P 500 Index fund is known to follow a normal distribution with a mean of μ = and a sd of σ = a. (8 points) What is the probability that this mutual fund loses money in any one day? Let X be the random variable representing the daily change for this mutual fund. From the opening paragraph, we know X ~ N( , ). Thus: X X P( X 0) P ( 0.4) P Z X b. (8 points) What is the probability that this mutual fund loses money in at least one day over the next week (5 days) assuming days are independent? X Let Y be the random variable for the number of days out of 5 that the fund loses money. Thus: Y ~ Bin ( n 5, p 0.405). So, X P ( Y 1) 1 P( Y 0) c. (8 points) What is the approximate probability that this mutual fund will lose money in at least 15 of the next 30 days assuming days are independent? Let V be the random variable for the number of days out of 30 that the fund loses money. Thus: V ~ Bin ( n 30, p 0.405). Based on the fact that np > 10 and n(1-p) > 10, we know that V is also approximately Normal: V ~ N( V np 30(0.405) 1.156, np(1 p) 30(0.405)(0.5948).689) V P( V V V 15) P ( 1.06).689 P Z V 1 P( Z 1.06)
11 d. (8 points) Let X be a random variable to represent the average daily change across 50 days (which is essentially a full year of business days). If you assume each day is independent, what is the probability that your investment will have an average change below zero (essentially meaning the fund lost money during the year)? From the central limit theorem, we know that: X ~ N( , n ). So: X X X X X ( 0) X P X P ( 3.83) P Z X If days are independent, then the mutual fund has almost no chance (about 1/5000 chance) of losing money (but in real life, days are not independent...which leads us to the next question...). e. (4 points) Now assume that instead of this mutual fund's daily change is not independent from day to day, but it actually has a positive correlation from one day to the next. Would the probability of losing money increase, decrease or stay the same from your answer in part (d)? Please justify your answer. This probability would definitely increase if there was a positive correlation from one day to the next. The varaince of X would increase with the positive correlation (remember: Var (X 1 + X ) = Var(X 1 ) + Var(X) + ρσ X1 σ X ), which means the z-score calculated in part (d) would be based ona larger denominator, leading to a z-score not as far out in the left tail, so the probability of falling below that would increase. 9. (6 points total) The table and graph below show numerical and graphical summaries of the monthly precipitation (in inches) over the last 60 months in Cambridge, MA. a. (8 points) Is this distribution left-skewed, right-skewed, or symmetric? Briefly justify your answer.
12 The distribution is right-skewed. This can be seen in the summary statistics since the mean is larger than the median, and also in the histogram since the right tail is longer than the left, pulling the mean up towards the right tail. b. (10 points) Identify any suspected high outliers in the data using the quantitative methods discussed in class. Show your work. The rule for outlier detection is the 1.5*IQR rule. So a value will be designated a high outlier if it lies above Q (IQR). We see that Q 1 =.11 and Q 3 = 4.981, so the boundary is: Q 3 1.5( IQR) ( ) (.869) 9.85 From the summary statistics table, we see there are 3 outliers at the values 9.57, 9.976, and c. (8 points) Calculate the mean and standard deviation of monthly precipitation in centimeters (1 inch =.54 cm). This is simply a linear transformation from inches to cm (it has the form y = a + bx). So if X is the variable for rainfall in inches and Y is the variable representing rainfall in cm, then: y a bx 0.54x.54(3.894) 9.89 s y b s.54 (.411) 6.14 x 10. (0 points total) Below are the summary statistics for two variables measured on the top 10 grossing box office movies so far in 011: how much revenue they generated in US markets and the amount of revenue generated in all international markets combined (both in millions of US dollars), along with the correlation table between the two, and the related scatterplot with international revenue on the y-axis, and US revenue on the x-axis:
13 a. (7 points) What is the formula for the least-squares regression line to predict international revenue based on US revenue? b b s y 1 r.31 sx ( 0 y b1 x) (0.58) yˆ b ( 0 b1 x) ( x) 57.7 b. (4 points) What is the predicted amount of international revenue for a movie that generated 16 million dollars in the US? y ˆ b ( 0 b1 x) (16) c. (4 points) Kung Fu Panda made 16 million dollars in the US and 614 million dollars internationally. What is Kung Fu Panda 's estimated residual? e y yˆ d. (5 points) What percentage of variability in international revenue can be explained by US revenue? R About 59.1% of the total variability in international revenue can be explained by US revenue. 10. (30 points total) Each part of this problem requires a short response with a brief explanation (simply yes or no will not suffice). Note: these parts are not related. a. (6 points) In a study of cold symptoms, every one of the 50 study subjects with a cold was found to be improved weeks after taking ginger pills. The authors concluded that ginger pills cure colds. What is the major flaw in this study? The major problem is that there is no control/comparison group. These subjects most likely would have improved within two weeks had they received no treatments whatsoever (the flu usually just takes a few days to run its course). b. (6 points) Let H be the event that the Democrats win the majority of the seats in the House of Representatives, and let S be the event that the Democrats win the majority of the seats in the Senate. Let P(H) = 0.5, P(S) = 0.6, and P(H or S) = 0.7. Are H and S independent? Solution: No, since P(H and S) = P(H) + P(S) - P(H or S) = = 0.4, which is different from P(H) * P(S) = 0.3.
14 c. (5 points) The sensitivity for a diagnostic test, P(positive test disease), is 0.85 and the specificity of the same test, P(negative test no disease), is also Are the two events, (A = having the disease) and (B = receiving a positive testing), independent? Show your work. No, these events are not independent, they are dependent, since: P(B A) = 0.85 P(B A C ) = 0.15 d. (6 points) It is known that 30% of young girls favorite color is blue while 70% of young boys favorite color is blue (you can also assume the population is split evenly into 50% boys and 50% girls). Are the two events (being a boy) and (favorite color is blue) independent? No, these events are not independent. The simplest way to show this it to show that: P(blue boy) P(blue girl) e. (6 points) Suppose that A and B are two disjoint events within the same sample space. In addition, let P(A) = 1/8 and P(B) = 1/4. Are events A and B independent? Exaplain or show your calculations. Events a and B must be dependent since they are disjoint. Since these events are disjoint, we know there is no overlap or intersection, so P(A and B) = 0. Thus: P( A and B) 0 1/ 3 (1/8) (1/ 4) P( A) P( B) f. (6 points) In 1990 a research organization sent questionnaires to all of the approximately 15,000 high school systems in the United States. These questionnaires asked about computer useage in the school system. As many as 3,600 schools systems returned answers. Of these 3,600, 60% indicated that some of their students used computers. In a speech shortly thereafter, an authority on the use of computers in high school education cited this study as evidence that "students in 60% of the high school systems in the United States use computers during their high school careers." Do you regard 60% as a trustworthy estimate of the proportion of school systems providing computer access in 1990? In two sentences or fewer, explain your answer. Since only 3600/15000 = 4% of the schools responded, this allows for the potential of nonresponse bias. It could be that the schools that chose to respond used computers more than those schools that chose not to respond. g. (6 points) A company in Hawaii builds bridges for married couples to walk over during their weddings. There are 3 islands in Hawaii that each have the same mean and variance of husbands weights and the wives weights. However, the relationship of weights within couples is different on the 3 islands: in Inde: the weights within couples are independent; in Posi: they are positively correlated; and in Nega the weights are negatively correlated. On which island should the company build the strongest bridge? Defend your answer in sentences or less.
15 Solution: since the variability will be higher for the sum of weights if the weights within couple are positively correlated, then there is greater potential (chance) for heavier couple. Thus, the bridge should be built highest on the Island of Posi. 11. (35 points total) A researcher is investigating variables that might be associated with death rates in the US states. He examined data from 008 for each of the 50 states plus Washington, D.C. The data included information on the following variables: deathrate The annual deathrate per one million inhabitants smokers The percent of inhabitants who smoke heavily, in percentage points college The percent of inhabitants that have a bachelors degree, in percentage points As part of his investigation, he ran the following multiple regression model: deathrate = (smokers) + (college) + This model was fit to the data using the method of least squares. The following results were obtained from statistical software: a. (4 points) What is the estimated standard deviation of the residuals? s e std. error of the estimate 7.313
16 b. (6 points) Suppose we wish to test the hypotheses H 0 : 1 = = 0 versus H a : at least one of the j is not 0. What is the value of the appropriate test statistic, the p-value, and conclusion to this test? F = 1.4 p-value = Since p-value < 0.05, we can reject the null hypothesis. There is evidence that either smokers or college (or both) is an important predictor of death rate among the 50 states and DC. c. (6 points) What is the interpretation of the value for b 1, the estimated coefficient for the variable smokers? For every additional percentage point of smokers in a state, we expect a per million people per year increase in the death rate in that state, holding the percentage of college graduates in that state constant. d. (7 points) Calculate the 95% confidence interval for 1, the coefficient for the variable smokers. b 1 ± t * se(b 1 ) = ±.01(4.500) = (8.38, 6.56) Note: t * =.01 is the critical value from a t-distribution with df = n p 1 = 51-3 = 48 that puts.5% in each tail. We rounded down and used df = 40 in the table. e. (6 points) Briefly comment on the residual diagnostic plot for this model shown below. Please be specific and limit your response to 3 sentences or bullet points. We can comment on two assumptions with this graph: 1) Constant Variance: since there is no fanning out around the y = 0 line (that, is more vertical spread on one side of the plot compared to the other), this appears to be a safe assumption. ) Linearity: since there is no curvature in the scatterplot of points, there is no evidence of non-linearity. This assumption also appears to be safe. Note: if you are REALLY good, you could try to make an argument that the residuals are normally distributed as well. This can be seen by the fact that most of the points in the vertical direction are close to zero, and they tail off both above and below this middle (there are fewer and fewer observations as you go further away from the y = 0 line in both directions, up and down). Another researcher, using the same data, ran the following simple linear regression model:
17 deathrate = + (college) + The following results were obtained from statistical software: f. (6 points) The second researcher concluded that because the coefficient for the variable college was negative in his results, spending additional money on education to have more college graduates would decrease the death rate in his state. This researcher therefore recommended more money be spent on education. The second researcher concluded that because the coefficient for the college variable was positive in his results, spending additional money on students would increase the death rate. This researcher therefore recommended less money be spent on education. Why are these two conclusions different even though the researchers used the same data? Explain using a few concise sentences. This is not surprising, actually. Most likely college is correlated with smokers. Since the first model included smokers, college no longer had predictive ability, and had a slight positive relationship with death rate when controlling for smokers. Without adjusting for smokers (since it s not in the second model), college has a strong negative relationship with death rate. 1. (0 points total) It is known that 0% of all Harvard students are varsity athletes. 50% of varsity athletes eat breakfast on any particular weekday, while only 5% of all other Harvard students eat breakfast on any particular weekday. Define the events: A: the event that a student is a varsity athlete B: the event that a student eats breakfast on a particular weekday a. (5 points) Are the events A and B independent? Please briefly justify. We know P(B A) = 0.50, and P(B A C ) = 0.5. Since these are not equal, we know that they are not independent; they are dependent. b. (5 points) Are the events A and B disjoint? Please briefly justify. P(A and B) = P(B A)*P(A) = 0.5*0. = 0.1. Since the probability of their intersection is not zero, they are NOT disjoint.
18 c. (5 points) Find the overall proportion of students that eat breakfast on any particular weekday. P(B) = P(A and B) + P( A C and B) = P(B A)*P(A) + P(B A C )*P(A C ) = 0.5* *0.8 = 0.30 d. (5 points) Given a student is eating breakfast on a particular weekday, what is the probability that that student is a varsity athlete? P(A B) = P(A and B) / P(B) = 0.10/0.30 = e. (4 points) In actuality the non-varsity-athlete students are comprised of two further subgroups: 30% of them are club athletes and 70% are nonathletes [so there are actually 3 distinct groups in the Harvard student population: varsity athletes, club athletes, and nonathletes]. Of the club athletes, 40% eat breakfast. Let NA denote non-athletes, CA denote club athletes, and VA denote varsity athletes. i) P(B NA) = P(B and NA)/P(NA) = 0.104/(0.8*0.7) = Since, P(B and NA) = P(B) P(B and VA) P(B and CA) = = Since, P(B and CA) = P(A C )*P(CA A C )*P(B CA and A C ) = 0.80*0.30*0.40 = ii) P(NA B) = P(NA and B)/P(B) = 0.104/0.30 = (55 points total) A study was conducted to determine the association between the maximum distance at which a highway sign can be read (in feet) and the age of the driver (in years). Fourty drivers of various ages were studied. The summary statistics for distance and age are shown below in a table from Stata: a. (8 points) The correlation coefficient between distance and age in this sample is r = Calculate a and b of the least-squares regression equation that would predict the distance at which a highway sign can be read given the age of the driver. b b sy r sx y b x) ( 3.863)*(46.1) 1 ( b. (10 points) The standard error of b was calculated to be from SPSS. Is age a significant predictor of distance in this linear model? Conduct this statistical test of H0: β = 0 using α = Be sure to include your hypotheses, test statistic, degrees of freedom if appropriate, either the p- value or critical value, and your conclusion in terms of the problem. H 0 : 1 0
19 H A : 1 0 t b 0 SE( b ) For this t-test for regression, we have df = n p 1 = = 38. We round down to df = 30 in the t-table. Our p-value is P(t < -4.) + P(t > 4.) = P(t<-4.) [since this is a two-sided test]. With df = 30, we see that the largest t* in the t-table is 3.385, and our t-statistic is farther out in the tail than that, so P(t < -4.) < Thus our p-value < (0.001) = 0.00 Since our p-value < α = 0.05, we can reject the null hypothesis and conclude that the distance someone is able to read a sign while driving is associated with age of the driver. c. (4 points) What is the predicted distance that a sign can be read for someone who is 40 years old? y ˆ b ( 0 b1 x) (40) d. (6 points) What is R for this regression model? What is the interpretation of R here? R r ( ) This means that 31.85% of the variability to being with in distance (the response) can be explained by using age as a predictor in this linear model. The variance of the residuals in this model is 31.85% less than the overall variance of y (distance) ignoring x (age). The investigators also decided to look at whether someone wore glasses had an effect on the distance a driver could read a sign. Below is the binary-predictor regression output, labeled as Model A, of the distance someone was able to read the sign predicted from whether or not that person wore glasses (which has value 1 for those wearing glasses or contact lenses, 0 otherwise): Model A: e. (3 points) What is the reference group in this model? The reference group is the group for which the binary predictor variable (glasses) takes the value zero. That means the reference group is the group of people who did not wear glasses or contact lenses during the test.
20 f. (4 points) What is the predicted distance that a sign can be read for someone who wears glasses based on this model? y ˆ b ( 0 b1 x) (1) Below is the Stata output of a multiple regression, labeled as Model B, of the distance someone was able to read the sign predicted from age and glasses (again, which has value 1 for individuals wearing classes or contact lenses, 0 otherwise): Model B: g. (5 points) What is the interpretation of the value in this regression model? This value represent the estimated difference in distance between people who wear glasses vs. do not wear glasses, adjusting for age. In essence, if you have two people that are the same age, one with glasses and the other without, the person who wears glasses will need to be about 5.68 feet closer to the sign in order to read it than the person without glasses. h. (8 points) Compare the results of the two regressions, Model A and Model B, above. Specifically mention any signs or significance that are different between the two models. Why do you suspect this is the case? When doing this, we see that glasses goes from being a significant predictor (p = 0.09) to a clearly insignificant one (p = 0.453). This can be explained by the fast that age and glasses are correlated themselves (older people wear glasses more often), and if you adjust for age in the model, glasses no longer have the apparent affect that they did in Model A. Age was confounding the significant result between glasses and distance we saw in Model A.
21 i. (7 points) Above are the residual vs. fitted scatterplot and histogram of the residuals for the multiple regression model (Model B) above. Use these plot to comment on whether the assumptions for this model seem to be valid. Be specific. We can check 3 assumptions with these plots: (i) Normally distributed residuals: this seems to be OK based on the histogram to the right. The points follow the general bell-shape, but there may be evidence of a bimodal distribution. (ii) Constant variance of the residuals: this assumption seems perfectly appropriate. In the residual scatterplot to the left, we see the spread of the points in the vertical direction seems pretty consisitent no matter where you are along the x-axis. (iii) Residuals are centered at zero (for any values of the X s): this assumption seems just fine. In the scatttplot on the left, the points don t show any curvature and appear to be centered at the zero line no matter what the X-axis is. *Note: we cannot check the independence assumption based on these graphs. 14. (5 points) As part of a study on student loan debt, a national agency that underwrites student loans is examining the differences in student loan debt for undergraduate students. One question the agency would like to address specifically is whether the mean undergraduate debt of Hispanic students graduating in 009 is less than the mean undergraduate debt of Asian- American students graduating in 009. To conduct the study, a random sample of 9 Hispanic students and a random sample 110 Asian- American students who completed an undergraduate degree in 009 were taken. The undergraduate debt incurred for financing college for each sampled student was collected. Let denote the population average student loan debt for Hispanic students, and the population average student loan debt for Asian-American students. Using the A summary statistics below, test the hypothesis H : H :. Clearly interpret your results. H o A H a A H Group n mean Std. Dev. Hispanic Asian-American Total
22 H H 0 A : H H A : A t x s 1 1 n 1 x s n df = min(n 1-1, n -1) = 91 We can get the p-value from the t-table (rounding down to df = 80), and we see that it is greater than (0.05) = Since our p-value is greater than 0.05, we are unable to reject the null hypothesis. There is not enough evidence to say that the average student loans for Hispanic vs. Asian-American students are different. The true averages in the population may actually be the same. 15. (33 points total) An investigator is trying to determine what factors are important in determining the graduation rate at US colleges. She collects a random sample of 53 four-year colleges, and records three variables: GradRate: the graduation rate for the class of 007 (as a percentage of all entering students that were full-time, in percentage points: 0 to 100) Tuition: the tuition in (in thousands of dollars) SATMath: the median SAT math score for entering freshmen in Below is SPSS s regression output of predicting GradRate based on Tuition and median SATMath scores.
23 a. (5 points) Based on the above model, what is the predicted graduation rate for a college with tuition of $35 thousand and median math SAT score of 750 (aka, Harvard)? ˆ y b0 b1 x1 b x ( tuition ) 0.184( satmath ) (35) 0.184(750) b. (5 points) In words, what is the interpretation of the coefficient for SATMath (which has value 0.184) in the above table? This is the estimated change in graduation rate for every extra point in the median sat math score when holding tuition constant (that s the key: since this is a multiple regression, this is the effect of sat math on graduation rate while adjusting for tuition). If we compared schools that had the same tuition and one of the schools had a median SAT math score 10 points higher than the other, we d expect that school to have a graduation rate of about 1.8% points higher. c. (4 points) What is the proportion of total variability in graduation rate that can be explained by this model? This is simply R = / = which is d. (6 points) Perform a single hypothesis test to determine whether any of the variables are associated with graduation rate. Be sure to state your hypotheses, test statistic, degrees of freedom (if applicable), p-value, and conclusion. H 0 : β tuition = β satmath = 0 H A : Either β tuition 0 or β satmath 0 This can be tested via an F-test. We see the F-statistic is (from SPSS) and has df =,5, which leads to p-value = Since this p-value < α = 0.05, we can reject the null hypothesis. It looks like at least one of our predictors is significantly associated with graduation rate. e. (7 points) Perform a hypothesis test to determine whether specifically tuition is associated with graduation rate in the above model. Be sure to state your hypotheses, test statistic, degrees of freedom (if applicable), p-value, and conclusion. H 0 : β tuition = 0 H A : β tuition 0 t 1.14 This t-statistic has df = n k 1 = 5. The p-value estimate is > 0.10 based on the table. Since this p-value > α = 0.05, we cannot reject the null hypothesis. Tuition may not truly be associated with graduation rate (when also adjusting for median SAT math score). f. (6 points) The dean at a college sees these results and suggests to his board of trustees that they raise their tuition in order to improve their graduation rate. What is the major mistake the Dean is making in concluding from these data that raising their tuition will lead to a higher graduation rate? The major mistake he is making is in thinking this is a causal relationship. Since this data is coming from a survey and not an experiment, there is no guarantee that raising tuition will lead to an
24 improvement in graduation rate (In fact, I would argue that it s the best schools that have a graduation rate that are able to charge a high tuition because of their reputation a reverse causation). 16. (1 points) A survey of male and female university students asked which popular musical artist they preferred. The survey focused on Lady Gaga and Justin Bieber but allowed for other artists as well. Some of the values from the two-way table are missing, but you can determine what they are and answer the given questions. a) What is the value of a? Artist Lady Gaga Justin Bieber Other Total Male a 100 Female Total a = Lady Gaga (lady Gaga and Female) = ( ) 50 = = 0 Here is the whole completed table: Justin Artist Lady Gaga Bieber Other Total Male Female Total a) What is the probability that a randomly chosen student will prefer Justin Bieber? P(Bieber) = 90/300 = 0.30 b) Given a student prefers Justin Bieber, what is the probability that they are female? P(Female Bieber) = (#Female and Bieber)/(#Bieber) = 60/90 = c) Is gender and artist preference dependent or independent events? (do not use a chi-square test) They are definitely dependent. In part (c) we showed that P(Female Bieber) = However, P(Female) = 150/300 = 0.50 overall. Since these probabilities are not equal, we can say that the two events are dependent. 17. (4 points total) Your younger sister and brother are strong believers in the tooth fairy. Whenever a baby tooth falls out, your sibling places it under his/her pillow before going to sleep, and in the morning the tooth fairy replaces it with cash. You observe this week that their baby teeth are close to falling out. Let A be the event that one of your sister s teeth will fall out today, and let B
25 be the event that one of your brother s teeth will fall out today. You estimate that P(A) = 0.3 and P(B) = 0.. Assume that for each sibling at most one tooth will fall out. a. (6 points) Assuming whether your brother s tooth falls out is independent of whether your sister s tooth falls out, the probability that neither falls out today is Demonstrate with appropriate calculations why this is true. P(A C and B C ) = P(A C )*P(B C ) (since independent) = (1 0.3)*(1 0.) = (0.7)*(0.8) = 0.56 b. (6 points) Under the assumption of independence as in part (a), what is the probability that exactly one (i.e., not both) of your siblings teeth falls out today? Let the random variable X = the count of teeth the fall out. Then we want P(X = 1) = 1 [P(X=0) + P(X=)] = 1 [ *0.3] = 1 [0.6] = 0.38 Or this can be solved by thinking of it as the union of two disjoint events: P[(brother loses a tooth and sister does not) or (sister loses a tooth and brother does not)] = P(brother loses and sister doesn t) + P(sister loses and brother doesn t) = P(brother loses)*p(sister doesn t) + P(sister loses)*p(brother doesn t) (these multiply since independent) = 0.*(1 0.3) + 0.3*(1 0.) = 0.* *0.8 = 0.38 c. (6 points) Describe a scenario involving your younger siblings where A and B are clearly not independent events. Be sure to state this scenario in context of this problem (do not just give the definition of dependence). There are lots of answers here. This would include the possibility of the two getting into a fight and start punching each other (somedays they do, somedays they don t). Or maybe both eating something sticky (like peanut brittle). Or if they both decide to play ice hockey without a helmet. The list goes on and on d. (6 points) The tooth fairy replaces a tooth with cash with probability 0.5 independently from child to child. On a given night, 10 children in a town have placed teeth that have fallen out under their pillows. What is the probability that at least 1 of these 10 children is visited by the tooth fairy? Let the random variable X = # children visited by the tooth fairy. Hence, X~Bin(n = 10, π = 0.5). The question being asked is P(X 1). Then, P(X 1) = 1 P(X = 0) = 1 ( ) = Note, P(X = 0) is the probability that no child is visited by the tooth fairy, which will lead you to the same calculation. 18. (16 points total) With the popularity of traditional lotteries waning across the US, many states are turning to instant games, called scratch-off tickets, to lure new players and raise revenue. However, many critics are concerned that instant gratification scratch-off tickets are more likely to
26 contribute to gambling addiction and take particular advantage of the poor members of society. A survey of 100 randomly selected gamblers with below median incomes was conducted in the El Paso area of Texas to study the association between gambling addiction and the primary type of gambling (traditional state lottery versus scratch-off tickets). The results are given below. Primary type of gambling Diagnosed with a gambling addiction No gambling addiction Total Scratch-off tickets Traditional lottery Total a. (8 points) Is this significant evidence that the primary type of gambling affects the risk of a gambling addiction? Test at level α = 0.05 and include the null and alternative hypotheses, the test statistic, the rejection region, an estimate of the P-value, a statement of whether or not you reject the null hypothesis, and a sentence summarizing your conclusion. There are two ways to do this problem using a χ test or using a z-test comparing π 1 π. The χ test method is given below. H 0 : A focus on scratch-off tickets and gambling addiction are independent H a : A focus on scratch-off tickets and gambling addiction are dependent ( Obs Exp ) Exp (11 6.5) 6.5 ( ) 43.5 ( 6.5) 6.5 ( ) Reject H 0 if χ > χ (df = 1, 0.05) = The p-value is between 0.01 and 0.0 (from the χ table). So we should reject H 0. We can conclude that gambling using scratch-off tickets increases the rate of gambling addiction. b. (3 points) Find the difference in proportions of a gambling addiction comparing scratch-off ticket users to traditional lottery users. 11 p ˆ1 0., p ˆ p ˆ ˆ 1 p c. (5 points) Find the 95% confidence interval for the difference in gambling addiction for scratchoff ticket users vs. traditional lottery users. 95% Confidence Interval for a difference in proportions: * pˆ(1 pˆ ) ˆ (1 ˆ 1 p p ) ( pˆ ˆ 1 p ) z ( ) 1.96 n n (0.053, 0.307) 1 0.(0.78) (0.96) 50
27 19. The mean length of stay in a hospital is useful for planning purposes. Suppose that the following is the distribution of the length of stay in a hospital after a minor operation. Number of Days 3 4 Probability a) What is the mean (expected value) length of stay? E( X ) xp( X x) (0.) 3(0.3) 4(0.5) 3.3 X b) What is the variance of length of stay? Var ( X ) X ( x ) P( X x) ( 3.3) (0.) (3 3.3) (0.3) (4 3.3) (0.5) 0.61 X
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this
Chapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -
Statistics 104: Section 6!
Page 1 Statistics 104: Section 6! TF: Deirdre (say: Dear-dra) Bloome Email: [email protected] Section Times Thursday 2pm-3pm in SC 109, Thursday 5pm-6pm in SC 705 Office Hours: Thursday 6pm-7pm SC
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
" Y. Notation and Equations for Regression Lecture 11/4. Notation:
Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through
The Math. P (x) = 5! = 1 2 3 4 5 = 120.
The Math Suppose there are n experiments, and the probability that someone gets the right answer on any given experiment is p. So in the first example above, n = 5 and p = 0.2. Let X be the number of correct
Final Exam Practice Problem Answers
Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal
1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ
STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material
1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
Review #2. Statistics
Review #2 Statistics Find the mean of the given probability distribution. 1) x P(x) 0 0.19 1 0.37 2 0.16 3 0.26 4 0.02 A) 1.64 B) 1.45 C) 1.55 D) 1.74 2) The number of golf balls ordered by customers of
AP STATISTICS (Warm-Up Exercises)
AP STATISTICS (Warm-Up Exercises) 1. Describe the distribution of ages in a city: 2. Graph a box plot on your calculator for the following test scores: {90, 80, 96, 54, 80, 95, 100, 75, 87, 62, 65, 85,
MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.
STT315 Practice Ch 5-7 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Solve the problem. 1) The length of time a traffic signal stays green (nicknamed
STATISTICS 8, FINAL EXAM. Last six digits of Student ID#: Circle your Discussion Section: 1 2 3 4
STATISTICS 8, FINAL EXAM NAME: KEY Seat Number: Last six digits of Student ID#: Circle your Discussion Section: 1 2 3 4 Make sure you have 8 pages. You will be provided with a table as well, as a separate
MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.
Final Exam Review MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. 1) A researcher for an airline interviews all of the passengers on five randomly
Mind on Statistics. Chapter 13
Mind on Statistics Chapter 13 Sections 13.1-13.2 1. Which statement is not true about hypothesis tests? A. Hypothesis tests are only valid when the sample is representative of the population for the question
Simple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)
Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume
4. Continuous Random Variables, the Pareto and Normal Distributions
4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random
Name: Date: Use the following to answer questions 2-3:
Name: Date: 1. A study is conducted on students taking a statistics class. Several variables are recorded in the survey. Identify each variable as categorical or quantitative. A) Type of car the student
Introduction to Hypothesis Testing
I. Terms, Concepts. Introduction to Hypothesis Testing A. In general, we do not know the true value of population parameters - they must be estimated. However, we do have hypotheses about what the true
STAT 350 Practice Final Exam Solution (Spring 2015)
PART 1: Multiple Choice Questions: 1) A study was conducted to compare five different training programs for improving endurance. Forty subjects were randomly divided into five groups of eight subjects
DATA INTERPRETATION AND STATISTICS
PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE
5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
Chapter 7 Section 1 Homework Set A
Chapter 7 Section 1 Homework Set A 7.15 Finding the critical value t *. What critical value t * from Table D (use software, go to the web and type t distribution applet) should be used to calculate the
C. The null hypothesis is not rejected when the alternative hypothesis is true. A. population parameters.
Sample Multiple Choice Questions for the material since Midterm 2. Sample questions from Midterms and 2 are also representative of questions that may appear on the final exam.. A randomly selected sample
Section 14 Simple Linear Regression: Introduction to Least Squares Regression
Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship
c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.
MBA/MIB 5315 Sample Test Problems Page 1 of 1 1. An English survey of 3000 medical records showed that smokers are more inclined to get depressed than non-smokers. Does this imply that smoking causes depression?
1) The table lists the smoking habits of a group of college students. Answer: 0.218
FINAL EXAM REVIEW Name ) The table lists the smoking habits of a group of college students. Sex Non-smoker Regular Smoker Heavy Smoker Total Man 5 52 5 92 Woman 8 2 2 220 Total 22 2 If a student is chosen
STATISTICS 8: CHAPTERS 7 TO 10, SAMPLE MULTIPLE CHOICE QUESTIONS
STATISTICS 8: CHAPTERS 7 TO 10, SAMPLE MULTIPLE CHOICE QUESTIONS 1. If two events (both with probability greater than 0) are mutually exclusive, then: A. They also must be independent. B. They also could
t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon
t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. [email protected] www.excelmasterseries.com
socscimajor yes no TOTAL female 25 35 60 male 30 27 57 TOTAL 55 62 117
Review for Final Stat 10 (1) The table below shows data for a sample of students from UCLA. (a) What percent of the sampled students are male? 57/117 (b) What proportion of sampled students are social
Two-sample hypothesis testing, II 9.07 3/16/2004
Two-sample hypothesis testing, II 9.07 3/16/004 Small sample tests for the difference between two independent means For two-sample tests of the difference in mean, things get a little confusing, here,
The Dummy s Guide to Data Analysis Using SPSS
The Dummy s Guide to Data Analysis Using SPSS Mathematics 57 Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved TABLE OF CONTENTS PAGE Helpful Hints for All Tests...1 Tests
Statistics Review PSY379
Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses
Chapter 23 Inferences About Means
Chapter 23 Inferences About Means Chapter 23 - Inferences About Means 391 Chapter 23 Solutions to Class Examples 1. See Class Example 1. 2. We want to know if the mean battery lifespan exceeds the 300-minute
Mind on Statistics. Chapter 8
Mind on Statistics Chapter 8 Sections 8.1-8.2 Questions 1 to 4: For each situation, decide if the random variable described is a discrete random variable or a continuous random variable. 1. Random variable
17. SIMPLE LINEAR REGRESSION II
17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.
CALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
Statistics E100 Fall 2013 Practice Midterm I - A Solutions
STATISTICS E100 FALL 2013 PRACTICE MIDTERM I - A SOLUTIONS PAGE 1 OF 5 Statistics E100 Fall 2013 Practice Midterm I - A Solutions 1. (16 points total) Below is the histogram for the number of medals won
Regression Analysis: A Complete Example
Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty
Mind on Statistics. Chapter 12
Mind on Statistics Chapter 12 Sections 12.1 Questions 1 to 6: For each statement, determine if the statement is a typical null hypothesis (H 0 ) or alternative hypothesis (H a ). 1. There is no difference
Normality Testing in Excel
Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. [email protected]
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing
Statistics 2014 Scoring Guidelines
AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home
Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation
Parkland College A with Honors Projects Honors Program 2014 Calculating P-Values Isela Guerra Parkland College Recommended Citation Guerra, Isela, "Calculating P-Values" (2014). A with Honors Projects.
International Statistical Institute, 56th Session, 2007: Phil Everson
Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: [email protected] 1. Introduction
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
3. There are three senior citizens in a room, ages 68, 70, and 72. If a seventy-year-old person enters the room, the
TMTA Statistics Exam 2011 1. Last month, the mean and standard deviation of the paychecks of 10 employees of a small company were $1250 and $150, respectively. This month, each one of the 10 employees
Statistics 151 Practice Midterm 1 Mike Kowalski
Statistics 151 Practice Midterm 1 Mike Kowalski Statistics 151 Practice Midterm 1 Multiple Choice (50 minutes) Instructions: 1. This is a closed book exam. 2. You may use the STAT 151 formula sheets and
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma [email protected] The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
Chapter 23. Inferences for Regression
Chapter 23. Inferences for Regression Topics covered in this chapter: Simple Linear Regression Simple Linear Regression Example 23.1: Crying and IQ The Problem: Infants who cry easily may be more easily
Recall this chart that showed how most of our course would be organized:
Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
Statistics Class Level Test Mu Alpha Theta State 2008
Statistics Class Level Test Mu Alpha Theta State 2008 1. Which of the following are true statements? I. The histogram of a binomial distribution with p = 0.5 is always symmetric no matter what n, the number
An Introduction to Statistics Course (ECOE 1302) Spring Semester 2011 Chapter 10- TWO-SAMPLE TESTS
The Islamic University of Gaza Faculty of Commerce Department of Economics and Political Sciences An Introduction to Statistics Course (ECOE 130) Spring Semester 011 Chapter 10- TWO-SAMPLE TESTS Practice
Linear Models in STATA and ANOVA
Session 4 Linear Models in STATA and ANOVA Page Strengths of Linear Relationships 4-2 A Note on Non-Linear Relationships 4-4 Multiple Linear Regression 4-5 Removal of Variables 4-8 Independent Samples
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a
Name: Date: Use the following to answer questions 3-4:
Name: Date: 1. Determine whether each of the following statements is true or false. A) The margin of error for a 95% confidence interval for the mean increases as the sample size increases. B) The margin
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:
Glo bal Leadership M BA BUSINESS STATISTICS FINAL EXAM Name: INSTRUCTIONS 1. Do not open this exam until instructed to do so. 2. Be sure to fill in your name before starting the exam. 3. You have two hours
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
Chapter 4. Probability Distributions
Chapter 4 Probability Distributions Lesson 4-1/4-2 Random Variable Probability Distributions This chapter will deal the construction of probability distribution. By combining the methods of descriptive
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.
Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative
Categorical Data Analysis
Richard L. Scheaffer University of Florida The reference material and many examples for this section are based on Chapter 8, Analyzing Association Between Categorical Variables, from Statistical Methods
Chapter 7. One-way ANOVA
Chapter 7 One-way ANOVA One-way ANOVA examines equality of population means for a quantitative outcome and a single categorical explanatory variable with any number of levels. The t-test of Chapter 6 looks
HYPOTHESIS TESTING WITH SPSS:
HYPOTHESIS TESTING WITH SPSS: A NON-STATISTICIAN S GUIDE & TUTORIAL by Dr. Jim Mirabella SPSS 14.0 screenshots reprinted with permission from SPSS Inc. Published June 2006 Copyright Dr. Jim Mirabella CHAPTER
SOLUTIONS TO BIOSTATISTICS PRACTICE PROBLEMS
SOLUTIONS TO BIOSTATISTICS PRACTICE PROBLEMS BIOSTATISTICS DESCRIBING DATA, THE NORMAL DISTRIBUTION SOLUTIONS 1. a. To calculate the mean, we just add up all 7 values, and divide by 7. In Xi i= 1 fancy
Module 2 Probability and Statistics
Module 2 Probability and Statistics BASIC CONCEPTS Multiple Choice Identify the choice that best completes the statement or answers the question. 1. The standard deviation of a standard normal distribution
Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade
Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements
What is the purpose of this document? What is in the document? How do I send Feedback?
This document is designed to help North Carolina educators teach the Common Core (Standard Course of Study). NCDPI staff are continually updating and improving these tools to better serve teachers. Statistics
KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management
KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To
HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...
HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1 PREVIOUSLY used confidence intervals to answer questions such as... You know that 0.25% of women have red/green color blindness. You conduct a study of men
1 Simple Linear Regression I Least Squares Estimation
Simple Linear Regression I Least Squares Estimation Textbook Sections: 8. 8.3 Previously, we have worked with a random variable x that comes from a population that is normally distributed with mean µ and
COMMON CORE STATE STANDARDS FOR
COMMON CORE STATE STANDARDS FOR Mathematics (CCSSM) High School Statistics and Probability Mathematics High School Statistics and Probability Decisions or predictions are often based on data numbers in
Descriptive Statistics
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
Part 2: Analysis of Relationship Between Two Variables
Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable
Chapter 7 Section 7.1: Inference for the Mean of a Population
Chapter 7 Section 7.1: Inference for the Mean of a Population Now let s look at a similar situation Take an SRS of size n Normal Population : N(, ). Both and are unknown parameters. Unlike what we used
Psychology 60 Fall 2013 Practice Exam Actual Exam: Next Monday. Good luck!
Psychology 60 Fall 2013 Practice Exam Actual Exam: Next Monday. Good luck! Name: 1. The basic idea behind hypothesis testing: A. is important only if you want to compare two populations. B. depends on
Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares
Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation
Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools
Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Occam s razor.......................................................... 2 A look at data I.........................................................
Section 1: Simple Linear Regression
Section 1: Simple Linear Regression Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction
Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:
Chapter 7 Notes - Inference for Single Samples You know already for a large sample, you can invoke the CLT so: X N(µ, ). Also for a large sample, you can replace an unknown σ by s. You know how to do a
Solution Let us regress percentage of games versus total payroll.
Assignment 3, MATH 2560, Due November 16th Question 1: all graphs and calculations have to be done using the computer The following table gives the 1999 payroll (rounded to the nearest million dolars)
Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test
Experimental Design Power and Sample Size Determination Bret Hanlon and Bret Larget Department of Statistics University of Wisconsin Madison November 3 8, 2011 To this point in the semester, we have largely
Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs
Types of Variables Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Quantitative (numerical)variables: take numerical values for which arithmetic operations make sense (addition/averaging)
Independent t- Test (Comparing Two Means)
Independent t- Test (Comparing Two Means) The objectives of this lesson are to learn: the definition/purpose of independent t-test when to use the independent t-test the use of SPSS to complete an independent
II. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
Thursday, November 13: 6.1 Discrete Random Variables
Thursday, November 13: 6.1 Discrete Random Variables Read 347 350 What is a random variable? Give some examples. What is a probability distribution? What is a discrete random variable? Give some examples.
Comparing Means in Two Populations
Comparing Means in Two Populations Overview The previous section discussed hypothesis testing when sampling from a single population (either a single mean or two means from the same population). Now we
Lecture 14. Chapter 7: Probability. Rule 1: Rule 2: Rule 3: Nancy Pfenning Stats 1000
Lecture 4 Nancy Pfenning Stats 000 Chapter 7: Probability Last time we established some basic definitions and rules of probability: Rule : P (A C ) = P (A). Rule 2: In general, the probability of one event
Mind on Statistics. Chapter 15
Mind on Statistics Chapter 15 Section 15.1 1. A student survey was done to study the relationship between class standing (freshman, sophomore, junior, or senior) and major subject (English, Biology, French,
CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression
Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the
Tests of Hypotheses Using Statistics
Tests of Hypotheses Using Statistics Adam Massey and Steven J. Miller Mathematics Department Brown University Providence, RI 0292 Abstract We present the various methods of hypothesis testing that one
Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion
Descriptive Statistics Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Statistics as a Tool for LIS Research Importance of statistics in research
Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY
Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY ABSTRACT: This project attempted to determine the relationship
