Statistics II Final Exam - January Use the University stationery to give your answers to the following questions.

Similar documents
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Regression Analysis: A Complete Example

Chapter 7: Simple linear regression Learning Objectives

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Part 2: Analysis of Relationship Between Two Variables

One-Way Analysis of Variance (ANOVA) Example Problem

Multiple Linear Regression

Chapter 5 Analysis of variance SPSS Analysis of variance

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Chapter 13 Introduction to Linear Regression and Correlation Analysis

SPSS Guide: Regression Analysis

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Simple Linear Regression Inference

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Module 5: Multiple Regression Analysis

Study Guide for the Final Exam

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

Premaster Statistics Tutorial 4 Full solutions

Regression step-by-step using Microsoft Excel

1.5 Oneway Analysis of Variance

2013 MBA Jump Start Program. Statistics Module Part 3

Final Exam Practice Problem Answers

STAT 350 Practice Final Exam Solution (Spring 2015)

Week TSX Index

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Example: Boats and Manatees

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

12: Analysis of Variance. Introduction

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Section 1: Simple Linear Regression

Univariate Regression

Simple Regression Theory II 2010 Samuel L. Baker

Statistical Models in R

Point Biserial Correlation Tests

2. Linear regression with multiple regressors

Statistiek II. John Nerbonne. October 1, Dept of Information Science

A Primer on Forecasting Business Performance

Simple linear regression

International Statistical Institute, 56th Session, 2007: Phil Everson

Section 13, Part 1 ANOVA. Analysis Of Variance

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Simple Methods and Procedures Used in Forecasting

Randomized Block Analysis of Variance

Outline. Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test

Causal Forecasting Models

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Name: Date: Use the following to answer questions 3-4:

Elementary Statistics Sample Exam #3

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Exam 4. Instructor: Cynthia Rudin TA: Dimitrios Bisias. December 21, 2011

Correlation and Simple Linear Regression

Analysis of Variance ANOVA

August 2012 EXAMINATIONS Solution Part I

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Estimation of σ 2, the variance of ɛ

Solución del Examen Tipo: 1

An Introduction to Statistics Course (ECOE 1302) Spring Semester 2011 Chapter 10- TWO-SAMPLE TESTS

3.4 Statistical inference for 2 populations based on two samples

5. Linear Regression

Notes on Applied Linear Regression

Comparing Nested Models

UNDERSTANDING THE TWO-WAY ANOVA

Regression Analysis (Spring, 2000)

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

SIMON FRASER UNIVERSITY

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

Research Methods & Experimental Design

Factors affecting online sales

Using R for Linear Regression

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Financial Risk Management Exam Sample Questions/Answers

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

ch12 practice test SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

11. Analysis of Case-control Studies Logistic Regression

THE KRUSKAL WALLLIS TEST

1 Simple Linear Regression I Least Squares Estimation

Descriptive Statistics

Introduction to General and Generalized Linear Models

17. SIMPLE LINEAR REGRESSION II

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2015, Mr. Ruey S. Tsay. Solutions to Midterm

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

CHAPTER 13. Experimental Design and Analysis of Variance

MULTIPLE REGRESSION ANALYSIS OF MAIN ECONOMIC INDICATORS IN TOURISM. R, analysis of variance, Student test, multivariate analysis

Recall this chart that showed how most of our course would be organized:

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Transcription:

Statistics II Final Exam - January 2012 Use the University stationery to give your answers to the following questions. Do not forget to write down your name and class group in each page. Indicate clearly the beginning and end of each question. Exercises 1. (2 points) In a certain game, a good player is assumed to be one who scores more than 4 points per match. You have been following player A, who scored an average of 5 points per match in a large series of 100 matches, with a sample (quasi)variance of 3.94. a) (0.5 points) Would you consider player A to be a good player at a 95 % confidence level? b) (0.5 points) Suppose you also observed player B, whose p-value corresponding to the goodplayer test is 0.002. According to this evidence, whom would you consider a better player, A or B? Why? c) (0.5 points) You have used Statgraphics to carry out a hypothesis test on the data for player A, with the following results: Hypothesis Tests Sample mean = 5,0 Sample standard deviation = 1,98479 Sample size = 100 99,0% confidence interval for mean: 5,0 +/- 0,521287 [4,47871;5,52129] Null Hypothesis: mean = 4,5 Alternative: not equal Computed t statistic = 2,51916 P-Value = 0,0133645 ****************************** Indicate the null and alternative hypotheses for this test. Would you reject the null hypothesis for a significance level of 1 %? Why? d) (0.5 points) Suppose that the sample average and variance values for player A have been obtained from a small series of 5 matches (instead of 100). Can you reach any meaningful conclusion for the test about the goodness of A? 1) Yes, without any further assumptions. 2) Yes, but we need to make some distributional assumption about the scores. 3) No. Let X i denote the points scored by the player for the i-th match, and X = (X 1 + + X n )/n. a) We have to test the null hypothesis H 0 : µ 4 vs. H 1 : µ > 4.

where µ denotes player A s average score per match. As n = 100, from the Central Limit Theorem we have that the test statistic is Z = ( X µ 0 )/S/ n N(0, 1) and we reject H 0 if z obs > z 0.05 = 1.65. In this case, z obs = (5 4)/ 3.94/100 = 5.038, we reject the null hypothesis and we conclude that player A can be considered a good one. b) To compare the scores of the two players, we consider the corresponding p-values. For player A, its value is p-value = Pr(Z > z obs ) = Pr(Z > 5.038) = 2.35 10 7 << 0.002, and we conclude that the probability of getting the scores obtained by A under the null hypothesis is lower than that for B and consequently player A seems to be much better than B. c) The test carried out in this case is H 0 : µ = 4.5 H 1 : µ 4.5, As the p-value for this test is 0.0134, we will reject the null hypothesis for all significance levels larger than this value. In particular, for a significance level of 1 % we would not reject H 0, but we would reject it for 5 %. d) 2. If the score per match and its corresponding sample variance were obtained observing only n = 5 matches, this sample size would not be enough to apply the Central Limit Theorem and we could not reach a meaningful decision, unless we were to compensate the lack of information in such a small sample with an assumption on the probability distribution of the scores, such as considering that they follow a Normal distribution. 2. (2 points) You are conducting a study on the seasonal variations in the sales of shellfish in one of Madrid s districts. You have collected sales data from 20 fish markets in the district, corresponding to two days in two different periods of interest: December 20th (Christmastime), and April 17th (Spring); both days are Wednesdays. The following table presents a summary of the shellfish sales income in each one of the days, as well as the value for the difference in sales income between both periods: December April December April Average sales 300 euros 180 euros 120 euros Quasi standard deviation 44 euros 29 euros 44 euros Answer the following questions, indicating in each case any sample or population assumptions that you might need to make: a) (1 point) Compute two confidence intervals for the average of the sales income in each of the two periods, for a confidence level of 99 %. b) (1 point) For a significance level of 5 %, conduct a hypothesis test to determine if the average daily sales in December are at least 100 euros greater than the sales in April. Indicate the null and alternative hypotheses and justify your conclusion. We define the variables of interest as X shellfish sales on December 20, Y shellfish sales on April 17. As we only have information for the 20 fish markets in the district, we cannot assume that we have a large sample; as a consequence, we will need to assume that the population follows a normal distribution. We will also assume that the observations corresponding to X and Y for the 20 markets are simple random samples. These samples (X, Y ) are paired, as they have been obtained for the same markets on two different dates.

a) The confidence intervals are given by CI µx (99 %) = s x x ± t 19,0.005 = 300 ± 2.86 44 20 20 = (271.85; 328.15) in euros; CI µy (99 %) = s y ȳ ± t 19,0.005 = 180 ± 2.86 29 20 20 = (161.45; 198.55) in euros. b) The null and alternative hypotheses for the test are: and if we define D = X Y, The value of the test statistic is H 0 : µ X µ Y 100 H 1 : µ X µ Y > 100, H 0 : µ D 100 H 1 : µ D > 100. t = d d 0 s d / 120 100 = n 44/ 20 = 2.033. As this statistic follows a Student-t distribution with n 1 degrees of freedom, the rejection region is defined as those samples that have a value of the statistic larger than the quantile of the Student-t, t 19,0.05 = 1.73, CR = {t > 1.73}. As this condition is satisfied for our samples, we conclude that we reject the null hypothesis for a significance level of 5 %, that is, we accept that the average increase of sales income between December and April in this district is larger than 100 euros. 3. (3 points) The sales department of a clothing company is conducting a study on the company s catalog sales. Their goal is to determine if there is a meaningful relationship between the number of phone lines open to receive orders ( Phone lines, L) and the volume of catalog sales ( Sales, S) (measured in hundreds of euros). The department has the following data on the values of these variables for the last 20 days: l i = 599, s i = 2835, l is i = 92000, l2 i = 19195, s2 i = 458657, 20 e2 i = 16823.72 where e i denotes the residuals of the regression model explaining the variable S as a function of L. a) (0.5 points) Compute the ANOVA table for S. b) (0.5 points) Test if the variable Phone lines has no impact on the values of the variable Sales, for a significance level of 5 %. c) (0.5 points) Compute the value of the coefficient of determination and interpret it. d) (0.5 points) Obtain the least-squares estimates for the parameters of the regression line explaining the variable Sales (S) as a function of the values of the variable Phone lines (L). e) (0.5 points) Obtain an estimate for the sales forecast corresponding to a day in which you have 12 open phone lines. Compute also a confidence interval at a 95 % level for this forecast. f ) (0.5 points) Additionally, you have information on the number of catalogs that have been distributed each day ( Number catalogs, C). You fit a multiple regression model including this new variable, and you obtain the following Statgraphics output:

Multiple Regression - Sales Dependent variable: Sales Independent variables: Phone_lines Number_catalogs Standard T Parameter Estimate Error Statistic P-Value CONSTANT -99,269 69,8328-1,42152 0,1733 Phone_lines 5,01165 1,03056 4,86301 0,0001 Number_catalogs 0,00957155 0,00861747 1,11071 0,2822 Identify the values of the estimates for the parameters of the multiple linear regression model, and interpret the value of the coefficient of the variable Phone lines (L). a) From the data we have been given we obtain SSR = 16823.72, and also SST = (n 1)s 2 s = = 56795.75. 20 (s i s) 2 = 20 s 2 i 20 s 2 = s 2 i ( 20 s i ) 2 /20 Based on this information, the ANOVA table is given by: Source Sum of squares D.F. Mean Squares F-ratio Model 39972.03 1 39972.03 42.767 Residuals 16823.72 18 934.651 Total 56795.75 19 b) From the information in the ANOVA table, and in particular from the value of the F-ratio, we conduct a significance test for the model with critical region given by CR 0.05 = {F > F 1,18;0.05 } = {F > 4.41} As the value of the ratio is in the critical region, we reject the null hypothesis and we conclude that the value of the variable open lines is linearly related to that of the variable sales. c) The coefficient of determination is given by R 2 = SSE SST = 39972.03 56795.75 = 0.704. The value of the variable open lines explains 70.4 % of the variability in the variable sales. d) We compute first some required values: 20 l = l i /20 = 29.95, s = s i /20 = 141.75 20 s 2 l = ( li 2 20 l 2 )/19 = 66.05, s 2 s = ( s 2 i 20 s 2 )/19 = 2989.25 20 cov(l, s) = ( l i s i 20 l s)/19 = 373.25

From these values we obtain Questions ˆβ 1 = cov(l, s) = 5.651 s 2 l ˆβ 0 = s ˆβ 1 l = 27.50, and the regression model is ŝ = 27.50 + 5.651l. We also have that the residual variance is (see the ANOVA table) s 2 R = e 2 i /(n 2) = 934.651. e) The point estimate for the forecast corresponding to l 0 = 12 is ŝ 0 = 27.50 + 5.651l 0 = 40.31. To obtain the confidence interval we use the formula, CI 0.05 = ŝ 0 ± t 18;0.025 s 2 R = 40.31 ± 2.101 ( 1 + 1 n + (l 0 l) 2 (n 1)s 2 l 934.651 f ) The multiple linear regression model of interest is ) ( 1 + 1 (12 29.95)2 + 20 19 66.05 ŝ i = ˆβ 0 + ˆβ 1 l i + ˆβ 2 c i, ) = ( 33.12; 113.74). and the values of the parameters from the Statgraphics output are ˆβ 0 = 99.269, ˆβ1 = 5.01165, ˆβ 2 = 0.00957155, yielding the model ŝ i = 99.269 + 5.01165l i + 0.00957155c i. If we increase the number of open lines by one unit, while keeping constant the value of the variable number of catalogs, the value of the sales increases by 501.165 euros on the average. 1. (1 point) Determine if the following statements are true or false. Provide a brief justification for your answer. a) (0.5 points) As a response to the current economic crisis, 15 countries have decided to apply a policy based on austerity measures, while another group of 15 countries have chosen to follow a policy based on the use of stimulus packages. You wish to use a statistical testing procedure to evaluate if the growth rates associated to each set of policies are significantly different. An appropriate hypothesis test is a two-sided test for paired samples. b) (0.5 points) We are interested in studying if there is a significant difference between the salaries of men and women in the communications and services sectors. We have selected 100 companies in the communications sector and 100 companies in the services sector. For each company we collect information on a standardized indicator for the difference in salaries between men and women. An appropriate hypothesis test is a two-sided test for independent samples.

a) FALSE. We have no information to think that the countries included in both samples can be paired in any meaningful way for this study. It would be more reasonable in this case to consider the samples as independent. b) TRUE. As in the preceding case, we do not have any information that might indicate that the companies included in both samples have any relationship. Thus, it seems reasonable in this case to treat both samples as independent. 2. (1 point) For a simple linear regression model y = β 0 + β 1 x + u, determine if the following statements are true or false. Provide a brief justification for your answer. a) (0.5 points) If the variance of the errors is equal to 0, the coefficient of determination is also equal to 0. b) (0.5 points) For the estimated linear regression model ŷ i = 3 + 0.5x i, each additional unit of variable X implies a decrease of 3 units in the value of variable Y. a) FALSE. If the variance of the errors is equal to 0, then the coefficient of determination is equal to 1. If the variance of the errors is 0, then SSR = 0 and R 2 = SST SSR SST = SST SST = 1. b) FALSE. For each additional unit of X the variable Y has an increase equal to ˆβ 1, that is, 0.5 units. 3. (1 point) Answer the following questions, using the information provided in the Statgraphics output. Simple Regression - Y vs. X Dependent variable: Y Independent variable: X Linear model: Y = a + b*x Coefficients Least Squares Standard T Parameter Estimate Error Statistic P-Value Intercept 21,5885 2,46742 8,74945 0,0001 Slope -2,68469 0,838677-3,20111 0,0150 Analysis of Variance Source Sum of Squares Df Mean Square F-Ratio P-Value Model 561,472 1 561,472 10,25 0,0150 Residual 383,553 7 54,7933 Total (Corr.) 945,025 8 Correlation Coefficient = -0,770801 R-squared = 59,4134 percent R-squared (adjusted for d.f.) = 53,6154 percent Standard Error of Est. = 7,40225 Mean absolute error = 4,99915 Durbin-Watson statistic = 2,71064 (P=0,8750) Lag 1 residual autocorrelation = -0,366548 a) (0.5 points) Specify the values of the estimates for the three parameters in the model. b) (0.5 points) Is the independent variable significant to explain the values of the response variable? Why?

a) The estimated model is given by ŷ i = 21.5885 2.68469x i, with a residual variance s 2 R equal to 54.7933 (from the ANOVA table). b) To carry out this test we look at the p-value associated to the slope of the regression line, equal to 0.0150 (this same p-value is associated to the F-ratio in the ANOVA table). We conclude that for any significance level larger than this p-value (α > 0.0150) we reject the null hypothesis and the independent variable x is significant to explain the values of the response variable y.