7.1 Inference for comparing means of two populations

Objectives 7.1 Inference for comparing means of two populations Matched pair t confidence interval Matched pair t hypothesis test http://onlinestatbook.com/2/tests_of_means/correlated.html

Overview of what is to come We have covered the difficult bits of the course. The part where we need to do extensive calculations. In terms of inference, what have we done? Constructed confidence intervals for locating the population mean. Statistical tests. From now on we will do much the same thing. The only difference is that the data sets will `appear more complex. But we will still Constructing confidence intervals Testing hypotheses. The only difference is: We need to identify the appropriate methodology/procedure given the data and how it was collected. In terms of calculations they become more difficult, but we don t do them! We make the computer do them instead. Our role is Understand every single part of the computer output. The assumptions used to do all the calculations.

As before the standard error is a vital ingredient and will be used to measure uncertainty: To construct CIs just do as before. To do tests via the t-transform just do as before. Recall t = Estimate mean in the null hypothesis standard error The t-transform measures the number of standard errors between the estimate mean and the null mean. It is a measure of distance. The further the distance the more infeasible the null. Important: continue to make plots of the normal and t-distribution, using the results from the output. This will help you to check that what you are doing makes sense.

New types of data: Comparative inference In most statistical procedures, the objective is to make comparisons. For example: Does tuition lead to higher grades? Does eating healthy food lead to longer life expectancy? Etc. How does one design the experiment in order to test such hypothesis? In matched pair studies, subjects are matched and comparisons are made with each pairing. Examples: A patient before and after a treatment. Accessing whether a diet worked by weighing a person before and after. Studies involving twins, each twin given different regimes. Using matched studies to make comparisons is an extremely useful method for reducing confounding in a design. Twin in space: http://www.nasa.gov/press/2015/january/astronaut-twins-available-forinterviews-about-yearlong-space-station-mission/ -.VRriQGR4qlw

Matched paired data We are given a data set If there is a clear matching in the data. Then we need to do a matched paired procedure. We need to determine whether there is a matching by understanding how the data is collected. Once we determine there is matching we need to understand the statistical output of the matched paired procedure. Examples of matched data we will consider in this chapter: The effect red wine has on polyphenol levels (wait a minute we already considered this data ) The influence that a full moon has on certain patients. What are the differences between running at high and low altitude Does Friday 13 th change behavior. Weight of baby calves. The questions asked above are answered by collecting matched data. In an exam you will be asked to identify matched data.

Example 1: Red wine and polyphenols levels in blood We have already come across this example in Chapter 6 and 7. We had used a one sample method to analyze this data, but this was after processing. The data we used after processing were the differences we saw between before and after taking red wine. The `raw data is simply the polyphenol levels before taking red wine and after taking red wine. It is very natural to consider the difference as it gives the increase/decrease in polyphenol after the treatment. The matched paired methods we discuss in this chapter are identical to the one sample methods discussed in Chapter 6 and 7. Just after the differences are taken. Our job is to understand that differences need to be taken. In Chapter 10 we will consider data where there is no natural pairing -- we need to be sure we do not confused between the two different types of data.

The statistical output: CI and tests Above is the 95% CI and tests for the mean. The top plot is with you telling the computer that it paired data (see demo) The lower plot is after manually taking differences and then construct a CI and test on the differences. We observe that the outputs are identical. In the next slide we review what the output is telling us.

Review of polyphenol output The output on the left is the CI interval for where we believe the mean difference after taking red wine should lie. We have 95% confidence it lies between [2.6,5.99]. We see that this CI is far above zero suggesting that mean levels increase. We formally test this in the next output. The output on the right is testing H 0 : µ A - µ B 0 against H A : µ A - µ B > 0. (ie. Taking red wine increases polyphenol levels). The p-value is very small, less than 0.1%. As this is far smaller than 5% it tells us that there is strong evidence to suggest that the mean level of polyphenol increase with wine consumption. Observe, using this one-sided test we can also deduce that if we were to test H 0 : µ A - µ B 0 against H A : µ A - µ B < 0. (ie. Taking red wine decreases polyphenol levels), then the p-value is greater 99.9%, thus there is no evidence that polyphenol decreases with wine consumption.

Examples 2: Does a full moon have an indluence on behavior? We want to investigate whether aggressive dementia patients tend to be more aggressive when there is a full moon. The behavior of 15 disruptive dementia patients was studied to see if there is any evidence of this. For each patient the average number of disruptive events on Full moon days and other days was counted. The data is on the right. The raw numbers do not contain the information on being `more disruptive. Instead one should consider the difference between each of the pairs. It is the difference that actually contains the information on whether the individuals are more or less disruptive. In addition, by taking differences we are `factoring out that some of the natural variability in aggressive behavior between patients.

The hypothesis of interest and output We want to test whether the full moon made people more disruptive. First we set notation: Let µ N = the mean number of disruptive events (no full moon) Let µ F = the mean number of disruptive events under full moon. We conjecture that there are more disruptive events during full moon, so we are testing that H 0 : µ F µ N 0 against H A: µ F - µ N > 0. We see that the p-value is extremely small, thus there strong evidence to reject the null. Understanding the output: The t-value is calculated as t = 2.3 0 =6.71 0.34 Using the output we can calculate the 95% CI using the same output [2.3 ± 2.1 0.34] = [1.58, 3.02]

Lab Practice I (moon example) Load the moon data into Statcrunch. We can visually look at the differences by going to Data -> Compute Data -> Expression -> Build (now compute the expression by clicking on the correct expression). By just looking at the differences it is clear that for this data set patients are more disruptive. Now we want to see whether we can infer that in general aggressive dementia patients are more disruptive during the full moon. To do this we see how likely that we can get these mainly positive differences by just random chance alone. This is the p-value. If this turns out to be large, the data is consistent with these numbers coming by random chance and there is no evidence that there is an actual difference in the behaviors. To do the test and construct CIs Go to Stat -> T-statistics -> Paired and select either the test of interest or the desired CI.

Checking reliability of output Now we want to see whether the calculations are correct: We always assume the sample is a simple random sample. In this example, this assumption is a bit dubious as the patients selected were the ones who were the most disruptive. Therefore we can only draw inference on the population of disruptive patients. As the sample size is quite small (18) we should check that the differences data does not deviate too much from normality. A histogram and Qqplot is given below.

Observations: The data is numerical continuous (average number of disruptive events per person). The histogram of the differences does not look very bell shaped and points on the QQplot don t fall on line. Data is not very close to normal, but there isn t any clear skew which is the main factor for the CLT not kicking in for relatively large samples sizes. The sample size of 18 is below the 30 rule of thumb. However, the simulation of the sampling distribution using the applet shows that the sample mean based on 18 will be quite close to normal

The above observations imply that the sample mean will be approximately normal. Therefore, the p-value that was calculated using the t-distribution (remember we only use the t because the standard deviation is unknown) is relatively close to the truth. Regardless of normality of the sample mean, the sample mean is 6.71 standard errors from the null (which is a huge difference) - the t-transform is so large, that the p-value is very small regardless of what the actual distribution is. Based on this there is overwhelming evidence that the behavior on Full moon days will be different to other days. Recommendations: As a consequence of this study the nursing home way want to bring in more staff on Full moon days. As on average the number of additional disruptive events on a full moon is between [1.57,3.02], the nursing home may want to use this to calculate the number of extra staff to bring on duty during the full moon.

Example 3: Running at different altitudes It is usually believed that peoples running times at high altitude are worse than their running times at sea level. We want to check this assertion. Data collection: 12 runners are asked to run the same distance at both sea level and high altitude, their running times are recorded. The data is given on the right. Since the same runner is used for both the high and low altitude there is clear matching in the data. Also observe that most of the differences are positive. For most of the runners we see an increase in time.

The output and using it for testing The 95% CI for the difference is given below We are interested in understanding whether running at a high altitude increases running time. So the hypothesis of interest is H 0 : µ H - µ L 0 against H A : µ H - µ L > 0. Suppose, we do the test at the 1% level. The t-transform is t = 1.2 0 0.311 =3.88 The p-value is the area to the RIGHT of 3.88. Looking up the tables we see that the area to the right of 3.88 is 0.25%. Thus the p-value is less than 0.25%. The p-value can be calculated exactly by doing the test. The results from three hypothesis test outputs are on the next slide.

One and two sided tests: The output

The output corresponding to the hypothesis test of interest is the last one and gives the p-value 0.13% However, for the purpose of an examination you should be able to deduce the correct p-value from any three of the outputs. The first output corresponds to the opposite hypothesis, where we are interested in seeing if there is evidence to suggest that at high altitude we run faster. The p-value for this test is 99.87% - this is the area to the left of 3.88. Therefore the p-value we are interested in is the area to the right of 3.88 which is 100 99.87 = 0.13%. The middle output is the two-sided test, that is the mean running speed at sea level and high altitudes is different. The p-value is 0.25%, which is double our p-value.

Example 4: Does Friday 13 th increase accidents? To answer this question the number of accidents was on 6 consecutive Friday 13ths (during the early 1990 s) was collected. A comparison is required, where all the factors the same except for the 13 th, Thus the data is compared with the number of accidents which happen on the previous Friday 6 th. It is not immediately obvious but there is matching in this data. This is because Friday 6 th and the following Friday 13 th share similar factors except for the numbers (for example more accidents tend to happen during July/August which would increase both values). There is a dependence between them (driven by these common factors).

Hypothesis of interest and the test To see whether there is evidence that accidents have increased our hypothesis of interest is H 0 : µ 13 - µ 6 0 against H A : µ 13 - µ 6 > 0. We see that the p-value is about 2.11% This is less than 5% so we can reject the null at the 5% level and determine that Friday 13 th tends to increase accidents: Notes of caution: The data is clearly not normally distributed (it is numerical discrete) and the sample size is very small (n=6). Therefore the p-value is unlikely to be that reliable. As it is relatively close to the boundary of 5% we need to be cautious in interpreting the full significance of this result. Ie. The true p-value may be over 5%.

More Examples: Compare the weights of calf data at different weeks. Load the calf data into Statcrunch and compare their weights at different weeks The data is clearly matched, because the same calf is followed over a few weeks. Moreover in the scatterplot of week 0.5 against week 1 we see a clear linear trend. This shows that there is a clear matching between the weights (notice that calves that are heavier at week 0.5 also tend to be heavier at week 1). Based on the above, if we want to compare the weights at different weeks we need to use a match paired procedure. Do this!

Summary: matched pair procedures Sometimes we want to compare treatments or conditions at the individual level. These situations produce two samples that are not independent they are related to each other. The subjects of one sample are identical to, or matched (paired) with, the subjects of the other sample. Example: Pre-test and post-test studies look at data collected on the same subjects before and after some treatment is performed. Example: Twin studies often try to sort out the influence of genetic factors by comparing a variable between sets of twins. Example: Using people matched for age, sex, and education in social studies helps to cancel out the effects of these potentially relevant variables. Except for pre/post studies, subjects should be randomized assigned to the samples at random (within each pair), in observational studies.

For data from a matched pair design, we use the observed differences X difference = (X 1 X 2 ) to test the difference in the two population means. The hypotheses can then be expressed as H 0 : µ difference = 0 ; H a : µ difference >0 (or <0, or 0) You will need to decide what test to apply to the data. In Chapter 10 we will cover the independent sample t-test. This tests the same hypothesis but there is no matching in the data so a different procedure is used. Based on how the data was collected you should be able to decide which test to use when.

Calculation Practice: Does no caffeine increase depression? Individuals diagnosed as caffeine-dependent were deprived of caffeine-rich foods and assigned pills for 10 days. Sometimes, the pills contained caffeine and other times they contained a placebo. A depression score was determined separately for the caffeine pills (as a whole) and for the placebo pills. There are 2 data points for each subject, but we only look at the difference. We calculate that = 7.36; s diff = 6.92, df = 10. We test H 0 : µ difference = 0, H a : µ difference > 0, using α = 0.05. Why is a one-sided test ok? t x diff = = = s diff 0 7.36 n x diff 6.92 / 11 3.53. From the t-distribution: P-value =.0027, which is quite small, in fact smaller than α. Depression Subject with Caffeine Depression with Placebo Placebo - Cafeine 1 5 16 11 2 5 23 18 3 4 5 1 4 3 7 4 5 8 14 6 6 5 24 19 7 0 6 6 8 0 3 3 9 2 15 13 10 11 12 1 11 1 0-1 Depression is greater with the placebo than with the caffeine pills, on average.

Accompanying problems associated with this Chapter Quiz 13 Homework 6 (Q3) Homework 8 (Q1)