CHAPTER TEN. Key Concepts
|
|
|
- Austin Jennings
- 9 years ago
- Views:
Transcription
1 CHAPTER TEN Ke Concepts linear regression: slope intercept residual error sum of squares or residual sum of squares sum of squares due to regression mean squares error mean squares regression (coefficient) of on least square homoscedasticit, heteroscedasticit linear relationship, covariance, product-moment correlation, rank correlation multiple regression, stepwise regression, regression diagnostics, multiple correlation coefficient, partial correlation coefficient regression toward the mean Basic Biostatistics for Geneticists and Epidemiologists: A Practical Approach 2008 John Wile & Sons, Ltd. ISBN: R. Elston and W. Johnson
2 Correlation and Regression SYMBOLS AND ABBREVIATIONS b 0 b 1, b 2,... MS r R R 2 s SS ŷ e 0 1, 2,... sample intercept sample regression coefficients mean square correlation coefficient multiple correlation coefficient proportion of the variabilit eplained b a regression model sample covariance of X and Y (estimate) sum of squares value of predicted from a regression equation estimated error or residual from a regression model population intercept population regression coefficients error or residual from a regression model (Greek letter epsilon) SIMPLE LINEAR REGRESSION In Chapter 9 we discussed categorical data and introduced the notion of response and predictor variables. We turn now to a discussion of relationships between response and predictor variables of the continuous tpe. To eplain such a relationship we often search for mathematical models such as equations for straight lines, parabolas, or other mathematical functions. The anthropologist Sir Francis Galton ( ) used the term regression in eplaining the relationship between the heights of fathers and their sons. From a group of 1078 father son pairs he developed the following model, in which the heights are in inches: son s height = (father s height). Basic Biostatistics for Geneticists and Epidemiologists: A Practical Approach 2008 John Wile & Sons, Ltd. ISBN: R. Elston and W. Johnson
3 234 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS If we substitute 74 inches into this equation for the father s height, we arrive at about 72 inches for the son s height (i.e. the son is not as tall as his father). On the other hand, if we substitute 65 inches for the father s height, we find the son s height to be 67 inches (i.e. the son is taller than his father). Galton concluded that although tall fathers tend to have tall sons and short fathers tend to have short sons, the son s height tends to be closer to the average than his father s height. Galton called this regression toward the mean. Although the techniques for modeling relationships among variables have taken on a much wider meaning, the term regression has become entrenched in the statistical literature to describe this modeling. Thus, nowadas we speak of regression models, regression equations, and regression analsis without wishing to impl that there is anthing regressive. As we shall see later, however, the phrase regression toward the mean is still used in a sense analogous to what Galton meant b it Figure 10.1 Graph of the equation = We begin b discussing a relativel simple relationship between two variables, namel a straight-line relationship. The equation = is the equation of the straight line shown in Figure The equation = 2 3 is shown in Figure 10.2, Figure 10.2 Graph of the equation = 2 3
4 CORRELATION AND REGRESSION Figure 10.3 Graph of the equation = 0.5 and the equation = 0.5 in Figure In general, the equation of a straight line can be epressed in the form = where 0 and 1 are specified constants. An point on this line has an -coordinate and a -coordinate. When = 0, = 0 ; so the parameter 0 is the value of the -coordinate where the line crosses the -ais and is called the intercept. In Figure 10.1, the intercept is 3, in Figure 10.2 it is 2. When 0 = 0, the line goes through the origin (Figure 10.3) and the intercept is 0. The parameter 1 is the slope of the line and measures the amount of change in per unit increase in. When 1 is positive (as in Figures 10.1 and 10.3), the slope is upward and and increase together; when 1 is negative (Figure 10.2), the slope is downward and decreases as increases. When 1 is zero, is the same (it has value 0 ) for all values of and the line is horizontal (i.e. there is no slope). The parameter 1 is undefined for vertical lines but approaches infinit as the line approaches a vertical position. So far in our discussion we have assumed that the relationship between and is eplained eactl b a straight line; if we are given we can determine and vice versa for all values of and. Now let us assume that the relationship between the two variables is not eact, because one of the variables is subject to random measurement errors. Let us call this random variable the response variable and denote it Y. The other variable is assumed to be measured without error; it is under the control of the investigator and we call it the predictor variable. This terminolog is consistent with that of Chapter 9. In practice the predictor variable will also often be subject to random variabilit caused b errors of measurement, but we assume that this variabilit is negligible relative to that of the response variable. For eample, suppose an investigator is interested in the rate at which a metabolite is consumed or produced b an enzme reaction. A reaction miture is set up from which aliquots are withdrawn at various intervals of time and the
5 236 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS concentration of the metabolite in each aliquot is determined. Whereas the time at which each aliquot is withdrawn can be accuratel measured, the metabolite concentration, being determined b a rather comple assa procedure, is subject to appreciable measurement error. In this situation, time would be the predictor variable under the investigator s control, and metabolite concentration would be the random response variable. Since the relationship is not eact, we write Y = where, called the error, is the amount b which the random variable Y, for a given, lies above the line (or below the line, if it is negative). This is illustrated in Figure In a practical situation, we would have a sample of pairs of numbers, 1 and 1, 2 and 2, and so on. Then, assuming a straight line is an appropriate model, we would tr to find the line that best fits the data. In other words, we would tr to find estimates b 0 and b 1 of the parameters 0 and 1, respectivel. One approach that ields estimates with good properties is to take the line that minimizes the sum of the squared errors (i.e. that makes the sum of the squared vertical deviations from the fitted line as small as possible). These are called least-squares estimates. Thus, if we have a sample of pairs, we can denote a particular pair (the ith pair) i, i, so that i = i + i = 2, = 4 Î= = 1 Î= 4 8 = = 6, = 17 Figure 10.4 Graph of the regression equation = and errors for the points ( = 2, = 4), and ( = 6, = 17).
6 CORRELATION AND REGRESSION 237 The ith error is i = i 0 1 i as illustrated in Figure 10.4, and its square is ε 2 i = ( i 0 1 i ) 2. Then the least-squares estimates b 0 and b 1 of 0 and 1 are those estimates of 0 and 1, respectivel, that minimize the sum of these squared deviations over all the sample values. The slope 1 or its least-squares estimate b 1 ) is also called the regression of on, or the regression coefficient of on. Notice that if the line provides a perfect fit to the data (i.e. all the points fall on the line), then i = 0 for all i. Moreover, the poorer the fit, the greater the magnitudes, either positive or negative, of the i. Now let us define the fitted line b ŷ i = b 0 + b 1 i, and the estimated error, or residual, b e i = i ŷ i = i b 0 b 1 i. Then a special propert of the line fitted b least squares is that the sum of e i over the whole sample is zero. If we sum the squared residuals e 2 i, we obtain a quantit called the error sum of squares,orresidual sum of squares. If the line is horizontal (b 1 = 0), as in Figure 10.5, the residual sum of squares is equal to the sum of squares about the sample mean. If the line is neither horizontal nor vertical, we have a situation such as that illustrated in Figure The deviation of the ith observation from the sample mean, i, has been partitioned into two components: a deviation from the regression line, i ŷ i = e i, the estimated error or residual; and a deviation of the regression line from the mean, ŷ i, which we call the deviation due to regression (i.e. due to the straight-line model). If we square each of these three deviations ( i, i ŷ i, and ŷ i ) and separatel add them e 1 e 3 e 5 e 2 e 4 0 Figure 10.5 Graph of the regression equation ŷ i = b 0 + b 1 i for the special case in which b i = 0 and b 0 =, with five residuals depicted.
7 238 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS ( i, i ) Ù e i = i i Ù i i 0 Figure 10.6 Graph of the regression equation ŷ = b 0 + b 1 i showing how the difference between each i and the mean ȳ can be decomposed from the line ( i ŷ i ) and a deviation of the line from the mean (ŷ i ȳ). up over all the sample values, we obtain three sums of squares which satisf the following relationship: total sum of squared deviations from the mean = sum of squared deviations from the regression model + sum of squared deviations due to, or eplained b, the regression model. We often abbreviate this relationship b writing SS T = SS E + SS R where SS T is the total sum of squares about the mean, SS E is the error, or residual, sum of squares, and SS R is the sum of squares due to regression. These three sums of squares have n 1, n 2, and 1 d.f., respectivel. If we divide the last two sums of squares b their respective degrees of freedom, we obtain quantities called mean squares: the error, or residual, mean square and the mean square due to regression. These mean squares are used to test for the significance of the regression, which in the case of a straight-line model is the same as testing whether the slope of the straight line is significantl different from zero. In the eample discussed above, we ma wish to test whether the metabolite concentration in the reaction miture is in fact changing, or whether it is the same at all the different points in time. Thus we would test the null hpothesis H 0 : 1 = 0. Denote the error mean square MS E and the mean square due to the straight line regression model MS R. Then, under H 0 and certain conditions that we shall specif, the ratio F = MS R MS E follows the F-distribution with 1 and n 2 d.f. (Note: as is alwas true of an F-statistic, the first number of degrees of freedom corresponds to the numerator, here MS R, and the second to the denominator, here MS E ). As 1 increases in magnitude, the numerator of the F-ratio will tend to have large values, and as 1
8 CORRELATION AND REGRESSION 239 approaches zero, it will tend toward zero. Thus, large values of F indicate departure from H 0, whether because 1 is greater or less than zero. Thus if the observed value of F is greater than the 95th percentile of the F-distribution, we reject H 0 at the 5% significance level for a two-sided test. Otherwise, there is no significant linear relationship between Y and. The conditions necessar for this test to be valid are the following: 1. For a fied value of, the corresponding Y must come from a normal distribution with mean The Ys must be independent. 3. The variance of Y must be the same at each value of. This is called homoscedasticit; if the variance changes for different values of, we have heteroscedasticit. Furthermore, under these conditions, the least-squares estimates b 0 and b 1 are also the maimum likelihood estimates of 0 and 1. The quantities b 0, b 1 and the mean squares can be automaticall calculated on a computer or a pocket calculator. The statistical test is often summarized as shown in Table Note that SS T /(n 1) is the usual estimator of the (total) variance of Y if we ignore the -values. The estimator of the variance of Y about the regression line, however, is MS E, the error mean square or the mean squared error. It estimates the error, or residual, variance not eplained b the model. Table 10.1 Summar results for testing the hpothesis of zero slope (linear regression analsis) Source of Variabilit in Y d.f. Sum of Squares Mean Square F-Ratio Regression 1 SS R MS R MS R /MS E Residual (error) n 2 SS E MS E Total n 1 SS T Now b 1 is an estimate of 1 and represents a particular value of an estimator which has a standard error that we shall denote S B1. Some computer programs and pocket calculators give these quantities instead of (or in addition to) the quantities in Table Then, under the same conditions, we can test H 0 : 1 = 0 b using the statistic t = b 1 s B1,
9 240 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS which under H 0 comes from Student s t-distribution with n 2 d.f. In fact, the square of this t is identical to the F-ratio defined earlier. So, in the case of simple linear regression, either an F-test or a t-test can be used to test the hpothesis that the slope of the regression line is equal to zero. THE STRAIGHT-LINE RELATIONSHIP WHEN THERE IS INHERENT VARIABILITY So far we have assumed that the onl source of variabilit about the regression line is due to measurement error. But ou will find that regression analsis is often used in the literature in situations in which it is known that the pairs of values and, even if measured without error, do not all lie on a line. The reason for this is not so much because such an analsis is appropriate, but rather because it is a relativel simple method of analsis, easil generalized to the case in which there are multiple predictor variables (as we discuss later in this chapter), and readil available in man computer packages of statistical programs. For eample, regression analsis might be used to stud the relationship between triglceride and cholesterol levels even though, however accuratel we measure these variables, a large number of pairs of values will never fall on a straight line, but rather give rise to a scatter diagram similar to Figure 3.7. Regression analsis is not the appropriate statistical tool if, in this situation, we want to know how triglceride levels and cholesterol levels are related in the population. It is, however, an appropriate tool if we wish to develop a prediction equation to estimate one from the other. Let us call triglceride level and cholesterol level Y. Using the data illustrated in Figure 3.7, it can be calculated that the estimated regression equation of Y on is ŷ = and the residual variance is estimated to be 776 (mg/dl) 2. This variance includes both measurement error and natural variabilit, so it is better to call it residual variance rather than error variance. Thus, for a population of persons who all have a measured triglceride level equal to, we estimate that the random variable Y, measured cholesterol level, has mean mg/dl and standard deviation 766 = mg/dl. This is the wa we use the results of regression analsis to predict the distribution of cholesterol level (Y) for an particular value of triglceride level (). But we must not use the same equation to predict triglceride level from cholesterol level. If we solve the equation ŷ =
10 CORRELATION AND REGRESSION 241 for, we obtain = ŷ. This is eactl the same line, the regression of Y on, although epressed differentl. It is not the regression of the random variable X on. Such a regression can be estimated, and for these same data it turns out to be ˆ = with residual variance 2930 (mg/dl) 2. This is the equation that should be used to predict triglceride level from cholesterol level. Figure 10.7 is a repeat of Figure 3.7 with these two regression lines superimposed on the data. The regression of Y on is obtained b minimizing the sum of the squared vertical deviations (deviations in Y, that is, parallel to the -ais) from the straight line and can be used to predict the distribution of Y for a fied value of. The regression of X on, on the other hand, is the converse situation: it is obtained b minimizing the sum of the squared horizontal deviations (deviations in the random variable X, that is, parallel to the -ais), and it is the appropriate regression to use if we wish to predict the distribution of X for a fied (controlled) value of. In this latter case, X is the response variable and the predictor variable = Cholestrol (mg/dl) = Triglceride (mg/dl) Figure 10.7 Scatterplot of serum cholesterol versus triglceride levels of 30 medical students with the two estimated regression lines.
11 242 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS It is clear that these are two different lines. Furthermore, the line that best describes the single, underling linear relationship between the two variables falls somewhere between these two lines. It is beond the scope of this book to discuss the various methods available for finding such a line, but ou should be aware that such methods eist, and that the depend on knowing the relative accurac with which the two variables X and Y are measured. CORRELATION In Chapter 3 we defined variance as a measure of dispersion. The definition applies to a single random variable. In this section we introduce a more general concept of variabilit called covariance. Let us suppose we have a situation in which two random variables are observed for each stud unit in a sample, and we are interested in measuring the strength of the association between the two random variables in the population. For eample, without tring to estimate the straight-line relationship itself between cholesterol and triglceride levels in male medical students, we might wish to estimate how closel the points in Figure 10.7 fit an underling straight line. First we shall see how to estimate the covariance between the two variables. Covariance is a measure of how two random variables var together, either in a sample or in the population, when the values of the two random variables occur in pairs. To compute the covariance for a sample of values of two random variables, sa X and Y, with sample means and, respectivel, the following steps are taken: 1. For each pair of values i and i, subtract from i and from i (i.e. compute the deviations i and i ). 2. Find the product of each pair of deviations (i.e. compute ( i ) ( i )). 3. Sum these products over the whole sample. 4. Divide the sum of these products b one less than the number of pairs in the sample. Suppose, for eample, we wish to compute the sample covariance for X, Y from the following data: i
12 CORRELATION AND REGRESSION 243 Note that the sample means are = 30 and = 70. We follow the steps outlined above. 1. Subtract from i and from i : i i i = = = = = = = = = = Find the products i ( i ) ( i ) 1 20 ( 40) = ( 20) = = = = Sum the products: = Divide b one less than the number of pairs: s XY = sample covariance = = 500. Note in steps 1 and 2 that, if both members of a pair are below their respective means (as in the case for the first pair, i = 1), the contribution to the covariance is positive (+800 for the first pair). It is similarl positive when both members of the pair are above their respective means (+200 and +800 for i = 4 and 5, in the eample). Thus, a positive covariance implies that X and Y tend to covar in such a manner that when one is either below or above its mean, so is the other. A negative covariance, on the other hand, would impl a tendenc for one to be above its mean when the other is below its mean, and vice versa. Now suppose X is measured in pounds and Y in inches. Then the covariance is in pound-inches, a miture of units that is difficult to interpret. Recall that a variance
13 244 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS is measured in squared units and we take the square root of the variance to get back to the original units. Obviousl this does not work for the covariance. Instead, we divide the covariance b the product of the estimated standard deviation of X and the estimated standard deviation of Y, which we denote s X and s Y, respectivel. The result is a pure, dimensionless number (no units) that is commonl denoted r and called the correlation coefficient, or Pearson s product-moment correlation coefficient, r = s XY s X s Y, where s XY is the sample covariance of X and Y, s X is the sample standard deviation of X, and s Y the sample standard deviation of Y. Thus, in the above eample, s X = 250 and s Y = 1000, so r = = 1. In this eample the correlation coefficient is +1. A scatterplot of the data indicates that all the points ( i, i ) lie on a straight line with positive slope, as illustrated in Figure 10.8(a). If all the points lie on a straight line with negative slope, as in Figure 10.8(b), the correlation coefficient is 1. These are the most etreme values possible: a correlation can onl take on values between 1 and +1. Figures 10.8(a h) illustrate a variet of possibilities, and it can be seen that the magnitude of the correlation measures how close the points are to a straight line. Remember that the correlation coefficient is a dimensionless number, and so does not depend on the units of measurement. In Figures 10.8(a h), the scales have been chosen so that the numerical value of the sample variance of Y is about the same as that of X ou can see that in each figure the range of Y is the same as the range of X. Now look at Figures 10.8(i, j). In each case the points appear to be close to a straight line, and ou might therefore think that the correlation coefficient should be large in magnitude. If the scales are changed to make the range of Y the same as the range of X, however, Figure 10.8(i) becomes identical to Figure 10.8(g), and Figure 10.8(j) becomes identical to Figure 10.8(h). Once the scales have been adjusted, it becomes obvious that the correlation coefficient is near zero in each of these two situations. Of course, if all the points are on a horizontal line or on a vertical line, it is impossible to adjust the scales so that the range is numericall the same for both variables. In such situations, as illustrated in Figures 10.8(k, l), the correlation coefficient is undefined.
14 CORRELATION AND REGRESSION 245 (a) r = +1 (b) r = 1 (c) r = 0.9 (d) r = 0.9 (e) r = 0.7 (f) r = 0.7 (g) r = 0 (h) r = 0 (i) r = 0 ( j) r = 0 (k) r undefined (i) r undefined Figure 10.8 Scatterplots illustrating how the correlation coefficient, r, is a measure of the linear association between two variables.
15 246 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS Note that the denominator in the correlation coefficient is the product of the sample standard deviations, which include both natural variabilit and measurement errors. Thus (unless the measurement errors in the two variables are themselves correlated), larger measurement errors automaticall decrease the correlation coefficient. A small correlation between two variables can thus be due either to (1) little linear association between the two variables, or (2) large errors in their measurement. A correlation close to +1 or 1, on the other hand, implies that the measurement errors must be small relative to the sample standard deviations, and that the data points all lie close to a straight line. In fact, there is a close connection between the correlation coefficient and the estimated slope of the regression line. The estimated slope of the regression of Y on is rs Y /s X, and the estimated slope of the regression of X on is rs X /s Y. The correlation coefficient is significantl different from zero if, and onl if, the regression coefficients are significantl different from zero; and such a finding implies a dependenc between the two variables. However, a correlation coefficient of zero does not impl two variables are independent (see Figure 10.8(h)) and, as we have seen before, a dependenc between two variables does not necessaril impl a causal relationship between them. SPEARMAN S RANK CORRELATION If we rank the s from1ton (from largest to smallest, or vice versa), and we rank the s from 1 to n in the same direction, and then compute the correlation coefficient between the pairs of ranks, we obtain the so-called rank correlation coefficient, or Spearman s rank correlation coefficient. This correlation measures how closel the points can be fitted b a smooth, monotonic curve (i.e. a curve that is either alwas increasing or alwas decreasing). The rank correlations of the data in Figures 10.8(e, f) are +1, and 1, respectivel. The curve that best fits the data in Figure 10.8(h), however, is not monotonic; it first increases and then decreases, and the rank correlation for these data is also approimatel 0. Apart from being a measure of closeness to a monotonic curve, the rank correlation coefficient is less subject to influence b a few etreme values, and therefore sometimes gives a more reasonable inde of association. MULTIPLE REGRESSION We have seen how a straight-line regression model can be fitted to data so that the variable eplains part of the variabilit in a random variable Y. A natural question to ask is, if one variable can account for part of the variabilit in Y, can more of
16 CORRELATION AND REGRESSION 247 the variabilit be eplained b further variables? This leads us to consider models such as Y = q q +, where 1, 2,..., q are q distinct predictor variables. Just as before, we can partition the total sum of squared deviations from the mean of as SS T = SS E + SS R, where SS E is the sum of squared deviations from the regression model and SS R is the sum of squared deviations due to, or eplained b, the regression model. But now SS E has n q 1 d.f. and SS R has q d.f. Following the same line of reasoning as used in the case of simple linear regression, we can compute the quantities indicated in Table Thus, F = MS R /MS E with q and n q 1 d.f. and provides a simultaneous test of the hpothesis that all the regression coefficients are equal to zero, H 0 : 1 = 2 =...= q = 0. Table 10.2 Summar results for multiple regression Source of Y d.f. Sum of Mean F Variabilit in Y Squares Square Regression q SS R MS R MS R /MS E Residual (error) n q 1 SS E MS E Total n 1 SS T The sum of squares for regression can be further partitioned into q terms, each with 1 d.f., so that each coefficient can be tested separatel. Thus, the results of a multiple regression analsis with three predictor variables might look something like Table The F-test provides an overall test of whether the coefficients 1, 2, and 3 are simultaneousl zero. The t-tests provide individual tests for each coefficient separatel. These t-tests are to be interpreted onl in light of the full model. Thus, if we drop 2 from the model because the results suggest 2 is not significantl different from zero, then we can fit a new model, sa Y = β 0 + β β ε,
17 248 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS Table 10.3 Summar results of regression analsis for the model Y = Source of Y d.f. Sum of Mean F Variabilit in Y Squares Square Regression model 3 SS R MS R MS R /MS E Error (residual) n 4 SS E MS E Total n 1 Parameter Estimate Standard Error of Estimate t p-value 0 b 0 s B0 b 0 /s B0 p 0 1 b 1 s B1 b 1 /s B1 p 1 2 b 2 s B2 b 2 /s B2 p 2 3 b 3 s B3 b 3 /s B3 p 3 to make inferences about the coefficients of l and 3 with 2 removed from the model. Note in particular that 1 in the full model is not equal to 1 in the reduced model. Similarl, 3 = β 3. To stud the effect of each predictor variable full, it is necessar to perform a regression analsis for ever possible combination of the predictor variables. In the eample of Table 10.3, this would entail conducting a regression analsis for each of the following models: 1. Y regressed on l 2. Y regressed on 2 3. Y regressed on 3 4. Y regressed on l and 2 5. Y regressed on l and 3 6. Y regressed on 2 and 3 7. Y regressed on l, 2, and 3. The larger the number of predictor variables, the greater the number of possible combinations, and so it is often not feasible to perform all possible regression analses. For this reason a stepwise approach is often used, though it ma not lead to the best subset of variables to keep in the model. There are two tpes of stepwise regression: forward and backward.
18 CORRELATION AND REGRESSION Forward stepwise regression first puts into the model the single predictor variable that eplains most of the variabilit in Y, and then successivel at each step inserts the variable that eplains most of the remaining (residual) variabilit in Y. However, if at an step none of the remaining predictor variables eplain a significant additional amount of variabilit in Y, at a predetermined level of significance, the procedure is terminated. 2. Backward stepwise regression includes all the predictor variables in the model to begin with, and then successivel at each step the variable that eplains the least amount of variabilit in Y (in the presence of the other predictor variables) is dropped from the model. However, a variable is dropped onl if, at a predetermined level of significance, its contribution to the variabilit of Y (in the presence of the other variables) is not significant. Whatever method is used to select among a set of predictor variables in order to arrive at the best subset to be included in a regression model, it must alwas be remembered that the final result is merel a prediction model, and not necessaril a model for the causation of variabilit in the response variable. Suppose a multiple regression analsis is performed assuming the model Y = and, based on a sample of n stud units, on each of which we have observations (, 1, 2, 3 ), we find b 0 = 40, b 1 = 5, b 2 = 10, b 3 = 7, as estimates of 0, 2, 2, and 3, respectivel. The fitted regression model in this case is ŷ = For each of the n stud units we can substitute 1, 2, and 3 into the fitted model to obtain a value ŷ. Suppose, for eample, the observations on one of the stud units were (, 1, 2, 3 ) = (99, 4, 2, 3). On substituting the s into the fitted model, we obtain ŷ = = 101. This procedure provides us with an estimate of the epected value of Y for the observed set of s. We actuall observed = 99 for these s; however, if we had
19 250 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS observed a second value of Y for these same s, that value would likel be some number other than 99. For each set of s, our model assumes there is a distribution of s corresponding to the random variable Y. Thus, ŷ is an estimate of the mean value of Y for that set of s, and ŷ ( = 2, in our eample) is the residual. If we compute for each of the n sample stud units, and then compute the n residuals ŷ, we can eamine these residuals to investigate the adequac of the model. In particular, we can obtain clues as to whether: (i) the regression function is linear; (ii) the residuals have constant variance; (iii) the residuals are normall distributed; (iv) the residuals are not independent; (v) the model fits all but a few observations; (vi) one or more predictor variables have been omitted from the model. Methods for investigating model adequac are called regression diagnostics. Regression diagnostics pla an important role in statistical modeling because it is so eas to fit models with eisting computer programs, whether or not those models are reall appropriate. Use of good regression diagnostics will guard against blindl accepting misleading models. Before leaving multiple regression, we should note that a special case involves fitting polnomial (curvilinear) models to data. We ma have measured onl one predictor variable, but powers of are also included in the regression model as separate predictor variables. For eample, we ma fit such models as the quadratic model, or parabola, and the cubic model Y = β 0 + β 1 + β ε Y = MULTIPLE CORRELATION AND PARTIAL CORRELATION Each sample observation i of the response variable corresponds to a fitted, or predicted, value ŷ i from the regression equation. Let us consider the pairs ( i, ŷ i ) as a set of data, and compute the correlation coefficient for these data. The result, called the multiple correlation coefficient, is denoted R. It is a measure of the
20 CORRELATION AND REGRESSION 251 overall linear association between the response variable Y and the set of predictor variables 1, 2,..., q in the regression equation. In the special case where q = 1 (i.e. if there is onl one predictor variable), the multiple correlation coefficient R is simpl equal to r, the correlation coefficient between X and Y. In general, R 2 equals the ratio SS R /SS T, and so measures the proportion of the variabilit eplained b the regression model. If we fit a model with three predictor variables and find R 2 = 0.46, and then fit a model that includes an additional, fourth predictor variable and find R 2 = 0.72, we would conclude that the fourth variable accounts for an additional 26% ( ) of the variabilit in Y. The square of the multiple correlation coefficient, R 2, is often reported when regression analses have been performed. The partial correlation coefficient is a measure of the strength of linear association between two variables after controlling for one or more other variables. Suppose, for eample, we are interested in the correlation between serum cholesterol and triglceride values in a random sample of men aged Now it is known that both cholesterol and triglceride levels tend to increase with age, so the mere fact that the sample includes men from a wide age range would tend to cause the two variables to be correlated in the sample. If we wish to discount this effect, controlling for age (i.e. determine that part of the correlation that is over and above the correlation induced b a common age), we would calculate the partial correlation coefficient, partialing out the effect of the variable age. The square of the partial correlation between the cholesterol and trglceride levels would then be the proportion of the variabilit in cholesterol level that is accounted for b the addition of triglceride to a regression model that alread includes age as an predictor variable. Similarl, it would also equal the proportion of the variabilit in triglceride level that is accounted for b the addition of cholesterol to a regression model that alread includes age as an predictor variable. REGRESSION TOWARD THE MEAN We conclude this chapter with a concept that, although it includes in its name the word regression and is indeed related to the original idea behind regression, is distinct from modeling the distribution of a random variable in terms of one or more other variables. Consider the three highest triglceride values among those listed in Table 3.1 for 30 medical students. The are (in mg/dl) 218, 225, and 287, with a mean of Suppose we were to draw aliquots of blood from these three students on several subsequent das; should we epect the mean of the subsequent values to be 243.3? Alternativel, if we took later samples from those students with the three lowest values (45, 46, and 49), should we epect their average to remain the same (46.7)? The answer is that we should not: we should epect the mean of the
21 252 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS highest three to become smaller, and the mean of the lowest three to become larger, on subsequent determinations. This phenomenon is called regression toward the mean and occurs whenever we follow up a selected, as opposed to a complete, or random, sample. To understand wh regression toward the mean occurs, think of each student s measured triglceride value as being made up of two parts: a true value (i.e. the mean of man, man determinations made on that student) and a random deviation from that true value; this latter could be due to measurement error and/or inherent variabilit in triglceride value from da to da. When we select the three students with the highest triglceride values based on a single measurement on each student, we tend to pick three that happen to have their highest random deviations, so that the mean of these three measurements (243.3 in our eample) is most probabl an overestimate of the mean of the three students true values. Subsequent measures on these three students are equall likel to have positive or negative random deviations, so the subsequent mean will be epected to be the mean of their true values, and therefore probabl somewhat lower. Similarl, if we pick the lowest three students in a sample, these single measurements will usuall be underestimates of their true values, because the were probabl picked partl because the happened to have their lowest random deviations when the were selected. If we were to make subsequent observations on the whole sample of 30 students, however, or on a random sample of them, regression toward the mean would not be epected to occur. It is important to distinguish between regression toward the mean and a treatment effect. If subjects with high cholesterol levels are given a potentiall cholesterol-lowering drug, their mean cholesterol level would be epected to decrease on follow-up even if the drug is ineffective because of regression toward the mean. This illustrates the importance of having a control group taking a placebo, with subjects randoml assigned to the two groups. Regression toward the mean is then epected to occur equall in both groups, so that the true treatment effect can be estimated b comparing the groups. Finall, something analogous to regression toward the mean tends to occur whenever multiple regression is used to select a set of variables that best eplains, or predicts, a response variable. Given a sample of data, when a model is first fitted to a set of predictor variables all the estimated regression coefficients are unbiased. But if we now select the most significant predictors and onl report these or include onl these in a new model that is fitted, using the same data set we have automaticall chosen predictors that best eplain the response in this particular sample and the new estimates will be biased. In other words, we should epect estimates of these regression coefficients, when estimated from future samples taken from the same population, to be closer to zero and hence much less significant. This is one reason wh man studies that first report a significant finding cannot be replicated b later investigators.
22 SUMMARY CORRELATION AND REGRESSION In the equation of a straight line, = 0 + 1, 0 is the intercept and 1 is the slope. 2. In simple linear regression analsis, it is assumed that one variable (Y), the response variable, is subject to random fluctuations, whereas the other variable (), the predictor variable, is under the investigator s control. Minimizing the sum of the squared deviations of n sample values of Y from a straight line leads to the least-squares estimates b 0 of 0 and b 1 of 1, and hence the prediction line ŷ = b 0 + b 1. The sample residuals about this line sum to zero. 3. The total sum of squared deviations from the sample mean can be partitioned into two parts that due to the regression model and that due to error, or the residual sum of squares. Dividing these b their respective degrees of freedom gives rise to mean squares. 4. Under certain conditions the estimates b 0 and b l are maimum likelihood estimates, and the ratio of the mean squares (that due to regression divided b that due to residual) can be compared to the F-distribution with 1 and n 2 d.f. to test the hpothesis 1 = 0. These conditions are: (a) for a fied, Y must be normall distributed with mean ; (b) the Ys must be independent; and (c) there must be homoscedasticit the variance of Y must be the same at each value of. 5. It is possible to determine a standard error for b 1 and, under the same conditions, b 1 divided b its standard error comes from a t-distribution with n 2 d.f. In this situation, the F-test and the t-test are equivalent tests of the null hpothesis 1 = The regression of Y on can be used to predict the distribution of Y for a given value, and the regression of X on can be used to predict the distribution of X for a given value. The line that best describes the underling linear relationship between X and Y is somewhere between these two lines. 7. Covariance is a measure of how two random variables var together. When divided b the product of the variables standard deviations, the result is the (product-moment) correlation, a dimensionless number that measures the strength of the linear association between the two variables. If all the points lie on a straight line with positive slope, the correlation is +1; if the all lie on a straight line with negative slope, the correlation is 1. A nonzero correlation between two random variables does not necessaril impl a causal relationship between them. A correlation of 0 implies no straight-line association but there ma nevertheless be a curvilinear association.
23 254 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS 8. The rank correlation (computed from the ranks of the observations) measures how closel the data points fit a monotonic curve. The rank correlation is not greatl influenced b a few outling values. 9. Multiple regression is used to obtain a prediction equation for a response variable Y from a set of predictor variables 1, 2,... The significance of the predictor variables can be jointl tested b an F-ratio, or singl tested b t-statistics. A stepwise analsis is often used to obtain the best subset of the -variables with which to predict the distribution of Y, but theoreticall we can onl be sure of reaching the best subset b eamining all possible subsets. The prediction equation obtained need not reflect an causal relationship between the response and predictor variables. 10. Regression diagnostics are used to investigate the adequac of a regression model in describing a set of data. An eamination of the residuals from a model gives clues as to whether the regression model is adequate, including whether the residuals are approimatel normall distributed with constant variance. 11. The square of the multiple correlation coefficient is a measure of the proportion of the variabilit in a response variable that can be accounted for b a set of predictor variables. The partial correlation coefficient between two variables is a measure of their linear association after allowing for the variables that have been partialed out. 12. Whenever we make subsequent measurements on stud units that have been selected for follow-up because the were etreme with respect to the variable being measured, we can epect regression toward the mean (i.e. stud units with high initial values will tend to have lower values later, and stud units with low initial values will tend to have higher values later). Similarl, if we select predictor variables that were most significant in the analsis of a particular sample, we can epect their effects to be closer to zero and less significant in a subsequent sample. FURTHER READING Kleinbaum, D.G., Kupper, L.L., Nizam, A., and Muller, K.E. (2008) Applied Regression and Other Multivariable Methods, 4th edn. Pacific Grove, CA: Dubur. (This book covers man aspects of regression analsis, including computational formulas. It requires onl a limited mathematical background to read.)
24 PROBLEMS CORRELATION AND REGRESSION Suppose two variables under stud are temperature in degrees Fahrenheit () and temperature in degrees Celsius (). The regression line for this situation is = Assuming there is no error in observing temperature, the correlation coefficient would be epected to be A. 9 / 5 B. 5 / 9 C. 1 D. +1 E An investigator studies 50 pairs, of unlike-se twins and reports that the regression of female birth weight () on male birth weight () is given b the following equation (all weights in grams): One can conclude from this that = A. the mean weight of twin brothers of girls who weigh 1000 g is predicted to be 1624 g B. the mean weight of twin sisters of bos who weigh 1000 g is predicted to be 1624 g C. the sample mean weight of the girls is 1221 g D. the sample mean weight of the bos is 1221 g E. the sample correlation between girl s weight and bo s weight is In a regression analsis, the residuals of a fitted model can be used to investigate all the following ecept A. the model fits all but a few observations B. the error terms are normall distributed C. the regression function is linear D. the robustness of the rank sum test E. one or more predictor variables have been omitted from the model
25 256 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS 4. Which of the following plots might represent data for variables X and Y that have a correlation coefficient equal to 0.82? (a) (b) (c) (d) A. a B. b C. c D. d E. a and b 5. The correlation coefficient for the data in the graph below would be epected to have a value that is A. a positive number of magnitude approimatel equal to one B. a negative number of magnitude approimatel equal to one C. a positive number of magnitude approimatel equal to zero D. a negative number of magnitude approimatel equal to zero E. none of the above
26 CORRELATION AND REGRESSION The Pearson correlation coefficient between variables A and B is known to be 0.50, and the correlation between B and C is known to be What can we infer about the relationship between A and C? A. There is no association between A and C. B. A and C are independent. C. A and C are negativel correlated. D. A and C have a linear association. E. The relationship cannot be determined from the information given. 7. It is reported that both Pearson s product-moment correlation and Spearman s rank correlation between two variables are zero. This implies A. the two variables are independent B. the two variables tend to follow a monotonic curve C. there is no linear or monotonic association between the two variables D. all of the above E. none of the above 8. A data analst is attempting to determine the adequac of the model = for a particular set of data. The parameters of the model are A. least-squares estimates B. unbiased C. 1 and 2 D. robust E. 0, 1, and 2 9. A stud was conducted to investigate the relationship between the estriol level of pregnant women and subsequent height of their children at birth. A scatter diagram of the data suggested a linear relationship. Pearson s product-moment correlation coefficient was computed and found to be r = The researcher decided to re-epress height in inches rather than centimeters and then recompute r. The recalculation should ield the following value of r: A B C
27 258 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS D E. cannot be determined from data available 10. An investigator finds the following results for a regression analsis based on the model Y = Source of Variabilit d.f. Sum of Squares Mean Square F p-value Regression model Error (residual) Total Assuming all assumptions that were made in the analsis are justified, an appropriate conclusion is A. the mean of the outcome variable Y is not significantl different from zero B. the intercept term 0 is significantl different from zero C. the parameters of the model are significantl different from zero D. the predictor variables 1, 2, and 3 do not account for a significant proportion of the variabilit in Y E. none of the above 11. A forward stepwise regression analsis was performed according to the following models: Step 1 : Y = Source of Variabilit d.f. Sum of Squares Mean Square F p-value Regression model Error (residual) Step 2 : Y =
28 CORRELATION AND REGRESSION 259 Source of Variabilit d.f. Sum of Squares Mean Square F p-value Regression model Added to regression b 2 Error (residual) Total The analsis was summarized as follows: Step 1: R 2 = 32.5% Step 2: R 2 = 47.5% Assuming all assumptions made in the analsis are justified, an appropriate conclusion is A. 2 accounts for a significant amount of the variabilit in Y over and above that accounted for b 1 B. neither 1 nor 2 accounts for a significant amount of the variabilit in Y C. the proportion of variabilit eplained b the regression model containing both 1 and 2 is less than should be epected in a stepwise regression analsis D. the residual sum of squares is too large for meaningful interpretation of the regression analsis E. the F -ratio is too small for interpretation in step A multiple regression analsis was performed assuming the model Y = The following results were obtained: Parameter Estimate Standard Error of Estimate t-test p-value
29 260 BASIC BIOSTATISTICS FOR GENETICISTS AND EPIDEMIOLOGISTS Assuming all assumptions made in the analsis are justified, an appropriate conclusion is A. none of the predictor variables considered belong in the model B. all of the predictor variables considered belong in the model C. 1 and 2 belong in the model, but 3 does not D. 1 and 3 belong in the model, but 2 does not E. 2 and 3 belong in the model, but 1 does not 13. An investigator finds the following results for a regression analsis of data on 50 subjects based on the model Y = Source of Variabilit d.f. Sum of Squares Mean Square F p-value Regression model <0.01 Error (residual) Total Assuming all assumptions that were made in the analsis are justified, an appropriate conclusion is A. the mean of the response variable Y is not significantl different from zero B. the intercept term 0 is significantl different from zero C. the parameters of the model are not significantl different from zero D. the predictor variables 1, 2, and 3 account for a significant proportion of the variabilit in Y E. none of the above 14. Regression diagnostics are useful in determining the A. coefficient of variation B. adequac of a fitted model C. degrees of freedom in a two-wa contingenc table D. percentiles of the t-distribution E. parameters of a normal distribution 15. A group of men were eamined during a routine screening for elevated blood pressure. Those men with the highest blood pressure namel those with diastolic blood pressure higher than the 80th percentile for
30 CORRELATION AND REGRESSION 261 the group were re-eamined at a follow-up eamination 2 weeks later. It was found that the mean for the re-eamined men had decreased b 8 mmhg at the follow-up eamination. The most likel eplanation for most of the decrease is A. the men were more relaed for the second eamination B. some of the men became concerned and sought treatment for their blood pressure C. observer bias D. the observers were better trained for the second eamination E. regression toward the mean
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing
The Big Picture. Correlation. Scatter Plots. Data
The Big Picture Correlation Bret Hanlon and Bret Larget Department of Statistics Universit of Wisconsin Madison December 6, We have just completed a length series of lectures on ANOVA where we considered
So, using the new notation, P X,Y (0,1) =.08 This is the value which the joint probability function for X and Y takes when X=0 and Y=1.
Joint probabilit is the probabilit that the RVs & Y take values &. like the PDF of the two events, and. We will denote a joint probabilit function as P,Y (,) = P(= Y=) Marginal probabilit of is the probabilit
Pearson s Correlation Coefficient
Pearson s Correlation Coefficient In this lesson, we will find a quantitative measure to describe the strength of a linear relationship (instead of using the terms strong or weak). A quantitative measure
INVESTIGATIONS AND FUNCTIONS 1.1.1 1.1.4. Example 1
Chapter 1 INVESTIGATIONS AND FUNCTIONS 1.1.1 1.1.4 This opening section introduces the students to man of the big ideas of Algebra 2, as well as different was of thinking and various problem solving strategies.
C3: Functions. Learning objectives
CHAPTER C3: Functions Learning objectives After studing this chapter ou should: be familiar with the terms one-one and man-one mappings understand the terms domain and range for a mapping understand the
Solving Quadratic Equations by Graphing. Consider an equation of the form. y ax 2 bx c a 0. In an equation of the form
SECTION 11.3 Solving Quadratic Equations b Graphing 11.3 OBJECTIVES 1. Find an ais of smmetr 2. Find a verte 3. Graph a parabola 4. Solve quadratic equations b graphing 5. Solve an application involving
Chapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a
More Equations and Inequalities
Section. Sets of Numbers and Interval Notation 9 More Equations and Inequalities 9 9. Compound Inequalities 9. Polnomial and Rational Inequalities 9. Absolute Value Equations 9. Absolute Value Inequalities
Functions and Graphs CHAPTER INTRODUCTION. The function concept is one of the most important ideas in mathematics. The study
Functions and Graphs CHAPTER 2 INTRODUCTION The function concept is one of the most important ideas in mathematics. The stud 2-1 Functions 2-2 Elementar Functions: Graphs and Transformations 2-3 Quadratic
LESSON EIII.E EXPONENTS AND LOGARITHMS
LESSON EIII.E EXPONENTS AND LOGARITHMS LESSON EIII.E EXPONENTS AND LOGARITHMS OVERVIEW Here s what ou ll learn in this lesson: Eponential Functions a. Graphing eponential functions b. Applications of eponential
Section 14 Simple Linear Regression: Introduction to Least Squares Regression
Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship
MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.
Module 7 Test Name MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. You are given information about a straight line. Use two points to graph the equation.
Chapter 9 Descriptive Statistics for Bivariate Data
9.1 Introduction 215 Chapter 9 Descriptive Statistics for Bivariate Data 9.1 Introduction We discussed univariate data description (methods used to eplore the distribution of the values of a single variable)
6. The given function is only drawn for x > 0. Complete the function for x < 0 with the following conditions:
Precalculus Worksheet 1. Da 1 1. The relation described b the set of points {(-, 5 ),( 0, 5 ),(,8 ),(, 9) } is NOT a function. Eplain wh. For questions - 4, use the graph at the right.. Eplain wh the graph
5.2 Inverse Functions
78 Further Topics in Functions. Inverse Functions Thinking of a function as a process like we did in Section., in this section we seek another function which might reverse that process. As in real life,
5.1. A Formula for Slope. Investigation: Points and Slope CONDENSED
CONDENSED L E S S O N 5.1 A Formula for Slope In this lesson ou will learn how to calculate the slope of a line given two points on the line determine whether a point lies on the same line as two given
Univariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
When I was 3.1 POLYNOMIAL FUNCTIONS
146 Chapter 3 Polnomial and Rational Functions Section 3.1 begins with basic definitions and graphical concepts and gives an overview of ke properties of polnomial functions. In Sections 3.2 and 3.3 we
MATH REVIEW SHEETS BEGINNING ALGEBRA MATH 60
MATH REVIEW SHEETS BEGINNING ALGEBRA MATH 60 A Summar of Concepts Needed to be Successful in Mathematics The following sheets list the ke concepts which are taught in the specified math course. The sheets
7.3 Parabolas. 7.3 Parabolas 505
7. Parabolas 0 7. Parabolas We have alread learned that the graph of a quadratic function f() = a + b + c (a 0) is called a parabola. To our surprise and delight, we ma also define parabolas in terms of
Zeros of Polynomial Functions. The Fundamental Theorem of Algebra. The Fundamental Theorem of Algebra. zero in the complex number system.
_.qd /7/ 9:6 AM Page 69 Section. Zeros of Polnomial Functions 69. Zeros of Polnomial Functions What ou should learn Use the Fundamental Theorem of Algebra to determine the number of zeros of polnomial
Higher. Polynomials and Quadratics 64
hsn.uk.net Higher Mathematics UNIT OUTCOME 1 Polnomials and Quadratics Contents Polnomials and Quadratics 64 1 Quadratics 64 The Discriminant 66 3 Completing the Square 67 4 Sketching Parabolas 70 5 Determining
Exponential and Logarithmic Functions
Chapter 6 Eponential and Logarithmic Functions Section summaries Section 6.1 Composite Functions Some functions are constructed in several steps, where each of the individual steps is a function. For eample,
Downloaded from www.heinemann.co.uk/ib. equations. 2.4 The reciprocal function x 1 x
Functions and equations Assessment statements. Concept of function f : f (); domain, range, image (value). Composite functions (f g); identit function. Inverse function f.. The graph of a function; its
Simple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
Correlation key concepts:
CORRELATION Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson s coefficient of correlation c) Spearman s Rank correlation coefficient d)
CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression
Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the
1.6. Piecewise Functions. LEARN ABOUT the Math. Representing the problem using a graphical model
. Piecewise Functions YOU WILL NEED graph paper graphing calculator GOAL Understand, interpret, and graph situations that are described b piecewise functions. LEARN ABOUT the Math A cit parking lot uses
Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade
Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements
EQUATIONS OF LINES IN SLOPE- INTERCEPT AND STANDARD FORM
. Equations of Lines in Slope-Intercept and Standard Form ( ) 8 In this Slope-Intercept Form Standard Form section Using Slope-Intercept Form for Graphing Writing the Equation for a Line Applications (0,
Section 7.2 Linear Programming: The Graphical Method
Section 7.2 Linear Programming: The Graphical Method Man problems in business, science, and economics involve finding the optimal value of a function (for instance, the maimum value of the profit function
X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)
CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.
Graphing Quadratic Equations
.4 Graphing Quadratic Equations.4 OBJECTIVE. Graph a quadratic equation b plotting points In Section 6.3 ou learned to graph first-degree equations. Similar methods will allow ou to graph quadratic equations
POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.
Polynomial Regression POLYNOMIAL AND MULTIPLE REGRESSION Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model. It is a form of linear regression
DATA INTERPRETATION AND STATISTICS
PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE
Price Theory Lecture 3: Theory of the Consumer
Price Theor Lecture 3: Theor of the Consumer I. Introduction The purpose of this section is to delve deeper into the roots of the demand curve, to see eactl how it results from people s tastes, income,
Section 3 Part 1. Relationships between two numerical variables
Section 3 Part 1 Relationships between two numerical variables 1 Relationship between two variables The summary statistics covered in the previous lessons are appropriate for describing a single variable.
AP Physics 1 and 2 Lab Investigations
AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks
Graphing Linear Equations
6.3 Graphing Linear Equations 6.3 OBJECTIVES 1. Graph a linear equation b plotting points 2. Graph a linear equation b the intercept method 3. Graph a linear equation b solving the equation for We are
Florida Algebra I EOC Online Practice Test
Florida Algebra I EOC Online Practice Test Directions: This practice test contains 65 multiple-choice questions. Choose the best answer for each question. Detailed answer eplanations appear at the end
LINEAR FUNCTIONS OF 2 VARIABLES
CHAPTER 4: LINEAR FUNCTIONS OF 2 VARIABLES 4.1 RATES OF CHANGES IN DIFFERENT DIRECTIONS From Precalculus, we know that is a linear function if the rate of change of the function is constant. I.e., for
1. a. standard form of a parabola with. 2 b 1 2 horizontal axis of symmetry 2. x 2 y 2 r 2 o. standard form of an ellipse centered
Conic Sections. Distance Formula and Circles. More on the Parabola. The Ellipse and Hperbola. Nonlinear Sstems of Equations in Two Variables. Nonlinear Inequalities and Sstems of Inequalities In Chapter,
Polynomials. Jackie Nicholas Jacquie Hargreaves Janet Hunter
Mathematics Learning Centre Polnomials Jackie Nicholas Jacquie Hargreaves Janet Hunter c 26 Universit of Sdne Mathematics Learning Centre, Universit of Sdne 1 1 Polnomials Man of the functions we will
STATISTICS AND STANDARD DEVIATION
STATISTICS AND STANDARD DEVIATION Statistics and Standard Deviation STSD-A Objectives... STSD 1 STSD-B Calculating Mean... STSD STSD-C Definition of Variance and Standard Deviation... STSD 4 STSD-D Calculating
The correlation coefficient
The correlation coefficient Clinical Biostatistics The correlation coefficient Martin Bland Correlation coefficients are used to measure the of the relationship or association between two quantitative
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
To Be or Not To Be a Linear Equation: That Is the Question
To Be or Not To Be a Linear Equation: That Is the Question Linear Equation in Two Variables A linear equation in two variables is an equation that can be written in the form A + B C where A and B are not
SAMPLE. Polynomial functions
Objectives C H A P T E R 4 Polnomial functions To be able to use the technique of equating coefficients. To introduce the functions of the form f () = a( + h) n + k and to sketch graphs of this form through
I think that starting
. Graphs of Functions 69. GRAPHS OF FUNCTIONS One can envisage that mathematical theor will go on being elaborated and etended indefinitel. How strange that the results of just the first few centuries
Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2
Lesson 4 Part 1 Relationships between two numerical variables 1 Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables
Simple Regression Theory II 2010 Samuel L. Baker
SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the
Session 7 Bivariate Data and Analysis
Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares
Introduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
Regression Analysis: A Complete Example
Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty
Biostatistics: Types of Data Analysis
Biostatistics: Types of Data Analysis Theresa A Scott, MS Vanderbilt University Department of Biostatistics [email protected] http://biostat.mc.vanderbilt.edu/theresascott Theresa A Scott, MS
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
Section V.2: Magnitudes, Directions, and Components of Vectors
Section V.: Magnitudes, Directions, and Components of Vectors Vectors in the plane If we graph a vector in the coordinate plane instead of just a grid, there are a few things to note. Firstl, directions
UNDERSTANDING THE TWO-WAY ANOVA
UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables
Using Excel for inferential statistics
FACT SHEET Using Excel for inferential statistics Introduction When you collect data, you expect a certain amount of variation, just caused by chance. A wide variety of statistical tests can be applied
" Y. Notation and Equations for Regression Lecture 11/4. Notation:
Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma [email protected] The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
Why should we learn this? One real-world connection is to find the rate of change in an airplane s altitude. The Slope of a Line VOCABULARY
Wh should we learn this? The Slope of a Line Objectives: To find slope of a line given two points, and to graph a line using the slope and the -intercept. One real-world connection is to find the rate
FINAL EXAM REVIEW MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.
FINAL EXAM REVIEW MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Determine whether or not the relationship shown in the table is a function. 1) -
1 Maximizing pro ts when marginal costs are increasing
BEE12 Basic Mathematical Economics Week 1, Lecture Tuesda 12.1. Pro t maimization 1 Maimizing pro ts when marginal costs are increasing We consider in this section a rm in a perfectl competitive market
Simple linear regression
Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between
SECTION 2-2 Straight Lines
- Straight Lines 11 94. Engineering. The cross section of a rivet has a top that is an arc of a circle (see the figure). If the ends of the arc are 1 millimeters apart and the top is 4 millimeters above
A Primer on Forecasting Business Performance
A Primer on Forecasting Business Performance There are two common approaches to forecasting: qualitative and quantitative. Qualitative forecasting methods are important when historical data is not available.
Polynomial Degree and Finite Differences
CONDENSED LESSON 7.1 Polynomial Degree and Finite Differences In this lesson you will learn the terminology associated with polynomials use the finite differences method to determine the degree of a polynomial
POLYNOMIAL FUNCTIONS
POLYNOMIAL FUNCTIONS Polynomial Division.. 314 The Rational Zero Test.....317 Descarte s Rule of Signs... 319 The Remainder Theorem.....31 Finding all Zeros of a Polynomial Function.......33 Writing a
Statistics. Measurement. Scales of Measurement 7/18/2012
Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does
2.6. The Circle. Introduction. Prerequisites. Learning Outcomes
The Circle 2.6 Introduction A circle is one of the most familiar geometrical figures and has been around a long time! In this brief Section we discuss the basic coordinate geometr of a circle - in particular
Descriptive Statistics
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
Part 2: Analysis of Relationship Between Two Variables
Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable
17. SIMPLE LINEAR REGRESSION II
17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.
Systems of Linear Equations: Solving by Substitution
8.3 Sstems of Linear Equations: Solving b Substitution 8.3 OBJECTIVES 1. Solve sstems using the substitution method 2. Solve applications of sstems of equations In Sections 8.1 and 8.2, we looked at graphing
NAME DATE PERIOD. 11. Is the relation (year, percent of women) a function? Explain. Yes; each year is
- NAME DATE PERID Functions Determine whether each relation is a function. Eplain.. {(, ), (0, 9), (, 0), (7, 0)} Yes; each value is paired with onl one value.. {(, ), (, ), (, ), (, ), (, )}. No; in the
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
Slope-Intercept Form and Point-Slope Form
Slope-Intercept Form and Point-Slope Form In this section we will be discussing Slope-Intercept Form and the Point-Slope Form of a line. We will also discuss how to graph using the Slope-Intercept Form.
SECTION 5-1 Exponential Functions
354 5 Eponential and Logarithmic Functions Most of the functions we have considered so far have been polnomial and rational functions, with a few others involving roots or powers of polnomial or rational
Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares
Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation
Module 5: Multiple Regression Analysis
Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College
Mathematical goals. Starting points. Materials required. Time needed
Level A7 of challenge: C A7 Interpreting functions, graphs and tables tables Mathematical goals Starting points Materials required Time needed To enable learners to understand: the relationship between
Find the Relationship: An Exercise in Graphing Analysis
Find the Relationship: An Eercise in Graphing Analsis Computer 5 In several laborator investigations ou do this ear, a primar purpose will be to find the mathematical relationship between two variables.
. 58 58 60 62 64 66 68 70 72 74 76 78 Father s height (inches)
PEARSON S FATHER-SON DATA The following scatter diagram shows the heights of 1,0 fathers and their full-grown sons, in England, circa 1900 There is one dot for each father-son pair Heights of fathers and
Linear Inequality in Two Variables
90 (7-) Chapter 7 Sstems of Linear Equations and Inequalities In this section 7.4 GRAPHING LINEAR INEQUALITIES IN TWO VARIABLES You studied linear equations and inequalities in one variable in Chapter.
Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation
Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation Display and Summarize Correlation for Direction and Strength Properties of Correlation Regression Line Cengage
Linear Equations in Two Variables
Section. Sets of Numbers and Interval Notation 0 Linear Equations in Two Variables. The Rectangular Coordinate Sstem and Midpoint Formula. Linear Equations in Two Variables. Slope of a Line. Equations
10.1. Solving Quadratic Equations. Investigation: Rocket Science CONDENSED
CONDENSED L E S S O N 10.1 Solving Quadratic Equations In this lesson you will look at quadratic functions that model projectile motion use tables and graphs to approimate solutions to quadratic equations
2.7 Applications of Derivatives to Business
80 CHAPTER 2 Applications of the Derivative 2.7 Applications of Derivatives to Business and Economics Cost = C() In recent ears, economic decision making has become more and more mathematicall oriented.
5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
Association Between Variables
Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi
THE POWER RULES. Raising an Exponential Expression to a Power
8 (5-) Chapter 5 Eponents and Polnomials 5. THE POWER RULES In this section Raising an Eponential Epression to a Power Raising a Product to a Power Raising a Quotient to a Power Variable Eponents Summar
Descriptive statistics; Correlation and regression
Descriptive statistics; and regression Patrick Breheny September 16 Patrick Breheny STA 580: Biostatistics I 1/59 Tables and figures Descriptive statistics Histograms Numerical summaries Percentiles Human
II. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
Elements of statistics (MATH0487-1)
Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -
Chapter 6 Quadratic Functions
Chapter 6 Quadratic Functions Determine the characteristics of quadratic functions Sketch Quadratics Solve problems modelled b Quadratics 6.1Quadratic Functions A quadratic function is of the form where
