PubH 7405: BIOSTATISTICS REGRESSION, 011 PRACTICE PROBLEMS FOR SIMPLE LINEAR REGRESSION (Some are new & Some from Old eam; lat 4 are from 010 Midterm) Problem 1: The Pearon Correlation Coefficient (r) between two variable X and Y can be epreed in everal equivalent form; one of which i n 1 i yi y r( X, Y ) ( )( ) n i1 y Where -bar (y-bar) i the ample mean and ( y ) the ample tandard deviation of X (Y). (1) If a and c are two poitive contant and b and d are any two contant, prove that: r( ax b, cy d) r( X, Y) () I the reult in (1) till true if we do not aume that a and c are poitive? (3) For a group of men, if the Correlation Coefficient between Weight in pound and Height in inche i r=.9; what i the value of that Correlation Coefficient if Weight i meaured in kilogram and Height in centimeter? Eplain your anwer. (4) Body Temperature (BT) can be meaured at many location in your body. Suppoe, for certain group of children with fever, the Correlation Coefficient between oral BT and rectal BT i r=.91 when BT i meaured in Fahrenheit cale ( 0 F); what i the value of that Correlation Coefficient if BT i meaured in Celiu cale ( 0 C)? Eplain your anwer. Problem : Let X and Y be two variable in a tudy; the regreion line that can be ued to predict Y from X value i: Predicted y b0 b1 The etimated intercept and lope can be epreed in everal equivalent form; one of which i y b1 r b0 y b1 Where -bar i ample mean and i the ample tandard deviation of X. (1) If a and c are two poitive contant and b and d are any two contant, conider the data tranformation: U ax b V cy d And let denote the etimated intercept and lope of the regreion line predicting V from U a B 0 and B 1. Epre B 0 and B 1 a function of a, b, c, d, and b 0 and b 1 () What would be the reult of (1) in the pecial cae that a=c and b=d=0? What would be the reult of (1) in the pecial cae that a=1 and b=d=0? (3) During ome operation, it would be more convenient to meaure Blood Preure (BP) from the patient leg than from a cuff on the arm. Let X = leg BP and Y = arm BP, the reult for a group under going orthopedic urgerie are b 0 =9.05 and b 1 =0.761 when BP i meaure in millimeter of mercury (Hg); what would be thee reult if BP i meaured in centimeter of Hg? Eplain your anwer.
(4) Apgar core wa devied in 195 by Dr. Virginia Apgar a a imple method to quickly ae the health of the newborn. Let X = Apgar core and Y= Birth Weight, the reult for a group of newborn are b 0 =1.306 and b 1 =0.05 when Birth Weight i meaured in kilogram; what would be thee reult if Birth Weight i meaured in pound? Eplain your anwer. Problem 3: Let X and Y be two variable in a tudy. (1) Invetigator #1 i intereted in predicting Y from X, and fit and compute a regreion line for thi purpoe. Invetigator # i intereted in predicting X from Y, and compute hi regreion line for that purpoe (note that in the real problem of parallel-line bioaay, with X=log(doe) and Y=repone, we have both of thee tep the firt for the Standard Preparation and the econd for the Tet preparation). Are thee two regreion line the ame? If o, hy? If not, compute the ratio and the product of the two lope a function of tandard tatitic. () Let X = Height and Y = Weight, we have for a group of 409 men: 8,359 inche y y 64,938 pound 1,969,716 inche 10,517,079 pound y 4,513,810 (inch)(pound) (a) Calculate the Coefficient of Correlation (b) Calculate the Slope, the product, and the ratio of lope in quetion (1) (c) Calculate the Intercept for Invetigator # (d) Calculate 95 percent Confidence Interval for the Slope for Invetigator #1 Problem 4: Let X and Y be two variable in a tudy; the regreion line that can be ued to predict Y from X value i: Predicted y b0 b1 So that the error of the prediction i: Error y - Predicted y e y ( b0 b1 ) (1) From the Sum of Squared Error: S e S [ y ( b0 b1 )] Derive the two normal equation () Ue the two normal equation in (1) to prove that (.1) the average error i zero, and (.) the error of prediction and the value of the Predictor are uncorrelated (the coefficient of correlation i zero, r(e,x)=0). (3) Recall that if a, b, c, and are contant then r(ax+b,cy+d) = r(x,y); ue thi and the reult in (.) to how that the error of prediction and the predicted value of the Repone are uncorrelated (the coefficient of correlation i zero, r(predicted y,e)=0). (4) Prove that Var(y) = Var(Predicted y) + Var(e) (5) (BONUS) From the reult of (4), prove that Var(e) = (1-r )Var(e); hence, -1 r 1
Problem 5: From a ample of n=15 reading on X = Traffic Volume (car per hour) and Y = Carbon Monoide Concentration (PPM) taken at certain metropolitan air quality ampling ite, we have thee tatitic: 3,550 y 167.8 y 974,450 1,915.36 y 41,945 (1) Compute the ample Correlation Coefficient r. () Tet for H 0 : = 0 at the.05 level of ignificance and tate your concluion in contet of thi problem ( i the Population Coefficient of Correlation). (3) Determine either the eact p-value for the tet or it upper bound (4) Contruction the 95 percent Confidence Interval for via Fiher tranformation. Problem 6: Conider the regreion line/model without intercept, Predicted y = b (1) Minimize S = (y-b) to verify that the etimated lope of the regreion line for predicting Y from X i given by b 1 = y/. () Conider another alternative etimate of the lope, the ratio of the ample mean, b = y/. Show that if Var(Y) i contant then Var(b 1 )Var(b ). (However, if the variance Var(Y) i proportional to, Var(b )Var(b 1 ); an eample of thi ituation would occur in a radioactivity counting eperiment where the ame material i oberved for replicate period of different length; count are ditributed a Poion). Problem 7: The data below how the conumption of alcohol (X, liter per year per peron, 14 year or older) and the death rate from cirrhoi, a liver dieae (Y, death per 100,000 population) in 15 countrie (each country i an obervation unit). Country Alc. Conumption Death Rate from Cirrhoi y y France 4.7 46.1 610.09 15.1 1138.67 Italy 15. 3.6 31.04 556.96 358.7 Germany 1.3 3.7 151.9 561.69 91.51 Autralia 10.9 7 118.81 49 76.3 Belgium 10.8 1.3 116.64 151.9 13.84 USA 9.9 14. 98.01 01.64 140.58 Canada 8.3 7.4 68.89 54.76 61.4 England 7. 3.0 51.84 9 1.6 Sweden 6.6 7. 43.56 51.84 47.5 Japan 5.8 10.6 33.64 11.36 61.48 Netherland 5.7 3.7 3.49 13.69 1.09 Ireland 5.6 3.4 31.36 11.56 19.04 Norway 4. 4.3 17.64 18.49 18.06 Finland 3.9 3.6 15.1 1.96 14.04 Ireal 3.1 5.4 9.61 9.16 16.74 Total 134. 175.5 1630.1 3959.61 419.61 (1) Draw a Scatter Diagram to how the aociation, if any, between thee two variable; can you draw any concluion/obervation without doing any calculation?
() Calculate the Coefficient of Correlation and it 95% Confidence Interval uing the Fiher tranformation; then tate your interpretation. (3) Form the regreion line by calculating the etimate Intercept and Slope; if the model hold, what would be the death rate from Cirrhoi for a country with alcohol conumption rate of 11.0 liter per year per peron? (4) What fraction of the total variability of Y i eplained by it relationhip to X? Form the ANOVA Table. (5) Tet for H 0 : Slope = 0 at the.05 level of ignificance and tate your concluion in term of thi problem decription Problem 8: When a patient i diagnoed a having cancer of the protate, an important quetion in deciding on treatment trategy for the patient i whether or not the cancer ha pread to the neighboring lymph node. The quetion i o critical in prognoi and treatment that it i cutomary to operate on the patient (i.e., perform a laparotomy) for the ole purpoe of eamining the node and removing tiue ample to eamine under the microcope for evidence of cancer. However, certain variable that can be meaured without urgery are predictive of the nodal involvement; and the purpoe of the tudy preented here wa to eamine the data for 53 protate cancer patient receiving urgery, to determine which of five preoperative variable are predictive of nodal involvement. For each of the 53 patient, there are information on patient age and four other potential independent variable, the level of erum acid phophatae (the factor of primary interet), and three binary variable, X-ray reading, pathology reading (grade) of a biopy of the tumor obtained by needle before urgery, and a rough meaure of the ize and location of the tumor (tage) obtained by palpation with the finger via the rectum. The primary outcome of interet, or dependent variable, repreent the finding at urgery which i binary indicating nodal involvement or no nodal involvement found at urgery. The analyi, with ome reult included here, i not about the main objective of predicting nodal involvement; it a ide analyi focuing on a poible confounder, age. The objective here i to ee if the level of erum acid phophatae and the patient age are related. Computer Program (SAS): option l=79; data Pcancer; input Xray Stage Grade Age Acid Node; card; 0 0 0 66 48 0 0 0 0 68 56 0.. 1 1 0 64 89 1 1 1 1 68 16 1 ; Proc UNIVARIATE data=pcancer; Var Age Acid; run; Proc CORR data=pcancer; run; Proc REG data=pcancer; model Acid = Age/COVB CLM; plot r.*age="+" r.*p.="*"; run;
Computer Output/reult PART A: Univariate Procedure Variable=AGE Moment N 53 Sum Wgt 53 Mean 59.37736 Sum 3147 Std Dev 6.16839 Variance 38.04717 Skewne -0.49481 Kurtoi -0.69677 Quantile Variable=ACID 100% Ma 68 99% 68 75% Q3 65 95% 67 50% Med 60 90% 67 5% Q1 56 10% 51 0% Min 45 5% 49 Moment N 53 Sum Wgt 53 Mean 69.41509 Sum 3679 Std Dev 6.0146 Variance 686.5167 Skewne.51881 Kurtoi 7.9481 Quantile 100% Ma 187 99% 187 75% Q3 78 95% 16 50% Med 65 90% 98 5% Q1 50 10% 48 0% Min 40 5% 46 PART B: Correlation Analyi Pearon Correlation Coefficient / Prob > R under Ho: Rho=0 / N = 53 XRAY STAGE GRADE AGE ACID NODES XRAY 1.00000 0.19761 0.017-0.00453 0.14973 0.46140 0.0 0.1561 0.1466 0.9743 0.846 0.0005 STAGE 0.19761 1.00000 0.37463-0.01970-0.0939 0.37463 0.1561 0.0 0.0057 0.8887 0.8345 0.0057 GRADE 0.017 0.37463 1.00000-0.04808-0.0894 0.777 0.1466 0.0057 0.0 0.734 0.5549 0.0444 AGE -0.00453-0.01970-0.04808 1.00000 0.05399-0.14365 0.9743 0.8887 0.734 0.0 0.7010 0.3048 ACID 0.14973-0.0939-0.0894 0.05399 1.00000 0.45 0.846 0.8345 0.5549 0.7010 0.0 0.080 NODES 0.46140 0.37463 0.777-0.14365 0.45 1.00000 0.0005 0.0057 0.0444 0.3048 0.080 0.0
PART C: Regreion Analyi, Dependent Variable: ACID Analyi of Variance Sum of Mean Source DF Square Square F Value Prob>F Model 104.04189 0.7010 Error 35594.8603 C Total Root MSE R-quare 0.009 Dep Mean 69.41509 Parameter Etimate Parameter Standard T for H0: Variable DF Etimate Error Parameter=0 Prob > T INTERCEP 1 55.798699 35.4530345 1.574 0.117 AGE 1 0.930 0.59394400 0.386 0.7010 Covariance of Etimate COVB INTERCEP AGE INTERCEP 156.9176378-0.94651955 AGE -0.94651955 0.357694745 Dep Var Predict Std Err Lower95% Upper95% Ob ACID Value Predict Mean Mean Reidual 1 48.0000 70.9338 60.1898 81.6778 -.9338 56.0000 71.394 6.77 58.7914 83.9934-15.394 3 50.0000 70.9338 5.35 60.1898 81.6778-0.9338 4 5.0000 68.6406 4.146 60.3164 76.9648-16.6406 5 50.0000 69.099 3.70 61.631 76.5673-19.099 6 49.0000 69.5579 3.648 6.349 76.8809-0.5579 7 46.0000 70.7045 4.93 60.8038 80.605-4.7045 8 6.0000 3.64 6.349 76.8809 9 56.0000 67.647 6.648 53.9193 80.6101-11.647 10 55.0000 67.0354 7.15 5.6761 81.3946-1.0354... etc... 44 76.0000 69.5579 3.648 6.349 76.8809 6.441 45 70.0000 66.1181 9.78 47.4909 84.7453 3.8819 46 78.0000 68.6406 4.146 60.3164 76.9648 9.3594 47 70.0000 66.3474 8.735 48.8114 83.8834 3.656 48 67.0000 71.1631 5.80 59.5146 8.8116-4.1631 49 8.0000 70.458 4.19 61.7763 78.7154 11.754 50 67.0000 68.8699 3.894 61.056 76.687-1.8699 51 7.0000 67.4940 6.158 55.1305 79.8575 4.5060 5 89.0000 70.475 4.550 61.3397 79.6106 18.548 53 16.0000 71.394 6.77 58.7914 83.9934 54.6076 Sum of Reidual 0 Sum of Squared Reidual 35594.860
PART D: Graph Graph #1 ---+----+----+----+----+----+----+----+----+----+----+----+----+--- RESIDUAL 10 + + + 100 + + 80 + + + 60 + + R + e i d 40 + + u a + l + + + 0 + + + + + + + + + + + + + + + + + 0 + + + + + + + + + + + + + + + + -0 + + + + + + + + + + + + + + + -40 + + ---+----+----+----+----+----+----+----+----+----+----+----+----+---- 44 46 48 50 5 54 56 58 60 6 64 66 68 AGE (1) From the problem decription in ection 1.1, if the main objective i to predict Nodal Involvement from Age, Serum acid phophatae, X-ray, Grade, and Stage, then why we may be intereted to ee if Serum acid phophatae and Age are related? () What doe the SAS computer program in ection 1. uppoe to give you? (3) Uing only the computer reult/output in PART A (Univariate Procedure) can we calculate the coefficient of correlation between Age and Serum acid phophatae? Why or why not? If not, what ele do you need? Can you get what you want (uing your calculator if needed) to obtain the needed piece of information from all data given here?
(4) From the computer reult/output in PART B, do you think it i reaonable to conclude that Serum acid phophatae and the patient Age are not related? (5) What i the model we uually aume in performing regreion analyi in PART C? i it true that the aumption of the model are only about the ditribution of the level of Serum acid phophatae? Do we make any aumption about the ditribution of age? (6) Fill in the blank to complete the ANOVA Table given in PART C (Degree of freedom, Mean quare, F tatitic, Root MSE); can you get F tatitic without MSE? (7) Fiing a value of Age, the value of Serum acid phophatae from the ub-population of patient with that age form a ditribution with variance σ, ue the reult you jut filled in (in quetion 1.6) to provide a point etimate for thi variance. (8) Re-calculate that point etimate of uing only the reult in the lat row preceding the graph#1; doe thi agree with the previou etimate? (9) If you are given only the reult in PART C but not the reult in PART B can you find the value of the coefficient of correlation? Doe thi agree with the reult given in PART B? (10) Doe the F-tet reult in PART C agree with the reult of the correponding t-tet in PART B, why or why not? (11) Fill in the two () blank (Predicted value and Reidual) for obervation/patient #8 in the long Table preceding the graph. (1) Fill in the blank (Standard error of the Predicted value) for obervation/patient #1. (13) What i the average/mean Serum acid phophatae for patient of 40 year of age? (14) If we treat the Predicted value in that Table a an etimate of a new obervation, can we calculate it tandard error? Why or why not? If not, what ele do you need? (15) What doe the graph tell you about the model aumption() in quetion (5) Problem 9: Let FEV (a meaure of Lung Health) be the dependent variable and Age i a potential predictor; and we have the following reult (computer output) uing data n ubject. Suppoe we fit the (Model FEV = Age) and obtain thee two table: ANOVA Significance F df SS MS F Regreion 1 [A] [B] [E] 0.09101 Reidual [C] 16.685 [D] Total 4 18.94 Coefficient Std Error t Stat P-value Lower 95% Upper 95% Intercept 5.53 0.91 5.703 <0.00001 3.34776 7.15847 Age -0.040 0.03-1.764 0.09101-0.08708 0.0069 (1) Calculate the quantitie A, B, C, D, and E (in order to complete the above ANOVA table); what wa the ample ize? () I it reaonable to conclude that FEV and the ubject Age are not related? (3) What i the model we uually aume in performing the above regreion analyi? Do we make any aumption about the ditribution of age? (4) Fiing a value of Age, the value of FEV from the ub-population of ubject with that age form a ditribution with variance σ, provide a point etimate for thi variance. (5) Suppoe you are alo given, a part of the computer output for regreion analyi, R =.119. Give your interpretation of thi number and ue it (and any reult from the above computer output) to calculate the Coefficient of Correlation repreenting the trength of the relationhip between Age and FEV.
(6) When we etimate the Mean Repone of a ub-population with a common value of the predictor value X = h, the variance i given by: ^ 1 ( h ) ( Y h) MSE n ( i ) (7) Calculate thi variance when h = 30; you can ue all the above computer output plu the following decriptive tatitic: n=5 Age (X) FEV (Y) Minimum 6 1.86 Maimum 54 4.91 Mean 39.8 3.66 Variance 58.56 0.79 St Deviation 7.65 0.89 (8) Suppoe you have a numerical reult for # (1f), [the following number i not necearily correct; it i given jut in cae you could not anwer or kipped #II1f)] Y h ).0786 (9) However, we like to we treat the Predicted/fitted value a an etimate of a new individual obervation, calculate it tandard error. ( ^ Problem 10: In the lat decade protate pecific antigen doubling time (PSA-DT; the time required for erum PSA to double it value) ha been etenively reearched in the protate cancer literature; it i a reliable predictor of many major clinical endpoint. The ue of PSA-DT i baed on the finding that erum PSA in patient with protate cancer follow an eponential growth curve model: Model 1a: PSA(t) PSA(t0 ) ep[ ( t t0 )]; t t0 Where t 0 i the time origin at which the eponential growth tage tart and >0 i parameter repreenting dieae everity. (1) In 199 a retropective tudy of banked erum ample of patient with protate cancer howed that the eponential increae in erum PSA begin 7 to 9 year before the tumor i detected clinically. In other word, the time of dieae inception t 0 cannot be determined. Prove that if t 0 <t 1, we till have the ame eponential model: Model 1b: PSA(t) PSA(t1 ) ep [β( t t1 )]; t t1 (That mean we can ue any time t 1 in the eponential growth tage a time origin intead of the unknown time of dieae inception t 0 ). () Show how to epre Model 1b a a imple linear regreion model: Model : Y t α βt ε (3) where Y t = ln[psa(t)]. What i the meaning of the intercept? Why the lope can be ued a a parameter repreenting dieae everity? Aume that the error term atifie aumption of the normal error regreion model; what are thee aumption? Do we really need to know a pecific value t 1 (in Model 1b) to perform data analyi or jut any ample in the eponential growth tage? What if we include data point before time t 0? (4) Find a poible link or relationhip between the regreion coefficient and (either one or both) and the PSA-DT.
(5) What i the link or relationhip between the coefficient of correlation r(y t,t) and coefficient of correlation r(y t,t-t 0 )? Can we calculate or approimate the coefficient of correlation r[psa(t), t- t 0 ] if we know r(y t,t) or r(y t,t-t 0 ) or both? (6) Ue the relationhip found in (3), how how to calculate or approimate the tandard error SE(PSA-DT) uing the variance-covariance matri for the regreion coefficient (variance of etimate of and and their covariance; number are often provided by computer out put uch a SAS). (7) Given the following mall data et: Time, t PSA Y=ln(PSA) 100 1. 0.18 80 1.6 0.470 640 4 1.386 1000 8.4.18 a) Calculate the coefficient of correlation r(y,t), the etimated lope b, the etimated intercept a, the PSA-DT, the tandard error SE(b), the tet tatitic for the tet of independence and it degree of freedom, the tandard error SE(PSA-DT) b) Set up the ANOVA table, include up to the F ratio tatitic; p-value i not required. Can we get the value of F ratio tatitic without the ANOVA table? c) Suppoe we change the data by changing the time origin: Time, t PSA Y=ln(PSA) 0 1. 0.18 180 1.6 0.470 540 4 1.386 900 8.4.18 Do we till get the ame reult in (a)? d) Suppoe we change the ampling time to the following: Time, t PSA Y=ln(PSA) 100 80 80 1000 How doe the new SE(PSA-DT) compare to the reult in (a)? eplain your anwer. Problem 11: The following data were collected during an eperiment in which 10 laboratory animal were inoculated with a pathogen. The variable are Time after inoculation (X, in minute) and Temperature (Y, in Celiu degree).
X, Time Minute) Y, Temperature ( 0 C) y y 4 38.8 576.00 1505.44 931.0 8 39.5 784.00 1560.5 1106.00 3 40.3 104.00 164.09 189.60 36 40.7 196.00 1656.49 1465.0 40 41.0 1600.00 1681.00 1640.00 44 41.1 1936.00 1689.1 1808.40 48 41.4 304.00 1713.96 1987.0 5 41.6 704.00 1730.56 163.0 56 41.8 3136.00 1747.4 340.80 60 41.9 3600.00 1755.61 514.00 Total 40 408.1 18960.00 16663.85 1745.60 (6) Draw a Scatter Diagram to how the aociation, if any, between thee two variable (correct cale i not very important); can you draw any concluion/obervation without doing any calculation? (7) Calculate the Coefficient of Correlation and it 95% Confidence Interval. (8) Form the regreion line of Y on X by calculating the etimate Intercept and Slope; if the model hold, what would be the temperature for an animal, choen at random, after 30 minute? Eplain, in the contet of thi problem, why it i riky to predict the repone outide the range of value of the independent variable repreented in the ample ay at 5 hour. (9) What fraction of the total variability of Y i eplained by it relationhip to X? Form the ANOVA Table. (10) Tet for H 0 : Slope = 0 at the.05 level of ignificance and tate your concluion in term of thi problem decription. Problem : The Pearon Correlation Coefficient (r) between two variable X and Y can be epreed in everal equivalent form; one of which i n 1 i yi y r( X, Y ) ( )( ) n i1 y (1) If a and c are two poitive contant and b and d are any two contant, prove that: r( ax b, cy d) r( X, Y) I thi reult in (1) till true if we do not aume that a and c are poitive? () What i the value of that Correlation Coefficient in quetion of Problem 11 if Temperature i meaured in Fahrenheit cale ( 0 F)? Eplain your anwer. Problem 13: Let X and Y be two variable in a tudy; the regreion line that can be ued to predict Y from X value i: Predicted y b0 b1 The etimated intercept and lope can be epreed in everal equivalent form; one of which i y b1 r b0 y b1 Where -bar i ample mean and i the ample tandard deviation of X.
(1) If a and c are two poitive contant and b and d are any two contant, conider the data tranformation: U ax b V cy d And let denote the etimated intercept and lope of the regreion line predicting V from U a B 0 and B 1. Epre B 0 and B 1 a function of a, b, c, d, and b 0 and b 1 () What would be the reult of (1) in the pecial cae that a=1 and b=0? What are the value of the Intercept and Slope in quetion 3 of Problem 11 if Temperature i meaured in Fahrenheit cale ( 0 F)? Eplain your anwer. Problem 14: For an eperiment like the one in Problem 1, Invetigator #1 may be intereted in predicting Y from X, and fit and compute a regreion line for thi purpoe. Invetigator #, however, may be intereted in predicting X from Y, and compute hi regreion line for that purpoe. (1) Are thee two regreion line the ame? If o, why? If not, compute the ratio and the product of the two lope a function of tandard tatitic. () Calculate the product of thoe two lope for data in Problem 11.