SEMATECH 997 Statstcal Methods Symposum Austn Regresson Models for a Bnary Response Usng EXCEL and JMP Davd C. Trndade, Ph.D. STAT-TECH Consultng and Tranng n Appled Statstcs San Jose, CA
Topcs Practcal Examples Propertes of a Bnary Response Lnear Regresson Models for Bnary Responses Smple Straght Lne Weghted Least Squares Regresson n EXCEL and JMP Logstc Response Functon Logstc Regresson Repeated Observatons (Grouped Data) Indvdual Observatons Logt Analyss n EXCEL and JMP Concluson
Practcal Examples: Bnary Responses Consder the followng stuatons: A weatherman would lke to understand f the probablty of a rany day occurrng depends on atmospherc pressure, temperature, or relatve humdty A doctor wants to estmate the chance of a stroke ncdent as a functon of blood pressure or weght An engneer s nterested n the lkelhood of a devce falng functonalty based on specfc parametrc readngs
More Practcal Examples The correctons department s tryng to learn f the number of nmate tranng hours affects the probablty of released prsoners returnng to jal (recdvsm) The mltary s nterested n the probablty of a mssle destroyng an ncomng target as a functon of the speed of the target A real estate agency s concerned wth measurng the lkelhood of sellng property gven the ncome of varous clents An equpment manufacturer s nvestgatng relablty after sx months of operaton usng dfferent spn rates or temperature settngs
Bnary Responses In all these examples, the dependent varable s a bnary ndcator response, takng on the values of ether 0 or, dependng on whch of of two categores the response falls nto: success-falure, yes-no, ranydry, target ht-target mssed, etc. We are nterested n determnng the role of explanatory or regressor varables X, X 2, on the bnary response for purposes of predcton.
Smple Lnear Regresson Consder the smple lnear regresson model for a bnary response: Y = β + β X + ε 0 where the ndcator varable Y = 0,. Snce E ε = 0, the mean response s ( ) ( ) EY = β + β X 0
Interpretaton of Bnary Response Snce Y can take on only the values 0 and, we choose the Bernoull dstrbuton for the probablty model. Thus, the probablty that Y = s the mean p and the probablty that Y = 0 s - p. The mean response EY ( ) = p+ 0 ( p) = p s thus nterpreted as the probablty that Y = when the regressor varable s X.
Model Consderatons Consder the varance of Y for a gven X : ( ) = ( β + β + ε ) = ( ε ) VYX V X X V X 0 ( ) ( β β )( β β ) = p p = + X X 0 0 We see the varance s not constant snce t depends on the value of X. Ths s a volaton of basc regresson assumptons. Soluton: Use weghted least squares regresson n whch the weghts selected are nversely proportonal to the varance of Y, where Var( Y ) = Yˆ ( Yˆ )
Dstrbuton of Errors Note also that the errors cannot be normally dstrbuted snce there are only two possble values (0 or ) for ε at each regressor level. Ftted model should have the property that the predcted responses le between 0 and for all X wthn the range of orgnal data. No guarantee that the smple lnear model wll have ths behavor.
Example : Mssle Test Data* The table shows the results of test-frng 25 ground to ar mssles at targets of varous speeds. A s a ht and a 0 s a mss. * Example from Montgomery & Peck, Introducton to Lnear Regresson Analyss, 2nd Ed. Table 6.4 Test Frng I Target Speed (knots) x Ht or Mss y 400 0 2 220 3 490 0 4 40 5 500 0 6 270 0 7 200 8 470 0 9 480 0 0 30 240 2 490 0 3 420 0 4 330 5 280 6 20 7 300 8 470 9 230 0 20 430 0 2 460 0 22 220 23 250 24 200 25 390 0
EXCEL Plot of Data There appears to be a tendency for msses to ncrease wth ncreasng target speed. Let us group the data to reveal the assocaton better. Ht or Mss y Plot of y Versus Target Speed x (knots) 0 00 50 200 250 300 350 400 450 500 x, Target Speed(knots)
Grouped Data Succes Fracton versus Speed Interval Speed Number of Number of Fracton Interval Attempts Hts Success 200-240 7 6 0.857 250-290 3 2 0.667 300-340 3 3.000 350-390 0 0.000 400-440 4 0.250 450-490 6 0.67 500 0 0.000 Sum 25 3 Success Fracton.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.00 0.000 200-240 250-290 300-340 350-390 400-440 450-490 500 Speed Interval (knots) Clearly, the probablty of a ht seems to decrease wth speed. We wll ft a straght-lne model to the data usng weghted least squares.
Weghted Least Squares We wll use the nverse of the varance of Y for the weghts w. Problem: these are not known because they are a functon of the unknown parameters β 0, β n the regresson model. That s, the weghts w are: w = = = VYX p p X X ( ) ( ) ( β + β )( β β ) 0 0 Soluton: We can ntally estmate β 0, β usng ordnary (unweghted) LS. Then, we calculate the weghts wth these estmates and solve for the weghted LS coeffcents. One teraton usually suffces.
Smple Lnear Regresson n EXCEL Several methods exst: Use Regresson macro n Data Analyss Tools. Use Functon button to pull up Slope and Intercept under Statstcal lstngs. Sort data frst by regressor varable. Clck on data ponts n plot of Y vs. X, select menubar Insert followed by Trendlne. In dalog box, select optons tab and choose Dsplay equaton on chart. Use EXCEL array tools (transpose, mnverse, and mmult) to defne and manpulate matrces. (Requres Cntrl-Shft-Enter for array entry.)
EXCEL Data Analyss Tools Output: SUMMARY OUTPUT Regresson Statstcs Multple R 0.64673 R Square 0.4826 Adjusted R Square 0.39296 Standard Error 0.39728 Observatons 25 Can also dsplay resduals and varous plots. ANOVA df SS MS F Sgnfcance F Regresson 2.6099 2.6099 6.53624 0.0004769 Resdual 23 3.63009 0.5783 Total 24 6.24 Coeffcents Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept.56228 0.26834 5.8294 0.0000.0077 2.739.0077 2.739 Target Speed (knot -0.0030 0.00074-4.06648 0.00048-0.00453-0.0048-0.00453-0.0048
EXCEL Functons Sorted data. Target Speed (knots) x Ht or Mss y 200 200 20 220 220 230 0 240 250 270 0 280 300 30 330 390 0 400 0 40 420 0 430 0 460 0 470 0 470 480 0 490 0 490 0 500 0 =ntercept(y column, x column) =slope(y column, x column) Output: Intercept.562282 Slope -0.0030
EXCEL Equaton on Chart Plot of y Versus Target Speed x (knots) y = -0.003x +.5623 Ht or Mss y 0 00 50 200 250 300 350 400 450 500 x, Target Speed(knots)
EXCEL Array Functons Three key functons: =transpose(range) =mmult(range, range2) =mnverse(range) Requres Cntrl-Shft-Enter each tme.
EXCEL Matrx Manpulaton Defne the desgn matrx X by addng a column of s for the constant n the model. Then, progressvely calculate: the transpose X the product X X the nverse of X X the product X Y the LS regresson coeffcents = (X X) - (X Y) The standard errors of the coeffcents can be obtaned from the square root of the dagonal elements of the varance-covarance matrx: MSE x (X X) -. Fnd MSE from the resduals SS and df.
EXCEL Matrx Example X X 200 200 20 220 220 230 240 250 270 280 300 30 330 390 400 40 420 430 460 470 470 480 490 490 500 Y 0 0 0 0 0 0 0 0 0 0 0 0 X X 25 8670 8670 3295700 [X X] - 0.45624-0.002-0.002 3.46E-06 X Y 3 3640 Coeffcents = [X X] - X Y β 0 β.562282-0.0030 200 200 20 220 220 230 240 250 270 280 300 30 330 390 400 40 420 430
Target Speed (knots) x EXCEL Matrx Example Ht or Mss y Y pred Resduals 200 0.962 0.0388 200 0.962 0.0388 20 0.93 0.0689 220 0.90 0.0989 220 0.90 0.0989 230 0 0.870-0.870 240 0.840 0.590 250 0.809 0.89 270 0 0.7508-0.7508 280 0.7208 0.2792 300 0.6607 0.3393 30 0.6306 0.3694 330 0.5705 0.4295 390 0 0.3902-0.3902 400 0 0.360-0.360 40 0.330 0.6699 420 0 0.3000-0.3000 430 0 0.2699-0.2699 460 0 0.798-0.798 470 0 0.497-0.497 470 0.497 0.8503 480 0 0.97-0.97 490 0 0.0896-0.0896 490 0 0.0896-0.0896 500 0 0.0596-0.0596 SS Resduals 3.630087 DF 23 MSE 0.5783 Standard Errors [X X] - 0.45624-0.002-0.002 3.46E-06 MSE x [X X] - 0.072008-0.0009-0.0009 5.46E-07 Standard Errors of Coeffcents β 0 β 0.268344 0.000739 Ftted model appears adequate snce all Y predctons are between 0 and. If not, would need non-lnear model.
Smple Lnear Regresson n JMP Specfy number of rows for data Set up X column Set up Y column Select under Analyze Ft Y by X For multple regresson, select under Analyze Ft Model
Note that Y s specfed C for contnuous at ths pont. Data Table n JMP
Ft Model n JMP
Weghted Least Squares Regresson In weghted least squares regresson, the squared devaton between the observed and predcted value (that s, the squared resdual) s multpled by weghts w that are nversely proportonal to Y. We then mnmze the followng functon wth respect to the coeffcents β 0, β : n ( ) SS = w Y β0 βx w = 2
Weghted LS Regresson n Several methods exst: EXCEL Transform all varables, ncludng constant. Use Regresson macro n Data Analyss Tools wth no ntercept Use Solver routne on sum of squares of weghted resduals Use EXCEL array tools (transpose, mnverse, and mmult) to defne and manpulate matrces. (Requres Cntrl-Shft-Enter for array entry.)
Transform Method for Weghted Least Squares Transform the varables by dvdng each term n the model by the square root of the varance of Y. SS w = n = w ( Y β β X ) 0 2 = n = Y var Y β 0 var Y β X var Y 2 = n ( ' ' Y β β ) 0Z X = 2
Transformed Varables The expresson below can be solved usng ordnary LS multple regresson wth the ntercept (constant term) equal to zero. n ' w 0 = ( ' β β ) SS = Y Z X 2
Transformng Varables Transformed Factors Test Frng I Constant Target Speed (knots) x Ht or Mss y Y Pred Var(Y) = (Y )*(-Y ) T = /sqrt[var(y)] Constant T*Cnst X = T*X Y = T*y 400 0 0.360 0.2304 2.083 2.0832 833.3 0.0000 2 220 0.90 0.089 3.350 3.3496 736.9 3.3496 3 490 0 0.0896 0.086 3.50 3.5009 75.4 0.0000 4 40 0.330 0.22 2.27 2.266 87.9 2.266 5 500 0 0.0596 0.0560 4.225 4.2250 22.5 0.0000 6 270 0 0.7508 0.87 2.32 2.39 624.2 0.0000 7 200 0.962 0.0373 5.78 5.780 035.6 5.780 8 470 0 0.497 0.273 2.803 2.8026 37.2 0.0000 9 480 0 0.97 0.054 3.08 3.0809 478.8 0.0000 0 30 0.6306 0.2329 2.072 2.079 642.3 2.079 240 0.840 0.337 2.735 2.7345 656.3 2.7345 2 490 0 0.0896 0.086 3.50 3.5009 75.4 0.0000 3 420 0 0.3000 0.200 2.82 2.822 96.5 0.0000 4 330 0.5705 0.2450 2.020 2.0202 666.7 2.0202 5 280 0.7208 0.203 2.229 2.2290 624. 2.2290 6 20 0.93 0.064 3.949 3.9493 829.3 3.9493 7 300 0.6607 0.2242 2.2 2.20 633.6 2.20 8 470 0.497 0.273 2.803 2.8026 37.2 2.8026 9 230 0 0.870 0.23 2.984 2.9836 686.2 0.0000 20 430 0 0.2699 0.97 2.253 2.2526 968.6 0.0000 2 460 0 0.798 0.475 2.604 2.604 97.9 0.0000 22 220 0.90 0.089 3.350 3.3496 736.9 3.3496 23 250 0.809 0.533 2.554 2.5538 638.5 2.5538 24 200 0.962 0.0373 5.78 5.780 035.6 5.780 25 390 0 0.3902 0.2379 2.050 2.050 799.5 0.0000 LS Coeff b0.56228 b -0.0030
EXCEL Data Analyss Regresson on Transformed Factors (Intercept =0) SUMMARY OUTPUT Regresson Statstcs Multple R 0.8252 R Square 0.6809 Adjusted R Square 0.6235 Standard Error.0060 Observatons 25 ANOVA df SS MS F Sgnfcance F Regresson 2 49.6625 24.832 24.5376 2.4989E-06 Resdual 23 23.2753.020 Total 25 72.9377 Coeffcents Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept 0 #N/A #N/A #N/A #N/A #N/A #N/A #N/A Constant T*Cnst.586687 0.85892 8.535539.39E-08.2024.9723.2024.9723 X = T*X -0.00309 0.000533-5.79882 6.58E-06-0.0049-0.0099-0.0049-0.0099
Weghted Least Squares Analyss Usng Solver Use the unweghted LS coeffcents to predct Y. Calculate the varance of Y based on predcted Y n equaton Y (- Y ) Calculate the weghts w as the recprocal varance of Y Usng tral settngs for the coeffcents for weghted LS regresson, calculate the sum of the squared resduals (= observed mnus predcted response) weghted by w. Apply solver to mnmze ths sum by changng the weghted coeffcents
Solver Routne
Solver Soluton SSQ Wres SS Res^2 23.2753 3.63288 b0 W LS.586687 DF DF b W LS -0.00309 23 23 MSE MSE.097 0.5795
EXCEL Matrx Manpulaton Defne the desgn matrx X by addng a column of s for the constant n the model. Defne the dagonal weght matrx V wth varances along dagonal. The standard error of the weghted LS coeffcents can be obtaned from: Var β* = ( ' ) X V X Then, progressvely calculate: the nverse V - the product V - X the transpose X the product X V - X the nverse of X V - X the product V - Y the product X V - Y the coeffcents = (X V - X) - (X V - Y)
Weghted Matrx Results
Weghted LS n JMP Set up a column for predcted Y usng ordnary LS coeffcents (Requres use of formula calculator n JMP) Set up column for weghts as recprocal varance of Y usng formula calculator Label ths column as weghts and select Ft Model
Weghted LS Data Table n JMP
Ft Model for Weghted LS n JMP
Logstc Regresson, A Non-Lnear Model The lnear model constrans the response to have ether a zero probablty or a probablty of one at large or small values of the regressor. Ths model may be unreasonable. Instead, we propose a model n whch the probabltes of zero and one are reached asymptotcally. Frequently we fnd that the response functon s S shaped, smlar to the CDF of a normal dstrbuton. In fact, probt analyss nvolves modelng the response wth a normal CDF.
Logstc Functon Model We attempt to model the ndcator varable response usng the logstc functon (logt analyss): E ( Y X ) = p = = + exp exp + exp ( β ) 0 + β X ( β + β X ) 0 ( β β X ) 0
Lnearzng Logstc Functon Consder the logt transformaton of the probablty p: p * = ln = β0 + β p p X p* s called the logt mean response. The logt response functon s a lnear model.
Fttng the Logt Response Model Two Possbltes:. If we have repeat observatons on Y at each level of X, we can estmate the probabltes usng the proporton of s at each X. Then, we ft the logt response functon usng weghted least squares. 2. If we have only a few or no repeat observatons at the varous X values, we cannot use proportons. We then estmate the logt response functon from ndvdual Y observatons usng maxmum lkelhood methods.
Weghted LS for Fttng Logt Response Model The observed proporton at each X level s If the number of observatons at each level of X s large, the varance of the transformed proporton s p = (# of ' s at X ) (#of observatons at X ) V p * ( * p ) p = ln p = n p ( p )
Weghts for LS Regresson We use the approprate weghts w = n p ( p ) and solve usng weghted LS methods prevously shown usng EXCEL or JMP. Then transform p* to the orgnal unts p usng logstc functon. pˆ e = + pˆ * e pˆ *
Weghted LS Logt Regresson We need to set followng columns: X N (number of observatons at each X) Y (number of s at each X) p (proporton) p* (transformed proporton) w (weghts) At ths pont, may want to consder MLE methods n JMP.
Maxmum Lkelhood Estmaton for Logstc Grouped Data n JMP Data table s easy to set up. Column wth each X value sequentally repeated Y column wth alternatng 0 s and s Frequency column for counts of 0 s and s Label X column C and X, Y column N (nomnal) and Y, and Frequency column C and F Then run Ft Y by X
Cauton: A JMP Feature JMP wll model the lowest value of the bnary response as the success and the alternatve as the falure. Thus, 0 wll be treated as success and as falure. Smlarly, no wll be vewed as success and yes as falure, snce n comes before y n the alphabet. Consequently, the functon you expect to be monotoncally ncreasng wll appear as decreasng and vce versa unless you flp the ndcator values. In the examples that follow, I have lsted the tables as they appear n texts but dsplayed the graphs by nterchangng s and 0 s for analyss (Ft Y by X)
MLE Table for Grouped Data Example from Appled Lnear Statstcal Models by Neter, Wasserman, and Kutner, Table 6.
Ft Y by X
Logstc Regresson n Jump Indvdual Values We can use JMP s MLE to ft a model to the data. The data table entry s smple: Column for X Column for Y or s and 0 s Label X column C and X Label Y column N and Y Ft Y by X
Data Table for Logstc MLE Example from Appled Lnear Statstcal Models by Neter, Wasserman, and Kutner, Table 6.2
Ft Y by X MLE Output
Multple Logstc Regresson Here s an example from the JMP In tranng manual that comes wth the student verson of JMP: A weatherman s tryng to predct the precptaton probablty by lookng at the mornng temperature and the barometrc pressure. He generates a table for 30 days n Aprl. If the precptaton was greater than 0.02 nches, the day was called rany. If below, then dry.
Partal Table: Sprng.JMP Data
JMP Logstc Analyss: Ft Y by X
Multple Logstc Regresson n JMP Ft Y by X Generates a separate logstc regresson for each predctor column X E( Y X ) = + exp β β X Ft Model ( ) Fts an overall logstc regresson model for specfed predctor columns X s and nteractons 0 E ( Y X, X ) 2 = + exp ( β β X β X β X X ) 0 2 2 2 2
Concluson Bnary response data occurs n many mportant applcatons. The smple lnear regresson model has constrants that may affect ts adequacy. The logstc model has many desrable propertes for modelng ndcator varables. EXCEL and JMP have excellent capabltes for analyss and modelng of bnary data. For logstc regresson modelng, JMP s MLE routnes are easy to apply and very useful.