Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Transcription

1 Exhaustve Regresson An Exploraton of Regresson-Based Data Mnng Technques Usng Super Computaton Antony Daves, Ph.D. Assocate Professor of Economcs Duquesne Unversty Pttsburgh, PA 58 Research Fellow The Mercatus Center George Mason Unversty Arlngton, VA 0 antony@antoln-daves.com RPF Workng Paper No August 8, 008 RESEARCH PROGRAM ON FORECASTING Center of Economc Research Department of Economcs The George Washngton Unversty Washngton, DC Research Program on Forecastng (RPF) Workng Papers represent prelmnary work crculated for comment and dscusson. Please contact the author(s) before ctng ths paper n any publcatons. The vews expressed n RPF Workng Papers are solely those of the author(s) and do not necessarly represent the vews of RPF or George Washngton Unversty.

2 Exhaustve Regresson An Exploraton of Regresson-Based Data Mnng Technques Usng Super Computaton Antony Daves, Ph.D. Assocate Professor of Economcs Duquesne Unversty Pttsburgh, PA 58 Research Fellow The Mercatus Center George Mason Unversty Arlngton, VA 0 antony@antoln-daves.com

3 Regresson analyss s ntended to be used when the researcher seeks to test a gven hypothess aganst a data set. Unfortunately, n many applcatons t s ether not possble to specfy a hypothess, typcally because the research s n a very early stage, or t s not desrable to form a hypothess, typcally because the number of potental explanatory varables s very large. In these cases, researchers have resorted ether to overt data mnng technques such as stepwse regresson, or covert data mnng technques such as runnng varatons on regresson models pror to runnng the fnal model (also known as data peekng ). Whle data mnng sde-steps the need to form a hypothess, t s hghly susceptble to generatng spurous results. Ths paper draws on the known propertes of OLS estmators n the presence of omtted and extraneous varable models to propose a procedure for data mnng that attempts to dstngush between parameter estmates that are sgnfcant due to an underlyng structural relatonshp and those that are sgnfcant due to random chance.

4 . Background Regresson analyss s desgned to estmate the probablty of observng a gven data set gven that a pre-determned hypothess about the relatonshp between an outcome varable and a set of factors s assumed to be true. Unfortunately, n many applcatons t s not possble to specfy a hypothess. Two possble reasons why a researcher would want to perform analyss absent a hypothess are: Large Data Scope: Data sze refers to the number of observatons n a data set. Data scope refers to the number of potental explanatory factors ( canddate factors ) n the data set. In the case of large data scope (e.g., economc and fnancal data sets), the number of canddate factors s so large that the cost of formng a tractable hypothess s prohbtve. Early Stage Analyss: In the case of early stage analyss (e.g., clncal data), there has not been enough observaton to as yet for a hypothess. To date, stepwse regresson (SR) has been one of the more wdely used technques for performng these hypothess-less analyses. SR methods perform a smart samplng of regresson models n an attempt to fnd a regresson model that best fts the data. The procedure for smartly samplng s ad-hoc. Two typcally used procedures are backward (wheren the frst model ncludes all factors and factors are removed one at a tme) and forward (wheren the frst model ncludes no factors and factors are added one at a tme). SR procedures contan three major flaws: Samplng sze: The number of regresson models that can be constructed from a gven data set can be ncredbly large. For example, wth only 30 canddate 3

5 factors one could construct more than bllon regresson models ( 30 ). Stepwse procedures sample only a small number (typcally less than 00) of the set of possble regresson models. Whle stepwse methods can fnd models that ft the data reasonably well, as the number of factors rses, the probablty of stepwse methods fndng the best-ft model s vrtually zero. Ft crteron: In evaluatng competng models, stepwse methods typcally employ an F-statstc crteron. Ths crteron causes stepwse to methods to seek out the model that comes closest to explanng the data set. However, as the number of canddate factors ncreases, what also ncreases s the probablty of a gven factor addng explanatory power smply by random chance. Thus, the stepwse ft crteron cannot dstngush between factors that contrbute explanatory power to the outcome varable because of an underlyng relatonshp ( determnstc factors ) and those that contrbute by random chance only ( spurous factors ). Intal condton: Because SR only examnes a small subset of the space of possble models and because the space of fts of the models (potentally) contans many local optma, the soluton SR returns vares based on the startng pont of the search. For example, for the same data set, SR backward and SR forward can yeld dfferent results. As the startng ponts for SR backward and SR forward are arbtrary, there are many other potental startng ponts each of whch potentally yelds a dfferent result. For example, Fgure depcts the space of possble regresson models that can be formed usng K factors. Each block represents one regresson model. The shade of the block ndcates the qualty of the model (e.g., goodness of ft). A stepwse procedure that starts at model A 4

6 evaluates models n the vcnty of A, moves to the best model, then re-evaluates n the new vcnty. Ths contnues untl the procedure cannot fnd a better model n the vcnty. In ths example, stepwse would move along the ndcated path startng at pont A. Were stepwse to start at pont B, however, t would end up at a dfferent optmal model. Out of the four startng ponts shown, only startng pont D fnds the best model. Fgure. Results from stepwse procedures are dependent on the ntal condton from whch the search commences.. All Subsets Regresson All Subsets Regresson (ASR) s a procedure ntended to be used when a researcher wants to perform analyss n the absence of a hypothess and wants to avod the samplng sze problem nherent n stepwse procedures. For K potental explanatory factors, ASR examnes all K lnear models that can be constructed. Untl recently, ASR has been nfeasble due to the massve computng power requred. By employng grd-enabled super computaton, t s now 5

7 feasble to conduct ASR for moderately szed data sets. For example, as of today an average computer would requre nearly 00 years of contnuous work to perform ASR on a data set contanng 40 canddate factors (see Fgure ), whle a 0,000 node grd computer could complete the same analyss n less than a week. Whle ASR solves the samplng sze and ntal condton problems nherent n SR, ASR remans subject to the ft crteron problem. If anythng, the ft crteron s more of a problem for ASR as the procedure examnes a much larger space of models than does SR and therefore s more lkely to fnd spurous factors. Fgure. The tme a sngle computer requres to perform ASR rses exponentally wth the number of canddate factors. 6

8 3. Exhaustve Regresson Exhaustve Regresson (ER) utlzes the ASR procedure, but attempts to dentfy spurous factors va a cross-model ch-square statstc that tests for stablty n parameter estmates across models. The cross-model stablty test compares parameter estmates for each factor across all K models n whch the factor appears. Factors whose parameter estmates yeld sgnfcantly dfferent results across models are dentfed as spurous. A gven factor can exst n one of three types of models: omtted factor model, correctly specfed model, and extraneous factor model. A correctly specfed model contans all of the explanatory factors (.e., the factors contrbute to explanng the outcome varable because of some underlyng relatonshp between the factor and the outcome) and no other factors. An omtted varable model ncludes at least one, but not all explanatory factors and (possbly) other factors. An extraneous varable model ncludes all explanatory factors and at least one other factor. In the cases of the correctly specfed model and the extraneous varable model, estmates of parameters assocated wth the factors ( slope coeffcents ) are unbased. In the case of the omtted varable model, however, estmates of the slope coeffcents are based and the drecton of bas s a functon of the covarances of the omtted factor wth the outcome varable and the omtted factor wth the factors ncluded n the model. For example, consder a data set contanng k +k +k 3 factors, for each of whch there are N observatons. Let the factors be arranged nto three Nxk j, j={,,3} matrces X, X, and X 3, and let the sets of slope coeffcents assocated wth each set of factors be the k j x vectors β, β, and β 3, respectvely. Let Y be an Nx vector of observatons on the outcome varable. Suppose that, unknown to the researcher, the process that determnes the outcome varable s Assumng, of course, that the remanng classcal lnear model assumptons hold. 7

9 Y=Xβ +Xβ +u () where u s an Nx vector of ndependently and dentcally normally dstrbuted errors. The three cases are attaned when we apply the ordnary least squares (OLS) procedure to estmate the followng models: Omtted Varable Case: Y=Xβ + u O Correctly Specfed Case: Y=Xβ +Xβ +u Extraneous Varable Case: Y=Xβ +Xβ +Xβ 3 3 +u E In the correctly specfed case, the ordnary least squares regresson procedure yelds the slope estmate: βˆ - ' ' = ([ X X] [ X X] ) [ X X] Y ˆ β () where the square brackets ndcate a parttoned matrx. Substtutng the defnton for Y from () nto (), t can be shown that the expected values of the slope estmates equal the true slope values: βˆ β E = ˆ β β Ths s the unbasedness condton. Smlarly, n the extraneous varable case, we have: βˆ ' - ' β ˆ = ([ X X X3] [ X X X3] ) [ X X X3] Y ˆ β3 (3) Agan, t can be shown that the expected values of the slope estmates equal the true slope values: 8

10 By contrast, n the omtted varable case, we have Substtutng () nto (4), we have and the expected values of the slope estmates are: βˆ β E ˆ β = β ˆ β3 0 ' ( ) - ˆ ' β = X X X Y (4) ' - ' ( ) ( ) β ˆ = X X X X β +X β +u E ( ˆ ' ) ( ) - ' β = β + X X X X β β (5) From (5), we see that the expected value of the slope estmates n the omtted varable case are based and that the drecton of the bas depends on ' XXβ. We can construct a sequence of omtted varable cases as follows. Let { Z,Z,..., Z } be the set of all (column-wse unque) subsets of X. Let X % be the set of k regressors formed by mergng X and X and then removng Z so that, from the superset of regressors formed by mergng X and X, X % s the set of ncluded regressors and Z s the correspondng set of excluded regressors. Fnally, let β ˆ be the OLS estmate of β obtaned by regressng Y on X %. The expected value of the mean of the β ˆ k across all regresson models s: k k ˆ ' - ' E E k β = β + ( ) k X% X% X% Zβ (6) = = Snce Z s are subsets of X, each Z s Nxj where j k. 9

11 From (6), we see that the expected value of the mean of the β ˆ s β when any of the followng condtons are met:. For each set of ncluded regressors, there s no covarance between the ncluded regressors and the correspondng set of excluded regressors (.e., ' XZ % =0 );. For some sets of ncluded regressors, there s a non-zero covarance between the ncluded ' regressors and the correspondng set of excluded regressors (.e., XZ % 0 ), but the XZ % =0); ' expected value of the covarances s zero (.e., E ( ) 3. For some sets of ncluded regressors, there s a non-zero covarance between the ncluded ' regressors and the correspondng set of excluded regressors (.e., XZ % 0 ), and the ' expected value of the covarances s non-zero (.e., E ( ) XZ % 0), but the expected covarance of the excluded regressors wth the dependent varable s zero (.e., E ( β ) =0); 4. None of the above holds, but the expected value of the product of () the covarances between the ncluded and excluded regressors and () the covarance of the excluded E XZβ % = 0 ). ' regressors wth the dependent varable s zero (.e., ( ) The ER procedure reles on the reasonable assumpton that, as the data number of factors n the data set ncreases, condtons (), (3), and (4) wll hold asymptotcally. If true, ths enables us to construct the cross-model ch-square statstc. 0

12 4. The Cross-Model Ch-Square Statstc Let us assume that the th (n a set of K) factor, x (where x s an Nx vector) has no structural relatonshp wth an outcome varable y. Consder the equaton: y = α + β x + β x + + β x + u (7)... k k Under the null hypothess of β = 0, the square of the rato of ˆ β to ts standard error s dstrbuted χ wth one degree of freedom. By addng factors to and removng factors (other than x ) from (7), we can obtan other estmates of β. Let ˆj β be the j th such estmate of β. By lookng at all combnatons of factors from the superset of K factors, we can construct K estmates of β. Under the null hypothess that β = 0 and assumng that the ˆj β are ndependent, we have: K ˆ β j ~ χ K (8) j= s β From (8), we can construct c, the cross-model ch-square statstc for the factor x : c K ˆ β j ~ χ K (9) j= s β = Gven the (typcally) large number of degrees of freedom nherent n the ER procedure, t s worth notng that the measure n (8) s lkely to be subject to Type II errors. In an attempt to compensate, we dvde by the number of degrees of freedom to obtan c, a relatve ch-square statstc, a measure that s less dependent on sample sze. Carmnes and McIver (98) and Klne (998) clam that one should conclude that the data represent a good ft to the hypothess when the relatve ch-square measure s less than 3. Ullman (00) recommends usng a ch-square less than.

13 Because the ˆj β are obtaned by explorng all combnatons of factors from a sngle superset, one mght expect the ˆj β to be correlated (partcularly when factors are postvely correlated), and for the correlaton to ncrease n the presence of stronger multcollnearty among the factors. 5. Monte-Carlo Tests of c To test the ablty of the cross-model ch-square statstc to dentfy factors that mght show statstcally sgnfcant slope coeffcents smply by random chance, consder an outcome varable, Y, that s determned by three factors as follows Y = α + β X + β X + β X + u (0) 3 3 where u s an error term satsfyng the requrements for the classcal lnear regresson model. Let us randomly generate addtonal factors X 4 through X 5, run the ER procedure and calculate c for each of the factors. The followng fgures show the results of the ER runs. The frst set of bars show results for ER runs appled to the superset of factors X through X 4. The results are derved as follows:. Generate 500 observatons for X, randomly selectng observatons from the unform dstrbuton.. Generate 500 observatons each for X through X 5 such that X = γ X + v where the γ are randomly selected from the standard normal dstrbuton and dstrbuted, and v are normally dstrbuted wth mean zero and varance 0.. Ths step creates varyng multcollnearty among the factors. 3. Generate Y accordng to (0) where α = β = β = β 3 =, and u s normally dstrbuted wth a varance of.

14 4. Run all K = 5 regresson models to obtan the K estmates for each β: ˆ β,..., ˆ β, ˆ β,..., ˆ β, ˆ β,..., ˆ β, ˆ β,..., ˆ β.,,8,,8 3, 3,8 4, 4,8 5. Calculate c, c, c 3, and c 4 accordng to (9). 6. Repeat steps through 5 three-thousand tmes. 7. Repeat steps through 6, each tme ncreasng the varance of u by untl the varance reaches 0. 3 At the completon of the algorthm, there wll be 60,000 measures for each of c, c, c 3, and c 4 based on (60,000) (5) = 900,000 separate regressons. We then repeat the procedure usng a superset of factors X through X 5, then X through X 6, etc. up to X through X 5. 4 Step ntroduces random multcollnearty among the factors. On average, the multcollnearty of factors wth X follows the pattern shown n Fgure 3. Approxmately half of the correlatons wth X are postve and half are negatve. Whle the correlatons are constructed between X and the other factors, ths wll also result n the other factors beng par-wse correlated though to a lesser extent (on average) than they are correlated wth X. 3 Ths results n an average R for the estmate of equaton (0) of approxmately Ths fnal pass requres the estmaton of bllon separate regressons. The entre Monte-Carlo run requres computaton equvalent to almost,000 CPU-hours. 3

15 has less than ths correlaton wth X % 0% 0% 30% 40% 50% 60% 70% 80% 90% 00% Ths proporton of factors Fgure 3. Pattern of Squared Correlatons of Factors X through X K wth X Fgure 4 shows the results of the Monte-Carlo runs n whch the crtcal value for the c s set to 3. For example, when there are four factors n the data set, c, c, and c 3 pass the sgnfcance test slghtly over 50% of the tme versus 0% for c 4. In other words, for data sets n whch three out of four factors determne the outcome varable (and a crtcal value of 3), the ER procedure wll dentfy the three determnng factors 50% of the tme and dentfy the nondetermnng factor 0% of the tme. As the number of factors n the data set ncreases, the ER procedure better dscrmnates between the factors that determne the outcome varable and those that mght appear sgnfcant by random chance alone. The last set of bars n Fgure 4 shows the results for data sets n whch three out of ffteen factors determne the outcome varable. Here, 4

16 the ER procedure dentfes the determnng factors approxmately 85% of the tme and (erroneously) dentfes non-determnng factors only slghtly more than 0% of the tme. 00% 90% Cross-Model Ch-Square Statstc Exceeds % 70% 60% 50% 40% 30% 0% 0% 0% Number of Factors n the Regresson Model Factor Factor Factor 3 All Other Factors Fgure 4. Monte-Carlo Tests of ER Procedure Usng Supersets of Data from 4 Through 5 Factors (crtcal value = 3) These results are based on the somewhat arbtrary selecton of 3 for the relatve chsquare crtcal value. Reducng the crtcal value to produces the results shown n Fgure 5. As expected, reducng the crtcal value causes the ncdence of false postves (where postve means dentfcaton of a determnng factor ) to approxmately 5%, but the ncdence of false negatves falls to below 0%. Fgure 6 and Fgure 7, where the crtcal value s set to.5 and, respectvely, are shown for comparson. As expected, the margnal gans n the reducton n false negatves (versus Fgure 5) appear to be outweghed by the ncrease n the rate of false postves. 5

17 00% 90% Cross-Model Ch-Square Statstc Exceeds.0 80% 70% 60% 50% 40% 30% 0% 0% 0% Number of Factors n the Regresson Model Factor Factor Factor 3 All Other Factors Fgure 5. Monte-Carlo Tests of ER Procedure Usng Supersets of Data from 4 Through 5 Factors (crtcal value = ) 6

18 00% 90% Cross-Model Ch-Square Statstc Exceeds.5 80% 70% 60% 50% 40% 30% 0% 0% 0% Number of Factors n the Regresson Model Factor Factor Factor 3 All Other Factors Fgure 6. Monte-Carlo Tests of ER Procedure Usng Supersets of Data from 4 Through 5 Factors (crtcal value =.5) 00% 90% Cross-Model Ch-Square Statstc Exceeds.0 80% 70% 60% 50% 40% 30% 0% 0% 0% Number of Factors n the Regresson Model Factor Factor Factor 3 All Other Factors Fgure 7. Monte-Carlo Tests of ER Procedure Usng Supersets of Data from 4 Through 5 Factors (crtcal value = ) 7

19 To measure the effect of the determnstc model s goodness of ft on the cross-model ch-square statstc, we can arrange the Monte-Carlo results accordng to the varance of the error term n (0). The algorthm vares the error term from to 0 n ncrements of. Fgure 8 shows the proporton of tmes that c passes the sgnfcance threshold of 3 for factors X, X, and X 3 (combned) for varous numbers of factors n the data set and for varous levels of error varance. An error varance of corresponds to an R (for the estmate of equaton (0)) of approxmately 0.93 whle an error varance of 0 corresponds to an R of approxmately % 90% Cross-Model Ch-Square Statstc Exceeds % 70% 60% 50% 40% 30% 0% 0% 0% Varance of the Error Term 5-Factor Models Data 0-Factor Models Data 5-Factor Models Data Sets Sets Sets Fgure 8. Monte-Carlo Tests of ER Procedure for Factors X through X 3 (combned) As expected, as the error varance rses n a 5-factor data set, the probablty of a false negatve rses from approxmately 5% (when var(u) = ) to 30% (when var(u) = 0). Results are markedly worse for data sets wth fewer factors. Fgure 9 shows results for factors X 4 through 8

20 X 5, combned. Here, we see that the probablty of a false postve rses from under 5% (when var(u) = ) to almost 0% (when var(u) = 0) for 5-factor data sets. Employng a crtcal value of.0 yelds the results n Fgure 0 and Fgure. Comparng Fgure 8 and Fgure 9 wth Fgure 0 and Fgure, we see that employng the crtcal value of 3 versus cuts n half (approxmately) the lkelhoods of false postves and negatves. 00% 90% Cross-Model Ch-Square Statstc Exceeds % 70% 60% 50% 40% 30% 0% 0% 0% Varance of the Error Term 5-Factor Models Data 0-Factor Models Data 5-Factor Models Data Sets Sets Sets Fgure 9. Monte-Carlo Tests of ER Procedure for Factors X 4 through X 5, X 0, and X 5 (combned) 9

21 00% 90% Cross-Model Ch-Square Statstc Exceeds.0 80% 70% 60% 50% 40% 30% 0% 0% 0% Varance of the Error Term 5-Factor Models Data 0-Factor Models Data 5-Factor Models Data Sets Sets Sets Fgure 0. Monte-Carlo Tests of ER Procedure for Factors X through X 3 (combned) 00% 90% Cross-Model Ch-Square Statstc Exceeds.0 80% 70% 60% 50% 40% 30% 0% 0% 0% Varance of the Error Term 5-Factor Models Data 0-Factor Models Data 5-Factor Models Data Sets Sets Sets Fgure. Monte-Carlo Tests of ER Procedure for Factors X 4 through X 5, X 0, and X 5 (combned) 0

22 6. Estmated ER (EER) Even wth the applcaton of super computaton, large data sets can make ASR and ER mpractcal. For example, t would take a full year for a top-of-the-lne 00,000 node cluster computer to perform ASR/ER on a 50-factor data set. When one consders data mnng just smple non-lnear transformatons of factors (nverse, square, logarthm), the 50-factor data set becomes a 50-factor data set. If one then consders data mnng two-factor cross-products (e.g., X X, X X 3, etc.), the 50-factor data set balloons to an,75-factor data set. Ths suggests that super computaton alone sn t enough to make ER a unversal tool for data mnng. One possble approach to usng ER wth large data sets s to employ estmated ER. Estmated ER (EER) randomly selects J (out of a possble K ) models to estmate. Note that EER does not select the factors randomly, but selects the models randomly. Selectng factors randomly bases the model selecton toward models wth a total of K/ factors. Selectng models randomly gves each of the K models an equal probablty of beng chosen. Fgure and Fgure 3 show the results of EER for varous numbers of randomly selected models for 5-, 0-, and 5-Factor data sets. These tests were performed as follows:. Generate 500 observatons for each of the factors X, randomly selectng observatons from the unform dstrbuton.. Generate 500 observatons each for X through X K (where K s 5, 0, or 5) such that X = γ X + v where the γ are randomly selected from the standard normal dstrbuton and dstrbuted, and v are normally dstrbuted wth mean zero and varance 0.. Ths step creates varyng multcollnearty among the factors. 3. Generate Y accordng to (0) where α = β = β = β 3 =, and u s normally dstrbuted wth a varance of.

23 4. Randomly select J models out of the possble K regresson models to obtan J estmates for each β. 5. Calculate c, c, c 3,, c K accordng to (9). 6. Repeat steps through 5 three-hundred tmes. 7. Calculate the percentage of tmes that the cross-model test statstcs for each β exceed the crtcal value. 8. Repeat steps through 7, for J runnng from to % Cross-Model Ch-Square Statstc Exceeds.0 95% 90% 85% 80% 75% 70% 65% 60% 55% 50% Number of Randomly Selected Models 5-Factor Data Sets 0-Factor Data Sets 5-Factor Data Sets Fgure. Monte-Carlo Tests of EER Procedure for Factors X through X 3 (combned) Evdence suggests that smaller data sets are more senstve to a small number of randomly selected models. For 5-factor data sets, the rate of false negatves (Fgure ) does not stop fallng sgnfcantly untl approxmately J = 30, and the rate of false postves (Fgure 3)

24 does not stop rsng sgnfcantly untl approxmately J = 45. The rates of false negatves (Fgure ) and false postves (Fgure 3) for 0-factor and 5-factor data sets appear to settle for lesser values of J. 80% Cross-Model Ch-Square Statstc Exceeds.0 70% 60% 50% 40% 30% 0% 0% Number of Randomly Selected Models 5-Factor Data Sets 0-Factor Data Sets 5-Factor Data Sets Fgure 3. Monte-Carlo Tests of EER Procedure for Factors X 4 through X 5, X 0, and X 5 (combned) The number of factors n the data set that determne the outcome varable has a greater effect on the number of randomly selected models requred n EER. Fgure 4 compares results for 5-factor data sets when the outcome varable s a functon of only one factor versus beng a functon of three factors. The vertcal axs measures the cross-model ch-squared statstc for the ndcated number of randomly selected models dvded by the average cross-model ch-squared statstc over all 00 runs. Fgure 4 shows that, for 5-factor data sets, as the number of randomly selected models ncreases, the cross-model ch-squared statstc approaches ts mean value for 3

25 the 00 runs faster when the outcome varable s a functon of three factors versus beng a functon of only one factor. Fgure 5 shows smlar results for 0-factor data sets. Cross-Model Ch-Square Statstc Relatve to Mean Number of Randomly Selected Models Y = f (X) Y = f(x, X, X3) Fgure 4. EER Procedure for Factor X and X through X 3 (combned) for 5-Factor Data Sets when Outcome Varable s a Functon of One vs. Three Factors 4

26 Cross-Model Ch-Square Statstc Relatve to Mean Number of Randomly Selected Models Y = f (X) Y = f(x, X, X3) Fgure 5. EER Procedure for Factor X and X through X 3 (combned) for 0-Factor Data Sets when Outcome Varable s a Functon of One vs. Three Factors Cross-Model Ch-Square Statstc Relatve to Mean Number of Randomly Selected Models Y = f (X) Y = f(x, X, X3) Fgure 6. EER Procedure for Factor X and X through X 3 (combned) for 5-Factor Data Sets when Outcome Varable s a Functon of One vs. Three Factors 5

27 Results n Fgure 6 are less compellng, but not contradctory. A second result, common to the large data sets (Fgure 5 and Fgure 6), s that the varaton n the cross-model chsquared estmates s less when the outcome varance s a functon of three versus one factor. Fgure 7, Fgure 8, and Fgure 9 show correspondng results for factors that do not determne the outcome varable. These results suggest that EER may be an adequate procedure for estmatng ER for larger data sets. Cross-Model Ch-Square Statstc Relatve to Mean Number of Randomly Selected Models Y = f (X) Y = f(x, X, X3) Fgure 7. EER Procedure for Factors X through X 5 (combned) and X 4 through X 5 (combned) for 5-Factor Data Sets when Outcome Varable s a Functon of One vs. Three Factors 6

28 Cross-Model Ch-Square Statstc Relatve to Mean Number of Randomly Selected Models Y = f (X) Y = f(x, X, X3) Fgure 8. EER Procedure for Factors X through X 5 (combned) and X 4 through X 5 (combned) for 0-Factor Data Sets when Outcome Varable s a Functon of One vs. Three Factors Cross-Model Ch-Square Statstc Relatve to Mean Number of Randomly Selected Models Y = f (X) Y = f(x, X, X3) Fgure 9. EER Procedure for Factors X through X 5 (combned) and X 4 through X 5 (combned) for 5-Factor Data Sets when Outcome Varable s a Functon of One vs. Three Factors 7

29 7. Comparson to Stepwse and k-fold Holdout Kuk (984) demonstrated that stepwse procedures are nferor to all subsets procedures n data mnng proportonal hazard models. Logcally, the same argument apples to data mnng regresson models. As stepwse examnes only a subset of models and ASR examnes all models, the best that stepwse can do s to match ASR. Because stepwse smartly samples on the bass of margnal changes to a ft functon, multcollnearty among the factors can cause stepwse to return a soluton that s a local, but not global, optmum. What s of nterest s a comparson of stepwse to EER because EER, lke stepwse, samples the space of possble regresson models. K-fold holdout s less an alternatve data mnng method than t s an alternatve objectve. Data mnng methods typcally have the objectve of fndng the model that best fts the data (typcally measured by mprovements to the F-statstc). K-fold holdout offers the alternatve objectve of maxmzng the model s ablty to predct observatons that were not ncluded n the model estmaton (.e., held out observatons). The selecton of whch observatons to hold out vares dependng on the data set. For example, n the case of tme seres data, t makes more sense to hold out observatons at the end of the data set. The followng tests use the same data set and apply EER, estmated all subets regresson (EASR) where we conduct a samplng of models rather than examne all possble models, and stepwse. The procedure s as follows:. Generate 500 observatons for X, randomly selectng observatons from the unform dstrbuton.. Generate 500 observatons each for X through X 5 such that X = γ X + v where the γ are randomly selected from the standard normal dstrbuton and dstrbuted, and v are 8

30 normally dstrbuted wth mean zero and varance 0.. Ths step creates varyng multcollnearty among the factors. 3. Generate Y accordng to (0) where α = β = β = β 3 =, and u s normally dstrbuted wth a varance of. 4. Perform EER and EASR k-fold holdout: a. Randomly select 500 models out of the possble 30 regresson models to obtan J estmates for each β. b. Evaluate the k-fold holdout crteron: c. Calculate c, c, c 3,, c 30 accordng to (9). d. Mark factors for whch c > as beng selected by EER. 5. Evaluate the EER models usng the k-fold holdout crteron: a. For each randomly selected model n step 4a, randomly select 50 observatons to exclude. b. Estmate the model usng the remanng 450 observatons. c. Use the estmated model to predct the 50 excluded observatons. d. Calculate the MSE (mean squared error) where MSE = 50 observaton predcted observaton excluded observatons ( ) e. Over the 500 models randomly selected by EER, dentfy the one for whch the MSE s least. Mark the factors that produce that model as beng selected by the k-fold holdout crteron. 6. Peform backward stepwse: a. Let M be the set of ncluded factors, and N be the set of excluded factors such that M = m, N = n, and m+ n= 30. 9

31 b. Estmate the model β and calculate the estmated model s X M Y = α + X + u adjusted multple correlaton coeffcent, R 0. c. For each factor X, =,, m, move the factor from set M to set N, estmate the model Y = α + β X + u X M, calculate the estmated model s adjusted multple correlaton coeffcent, result n the set of measures d. Let R ( L mn R0,..., Rm ) measure R, and then return the factor X j to M from N. Ths wll R,..., R m. =. Identfy the factor whose removal resulted n the R L. Move that factor from set M to set N. e. Gven the new set M, for each factor X, =,, m, remove the factor from set M, estmate the model Y = α + β X + u X M, calculate the estmated model s adjusted multple correlaton coeffcent, from N. Ths wll result n the set of measures R, and then return the factor X j to M R,..., R m. f. For each factor X, =,, n, move the factor from N to M, estmate the model Y = α + β X + u X M, calculate the estmated model s adjusted multple correlaton coeffcent, R, and then return the factor X to M from N. Ths wll result n the set of measures R,..., m R. + m+ n g. Let R ( L = mn R,..., R m + n) and R ( H max R,..., R m + n) removal resulted n the measure =. Identfy the factor whose R L. Move that factor from set M to set N. If was attaned usng a factor from N, move that factor from set N to set M. R H 30

32 h. Repeat steps e through g untl there s no further mprovement n R H.. Mark the factors n the model attaned n step 5h as beng selected by stepwse. 7. Repeat steps through 6 sx-hundred tmes. 8. Calculate the percentage of tmes that each factor s selected by EER, k-fold, and stepwse. Fgure 0 shows the results of ths comparson. For 30 factors, where the outcome varable s determned by the frst three factors, stepwse correctly dentfed the frst factor slghtly less frequently than dd EER (43% of the tme versus 48% of the tme). Stepwse correctly dentfed the second and thrd factors slghtly more frequently than dd EER (87% and 9% of the tme versus 8% and 85% of the tme). The k-fold crteron when appled to EASR dentfed the frst three factors 53%, 63%, and 63% of the tme, respectvely. In ths, the false postve error rate was comparable for EER and stepwse, and sgnfcantly worse for k-fold. EER erroneously dentfed the remanng factors as beng sgnfcant, on average, 7% of the tme versus 34% of the tme for stepwse and 49% of the tme for k-fold. Ths suggests that EER may be more powerful as a tool for elmnatng factors from consderaton. 3

33 00% 90% 80% Proporton of Tmes Procedure Selects the Factor 70% 60% 50% 40% 30% 0% 0% 0% Factors EER Stepwse EASR Wth k-fold Crteron Fgure 0. Comparson of EER, Stepwse, and k-fold Crteron 8. Applcablty to Other Procedures and Drawbacks Ordnary least squares estmators belong to the class of maxmum lkelhood estmators. The estmates are unbased (.e., E( ˆ β ) arbtrarly small ε), and effcent (.e., var ( ˆ β ) var ( β ) = β ), consstent (.e., lm Pr ( ˆ β β > ε) = 0 for an N < % where β % s any lnear, unbased estmator of β. The ER procedure reles on the fact that parameter estmators are unbased n extraneous varable cases though based n varyng drectons across the omtted varable cases. 3

34 In the case of lmted dependent varable models (e.g., logt), parameter estmates are unbased and consstent when the model s correctly specfed. However, n the omtted varable case, logt slope coeffcents are based toward zero. For example, suppose the outcome varable, * Y, s determned by a latent varable Y * ff Y 0 such that Y = >. Suppose also that Y * s 0 otherwse determned by the equaton = α + β + β +. If we omt the factor X from the (logt) * Y X X u regresson model, we estmate * o Y α β X u = + +. Yatchew and Grlches (985) show that β β = < β o βσ X + σ u () As the denomnator n () s strctly greater than zero, the estmator s based toward zero. More mportantly, the based estmator has the same sgn as the unbased estmator. Ths volates all four of condtons n (6), any one of whch would valdate the ER procedure. However, whle estmates of determnstc parameters are based downward, estmates of nondetermnstc parameters are randomly based. For example, f Y * s determned by the equaton * Y α βx βx u = and we estmate * o o Y α β X β3 X3 u = + + +, we obtan β β = < β o βσ X + σ u () β o 3 β 3 = = βσ X + σ u 0 (3) Because X 3 s extraneous, β 3 s zero and so repeated estmates of ˆ β o 3 wll be randomly dstrbuted around zero. In short, parameter estmates for determnstc factors wll be: () unbased n the 33

35 correctly specfed case, () unbased n the extraneous varable case, and (3) based toward zero n the omtted varable case. But, parameter estmates for spurous factors wll be unbased toward zero n all three cases. Hence, we can expect the cross-model ch-square statstcs to work n the case of logt models, though lkely n an asymptotc sense. One drawback of the ER (and EER) procedure s that the procedure may select a set of factors that, whle ndvdually passng the cross-model ch-square test, are not statstcally sgnfcant when run together n a regresson model. One possble explanaton s that, wthn the confnes of a sngle regresson model, the error varance s large enough to drown out the explanatory power of a factor but, because the cross-model ch-square statstc s based on crossmodel nformaton that has estmated and fltered out the error term, the factor appears sgnfcant. Ths s analogous to the gan n nformaton obtaned from employng panel data versus tme seres data (cf., Daves, 006). A tme seres data set may measure the same phenomenon over the same tme perod as a panel data set, but because the panel data set also measures the phenomenon over multple cross-sectons, nformaton across cross-sectons can be used to mtgate the error varance wthn a gven tme perod. 9. Concluson The purpose of ths analyss s to buld on ASR by proposng a new procedure, ER, that combnes ASR wth a cross-model ch-square statstc that attempts to dstngush determnstc factors from spurous factors. Recognzng that, for large data sets, even ER s nfeasble even wth the applcaton of super computaton, ths analyss further proposes an adaptaton on ER, EER, whch samples the space of possble regresson models. 34

36 Ths analyss () proves that, under condtons less strngent than the classcal lnear model condtons, the cross-model ch-square statstc s a reasonable measure for dstngushng between determnstc and spurous factors, () demonstrates va Monte-Carlo studes the lkelhood of ER produce false postve and false negatve results under condtons of varyng numbers of factors n the data set, varyng varance of the regresson error, and varyng choce of crtcal value, (3) proposes a procedure, EER, that estmates ER results va samplng a subset of the space of possble regresson models, (4) demonstrates va Monte-Carlo studes the lkelhood of EER producng false postve and false negatve results under condtons of varyng sample sze, and varyng number of factors n the determnstc equaton, (5) compares EER model selecton wth backward stepwse and the k-fold crteron by va Monte-Carlo studes that apply the same data sets to all three procedures, and (6) demonstrates that EER avods false postves and false negatves better than the k-fold crteron, avods false postves approxmately as well as stepwse, and avods false negatves sgnfcantly better than stepwse. 35

37 0. References Carmnes, E.G., and J.P. McIver, 98. Analyzng models wth unobserved varables: Analyss of covarance structures, n Bohmstedt. In G.W. and E.F. Borgatta, eds., Socal Measurement. Sage Publcatons: Thousand Oaks, CA. pp Daves, A., 006. A framework for decomposng shocks and measurng volatltes derved from mult-dmensonal panel data of survey forecasts. Internatonal Journal of Forecastng, (): Klne, R.B., 998. Prncples and practce of structural equaton modelng. Gulford Press: New York. Kuk, A.C., 984. All subsets regresson n proportonal hazard models. Bometrka, 7(3): Ullman, J.B., 00. Structural equaton modelng. In Tabachnck, B.G. and L.S. Fdell, eds., Usng Multvarate Statstcs. Allyn and Bacon: Needham Heghts, MA. pp Yatchew, A. and Z. Grlches, 985. Specfcaton error n probt models. The Revew of Economcs and Statstcs, 67: