Exhaustive Regression. An Exploration of RegressionBased Data Mining Techniques Using Super Computation


 Elmer Walker
 2 years ago
 Views:
Transcription
1 Exhaustve Regresson An Exploraton of RegressonBased Data Mnng Technques Usng Super Computaton Antony Daves, Ph.D. Assocate Professor of Economcs Duquesne Unversty Pttsburgh, PA 58 Research Fellow The Mercatus Center George Mason Unversty Arlngton, VA 0 RPF Workng Paper No August 8, 008 RESEARCH PROGRAM ON FORECASTING Center of Economc Research Department of Economcs The George Washngton Unversty Washngton, DC Research Program on Forecastng (RPF) Workng Papers represent prelmnary work crculated for comment and dscusson. Please contact the author(s) before ctng ths paper n any publcatons. The vews expressed n RPF Workng Papers are solely those of the author(s) and do not necessarly represent the vews of RPF or George Washngton Unversty.
2 Exhaustve Regresson An Exploraton of RegressonBased Data Mnng Technques Usng Super Computaton Antony Daves, Ph.D. Assocate Professor of Economcs Duquesne Unversty Pttsburgh, PA 58 Research Fellow The Mercatus Center George Mason Unversty Arlngton, VA 0
3 Regresson analyss s ntended to be used when the researcher seeks to test a gven hypothess aganst a data set. Unfortunately, n many applcatons t s ether not possble to specfy a hypothess, typcally because the research s n a very early stage, or t s not desrable to form a hypothess, typcally because the number of potental explanatory varables s very large. In these cases, researchers have resorted ether to overt data mnng technques such as stepwse regresson, or covert data mnng technques such as runnng varatons on regresson models pror to runnng the fnal model (also known as data peekng ). Whle data mnng sdesteps the need to form a hypothess, t s hghly susceptble to generatng spurous results. Ths paper draws on the known propertes of OLS estmators n the presence of omtted and extraneous varable models to propose a procedure for data mnng that attempts to dstngush between parameter estmates that are sgnfcant due to an underlyng structural relatonshp and those that are sgnfcant due to random chance.
4 . Background Regresson analyss s desgned to estmate the probablty of observng a gven data set gven that a predetermned hypothess about the relatonshp between an outcome varable and a set of factors s assumed to be true. Unfortunately, n many applcatons t s not possble to specfy a hypothess. Two possble reasons why a researcher would want to perform analyss absent a hypothess are: Large Data Scope: Data sze refers to the number of observatons n a data set. Data scope refers to the number of potental explanatory factors ( canddate factors ) n the data set. In the case of large data scope (e.g., economc and fnancal data sets), the number of canddate factors s so large that the cost of formng a tractable hypothess s prohbtve. Early Stage Analyss: In the case of early stage analyss (e.g., clncal data), there has not been enough observaton to as yet for a hypothess. To date, stepwse regresson (SR) has been one of the more wdely used technques for performng these hypothessless analyses. SR methods perform a smart samplng of regresson models n an attempt to fnd a regresson model that best fts the data. The procedure for smartly samplng s adhoc. Two typcally used procedures are backward (wheren the frst model ncludes all factors and factors are removed one at a tme) and forward (wheren the frst model ncludes no factors and factors are added one at a tme). SR procedures contan three major flaws: Samplng sze: The number of regresson models that can be constructed from a gven data set can be ncredbly large. For example, wth only 30 canddate 3
5 factors one could construct more than bllon regresson models ( 30 ). Stepwse procedures sample only a small number (typcally less than 00) of the set of possble regresson models. Whle stepwse methods can fnd models that ft the data reasonably well, as the number of factors rses, the probablty of stepwse methods fndng the bestft model s vrtually zero. Ft crteron: In evaluatng competng models, stepwse methods typcally employ an Fstatstc crteron. Ths crteron causes stepwse to methods to seek out the model that comes closest to explanng the data set. However, as the number of canddate factors ncreases, what also ncreases s the probablty of a gven factor addng explanatory power smply by random chance. Thus, the stepwse ft crteron cannot dstngush between factors that contrbute explanatory power to the outcome varable because of an underlyng relatonshp ( determnstc factors ) and those that contrbute by random chance only ( spurous factors ). Intal condton: Because SR only examnes a small subset of the space of possble models and because the space of fts of the models (potentally) contans many local optma, the soluton SR returns vares based on the startng pont of the search. For example, for the same data set, SR backward and SR forward can yeld dfferent results. As the startng ponts for SR backward and SR forward are arbtrary, there are many other potental startng ponts each of whch potentally yelds a dfferent result. For example, Fgure depcts the space of possble regresson models that can be formed usng K factors. Each block represents one regresson model. The shade of the block ndcates the qualty of the model (e.g., goodness of ft). A stepwse procedure that starts at model A 4
6 evaluates models n the vcnty of A, moves to the best model, then reevaluates n the new vcnty. Ths contnues untl the procedure cannot fnd a better model n the vcnty. In ths example, stepwse would move along the ndcated path startng at pont A. Were stepwse to start at pont B, however, t would end up at a dfferent optmal model. Out of the four startng ponts shown, only startng pont D fnds the best model. Fgure. Results from stepwse procedures are dependent on the ntal condton from whch the search commences.. All Subsets Regresson All Subsets Regresson (ASR) s a procedure ntended to be used when a researcher wants to perform analyss n the absence of a hypothess and wants to avod the samplng sze problem nherent n stepwse procedures. For K potental explanatory factors, ASR examnes all K lnear models that can be constructed. Untl recently, ASR has been nfeasble due to the massve computng power requred. By employng grdenabled super computaton, t s now 5
7 feasble to conduct ASR for moderately szed data sets. For example, as of today an average computer would requre nearly 00 years of contnuous work to perform ASR on a data set contanng 40 canddate factors (see Fgure ), whle a 0,000 node grd computer could complete the same analyss n less than a week. Whle ASR solves the samplng sze and ntal condton problems nherent n SR, ASR remans subject to the ft crteron problem. If anythng, the ft crteron s more of a problem for ASR as the procedure examnes a much larger space of models than does SR and therefore s more lkely to fnd spurous factors. Fgure. The tme a sngle computer requres to perform ASR rses exponentally wth the number of canddate factors. 6
8 3. Exhaustve Regresson Exhaustve Regresson (ER) utlzes the ASR procedure, but attempts to dentfy spurous factors va a crossmodel chsquare statstc that tests for stablty n parameter estmates across models. The crossmodel stablty test compares parameter estmates for each factor across all K models n whch the factor appears. Factors whose parameter estmates yeld sgnfcantly dfferent results across models are dentfed as spurous. A gven factor can exst n one of three types of models: omtted factor model, correctly specfed model, and extraneous factor model. A correctly specfed model contans all of the explanatory factors (.e., the factors contrbute to explanng the outcome varable because of some underlyng relatonshp between the factor and the outcome) and no other factors. An omtted varable model ncludes at least one, but not all explanatory factors and (possbly) other factors. An extraneous varable model ncludes all explanatory factors and at least one other factor. In the cases of the correctly specfed model and the extraneous varable model, estmates of parameters assocated wth the factors ( slope coeffcents ) are unbased. In the case of the omtted varable model, however, estmates of the slope coeffcents are based and the drecton of bas s a functon of the covarances of the omtted factor wth the outcome varable and the omtted factor wth the factors ncluded n the model. For example, consder a data set contanng k +k +k 3 factors, for each of whch there are N observatons. Let the factors be arranged nto three Nxk j, j={,,3} matrces X, X, and X 3, and let the sets of slope coeffcents assocated wth each set of factors be the k j x vectors β, β, and β 3, respectvely. Let Y be an Nx vector of observatons on the outcome varable. Suppose that, unknown to the researcher, the process that determnes the outcome varable s Assumng, of course, that the remanng classcal lnear model assumptons hold. 7
9 Y=Xβ +Xβ +u () where u s an Nx vector of ndependently and dentcally normally dstrbuted errors. The three cases are attaned when we apply the ordnary least squares (OLS) procedure to estmate the followng models: Omtted Varable Case: Y=Xβ + u O Correctly Specfed Case: Y=Xβ +Xβ +u Extraneous Varable Case: Y=Xβ +Xβ +Xβ 3 3 +u E In the correctly specfed case, the ordnary least squares regresson procedure yelds the slope estmate: βˆ  ' ' = ([ X X] [ X X] ) [ X X] Y ˆ β () where the square brackets ndcate a parttoned matrx. Substtutng the defnton for Y from () nto (), t can be shown that the expected values of the slope estmates equal the true slope values: βˆ β E = ˆ β β Ths s the unbasedness condton. Smlarly, n the extraneous varable case, we have: βˆ '  ' β ˆ = ([ X X X3] [ X X X3] ) [ X X X3] Y ˆ β3 (3) Agan, t can be shown that the expected values of the slope estmates equal the true slope values: 8
10 By contrast, n the omtted varable case, we have Substtutng () nto (4), we have and the expected values of the slope estmates are: βˆ β E ˆ β = β ˆ β3 0 ' ( )  ˆ ' β = X X X Y (4) '  ' ( ) ( ) β ˆ = X X X X β +X β +u E ( ˆ ' ) ( )  ' β = β + X X X X β β (5) From (5), we see that the expected value of the slope estmates n the omtted varable case are based and that the drecton of the bas depends on ' XXβ. We can construct a sequence of omtted varable cases as follows. Let { Z,Z,..., Z } be the set of all (columnwse unque) subsets of X. Let X % be the set of k regressors formed by mergng X and X and then removng Z so that, from the superset of regressors formed by mergng X and X, X % s the set of ncluded regressors and Z s the correspondng set of excluded regressors. Fnally, let β ˆ be the OLS estmate of β obtaned by regressng Y on X %. The expected value of the mean of the β ˆ k across all regresson models s: k k ˆ '  ' E E k β = β + ( ) k X% X% X% Zβ (6) = = Snce Z s are subsets of X, each Z s Nxj where j k. 9
11 From (6), we see that the expected value of the mean of the β ˆ s β when any of the followng condtons are met:. For each set of ncluded regressors, there s no covarance between the ncluded regressors and the correspondng set of excluded regressors (.e., ' XZ % =0 );. For some sets of ncluded regressors, there s a nonzero covarance between the ncluded ' regressors and the correspondng set of excluded regressors (.e., XZ % 0 ), but the XZ % =0); ' expected value of the covarances s zero (.e., E ( ) 3. For some sets of ncluded regressors, there s a nonzero covarance between the ncluded ' regressors and the correspondng set of excluded regressors (.e., XZ % 0 ), and the ' expected value of the covarances s nonzero (.e., E ( ) XZ % 0), but the expected covarance of the excluded regressors wth the dependent varable s zero (.e., E ( β ) =0); 4. None of the above holds, but the expected value of the product of () the covarances between the ncluded and excluded regressors and () the covarance of the excluded E XZβ % = 0 ). ' regressors wth the dependent varable s zero (.e., ( ) The ER procedure reles on the reasonable assumpton that, as the data number of factors n the data set ncreases, condtons (), (3), and (4) wll hold asymptotcally. If true, ths enables us to construct the crossmodel chsquare statstc. 0
12 4. The CrossModel ChSquare Statstc Let us assume that the th (n a set of K) factor, x (where x s an Nx vector) has no structural relatonshp wth an outcome varable y. Consder the equaton: y = α + β x + β x + + β x + u (7)... k k Under the null hypothess of β = 0, the square of the rato of ˆ β to ts standard error s dstrbuted χ wth one degree of freedom. By addng factors to and removng factors (other than x ) from (7), we can obtan other estmates of β. Let ˆj β be the j th such estmate of β. By lookng at all combnatons of factors from the superset of K factors, we can construct K estmates of β. Under the null hypothess that β = 0 and assumng that the ˆj β are ndependent, we have: K ˆ β j ~ χ K (8) j= s β From (8), we can construct c, the crossmodel chsquare statstc for the factor x : c K ˆ β j ~ χ K (9) j= s β = Gven the (typcally) large number of degrees of freedom nherent n the ER procedure, t s worth notng that the measure n (8) s lkely to be subject to Type II errors. In an attempt to compensate, we dvde by the number of degrees of freedom to obtan c, a relatve chsquare statstc, a measure that s less dependent on sample sze. Carmnes and McIver (98) and Klne (998) clam that one should conclude that the data represent a good ft to the hypothess when the relatve chsquare measure s less than 3. Ullman (00) recommends usng a chsquare less than.
13 Because the ˆj β are obtaned by explorng all combnatons of factors from a sngle superset, one mght expect the ˆj β to be correlated (partcularly when factors are postvely correlated), and for the correlaton to ncrease n the presence of stronger multcollnearty among the factors. 5. MonteCarlo Tests of c To test the ablty of the crossmodel chsquare statstc to dentfy factors that mght show statstcally sgnfcant slope coeffcents smply by random chance, consder an outcome varable, Y, that s determned by three factors as follows Y = α + β X + β X + β X + u (0) 3 3 where u s an error term satsfyng the requrements for the classcal lnear regresson model. Let us randomly generate addtonal factors X 4 through X 5, run the ER procedure and calculate c for each of the factors. The followng fgures show the results of the ER runs. The frst set of bars show results for ER runs appled to the superset of factors X through X 4. The results are derved as follows:. Generate 500 observatons for X, randomly selectng observatons from the unform dstrbuton.. Generate 500 observatons each for X through X 5 such that X = γ X + v where the γ are randomly selected from the standard normal dstrbuton and dstrbuted, and v are normally dstrbuted wth mean zero and varance 0.. Ths step creates varyng multcollnearty among the factors. 3. Generate Y accordng to (0) where α = β = β = β 3 =, and u s normally dstrbuted wth a varance of.
14 4. Run all K = 5 regresson models to obtan the K estmates for each β: ˆ β,..., ˆ β, ˆ β,..., ˆ β, ˆ β,..., ˆ β, ˆ β,..., ˆ β.,,8,,8 3, 3,8 4, 4,8 5. Calculate c, c, c 3, and c 4 accordng to (9). 6. Repeat steps through 5 threethousand tmes. 7. Repeat steps through 6, each tme ncreasng the varance of u by untl the varance reaches 0. 3 At the completon of the algorthm, there wll be 60,000 measures for each of c, c, c 3, and c 4 based on (60,000) (5) = 900,000 separate regressons. We then repeat the procedure usng a superset of factors X through X 5, then X through X 6, etc. up to X through X 5. 4 Step ntroduces random multcollnearty among the factors. On average, the multcollnearty of factors wth X follows the pattern shown n Fgure 3. Approxmately half of the correlatons wth X are postve and half are negatve. Whle the correlatons are constructed between X and the other factors, ths wll also result n the other factors beng parwse correlated though to a lesser extent (on average) than they are correlated wth X. 3 Ths results n an average R for the estmate of equaton (0) of approxmately Ths fnal pass requres the estmaton of bllon separate regressons. The entre MonteCarlo run requres computaton equvalent to almost,000 CPUhours. 3
15 has less than ths correlaton wth X % 0% 0% 30% 40% 50% 60% 70% 80% 90% 00% Ths proporton of factors Fgure 3. Pattern of Squared Correlatons of Factors X through X K wth X Fgure 4 shows the results of the MonteCarlo runs n whch the crtcal value for the c s set to 3. For example, when there are four factors n the data set, c, c, and c 3 pass the sgnfcance test slghtly over 50% of the tme versus 0% for c 4. In other words, for data sets n whch three out of four factors determne the outcome varable (and a crtcal value of 3), the ER procedure wll dentfy the three determnng factors 50% of the tme and dentfy the nondetermnng factor 0% of the tme. As the number of factors n the data set ncreases, the ER procedure better dscrmnates between the factors that determne the outcome varable and those that mght appear sgnfcant by random chance alone. The last set of bars n Fgure 4 shows the results for data sets n whch three out of ffteen factors determne the outcome varable. Here, 4
16 the ER procedure dentfes the determnng factors approxmately 85% of the tme and (erroneously) dentfes nondetermnng factors only slghtly more than 0% of the tme. 00% 90% CrossModel ChSquare Statstc Exceeds % 70% 60% 50% 40% 30% 0% 0% 0% Number of Factors n the Regresson Model Factor Factor Factor 3 All Other Factors Fgure 4. MonteCarlo Tests of ER Procedure Usng Supersets of Data from 4 Through 5 Factors (crtcal value = 3) These results are based on the somewhat arbtrary selecton of 3 for the relatve chsquare crtcal value. Reducng the crtcal value to produces the results shown n Fgure 5. As expected, reducng the crtcal value causes the ncdence of false postves (where postve means dentfcaton of a determnng factor ) to approxmately 5%, but the ncdence of false negatves falls to below 0%. Fgure 6 and Fgure 7, where the crtcal value s set to.5 and, respectvely, are shown for comparson. As expected, the margnal gans n the reducton n false negatves (versus Fgure 5) appear to be outweghed by the ncrease n the rate of false postves. 5
17 00% 90% CrossModel ChSquare Statstc Exceeds.0 80% 70% 60% 50% 40% 30% 0% 0% 0% Number of Factors n the Regresson Model Factor Factor Factor 3 All Other Factors Fgure 5. MonteCarlo Tests of ER Procedure Usng Supersets of Data from 4 Through 5 Factors (crtcal value = ) 6
18 00% 90% CrossModel ChSquare Statstc Exceeds.5 80% 70% 60% 50% 40% 30% 0% 0% 0% Number of Factors n the Regresson Model Factor Factor Factor 3 All Other Factors Fgure 6. MonteCarlo Tests of ER Procedure Usng Supersets of Data from 4 Through 5 Factors (crtcal value =.5) 00% 90% CrossModel ChSquare Statstc Exceeds.0 80% 70% 60% 50% 40% 30% 0% 0% 0% Number of Factors n the Regresson Model Factor Factor Factor 3 All Other Factors Fgure 7. MonteCarlo Tests of ER Procedure Usng Supersets of Data from 4 Through 5 Factors (crtcal value = ) 7
19 To measure the effect of the determnstc model s goodness of ft on the crossmodel chsquare statstc, we can arrange the MonteCarlo results accordng to the varance of the error term n (0). The algorthm vares the error term from to 0 n ncrements of. Fgure 8 shows the proporton of tmes that c passes the sgnfcance threshold of 3 for factors X, X, and X 3 (combned) for varous numbers of factors n the data set and for varous levels of error varance. An error varance of corresponds to an R (for the estmate of equaton (0)) of approxmately 0.93 whle an error varance of 0 corresponds to an R of approxmately % 90% CrossModel ChSquare Statstc Exceeds % 70% 60% 50% 40% 30% 0% 0% 0% Varance of the Error Term 5Factor Models Data 0Factor Models Data 5Factor Models Data Sets Sets Sets Fgure 8. MonteCarlo Tests of ER Procedure for Factors X through X 3 (combned) As expected, as the error varance rses n a 5factor data set, the probablty of a false negatve rses from approxmately 5% (when var(u) = ) to 30% (when var(u) = 0). Results are markedly worse for data sets wth fewer factors. Fgure 9 shows results for factors X 4 through 8
20 X 5, combned. Here, we see that the probablty of a false postve rses from under 5% (when var(u) = ) to almost 0% (when var(u) = 0) for 5factor data sets. Employng a crtcal value of.0 yelds the results n Fgure 0 and Fgure. Comparng Fgure 8 and Fgure 9 wth Fgure 0 and Fgure, we see that employng the crtcal value of 3 versus cuts n half (approxmately) the lkelhoods of false postves and negatves. 00% 90% CrossModel ChSquare Statstc Exceeds % 70% 60% 50% 40% 30% 0% 0% 0% Varance of the Error Term 5Factor Models Data 0Factor Models Data 5Factor Models Data Sets Sets Sets Fgure 9. MonteCarlo Tests of ER Procedure for Factors X 4 through X 5, X 0, and X 5 (combned) 9
21 00% 90% CrossModel ChSquare Statstc Exceeds.0 80% 70% 60% 50% 40% 30% 0% 0% 0% Varance of the Error Term 5Factor Models Data 0Factor Models Data 5Factor Models Data Sets Sets Sets Fgure 0. MonteCarlo Tests of ER Procedure for Factors X through X 3 (combned) 00% 90% CrossModel ChSquare Statstc Exceeds.0 80% 70% 60% 50% 40% 30% 0% 0% 0% Varance of the Error Term 5Factor Models Data 0Factor Models Data 5Factor Models Data Sets Sets Sets Fgure. MonteCarlo Tests of ER Procedure for Factors X 4 through X 5, X 0, and X 5 (combned) 0
22 6. Estmated ER (EER) Even wth the applcaton of super computaton, large data sets can make ASR and ER mpractcal. For example, t would take a full year for a topofthelne 00,000 node cluster computer to perform ASR/ER on a 50factor data set. When one consders data mnng just smple nonlnear transformatons of factors (nverse, square, logarthm), the 50factor data set becomes a 50factor data set. If one then consders data mnng twofactor crossproducts (e.g., X X, X X 3, etc.), the 50factor data set balloons to an,75factor data set. Ths suggests that super computaton alone sn t enough to make ER a unversal tool for data mnng. One possble approach to usng ER wth large data sets s to employ estmated ER. Estmated ER (EER) randomly selects J (out of a possble K ) models to estmate. Note that EER does not select the factors randomly, but selects the models randomly. Selectng factors randomly bases the model selecton toward models wth a total of K/ factors. Selectng models randomly gves each of the K models an equal probablty of beng chosen. Fgure and Fgure 3 show the results of EER for varous numbers of randomly selected models for 5, 0, and 5Factor data sets. These tests were performed as follows:. Generate 500 observatons for each of the factors X, randomly selectng observatons from the unform dstrbuton.. Generate 500 observatons each for X through X K (where K s 5, 0, or 5) such that X = γ X + v where the γ are randomly selected from the standard normal dstrbuton and dstrbuted, and v are normally dstrbuted wth mean zero and varance 0.. Ths step creates varyng multcollnearty among the factors. 3. Generate Y accordng to (0) where α = β = β = β 3 =, and u s normally dstrbuted wth a varance of.
23 4. Randomly select J models out of the possble K regresson models to obtan J estmates for each β. 5. Calculate c, c, c 3,, c K accordng to (9). 6. Repeat steps through 5 threehundred tmes. 7. Calculate the percentage of tmes that the crossmodel test statstcs for each β exceed the crtcal value. 8. Repeat steps through 7, for J runnng from to % CrossModel ChSquare Statstc Exceeds.0 95% 90% 85% 80% 75% 70% 65% 60% 55% 50% Number of Randomly Selected Models 5Factor Data Sets 0Factor Data Sets 5Factor Data Sets Fgure. MonteCarlo Tests of EER Procedure for Factors X through X 3 (combned) Evdence suggests that smaller data sets are more senstve to a small number of randomly selected models. For 5factor data sets, the rate of false negatves (Fgure ) does not stop fallng sgnfcantly untl approxmately J = 30, and the rate of false postves (Fgure 3)
24 does not stop rsng sgnfcantly untl approxmately J = 45. The rates of false negatves (Fgure ) and false postves (Fgure 3) for 0factor and 5factor data sets appear to settle for lesser values of J. 80% CrossModel ChSquare Statstc Exceeds.0 70% 60% 50% 40% 30% 0% 0% Number of Randomly Selected Models 5Factor Data Sets 0Factor Data Sets 5Factor Data Sets Fgure 3. MonteCarlo Tests of EER Procedure for Factors X 4 through X 5, X 0, and X 5 (combned) The number of factors n the data set that determne the outcome varable has a greater effect on the number of randomly selected models requred n EER. Fgure 4 compares results for 5factor data sets when the outcome varable s a functon of only one factor versus beng a functon of three factors. The vertcal axs measures the crossmodel chsquared statstc for the ndcated number of randomly selected models dvded by the average crossmodel chsquared statstc over all 00 runs. Fgure 4 shows that, for 5factor data sets, as the number of randomly selected models ncreases, the crossmodel chsquared statstc approaches ts mean value for 3
25 the 00 runs faster when the outcome varable s a functon of three factors versus beng a functon of only one factor. Fgure 5 shows smlar results for 0factor data sets. CrossModel ChSquare Statstc Relatve to Mean Number of Randomly Selected Models Y = f (X) Y = f(x, X, X3) Fgure 4. EER Procedure for Factor X and X through X 3 (combned) for 5Factor Data Sets when Outcome Varable s a Functon of One vs. Three Factors 4
26 CrossModel ChSquare Statstc Relatve to Mean Number of Randomly Selected Models Y = f (X) Y = f(x, X, X3) Fgure 5. EER Procedure for Factor X and X through X 3 (combned) for 0Factor Data Sets when Outcome Varable s a Functon of One vs. Three Factors CrossModel ChSquare Statstc Relatve to Mean Number of Randomly Selected Models Y = f (X) Y = f(x, X, X3) Fgure 6. EER Procedure for Factor X and X through X 3 (combned) for 5Factor Data Sets when Outcome Varable s a Functon of One vs. Three Factors 5
27 Results n Fgure 6 are less compellng, but not contradctory. A second result, common to the large data sets (Fgure 5 and Fgure 6), s that the varaton n the crossmodel chsquared estmates s less when the outcome varance s a functon of three versus one factor. Fgure 7, Fgure 8, and Fgure 9 show correspondng results for factors that do not determne the outcome varable. These results suggest that EER may be an adequate procedure for estmatng ER for larger data sets. CrossModel ChSquare Statstc Relatve to Mean Number of Randomly Selected Models Y = f (X) Y = f(x, X, X3) Fgure 7. EER Procedure for Factors X through X 5 (combned) and X 4 through X 5 (combned) for 5Factor Data Sets when Outcome Varable s a Functon of One vs. Three Factors 6
28 CrossModel ChSquare Statstc Relatve to Mean Number of Randomly Selected Models Y = f (X) Y = f(x, X, X3) Fgure 8. EER Procedure for Factors X through X 5 (combned) and X 4 through X 5 (combned) for 0Factor Data Sets when Outcome Varable s a Functon of One vs. Three Factors CrossModel ChSquare Statstc Relatve to Mean Number of Randomly Selected Models Y = f (X) Y = f(x, X, X3) Fgure 9. EER Procedure for Factors X through X 5 (combned) and X 4 through X 5 (combned) for 5Factor Data Sets when Outcome Varable s a Functon of One vs. Three Factors 7
29 7. Comparson to Stepwse and kfold Holdout Kuk (984) demonstrated that stepwse procedures are nferor to all subsets procedures n data mnng proportonal hazard models. Logcally, the same argument apples to data mnng regresson models. As stepwse examnes only a subset of models and ASR examnes all models, the best that stepwse can do s to match ASR. Because stepwse smartly samples on the bass of margnal changes to a ft functon, multcollnearty among the factors can cause stepwse to return a soluton that s a local, but not global, optmum. What s of nterest s a comparson of stepwse to EER because EER, lke stepwse, samples the space of possble regresson models. Kfold holdout s less an alternatve data mnng method than t s an alternatve objectve. Data mnng methods typcally have the objectve of fndng the model that best fts the data (typcally measured by mprovements to the Fstatstc). Kfold holdout offers the alternatve objectve of maxmzng the model s ablty to predct observatons that were not ncluded n the model estmaton (.e., held out observatons). The selecton of whch observatons to hold out vares dependng on the data set. For example, n the case of tme seres data, t makes more sense to hold out observatons at the end of the data set. The followng tests use the same data set and apply EER, estmated all subets regresson (EASR) where we conduct a samplng of models rather than examne all possble models, and stepwse. The procedure s as follows:. Generate 500 observatons for X, randomly selectng observatons from the unform dstrbuton.. Generate 500 observatons each for X through X 5 such that X = γ X + v where the γ are randomly selected from the standard normal dstrbuton and dstrbuted, and v are 8
30 normally dstrbuted wth mean zero and varance 0.. Ths step creates varyng multcollnearty among the factors. 3. Generate Y accordng to (0) where α = β = β = β 3 =, and u s normally dstrbuted wth a varance of. 4. Perform EER and EASR kfold holdout: a. Randomly select 500 models out of the possble 30 regresson models to obtan J estmates for each β. b. Evaluate the kfold holdout crteron: c. Calculate c, c, c 3,, c 30 accordng to (9). d. Mark factors for whch c > as beng selected by EER. 5. Evaluate the EER models usng the kfold holdout crteron: a. For each randomly selected model n step 4a, randomly select 50 observatons to exclude. b. Estmate the model usng the remanng 450 observatons. c. Use the estmated model to predct the 50 excluded observatons. d. Calculate the MSE (mean squared error) where MSE = 50 observaton predcted observaton excluded observatons ( ) e. Over the 500 models randomly selected by EER, dentfy the one for whch the MSE s least. Mark the factors that produce that model as beng selected by the kfold holdout crteron. 6. Peform backward stepwse: a. Let M be the set of ncluded factors, and N be the set of excluded factors such that M = m, N = n, and m+ n= 30. 9
31 b. Estmate the model β and calculate the estmated model s X M Y = α + X + u adjusted multple correlaton coeffcent, R 0. c. For each factor X, =,, m, move the factor from set M to set N, estmate the model Y = α + β X + u X M, calculate the estmated model s adjusted multple correlaton coeffcent, result n the set of measures d. Let R ( L mn R0,..., Rm ) measure R, and then return the factor X j to M from N. Ths wll R,..., R m. =. Identfy the factor whose removal resulted n the R L. Move that factor from set M to set N. e. Gven the new set M, for each factor X, =,, m, remove the factor from set M, estmate the model Y = α + β X + u X M, calculate the estmated model s adjusted multple correlaton coeffcent, from N. Ths wll result n the set of measures R, and then return the factor X j to M R,..., R m. f. For each factor X, =,, n, move the factor from N to M, estmate the model Y = α + β X + u X M, calculate the estmated model s adjusted multple correlaton coeffcent, R, and then return the factor X to M from N. Ths wll result n the set of measures R,..., m R. + m+ n g. Let R ( L = mn R,..., R m + n) and R ( H max R,..., R m + n) removal resulted n the measure =. Identfy the factor whose R L. Move that factor from set M to set N. If was attaned usng a factor from N, move that factor from set N to set M. R H 30
32 h. Repeat steps e through g untl there s no further mprovement n R H.. Mark the factors n the model attaned n step 5h as beng selected by stepwse. 7. Repeat steps through 6 sxhundred tmes. 8. Calculate the percentage of tmes that each factor s selected by EER, kfold, and stepwse. Fgure 0 shows the results of ths comparson. For 30 factors, where the outcome varable s determned by the frst three factors, stepwse correctly dentfed the frst factor slghtly less frequently than dd EER (43% of the tme versus 48% of the tme). Stepwse correctly dentfed the second and thrd factors slghtly more frequently than dd EER (87% and 9% of the tme versus 8% and 85% of the tme). The kfold crteron when appled to EASR dentfed the frst three factors 53%, 63%, and 63% of the tme, respectvely. In ths, the false postve error rate was comparable for EER and stepwse, and sgnfcantly worse for kfold. EER erroneously dentfed the remanng factors as beng sgnfcant, on average, 7% of the tme versus 34% of the tme for stepwse and 49% of the tme for kfold. Ths suggests that EER may be more powerful as a tool for elmnatng factors from consderaton. 3
33 00% 90% 80% Proporton of Tmes Procedure Selects the Factor 70% 60% 50% 40% 30% 0% 0% 0% Factors EER Stepwse EASR Wth kfold Crteron Fgure 0. Comparson of EER, Stepwse, and kfold Crteron 8. Applcablty to Other Procedures and Drawbacks Ordnary least squares estmators belong to the class of maxmum lkelhood estmators. The estmates are unbased (.e., E( ˆ β ) arbtrarly small ε), and effcent (.e., var ( ˆ β ) var ( β ) = β ), consstent (.e., lm Pr ( ˆ β β > ε) = 0 for an N < % where β % s any lnear, unbased estmator of β. The ER procedure reles on the fact that parameter estmators are unbased n extraneous varable cases though based n varyng drectons across the omtted varable cases. 3
34 In the case of lmted dependent varable models (e.g., logt), parameter estmates are unbased and consstent when the model s correctly specfed. However, n the omtted varable case, logt slope coeffcents are based toward zero. For example, suppose the outcome varable, * Y, s determned by a latent varable Y * ff Y 0 such that Y = >. Suppose also that Y * s 0 otherwse determned by the equaton = α + β + β +. If we omt the factor X from the (logt) * Y X X u regresson model, we estmate * o Y α β X u = + +. Yatchew and Grlches (985) show that β β = < β o βσ X + σ u () As the denomnator n () s strctly greater than zero, the estmator s based toward zero. More mportantly, the based estmator has the same sgn as the unbased estmator. Ths volates all four of condtons n (6), any one of whch would valdate the ER procedure. However, whle estmates of determnstc parameters are based downward, estmates of nondetermnstc parameters are randomly based. For example, f Y * s determned by the equaton * Y α βx βx u = and we estmate * o o Y α β X β3 X3 u = + + +, we obtan β β = < β o βσ X + σ u () β o 3 β 3 = = βσ X + σ u 0 (3) Because X 3 s extraneous, β 3 s zero and so repeated estmates of ˆ β o 3 wll be randomly dstrbuted around zero. In short, parameter estmates for determnstc factors wll be: () unbased n the 33
35 correctly specfed case, () unbased n the extraneous varable case, and (3) based toward zero n the omtted varable case. But, parameter estmates for spurous factors wll be unbased toward zero n all three cases. Hence, we can expect the crossmodel chsquare statstcs to work n the case of logt models, though lkely n an asymptotc sense. One drawback of the ER (and EER) procedure s that the procedure may select a set of factors that, whle ndvdually passng the crossmodel chsquare test, are not statstcally sgnfcant when run together n a regresson model. One possble explanaton s that, wthn the confnes of a sngle regresson model, the error varance s large enough to drown out the explanatory power of a factor but, because the crossmodel chsquare statstc s based on crossmodel nformaton that has estmated and fltered out the error term, the factor appears sgnfcant. Ths s analogous to the gan n nformaton obtaned from employng panel data versus tme seres data (cf., Daves, 006). A tme seres data set may measure the same phenomenon over the same tme perod as a panel data set, but because the panel data set also measures the phenomenon over multple crosssectons, nformaton across crosssectons can be used to mtgate the error varance wthn a gven tme perod. 9. Concluson The purpose of ths analyss s to buld on ASR by proposng a new procedure, ER, that combnes ASR wth a crossmodel chsquare statstc that attempts to dstngush determnstc factors from spurous factors. Recognzng that, for large data sets, even ER s nfeasble even wth the applcaton of super computaton, ths analyss further proposes an adaptaton on ER, EER, whch samples the space of possble regresson models. 34
36 Ths analyss () proves that, under condtons less strngent than the classcal lnear model condtons, the crossmodel chsquare statstc s a reasonable measure for dstngushng between determnstc and spurous factors, () demonstrates va MonteCarlo studes the lkelhood of ER produce false postve and false negatve results under condtons of varyng numbers of factors n the data set, varyng varance of the regresson error, and varyng choce of crtcal value, (3) proposes a procedure, EER, that estmates ER results va samplng a subset of the space of possble regresson models, (4) demonstrates va MonteCarlo studes the lkelhood of EER producng false postve and false negatve results under condtons of varyng sample sze, and varyng number of factors n the determnstc equaton, (5) compares EER model selecton wth backward stepwse and the kfold crteron by va MonteCarlo studes that apply the same data sets to all three procedures, and (6) demonstrates that EER avods false postves and false negatves better than the kfold crteron, avods false postves approxmately as well as stepwse, and avods false negatves sgnfcantly better than stepwse. 35
37 0. References Carmnes, E.G., and J.P. McIver, 98. Analyzng models wth unobserved varables: Analyss of covarance structures, n Bohmstedt. In G.W. and E.F. Borgatta, eds., Socal Measurement. Sage Publcatons: Thousand Oaks, CA. pp Daves, A., 006. A framework for decomposng shocks and measurng volatltes derved from multdmensonal panel data of survey forecasts. Internatonal Journal of Forecastng, (): Klne, R.B., 998. Prncples and practce of structural equaton modelng. Gulford Press: New York. Kuk, A.C., 984. All subsets regresson n proportonal hazard models. Bometrka, 7(3): Ullman, J.B., 00. Structural equaton modelng. In Tabachnck, B.G. and L.S. Fdell, eds., Usng Multvarate Statstcs. Allyn and Bacon: Needham Heghts, MA. pp Yatchew, A. and Z. Grlches, 985. Specfcaton error n probt models. The Revew of Economcs and Statstcs, 67:
THE TITANIC SHIPWRECK: WHO WAS
THE TITANIC SHIPWRECK: WHO WAS MOST LIKELY TO SURVIVE? A STATISTICAL ANALYSIS Ths paper examnes the probablty of survvng the Ttanc shpwreck usng lmted dependent varable regresson analyss. Ths appled analyss
More informationCausal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting
Causal, Explanatory Forecastng Assumes causeandeffect relatonshp between system nputs and ts output Forecastng wth Regresson Analyss Rchard S. Barr Inputs System Cause + Effect Relatonshp The job of
More informationPSYCHOLOGICAL RESEARCH (PYC 304C) Lecture 12
14 The Chsquared dstrbuton PSYCHOLOGICAL RESEARCH (PYC 304C) Lecture 1 If a normal varable X, havng mean µ and varance σ, s standardsed, the new varable Z has a mean 0 and varance 1. When ths standardsed
More informationWhat is Candidate Sampling
What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble
More informationQuestions that we may have about the variables
Antono Olmos, 01 Multple Regresson Problem: we want to determne the effect of Desre for control, Famly support, Number of frends, and Score on the BDI test on Perceved Support of Latno women. Dependent
More informationThe Analysis of Outliers in Statistical Data
THALES Project No. xxxx The Analyss of Outlers n Statstcal Data Research Team Chrysses Caron, Assocate Professor (P.I.) Vaslk Karot, Doctoral canddate Polychrons Economou, Chrstna Perrakou, Postgraduate
More information9.1 The Cumulative Sum Control Chart
Learnng Objectves 9.1 The Cumulatve Sum Control Chart 9.1.1 Basc Prncples: Cusum Control Chart for Montorng the Process Mean If s the target for the process mean, then the cumulatve sum control chart s
More informationb) The mean of the fitted (predicted) values of Y is equal to the mean of the Y values: c) The residuals of the regression line sum up to zero: = ei
Mathematcal Propertes of the Least Squares Regresson The least squares regresson lne obeys certan mathematcal propertes whch are useful to know n practce. The followng propertes can be establshed algebracally:
More informationThe covariance is the two variable analog to the variance. The formula for the covariance between two variables is
Regresson Lectures So far we have talked only about statstcs that descrbe one varable. What we are gong to be dscussng for much of the remander of the course s relatonshps between two or more varables.
More informationTHE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES
The goal: to measure (determne) an unknown quantty x (the value of a RV X) Realsaton: n results: y 1, y 2,..., y j,..., y n, (the measured values of Y 1, Y 2,..., Y j,..., Y n ) every result s encumbered
More informationAn Alternative Way to Measure Private Equity Performance
An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate
More informationCan Auto Liability Insurance Purchases Signal Risk Attitude?
Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? ChuShu L Department of Internatonal Busness, Asa Unversty, Tawan ShengChang
More informationbenefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).
REVIEW OF RISK MANAGEMENT CONCEPTS LOSS DISTRIBUTIONS AND INSURANCE Loss and nsurance: When someone s subject to the rsk of ncurrng a fnancal loss, the loss s generally modeled usng a random varable or
More informationCHAPTER 14 MORE ABOUT REGRESSION
CHAPTER 14 MORE ABOUT REGRESSION We learned n Chapter 5 that often a straght lne descrbes the pattern of a relatonshp between two quanttatve varables. For nstance, n Example 5.1 we explored the relatonshp
More informationHYPOTHESIS TESTING OF PARAMETERS FOR ORDINARY LINEAR CIRCULAR REGRESSION
HYPOTHESIS TESTING OF PARAMETERS FOR ORDINARY LINEAR CIRCULAR REGRESSION Abdul Ghapor Hussn Centre for Foundaton Studes n Scence Unversty of Malaya 563 KUALA LUMPUR Emal: ghapor@umedumy Abstract Ths paper
More informationCHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol
CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK Sample Stablty Protocol Background The Cholesterol Reference Method Laboratory Network (CRMLN) developed certfcaton protocols for total cholesterol, HDL
More informationQuality Adjustment of Secondhand Motor Vehicle Application of Hedonic Approach in Hong Kong s Consumer Price Index
Qualty Adustment of Secondhand Motor Vehcle Applcaton of Hedonc Approach n Hong Kong s Consumer Prce Index Prepared for the 14 th Meetng of the Ottawa Group on Prce Indces 20 22 May 2015, Tokyo, Japan
More informationIntroduction to Regression
Introducton to Regresson Regresson a means of predctng a dependent varable based one or more ndependent varables. Ths s done by fttng a lne or surface to the data ponts that mnmzes the total error. 
More informationBinary Dependent Variables. In some cases the outcome of interest rather than one of the right hand side variables is discrete rather than continuous
Bnary Dependent Varables In some cases the outcome of nterest rather than one of the rght hand sde varables s dscrete rather than contnuous The smplest example of ths s when the Y varable s bnary so that
More informationx f(x) 1 0.25 1 0.75 x 1 0 1 1 0.04 0.01 0.20 1 0.12 0.03 0.60
BIVARIATE DISTRIBUTIONS Let be a varable that assumes the values { 1,,..., n }. Then, a functon that epresses the relatve frequenc of these values s called a unvarate frequenc functon. It must be true
More informationNPAR TESTS. OneSample ChiSquare Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6
PAR TESTS If a WEIGHT varable s specfed, t s used to replcate a case as many tmes as ndcated by the weght value rounded to the nearest nteger. If the workspace requrements are exceeded and samplng has
More informationInequality and The Accounting Period. Quentin Wodon and Shlomo Yitzhaki. World Bank and Hebrew University. September 2001.
Inequalty and The Accountng Perod Quentn Wodon and Shlomo Ytzha World Ban and Hebrew Unversty September Abstract Income nequalty typcally declnes wth the length of tme taen nto account for measurement.
More information1. Measuring association using correlation and regression
How to measure assocaton I: Correlaton. 1. Measurng assocaton usng correlaton and regresson We often would lke to know how one varable, such as a mother's weght, s related to another varable, such as a
More informationI. SCOPE, APPLICABILITY AND PARAMETERS Scope
D Executve Board Annex 9 Page A/R ethodologcal Tool alculaton of the number of sample plots for measurements wthn A/R D project actvtes (Verson 0) I. SOPE, PIABIITY AD PARAETERS Scope. Ths tool s applcable
More informationThe Development of Web Log Mining Based on ImproveKMeans Clustering Analysis
The Development of Web Log Mnng Based on ImproveKMeans Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract.
More informationMultivariate EWMA Control Chart
Multvarate EWMA Control Chart Summary The Multvarate EWMA Control Chart procedure creates control charts for two or more numerc varables. Examnng the varables n a multvarate sense s extremely mportant
More informationStatistical Methods to Develop Rating Models
Statstcal Methods to Develop Ratng Models [Evelyn Hayden and Danel Porath, Österrechsche Natonalbank and Unversty of Appled Scences at Manz] Source: The Basel II Rsk Parameters Estmaton, Valdaton, and
More informationLuby s Alg. for Maximal Independent Sets using Pairwise Independence
Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent
More informationForecasting the Direction and Strength of Stock Market Movement
Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye cjngwe@stanford.edu mchen5@stanford.edu nanye@stanford.edu Abstract  Stock market s one of the most complcated systems
More informationSIMPLE LINEAR CORRELATION
SIMPLE LINEAR CORRELATION Smple lnear correlaton s a measure of the degree to whch two varables vary together, or a measure of the ntensty of the assocaton between two varables. Correlaton often s abused.
More informationRegression Models for a Binary Response Using EXCEL and JMP
SEMATECH 997 Statstcal Methods Symposum Austn Regresson Models for a Bnary Response Usng EXCEL and JMP Davd C. Trndade, Ph.D. STATTECH Consultng and Tranng n Appled Statstcs San Jose, CA Topcs Practcal
More informationThe Analysis of Covariance. ERSH 8310 Keppel and Wickens Chapter 15
The Analyss of Covarance ERSH 830 Keppel and Wckens Chapter 5 Today s Class Intal Consderatons Covarance and Lnear Regresson The Lnear Regresson Equaton TheAnalyss of Covarance Assumptons Underlyng the
More informationL10: Linear discriminants analysis
L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss
More informationStudy on CET4 Marks in China s Graded English Teaching
Study on CET4 Marks n Chna s Graded Englsh Teachng CHE We College of Foregn Studes, Shandong Insttute of Busness and Technology, P.R.Chna, 264005 Abstract: Ths paper deploys Logt model, and decomposes
More informationLatent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006
Latent Class Regresson Statstcs for Psychosocal Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson (LCR) What s t and when do we use t? Recall the standard latent class model
More informationTime Series Analysis in Studies of AGN Variability. Bradley M. Peterson The Ohio State University
Tme Seres Analyss n Studes of AGN Varablty Bradley M. Peterson The Oho State Unversty 1 Lnear Correlaton Degree to whch two parameters are lnearly correlated can be expressed n terms of the lnear correlaton
More informationThe Probit Model. Alexander Spermann. SoSe 2009
The Probt Model Aleander Spermann Unversty of Freburg SoSe 009 Course outlne. Notaton and statstcal foundatons. Introducton to the Probt model 3. Applcaton 4. Coeffcents and margnal effects 5. Goodnessofft
More informationRecurrence. 1 Definitions and main statements
Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.
More informationCS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements
Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there
More informationEvaluating credit risk models: A critique and a new proposal
Evaluatng credt rsk models: A crtque and a new proposal Hergen Frerchs* Gunter Löffler Unversty of Frankfurt (Man) February 14, 2001 Abstract Evaluatng the qualty of credt portfolo rsk models s an mportant
More informationLinear Regression, Regularization BiasVariance Tradeoff
HTF: Ch3, 7 B: Ch3 Lnear Regresson, Regularzaton BasVarance Tradeoff Thanks to C Guestrn, T Detterch, R Parr, N Ray 1 Outlne Lnear Regresson MLE = Least Squares! Bass functons Evaluatng Predctors Tranng
More informationSolution of Algebraic and Transcendental Equations
CHAPTER Soluton of Algerac and Transcendental Equatons. INTRODUCTION One of the most common prolem encountered n engneerng analyss s that gven a functon f (, fnd the values of for whch f ( = 0. The soluton
More informationH 1 : at least one is not zero
Chapter 6 More Multple Regresson Model The Ftest Jont Hypothess Tests Consder the lnear regresson equaton: () y = β + βx + βx + β4x4 + e for =,,..., N The tstatstc gve a test of sgnfcance of an ndvdual
More informationPortfolio Loss Distribution
Portfolo Loss Dstrbuton Rsky assets n loan ortfolo hghly llqud assets holdtomaturty n the bank s balance sheet Outstandngs The orton of the bank asset that has already been extended to borrowers. Commtment
More informationAnalysis of Premium Liabilities for Australian Lines of Business
Summary of Analyss of Premum Labltes for Australan Lnes of Busness Emly Tao Honours Research Paper, The Unversty of Melbourne Emly Tao Acknowledgements I am grateful to the Australan Prudental Regulaton
More informationLogistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification
Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson
More informationCHAPTER 7 THE TWOVARIABLE REGRESSION MODEL: HYPOTHESIS TESTING
CHAPTER 7 THE TWOVARIABLE REGRESSION MODEL: HYPOTHESIS TESTING QUESTIONS 7.1. (a) In the regresson contet, the method of least squares estmates the regresson parameters n such a way that the sum of the
More informationCHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES
CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES In ths chapter, we wll learn how to descrbe the relatonshp between two quanttatve varables. Remember (from Chapter 2) that the terms quanttatve varable
More information+ + +   This circuit than can be reduced to a planar circuit
MeshCurrent Method The meshcurrent s analog of the nodeoltage method. We sole for a new set of arables, mesh currents, that automatcally satsfy KCLs. As such, meshcurrent method reduces crcut soluton to
More informationMoment of a force about a point and about an axis
3. STATICS O RIGID BODIES In the precedng chapter t was assumed that each of the bodes consdered could be treated as a sngle partcle. Such a vew, however, s not always possble, and a body, n general, should
More informationIDENTIFICATION AND CORRECTION OF A COMMON ERROR IN GENERAL ANNUITY CALCULATIONS
IDENTIFICATION AND CORRECTION OF A COMMON ERROR IN GENERAL ANNUITY CALCULATIONS Chrs Deeley* Last revsed: September 22, 200 * Chrs Deeley s a Senor Lecturer n the School of Accountng, Charles Sturt Unversty,
More informationRiskbased Fatigue Estimate of Deep Water Risers  Course Project for EM388F: Fracture Mechanics, Spring 2008
Rskbased Fatgue Estmate of Deep Water Rsers  Course Project for EM388F: Fracture Mechancs, Sprng 2008 Chen Sh Department of Cvl, Archtectural, and Envronmental Engneerng The Unversty of Texas at Austn
More informationLogistic Regression. Steve Kroon
Logstc Regresson Steve Kroon Course notes sectons: 24.324.4 Dsclamer: these notes do not explctly ndcate whether values are vectors or scalars, but expects the reader to dscern ths from the context. Scenaro
More informationEconomic Interpretation of Regression. Theory and Applications
Economc Interpretaton of Regresson Theor and Applcatons Classcal and Baesan Econometrc Methods Applcaton of mathematcal statstcs to economc data for emprcal support Economc theor postulates a qualtatve
More informationMAPP. MERIS level 3 cloud and water vapour products. Issue: 1. Revision: 0. Date: 9.12.1998. Function Name Organisation Signature Date
Ttel: Project: Doc. No.: MERIS level 3 cloud and water vapour products MAPP MAPPATBDClWVL3 Issue: 1 Revson: 0 Date: 9.12.1998 Functon Name Organsaton Sgnature Date Author: Bennartz FUB Preusker FUB Schüller
More informationErrorPropagation.nb 1. Error Propagation
ErrorPropagaton.nb Error Propagaton Suppose that we make observatons of a quantty x that s subject to random fluctuatons or measurement errors. Our best estmate of the true value for ths quantty s then
More information2.4 Bivariate distributions
page 28 2.4 Bvarate dstrbutons 2.4.1 Defntons Let X and Y be dscrete r.v.s defned on the same probablty space (S, F, P). Instead of treatng them separately, t s often necessary to thnk of them actng together
More informationThe OC Curve of Attribute Acceptance Plans
The OC Curve of Attrbute Acceptance Plans The Operatng Characterstc (OC) curve descrbes the probablty of acceptng a lot as a functon of the lot s qualty. Fgure 1 shows a typcal OC Curve. 10 8 6 4 1 3 4
More information8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by
6 CHAPTER 8 COMPLEX VECTOR SPACES 5. Fnd the kernel of the lnear transformaton gven n Exercse 5. In Exercses 55 and 56, fnd the mage of v, for the ndcated composton, where and are gven by the followng
More informationCS 2750 Machine Learning. Lecture 17a. Clustering. CS 2750 Machine Learning. Clustering
Lecture 7a Clusterng Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square Clusterng Groups together smlar nstances n the data sample Basc clusterng problem: dstrbute data nto k dfferent groups such that
More informationChapter 7. RandomVariate Generation 7.1. Prof. Dr. Mesut Güneş Ch. 7 RandomVariate Generation
Chapter 7 RandomVarate Generaton 7. Contents Inversetransform Technque AcceptanceRejecton Technque Specal Propertes 7. Purpose & Overvew Develop understandng of generatng samples from a specfed dstrbuton
More information1 Approximation Algorithms
CME 305: Dscrete Mathematcs and Algorthms 1 Approxmaton Algorthms In lght of the apparent ntractablty of the problems we beleve not to le n P, t makes sense to pursue deas other than complete solutons
More informationSPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:
SPEE Recommended Evaluaton Practce #6 efnton of eclne Curve Parameters Background: The producton hstores of ol and gas wells can be analyzed to estmate reserves and future ol and gas producton rates and
More informationStaff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall
SP 200502 August 2005 Staff Paper Department of Appled Economcs and Management Cornell Unversty, Ithaca, New York 148537801 USA Farm Savngs Accounts: Examnng Income Varablty, Elgblty, and Benefts Brent
More informationSTATISTICAL DATA ANALYSIS IN EXCEL
Mcroarray Center STATISTICAL DATA ANALYSIS IN EXCEL Lecture 6 Some Advanced Topcs Dr. Petr Nazarov 1401013 petr.nazarov@crpsante.lu Statstcal data analyss n Ecel. 6. Some advanced topcs Correcton for
More informationA DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATIONBASED OPTIMIZATION. Michael E. Kuhl Radhamés A. TolentinoPeña
Proceedngs of the 2008 Wnter Smulaton Conference S. J. Mason, R. R. Hll, L. Mönch, O. Rose, T. Jefferson, J. W. Fowler eds. A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATIONBASED OPTIMIZATION
More informationCapital asset pricing model, arbitrage pricing theory and portfolio management
Captal asset prcng model, arbtrage prcng theory and portfolo management Vnod Kothar The captal asset prcng model (CAPM) s great n terms of ts understandng of rsk decomposton of rsk nto securtyspecfc rsk
More informationCommunication Networks II Contents
8 / 1  Communcaton Networs II (Görg)  www.comnets.unbremen.de Communcaton Networs II Contents 1 Fundamentals of probablty theory 2 Traffc n communcaton networs 3 Stochastc & Marovan Processes (SP
More informationU.C. Berkeley CS270: Algorithms Lecture 4 Professor Vazirani and Professor Rao Jan 27,2011 Lecturer: Umesh Vazirani Last revised February 10, 2012
U.C. Berkeley CS270: Algorthms Lecture 4 Professor Vazran and Professor Rao Jan 27,2011 Lecturer: Umesh Vazran Last revsed February 10, 2012 Lecture 4 1 The multplcatve weghts update method The multplcatve
More informationPRACTICE 1: MUTUAL FUNDS EVALUATION USING MATLAB.
PRACTICE 1: MUTUAL FUNDS EVALUATION USING MATLAB. INDEX 1. Load data usng the Edtor wndow and mfle 2. Learnng to save results from the Edtor wndow. 3. Computng the Sharpe Rato 4. Obtanng the Treynor Rato
More informationDescribing Communities. Species Diversity Concepts. Species Richness. Species Richness. SpeciesArea Curve. SpeciesArea Curve
peces versty Concepts peces Rchness pecesarea Curves versty Indces  mpson's Index  hannonwener Index  rlloun Index peces Abundance Models escrbng Communtes There are two mportant descrptors of a communty:
More informationTHE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek
HE DISRIBUION OF LOAN PORFOLIO VALUE * Oldrch Alfons Vascek he amount of captal necessary to support a portfolo of debt securtes depends on the probablty dstrbuton of the portfolo loss. Consder a portfolo
More informationII. PROBABILITY OF AN EVENT
II. PROBABILITY OF AN EVENT As ndcated above, probablty s a quantfcaton, or a mathematcal model, of a random experment. Ths quantfcaton s a measure of the lkelhood that a gven event wll occur when the
More informationA Probabilistic Theory of Coherence
A Probablstc Theory of Coherence BRANDEN FITELSON. The Coherence Measure C Let E be a set of n propostons E,..., E n. We seek a probablstc measure C(E) of the degree of coherence of E. Intutvely, we want
More informationPROFIT RATIO AND MARKET STRUCTURE
POFIT ATIO AND MAKET STUCTUE By Yong Yun Introducton: Industral economsts followng from Mason and Ban have run nnumerable tests of the relaton between varous market structural varables and varous dmensons
More informationHow Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence
1 st Internatonal Symposum on Imprecse Probabltes and Ther Applcatons, Ghent, Belgum, 29 June 2 July 1999 How Sets of Coherent Probabltes May Serve as Models for Degrees of Incoherence Mar J. Schervsh
More informationDiagnostic Tests of Cross Section Independence for Nonlinear Panel Data Models
DISCUSSION PAPER SERIES IZA DP No. 2756 Dagnostc ests of Cross Secton Independence for Nonlnear Panel Data Models Cheng Hsao M. Hashem Pesaran Andreas Pck Aprl 2007 Forschungsnsttut zur Zukunft der Arbet
More informationAn Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services
An Evaluaton of the Extended Logstc, Smple Logstc, and Gompertz Models for Forecastng Short Lfecycle Products and Servces Charles V. Trappey a,1, Hsnyng Wu b a Professor (Management Scence), Natonal Chao
More informationBinomial Link Functions. Lori Murray, Phil Munz
Bnomal Lnk Functons Lor Murray, Phl Munz Bnomal Lnk Functons Logt Lnk functon: ( p) p ln 1 p Probt Lnk functon: ( p) 1 ( p) Complentary Log Log functon: ( p) ln( ln(1 p)) Motvatng Example A researcher
More informationPRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIGIOUS AFFILIATION AND PARTICIPATION
PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIIOUS AFFILIATION AND PARTICIPATION Danny CohenZada Department of Economcs, Benuron Unversty, BeerSheva 84105, Israel Wllam Sander Department of Economcs, DePaul
More informationAn InterestOriented Network Evolution Mechanism for Online Communities
An InterestOrented Network Evoluton Mechansm for Onlne Communtes Cahong Sun and Xaopng Yang School of Informaton, Renmn Unversty of Chna, Bejng 100872, P.R. Chna {chsun,yang}@ruc.edu.cn Abstract. Onlne
More informationThe Greedy Method. Introduction. 0/1 Knapsack Problem
The Greedy Method Introducton We have completed data structures. We now are gong to look at algorthm desgn methods. Often we are lookng at optmzaton problems whose performance s exponental. For an optmzaton
More informationLecture 10: Linear Regression Approach, Assumptions and Diagnostics
Approach to Modelng I Lecture 1: Lnear Regresson Approach, Assumptons and Dagnostcs Sandy Eckel seckel@jhsph.edu 8 May 8 General approach for most statstcal modelng: Defne the populaton of nterest State
More informationMultiplePeriod Attribution: Residuals and Compounding
MultplePerod Attrbuton: Resduals and Compoundng Our revewer gave these authors full marks for dealng wth an ssue that performance measurers and vendors often regard as propretary nformaton. In 1994, Dens
More informationGRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 NORM
GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 NORM BARRIOT JeanPerre, SARRAILH Mchel BGI/CNES 18.av.E.Beln 31401 TOULOUSE Cedex 4 (France) Emal: jeanperre.barrot@cnes.fr 1/Introducton The
More informationAn Empirical Study of Search Engine Advertising Effectiveness
An Emprcal Study of Search Engne Advertsng Effectveness Sanjog Msra, Smon School of Busness Unversty of Rochester Edeal Pnker, Smon School of Busness Unversty of Rochester Alan RmmKaufman, RmmKaufman
More informationHOUSEHOLDS DEBT BURDEN: AN ANALYSIS BASED ON MICROECONOMIC DATA*
HOUSEHOLDS DEBT BURDEN: AN ANALYSIS BASED ON MICROECONOMIC DATA* Luísa Farnha** 1. INTRODUCTION The rapd growth n Portuguese households ndebtedness n the past few years ncreased the concerns that debt
More informationProceedings of the Annual Meeting of the American Statistical Association, August 59, 2001
Proceedngs of the Annual Meetng of the Amercan Statstcal Assocaton, August 59, 2001 LISTASSISTED SAMPLING: THE EFFECT OF TELEPHONE SYSTEM CHANGES ON DESIGN 1 Clyde Tucker, Bureau of Labor Statstcs James
More informationInstitute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic
Lagrange Multplers as Quanttatve Indcators n Economcs Ivan Mezník Insttute of Informatcs, Faculty of Busness and Management, Brno Unversty of TechnologCzech Republc Abstract The quanttatve role of Lagrange
More informationAdaptive Clinical Trials Incorporating Treatment Selection and Evaluation: Methodology and Applications in Multiple Sclerosis
Adaptve Clncal Trals Incorporatng Treatment electon and Evaluaton: Methodology and Applcatons n Multple cleross usan Todd, Tm Frede, Ngel tallard, Ncholas Parsons, Elsa ValdésMárquez, Jeremy Chataway
More informationBERNSTEIN POLYNOMIALS
OnLne Geometrc Modelng Notes BERNSTEIN POLYNOMIALS Kenneth I. Joy Vsualzaton and Graphcs Research Group Department of Computer Scence Unversty of Calforna, Davs Overvew Polynomals are ncredbly useful
More informationAn Analysis of Factors Influencing the SelfRated Health of Elderly Chinese People
Open Journal of Socal Scences, 205, 3, 520 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/ss http://dx.do.org/0.4236/ss.205.35003 An Analyss of Factors Influencng the SelfRated Health of
More informationv a 1 b 1 i, a 2 b 2 i,..., a n b n i.
SECTION 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS 455 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS All the vector spaces we have studed thus far n the text are real vector spaces snce the scalars are
More informationStatistical algorithms in Review Manager 5
Statstcal algorthms n Reve Manager 5 Jonathan J Deeks and Julan PT Hggns on behalf of the Statstcal Methods Group of The Cochrane Collaboraton August 00 Data structure Consder a metaanalyss of k studes
More informationNonlinear data mapping by neural networks
Nonlnear data mappng by neural networks R.P.W. Dun Delft Unversty of Technology, Netherlands Abstract A revew s gven of the use of neural networks for nonlnear mappng of hgh dmensonal data on lower dmensonal
More informationModule 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..
More informationSolution: Let i = 10% and d = 5%. By definition, the respective forces of interest on funds A and B are. i 1 + it. S A (t) = d (1 dt) 2 1. = d 1 dt.
Chapter 9 Revew problems 9.1 Interest rate measurement Example 9.1. Fund A accumulates at a smple nterest rate of 10%. Fund B accumulates at a smple dscount rate of 5%. Fnd the pont n tme at whch the forces
More informationECONOMICS OF PLANT ENERGY SAVINGS PROJECTS IN A CHANGING MARKET Douglas C White Emerson Process Management
ECONOMICS OF PLANT ENERGY SAVINGS PROJECTS IN A CHANGING MARKET Douglas C Whte Emerson Process Management Abstract Energy prces have exhbted sgnfcant volatlty n recent years. For example, natural gas prces
More informationAn empirical study for credit card approvals in the Greek banking sector
An emprcal study for credt card approvals n the Greek bankng sector Mara Mavr George Ioannou Bergamo, Italy 1721 May 2004 Management Scences Laboratory Department of Management Scence & Technology Athens
More informationChapter 14 Simple Linear Regression
Sldes Prepared JOHN S. LOUCKS St. Edward s Unverst Slde Chapter 4 Smple Lnear Regresson Smple Lnear Regresson Model Least Squares Method Coeffcent of Determnaton Model Assumptons Testng for Sgnfcance Usng
More information