A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Categorical Data

Save this PDF as:
Size: px
Start display at page:

Download "A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Categorical Data"

Transcription

1 A Two-Step Self-Evaluaton Algorthm On Imputaton Approaches For Mssng Categorcal Data Lukun Zheng Faculty of Department of Mathematcs Tennessee Technologcal Unversty Cookevlle, TN 385, Unted States of Amerca Abstract Mssng data are often encountered n data sets and a common problem for researchers n dfferent felds of research. There are many reasons why observatons may have mssng values. For nstance, some respondents may not report some of the tems for some reason. The exstence of mssng data brngs dffcultes to the conduct of statstcal analyses, especally when there s a large fracton of data whch are mssng. Many methods have been developed for dealng wth mssng data, numerc or categorcal. The performances of mputaton methods on mssng data are key n choosng whch mputaton method to use. They are usually evaluated on how the mssng data method performs for nference about target parameters based on a statstcal model. One mportant parameter s the expected mputaton accuracy rate, whch, however, reles heavly on the assumptons of mssng data type and the mputaton methods. For nstance, t may requre that the mssng data s mssng completely at random. The goal of the current study was to develop a two-step algorthm to evaluate the performances of mputaton methods for mssng categorcal data. The evaluaton s based on the re-mputaton accuracy rate (RIAR) ntroduced n the current work. A smulaton study based on real data s conducted to demonstrate how the evaluaton algorthm works. Keywords: Categorcal Varable, Imputaton Methods, Mssng Value, Re-Imputaton Accuracy Rate.. INTRODUCTION AND MOTIVATION Data scentsts are often faced wth ssues of mssng data n dfferent stuatons [, 2]. There are several reasons why observatons may have mssng values. For nstance, one reason may be that some respondents just do not report some of the varable for some reason. Researchers have nvestgated the mpact of mssng data on statstcal analyses [3]. The mssng data may lead to based estmates and nflated standard errors. In addton, the power of statstcal tests may drop dramatcally when there s a large amount of mssng data n the data set. Data often are mssng n research n economcs, socology, and poltcal scence because governments choose not to, or fal to, report crtcal statstcs [4]. Sometmes mssng values are caused by the researcher. For example, when data collecton s done mproperly or mstakes are made n data entry [5]. There are dfferent types of mssng data based on the underlyng formaton process. Dfferent forms of mssng data have dfferent mpacts on the statstcal analyses. There are a lot of open resources about mssng data and nterested readers are encouraged to refer to them for a more thorough dscusson. Values n a data set are mssng completely at random (MCAR) f the events that lead to any data-tem beng mssng are ndependent both of observable varables and of unobservable parameters of nterest, and occur entrely at random [6]. When data are MCAR, the mssng data are a random sample of the observed data and the analyss performed on the data s unbased. Mssng at random (MAR) occurs when the mssngness s not random, but where mssngness can be fully accounted for by varables where there s complete nformaton. For nstance, males are less lkely to enroll n a depresson survey than females but ths has nothng to do wth ther level of depresson. Fnally, Internatonal Journal of Scentfc and Statstcal Computng (IJSSC), Volume (7) : Issue () : 28

2 mssng not at random (MNAR) (also known as nongnorable nonresponse) occurs when the mssngness s related to the value of the varable tself. To extend the prevous example, ths would occur f men faled to enroll n a depresson survey because of ther level of depresson. As another example, respondents mght be asked to ndcate the number of tmes they are pulled over due to speedng durng the prevous year. A mssng response would be MNAR f ndvduals who were pulled over for many tmes due to speedng durng ths perod were more lkely to leave the tem unanswered rather than report ther behavor. Gven the potental problems caused by the presence of mssng data, many methods have been suggested for data mputaton. People may refer to [7] for a comprehensve revew of many of these methods. Many mssng data technques have been suggested and developed. Accordng to [8], these mssng data technques can be roughly grouped nto:. technques gnorng ncomplete observatons, 2. mputaton-based technques, 3. weghtng technques, and 4. model-based technques [2]. In ths paper, we wll gnore weghtng technques and manly focus on the mputaton-based technques and model-based technques. The smple and wdely used technque s the so-called lst wse deleton whch gnore all ncomplete observatons and only focus on complete observatons. Lst wse deleton s easy to carry out and the result may be relable and satsfactory wth small porton of observatons wth mssng data. However, t fals f there s a large porton of observatons wth mssng data, whch mght be very common n hgh-dmensonal case. In addton, lst wse deleton may lead to serous bases f the data are mssng not at random. Imputaton-based technques provde a way to replace mssng values wth sutable estmates, whch results n mputed complete data set. Many methods have been suggested for mputng sutable responses to mssng data. Among these technques are mean or mode substtuton, regresson mputaton, k-nearest neghbor mputaton, Hot Deck mputaton, multple mputaton, etc. In the mputaton-based technques, mssng values are replaced by artfcal values and t may cause seres bases. And ths mputed data set n turn mght lead to based result n the subsequent data analyss. And most of these mputaton methods has been found nadequate n reproducng known populaton parameters and standard errors [7]. Model-based mputaton technques performs parameter estmaton. Two popular model-based mputaton technques are regresson-based and lkelhood-based technques [2, 9]. In regressonbased mputaton, the mssng values are mputed by a regresson of the dependent varable wth mssng values on the observed values of other ndependent varables for a gven data set. In lkelhood-based mputaton, the data are descrbed based on a model and the parameters are estmated by maxmzng lkelhood for a gven data set [2]. The mputaton methods can also be roughly classfed nto two: parametrc mputaton methods and non-parametrc mputaton methods. The parametrc mputaton methods are usually superor f the data set can be modeled adequately by a parametrc model. For nstance, the lnear regresson can be used to conduct the mputaton of mssng quanttatve values and logstc regresson can be used to conduct mputaton of mssng bnary qualtatve values [, ]. Nonparametrc mputaton methods conduct mputaton of mssng data by capturng structures n the data sets and they offer an alternatve f the users have no dea of the actual dstrbuton of the data set. In other words, t s benefcal to use a non-parametrc method n cases that the relatonshp between the response and explanatory varables are unknown. For nstance, the nearest-neghbor (NN) mputaton algorthm s one of the non-parametrc methods used for mputaton of mssng data n sample surveys [2]. Chen and Huang [3] constructed a genetc system to mpute n relatonal database systems. The machne learnng methods also nclude decson tree mputaton, auto assocatve neural network and so forth. Internatonal Journal of Scentfc and Statstcal Computng (IJSSC), Volume (7) : Issue () : 28 2

3 Once the mputaton of mssng data s done, t s crucal to evaluate the performance of the mputaton technques used through determnng the effect of mputaton on subsequent statstcal nferences. All these mputaton methods have pros and cons and they may perform well n one or more stuatons but fals n others. And the comparson among these dfferent mputaton methods have been studed n many stuatons. And these comparsons are manly based on the potental consequences n the subsequent data analyss caused by correspondng mputaton methods. For nstance, [] compares the performance of three mputaton methods based on the estmaton accuracy of logstc regresson parameters, standard errors and hypothess testng results from the mputed data usng rounded multple mputaton for contnuous data(mi), stochastc regresson mputaton (SRI), and multple mputaton for categorcal data, respectvely. [4] studed the performance of three mssng value mputaton technques wth respect to dfferent rate of mssng values n the data set. The purpose of ths paper s to present a new self-evaluaton algorthm of mputaton approaches for mssng categorcal data values. The new evaluaton algorthm can be used to test a wde range of mputaton technques of mssng categorcal data. The performance of several leadng mputaton approaches s measured wth respect to dfferent rate of mssng values n the data set. To ths end, the paper provdes:. a descrpton of several popular and modern mputaton approaches, namely, MI, KNN, C5. decson tree, Nave Bayes, and polytomous regresson mputaton- ordered (POLR). 2. a new self-evaluaton algorthm for mputaton approaches of mssng categorcal data. 3. a wde range of evaluaton of qualty of mputaton wth respect to the rate of mssng data. 4. several smulaton results based on real data. The paper s organzed as follows. Secton 2 provdes a descrpton of the relevant mputaton approaches and our proposed self-evaluaton algorthm. Secton 3 explans detals of the expermental study and presents and analyzes the results. Fnally, Secton 4 summarzes the paper. 2. IMPUTATION APPROACHES AND THE PROPOSED SELF-EVALUATION ALGORITHM In ths secton, we wll ntroduce the mputaton approaches used n ths study and the proposed self-evaluaton algorthm. 2. Imputaton Approaches The mputaton technques used n ths paper nclude mean mputaton (MI), k nearest neghbor mputaton (KNN), C5. decson tree mputaton, Nave Bayes (NB), and polytomous regresson mputaton- ordered (POLR).. Mean Imputaton (MI). In mean mputaton, the mssng values are mputed wth the mean for quanttatve data or the most frequent value (mode) for qualtatve data of the correspondng the varable. Anderson et al [5] states that the sample mean provdes an optmal estmate of the most probable value n the case of normal dstrbuton. The use of MI wll shrnk the sample varance and affect the correlaton between the mputed varable and other varables. Such mpact wll be sgnfcant f there s a hgh percentage of mssng values mputed usng MI and the subsequent statstcal nferences mght be msleadng due to the too much centrally located values created by MI [6]. 2. K Nearest Neghbor Imputaton (KNN). KNN s an effcent hot deck method to mpute the mssng value, n whch the mssng values are mputed based on ts k nearest neghbors n the whole data set accordng to some metrc. Ths method apples to both contnuous varables or dscrete varables. For contnuous varables, the most commonly used dstance metrc to determne the k nearest neghbors s the Eucldean dstance (Mnkowsk norm Internatonal Journal of Scentfc and Statstcal Computng (IJSSC), Volume (7) : Issue () : 28 3

4 / p n p d( x, y) x y wth p = 2). Then the mssng values s mputed as the average or weghted average of these k nearest neghbors. For dscrete varables, such as texts, other types of dstance metrc such as Hammng dstance or overlap metrc can be used to determne the k nearest neghbors. Then the mssng value s mputed as the most frequent value among the k nearest neghbors. Usually k s taken as a small postve nteger. When k =, the KNN method s the smlar response pattern mputaton (SRPI), whch conssts of dentfyng the nearest neghbor (the most smlar observaton) and mputng the mssng value by copyng the value of ths nearest neghbor. An advantage over mean mputaton s that the replacement values are nfluenced only by the most smlar cases rather than by all cases. Several studes have found that the k-nn method performs well or better than other methods n certan contexts [7, 8]. A shortcomng of the KNN mputaton s that t s senstve to the local structure of the data. 3. C5. Decson Tree Imputaton (C5.) [9]. C5. extends the C4.5 classfcaton algorthms descrbed n [2]. It s a well-known machne learnng algorthm whch has a good nternal method to treat mssng values. Hurley (27) dd a comparatve study, wth other smple methods to treat mssng values, and concluded that t was one of the best methods. C5. uses the nformaton gan rato measure to choose a good test on one categorcal varable that has mutually exclusve outcomes O, O2,..., Or for a gven tranng data set T. Then T s parttoned nto T, T2,..., Tr wth T consstng of the observatons n T that wll be classfed as O by the test. The same algorthm s then appled to each subset T untl a stop crteron s encountered. When an nstance n T wth known value s assgned to a subset T, ths ndcates that the probablty of that nstance belongng to subset T s and to all other subsets s. When the value s mssng, C5. assocates to each nstance n T a weght representng the probablty of that nstance belongng to T. Ths probablty s estmated as the sum of the weghts of nstances n T known to satsfy the test wth outcome O, dvded by the sum of weghts of the cases n T wth known values on the varable. 4. Nave-Bayes (NB). Nave Bayes s an also a machne learnng technque [2]. The algorthm works wth dscrete data and requres only one pass through the database to generate a classfcaton model, whch makes t computatonally effcent. Ths algorthm assumes that the feature or attrbute values are condtonally ndependent gven the class of the attrbute, d P( x, x2,..., xd c) P( x c), where x s the th attrbute, c represents the class, and d s the number of attrbutes. The data are dvded nto two parts: ) tranng database that ncludes all records for whch class attrbute s complete and 2) testng database for whch the records are mssng. Imputaton based on the Nave Bayes conssts of two smple steps. Frst, the condtonal probabltes P( x c) and the pror probabltes P(c) are estmated based on the tranng database. Then the estmated probabltes are then used to conduct mputaton of the mssng values of the attrbutes n testng database based on the rule: arg max c P ( c x, x,..., x ) arg max P( c). 2 d c d P( x c). 5. Polytomous Regresson Imputaton- Ordered (POLR). Ths s a regresson mputaton method based on proportonal odds model. The proportonal odds model s estmated frst based on avalable complete or ncomplete observatons and then used to mpute mssng values of a varable based on the ftted values from the model. Internatonal Journal of Scentfc and Statstcal Computng (IJSSC), Volume (7) : Issue () : 28 4

5 Imputaton Approaches Works wth contnuous Works wth dscrete data? data? MI Yes Yes KNN Yes Yes C5. No Yes Naïve-Bayes No Yes POLR Yes Yes TABLE : Summary of Imputaton Approaches. Table summarzed the mputaton approaches used n ths study. 2.2 The Self-Evaluaton Algorthm The evaluaton algorthm proposed n the paper conssts of two steps: mputaton and remputaton. Gven a data set, let Y be a categorcal varable wth mssng values n the data set T and let X X, X,..., ) corresponds all other varables. Assume that there are n ( 2 X d ( x, x2,..., xd ; y observatons o ),,2,..., n n the data set. We may assume that there are no mssng values for other varables n the data set. Splt the data set nto two sets: one set S wth observatons wth no mssng values and the other set S of observatons wth mssng values. Assume that there are n observatons n S and n observatons n S. Therefore, n s n the number of mssng values n the data set. And r s the proporton of mssng values n Y. We select one mputaton technque to fll n the mssng value. To evaluate the performance of the selected mputaton technque, we propose the followng self-evaluaton algorthm. Step (Imputaton): Conduct the mputaton of mssng values n S usng the selected mputaton technque to obtan an mputed complete data set. Step 2 (Re-mputaton): From the mputed complete data set, delete certan number g of values of Y from the data set S S usng approprate method to obtan a new ncomplete data set. Then re-mpute the mssng values usng the same mputaton method n Step. Here the method to deleted values of Y must be chosen to keep the type of mssng values the same as those n the orgnal data set. For nstance, f the mssng value n the orgnal data set s mssng completely at random, then we must delete the values of Y n S completely at random. n Internatonal Journal of Scentfc and Statstcal Computng (IJSSC), Volume (7) : Issue () : 28 5

6 The man procedures n the self-evaluaton algorthm are shown n Fgure. FIGURE : Self-Evaluaton Algorthm (? denotes mssng values). In the re-mputaton step, we have the real values for those mssng snce they are made mssng from the data set S. Therefore, the mputaton accuracy can be obtaned. In fact, let c c c, 2,... m be the possble values of Y. For each observaton c be the real value of Y n the observaton j j o j, j,..., g wth value of Y deleted n o and let c j be the mputed value after the remputaton step. Therefore, the number of correctly mputed mssng values s S, let T g j I( c j c j ) () where I( c c, f c j c j ), Otherwse s the ndcator functon. The re-mputaton accuracy rate (RIAR) s j j T RIAR g (2) where g s the total number of observatons n S wth correspondng Y values deleted and later re-mputed. We may repeat these two steps for K tmes ndependently, and, as a result, we obtan correspondngly K assocated mputaton accuracy rates RIAR, k,..., K. Then the average Imputaton accuracy rate s obtaned as: RIAR K k RIAR Internatonal Journal of Scentfc and Statstcal Computng (IJSSC), Volume (7) : Issue () : 28 6 K k K k k T gk k (3)

7 where T k s the number of correctly mputed mssng values n the kth repetton. 2.3 Estmaton and Comparson Due to the law of large numbers, the average mputaton accuracy rate tends to be the expected mputaton accuracy rate under certan assumptons. In other words, as we do suffcent number K of repettons of the algorthm, the average re-mputaton accuracy rate wll approach the true remputaton accuracy rate and can, therefore, can be used to measure the performance of the mputaton technque. Theorem. Gven a data set wth mssng categorcal values and an mputaton technque, let RIAR be the average re-mputaton accuracy rate, then t converges to the expected remputaton accuracy rate E(RIAR ) n probablty. That s, RIAR P E( RIAR ), as K. (4) The followng theorem s a drect result from central lmt theorem on proportons. Theorem 2. Gven a data set wth mssng categorcal values and an mputaton technque, let RIAR be the average re-mputaton accuracy rate, then RIAR E( RIAR ) L N(,), RIAR as K. (5) Where E( RIAR )[ E( RIAR )]/ K. RIAR From Theorems, 2, and Slutsky's theorem, we have the followng corollary. Corollary. Gven a data set wth mssng categorcal values and an mputaton technque, let RIAR be the average re-mputaton accuracy rate, then RIAR E( RIAR ) RIAR ( RIAR ) / K L N(,), as K. (6) These theorems and corollary provde statstcal tools for estmaton and comparson of the remputaton accuracy rates based on dfferent mputaton approaches on a data set wth mssng categorcal data. The expected mputaton accuracy rate E(IAR) s an mportant measure of the performance of a gven mputaton technque. However, there are many cases n practce that no real values of the mssng value are known, whch makes t mpossble to obtan the mputaton accuracy rate. In the re-mputaton step of the proposed algorthm, t becomes possble to obtan the re-mputaton accuracy rate snce the mssng values of Y are made mssng n the data set S and we do have the correspondng real values. The re-mputaton accuracy rate can be obtaned to measure the performance of the mputaton technque. The fact that the same mputaton technque s used to evaluate the performance of tself leads to the name of the proposed self-evaluaton algorthm for mputaton approaches". The proposed algorthm s applcable to a wde range of mputaton methods. There are several evaluaton algorthms whch have been proved to be effectve n many stuatons for comparng dfferent mputaton methods [, 8]. However, they usually assume strong assumptons on the type of varables and statstcal models. Our algorthm s based on the Internatonal Journal of Scentfc and Statstcal Computng (IJSSC), Volume (7) : Issue () : 28 7

8 expected re-mputaton accuracy rate, whch assumes no assumptons on the mssng values n the dataset and models on evaluatons. Hence, t s vald to perform evaluatons on a wde range of mputaton methods n dfferent stuatons. In addton, the proposed algorthm s easer to perform evaluatons compared wth many other algorthms. Imputaton methods are usually evaluated on how they perform for nference about target parameters based on a statstcal model such as a regresson model. Sometmes, these statstcal models are complcated and come wth strong assumptons, whch makes the evaluatons hard to perform and restrct the applcatons to some lmted stuatons. In our proposed algorthm, the performance can be easly obtaned n a two-step procedure based on a smple valuaton measure E(RIAR) and very basc assumptons. In the next secton, we wll use the proposed evaluaton algorthm to measure the performance of the fve mputaton approaches mentoned n secton 2. based on a real data set. 3. EXPERIMENT AND RESULTS The man purpose of the experments s to emprcally evaluate the performances of mputaton approaches based on ther re-mputaton accuracy rates usng the proposed self-evaluaton algorthm. We wll start wth the descrpton of the data sets. The expermental results and analyss wll follow. Varables Categores Frequency Relatve Frequency Rank (Ordnal). Assstant Professor Assocate Professor Full Professor Mssng NA Dscplne (Nomnal) yrs.servce (Ordnal) Salary (Ordnal). Theoretcal (A) Appled (B) Mssng NA. Less than From 6 to More than Mssng 5 NA. Less than $85, From $85, to $, 3. More than $, Mssng 22 NA TABLE 2: Descrptve Summary of Varables In The Data Set. 3. Estmaton and Comparson The experments are based on the data set Salares" ncluded n the R package car [22]. The data set ncluded the 28-9 nne-month academc salares for assstant professors, assocate professors and full professors n a college n the U.S. After some ntal data processng, the resultng data set contaned nformaton of 67 assstant professors, 64 assocate professors, and 94 full professors on four varables, namely, rank, dscplne, years of servce (yrs.servce), and salary. The varable rank has three dfferent levels: assstant professor (AsstProf), assocate professor (AssocProf), and full professor (Prof). The varable dscplne has two levels: theoretcal department (A) and appled department (B). The varable years of servce (yrs.servce) s classfed nto three categores wth level f t s less than sx years, level 2 f t s from sx years to ffteen years, and level 3 f t s more than ffteen years. Smlarly, the varable salary s also classfed nto three categores wth level f t s less than $ 85,, level 2 f t s more than or equal to $ 85, and less than $,, and level 3 f t s more than $,. There are ffteen mssng values for the varable years of servce (yrs.servce) and twenty two Internatonal Journal of Scentfc and Statstcal Computng (IJSSC), Volume (7) : Issue () : 28 8

9 mssng values for the varable salary. A descrptve summary of the resultng data set s gven n table Expermental Setup The experments were performed n the followng procedure. The data set was frst mputed usng a selected mputaton method. Next, the mssng data were generated randomly, usng the MCAR mechansm, from the data set S(the set contanng complete observatons) n the followng amounts: 5%, %, 2%, and 3%. Then the mssng data were agan mputed usng the same mputaton method and the re-mputaton accuracy rate (RIAR) of the selected mputaton method was calculated and evaluated through estmaton. The mputaton methods used ncludes mean mputaton (MI), KNN, C5. decson tree mputaton, Nave-Bayes(NB), and polytomous regresson mputaton-ordered (POLR). The results of the above experments were then examned and compared. 3.3 Expermental Results The experments report the re-mputaton accuracy rate aganst four amounts of mssng values, for fve dfferent mputaton methods mentoned above. For KNN, three neghbors are consdered. In these experment, the accuracy rate s measured based on a zero-one loss, whch s commonly used to evaluate the performance of mputaton methods. We assume a unform cost for all the categores of all varables. A future study would consder dfferent categores havng dfferent assocated costs. Table 3 presents the average re-mputaton accuracy rate of these fve dfferent mputaton methods at dfferent amounts of mssng values based on K = repettons. The best results based on these average RIAR's are hghlghted n bold. We note that the hghest re-mputaton accuracy rate over all amounts of mssng data s 7.9% for 5% mssng data re-mputed by the C5 decson tree mputaton for the varable yrs.servce. The hghest re-mputaton accuracy rate over all amounts of mssng data s 8.6% for % mssng data re-mputed by the C5 decson tree mputaton for the varable salary. Based on these results, we notce that C5 s n general the best mputaton method among these fve mputaton methods for the mssng values n the data set. Varable yrs.servce salary Mssng Imputaton Methods (%) MI KNN C5 Naïve Bayes POLR TABLE 3: Average re-mputaton accuracy rates for two varables, fve mputaton methods and four amounts of mssng data based on repettons (Best average RIAR's are shown n bold). 3.4 Statstcal Comparson Now we use test of sgnfcance to compare the performance of the mputaton methods based on the results of re-mputaton accuracy rates. We analyze the statstcal sgnfcance of dfferences n re-mputaton accuracy rates between mputaton methods at 95% level of sgnfcance. We want to test whether two mean re-mputaton accuracy rates between two dfferent mputaton methods for a gven amount of mssng data are dfferent. The null hypothess H s that they are the same. Snce the average re-mputaton accuracy rates are computed based on K = Internatonal Journal of Scentfc and Statstcal Computng (IJSSC), Volume (7) : Issue () : 28 9

10 ndependent repettons, a Z-test s appled for the comparson. The standard error of the samplng dstrbuton of the accuracy rate dfference s SE p( p) n n 2 2p( p) K, where n ˆ ˆ ˆ ˆ p n2 p2 p p2 p n n 2 2 s the pooled sample accuracy rate ( n n 2 K ). Here ˆp and ˆp 2 denotes the sample remputaton accuracy rates of two selected mputaton methods, respectvely. For llustraton, we test the sgnfcance of dfferences n re-mputaton accuracy rates usng KNN and all other methods at dfferent amounts of mssng data. The results are summarzed n table 4. Varable yrs.servce salary Mssng (%) MI Sgnfcance (z-score) Imputaton Methods C5 Sgnfcance (z-score) Naïve Bayes Sgnfcance (z-score) POLR Sgnfcance (z-score) 5 (-.77) + + (4.2) (.59) + + (3.4) (-:65) + + (5:) + + (3:8) + + (4:8) 2 (-9:) + + (6:46) + + (4:55) + + (5:63) 3 (-7:78) + + (7:82) + + (6:) + + (6:83) 5 (-:4) + + (8:44) + + (8:) (-2:39) (-9.3) + + (9:73) + + (9:45) (-:3) 2 (-7:23) + + (:38) + + (:39) (:87) 3 (-5:63) + + (2:77) + + (2:78) + + (2:74) TABLE 4: Statstcal sgnfcance of the dfference of re-mputaton accuracy rates usng KNN and other methods at dfferent amounts of mssng data. Here, ++" ndcates that the gven mputaton method (columns) gves statstcally sgnfcantly better re-mputaton accuracy rate than KNN; " ndcates that the gven mputaton method (columns) gves statstcally sgnfcantly worse re-mputaton accuracy rate than KNN; " ndcates that the dfference between the accuracy rates usng the gven mputaton methods and KNN s nsgnfcant. 4. CONCLUSION In ths paper, we have ntroduced a new evaluaton algorthm for mputaton methods of mssng categorcal values. There are two man steps n the algorthm. In the frst step, the ncomplete data set s mputed usng a selected mputaton method to obtan an mputed complete data set frst. In the second step, a certan amount of mssng values are generated from the orgnally exstng observatons n the mputed complete data set and then these mssng values get remputed. The re-mputaton accuracy rate s obtaned based on certan number of repettons of ths process and used to evaluate the performance of selected mputaton method. From Table 3 and 4, we can see that, among the selected mputaton methods, C5 reconstructed the mssng data n a better manner n general. The Naïve Bayes methods was comparable to C5 only for the varable salary when the amount of mssng values s 2 percent or 3 percent. The mean mputaton method produced worse results compared to other methods. Internatonal Journal of Scentfc and Statstcal Computng (IJSSC), Volume (7) : Issue () : 28

11 The algorthm assumes no assumptons on the mssng values. That s, t appled to any type of mssng data (MCAR, MAR, or mssng not at random). In addton, researchers can evaluate most of mputaton methods avalable for categorcal data usng our algorthm on a specfc gven data set. The am of the algorthm s to two-fold. Frst, t ntroduces an evaluaton algorthm of mputaton methods for categorcal data n a data set and, therefore, provdes the best" mputaton method for the mssng categorcal data. Second, t can shed some nsghts on the true reason of mssng values n the data set based on the performance of dfferent mputaton methods. In ths paper, we only studed the evaluaton algorthm for mssng categorcal data. The evaluaton algorthm for mssng contnuous data needs to be studed n future works. 5. REFERENCES [] W.H. Fnch. Imputaton methods for mssng categorcal questonnare data: a comparson of approaches. Journal of Data Scence, vol. 8, pp , 2. [2] R.J.A. Lttle. and D.B. Rubn. Statstcal Analyss wth Mssng Data. New York: Wley, 987. [3] E. D. de Leeuw, J. Hox, and M. Husman. Preventon and treatment of tem nonresponse. Journal of Offcal Statstcs, vol. 9, pp , 23. [4] S. F. Messner. Explorng the Consequences of Erratc Data Reportng for Cross- Natonal Research on Homcde. Journal of Quanttatve Crmnology, vol. 8, pp.55-73, 992. [5] D. J. Hand, H. J. Adér, and G. J. Mellenbergh. Advsng on Research Methods: A Consultant's Companon. Huzen, Netherlands: Johannes van Kessel. pp , 28. [6] J. L. Schafer. Analyss of Incomplete Multvarate Data. Chapman and Hall, 997. [7] J. L. Schafer and J. W. Graham. Mssng data: Our vew of the state of the art. Psychologcal Methods, vol. 7, pp.47-77, 22. [8] I. Myrtvert and E. Stensrud. Analyzng data sets wth mssng data: an emprcal evaluaton of mputaton methods and lkelhood-based methods. IEEE Transactons On Software Engneerng, vol. 27, pp.999-3, 2. [9] D.B. Rubn. Multple mputaton after 8+ years. J. Am. Stat. Assoc, vol. 9, pp , 996. [] S.C. Zhang, et al. Optmzed parameters for mssng data mputaton. PRICAI, vol. 6, pp. -6, 26. [] Q. Wang and J. Rao, Emprcal lkelhood-based nferences n lnear models wth mssng data. Scand. J. Statst, vol. 29, pp , 22. [2] J. Chen and J. Shao. Jackknfe varance estmaton for nearest-neghbor mputaton. J. Amer. Statst, Assoc, vol. 96, pp , 2. [3] S.M. Chen and C.M. Huang. Generatng weghted fuzzy rules from relatonal database systems for estmatng null values usng genetc algorthms. IEEE Transactons on Fuzzy Systems, vol., pp , 23. [4] R.S. Somasundaram and R. Nedunchezhan. Evaluaton of three smple mputaton methods for enhancng preprocessng of data wth mssng values. Internatonal Journal of Computer Applcatons, vol. 2, pp. 4-9, 2. Internatonal Journal of Scentfc and Statstcal Computng (IJSSC), Volume (7) : Issue () : 28

12 [5] A.B. Anderson, A. Baslevsky, and D.P.J. Hum. Mssng data: a revew of the lterature, n Handbook of Survey Research. New York: Academc Press, 983, pp [6] M.J. Rovne and M. Delaney. Mssng data estmaton n developmental research, n Statstcal Methods n Longtudnal Research: Prncples and Structurng Change, A. Von Eye ed., New York: Academc Press, pp [7] O. Troyanskaya, M. Cantor, and G. Sherlock. Mssng value estmaton methods for DNA mcroarrays. Bonformatcs, vol. 7, pp , 2. [8] J. Chen and J. Shao. Nearest neghbor mputaton for survey data. Journal of Offcal Statstcs, vol. 6, pp. 3-3, 2. [9] L. Hurley. Mssng covarates n causal nference matchng: Statstcal mputaton usng machne learnng and evolutonary search algorthms. Doctoral dssertaton, Fordham Unversty, 27. [2] J. R. Qunlan. C4.5: Programs for machne learnng, Morgan Kaufman, Los Altos, CA, 993. [2] R.O. Duda and P.E. Hart. Pattern Classfcaton and Scene Analyss, New York: Wley, 973. [22] J. Fox, S. Wesberg, D. Adler, D. Bates, G. Baud-Bovy, S. Ellson,... and R. Heberger. Package car, Companon to Appled Regresson. R Package verson, 2-, 26. Internatonal Journal of Scentfc and Statstcal Computng (IJSSC), Volume (7) : Issue () : 28 2

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Can Auto Liability Insurance Purchases Signal Risk Attitude? Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159-164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? Chu-Shu L Department of Internatonal Busness, Asa Unversty, Tawan Sheng-Chang

More information

An Alternative Way to Measure Private Equity Performance

An Alternative Way to Measure Private Equity Performance An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate

More information

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12 14 The Ch-squared dstrbuton PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 1 If a normal varable X, havng mean µ and varance σ, s standardsed, the new varable Z has a mean 0 and varance 1. When ths standardsed

More information

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract.

More information

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ). REVIEW OF RISK MANAGEMENT CONCEPTS LOSS DISTRIBUTIONS AND INSURANCE Loss and nsurance: When someone s subject to the rsk of ncurrng a fnancal loss, the loss s generally modeled usng a random varable or

More information

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College Feature selecton for ntruson detecton Slobodan Petrovć NISlab, Gjøvk Unversty College Contents The feature selecton problem Intruson detecton Traffc features relevant for IDS The CFS measure The mrmr measure

More information

What is Candidate Sampling

What is Candidate Sampling What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble

More information

Forecasting the Direction and Strength of Stock Market Movement

Forecasting the Direction and Strength of Stock Market Movement Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye cjngwe@stanford.edu mchen5@stanford.edu nanye@stanford.edu Abstract - Stock market s one of the most complcated systems

More information

Calculation of Sampling Weights

Calculation of Sampling Weights Perre Foy Statstcs Canada 4 Calculaton of Samplng Weghts 4.1 OVERVIEW The basc sample desgn used n TIMSS Populatons 1 and 2 was a two-stage stratfed cluster desgn. 1 The frst stage conssted of a sample

More information

Statistical Methods to Develop Rating Models

Statistical Methods to Develop Rating Models Statstcal Methods to Develop Ratng Models [Evelyn Hayden and Danel Porath, Österrechsche Natonalbank and Unversty of Appled Scences at Manz] Source: The Basel II Rsk Parameters Estmaton, Valdaton, and

More information

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

Study on Model of Risks Assessment of Standard Operation in Rural Power Network Study on Model of Rsks Assessment of Standard Operaton n Rural Power Network Qngj L 1, Tao Yang 2 1 Qngj L, College of Informaton and Electrcal Engneerng, Shenyang Agrculture Unversty, Shenyang 110866,

More information

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there

More information

Single and multiple stage classifiers implementing logistic discrimination

Single and multiple stage classifiers implementing logistic discrimination Sngle and multple stage classfers mplementng logstc dscrmnaton Hélo Radke Bttencourt 1 Dens Alter de Olvera Moraes 2 Vctor Haertel 2 1 Pontfíca Unversdade Católca do Ro Grande do Sul - PUCRS Av. Ipranga,

More information

1. Measuring association using correlation and regression

1. Measuring association using correlation and regression How to measure assocaton I: Correlaton. 1. Measurng assocaton usng correlaton and regresson We often would lke to know how one varable, such as a mother's weght, s related to another varable, such as a

More information

The OC Curve of Attribute Acceptance Plans

The OC Curve of Attribute Acceptance Plans The OC Curve of Attrbute Acceptance Plans The Operatng Characterstc (OC) curve descrbes the probablty of acceptng a lot as a functon of the lot s qualty. Fgure 1 shows a typcal OC Curve. 10 8 6 4 1 3 4

More information

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..

More information

CHAPTER 14 MORE ABOUT REGRESSION

CHAPTER 14 MORE ABOUT REGRESSION CHAPTER 14 MORE ABOUT REGRESSION We learned n Chapter 5 that often a straght lne descrbes the pattern of a relatonshp between two quanttatve varables. For nstance, n Example 5.1 we explored the relatonshp

More information

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK Sample Stablty Protocol Background The Cholesterol Reference Method Laboratory Network (CRMLN) developed certfcaton protocols for total cholesterol, HDL

More information

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6 PAR TESTS If a WEIGHT varable s specfed, t s used to replcate a case as many tmes as ndcated by the weght value rounded to the nearest nteger. If the workspace requrements are exceeded and samplng has

More information

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services An Evaluaton of the Extended Logstc, Smple Logstc, and Gompertz Models for Forecastng Short Lfecycle Products and Servces Charles V. Trappey a,1, Hsn-yng Wu b a Professor (Management Scence), Natonal Chao

More information

STATISTICAL DATA ANALYSIS IN EXCEL

STATISTICAL DATA ANALYSIS IN EXCEL Mcroarray Center STATISTICAL DATA ANALYSIS IN EXCEL Lecture 6 Some Advanced Topcs Dr. Petr Nazarov 14-01-013 petr.nazarov@crp-sante.lu Statstcal data analyss n Ecel. 6. Some advanced topcs Correcton for

More information

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson Statstcs for Psychosocal Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson (LCR) What s t and when do we use t? Recall the standard latent class model

More information

Editing and Imputing Administrative Tax Return Data. Charlotte Gaughan Office for National Statistics UK

Editing and Imputing Administrative Tax Return Data. Charlotte Gaughan Office for National Statistics UK Edtng and Imputng Admnstratve Tax Return Data Charlotte Gaughan Offce for Natonal Statstcs UK Overvew Introducton Lmtatons Data Lnkng Data Cleanng Imputaton Methods Concluson and Future Work Introducton

More information

DEFINING %COMPLETE IN MICROSOFT PROJECT

DEFINING %COMPLETE IN MICROSOFT PROJECT CelersSystems DEFINING %COMPLETE IN MICROSOFT PROJECT PREPARED BY James E Aksel, PMP, PMI-SP, MVP For Addtonal Informaton about Earned Value Management Systems and reportng, please contact: CelersSystems,

More information

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

On the Optimal Control of a Cascade of Hydro-Electric Power Stations On the Optmal Control of a Cascade of Hydro-Electrc Power Statons M.C.M. Guedes a, A.F. Rbero a, G.V. Smrnov b and S. Vlela c a Department of Mathematcs, School of Scences, Unversty of Porto, Portugal;

More information

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending Proceedngs of 2012 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 25 (2012) (2012) IACSIT Press, Sngapore Bayesan Network Based Causal Relatonshp Identfcaton and Fundng Success

More information

Evaluating credit risk models: A critique and a new proposal

Evaluating credit risk models: A critique and a new proposal Evaluatng credt rsk models: A crtque and a new proposal Hergen Frerchs* Gunter Löffler Unversty of Frankfurt (Man) February 14, 2001 Abstract Evaluatng the qualty of credt portfolo rsk models s an mportant

More information

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm Avalable onlne www.ocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(7):1884-1889 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCPRC5 A hybrd global optmzaton algorthm based on parallel

More information

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression Novel Methodology of Workng Captal Management for Large Publc Constructons by Usng Fuzzy S-curve Regresson Cheng-Wu Chen, Morrs H. L. Wang and Tng-Ya Hseh Department of Cvl Engneerng, Natonal Central Unversty,

More information

An Interest-Oriented Network Evolution Mechanism for Online Communities

An Interest-Oriented Network Evolution Mechanism for Online Communities An Interest-Orented Network Evoluton Mechansm for Onlne Communtes Cahong Sun and Xaopng Yang School of Informaton, Renmn Unversty of Chna, Bejng 100872, P.R. Chna {chsun,yang}@ruc.edu.cn Abstract. Onlne

More information

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by 6 CHAPTER 8 COMPLEX VECTOR SPACES 5. Fnd the kernel of the lnear transformaton gven n Exercse 5. In Exercses 55 and 56, fnd the mage of v, for the ndcated composton, where and are gven by the followng

More information

An Analysis of Central Processor Scheduling in Multiprogrammed Computer Systems

An Analysis of Central Processor Scheduling in Multiprogrammed Computer Systems STAN-CS-73-355 I SU-SE-73-013 An Analyss of Central Processor Schedulng n Multprogrammed Computer Systems (Dgest Edton) by Thomas G. Prce October 1972 Techncal Report No. 57 Reproducton n whole or n part

More information

Gender differences in revealed risk taking: evidence from mutual fund investors

Gender differences in revealed risk taking: evidence from mutual fund investors Economcs Letters 76 (2002) 151 158 www.elsever.com/ locate/ econbase Gender dfferences n revealed rsk takng: evdence from mutual fund nvestors a b c, * Peggy D. Dwyer, James H. Glkeson, John A. Lst a Unversty

More information

SIMPLE LINEAR CORRELATION

SIMPLE LINEAR CORRELATION SIMPLE LINEAR CORRELATION Smple lnear correlaton s a measure of the degree to whch two varables vary together, or a measure of the ntensty of the assocaton between two varables. Correlaton often s abused.

More information

Recurrence. 1 Definitions and main statements

Recurrence. 1 Definitions and main statements Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.

More information

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES The goal: to measure (determne) an unknown quantty x (the value of a RV X) Realsaton: n results: y 1, y 2,..., y j,..., y n, (the measured values of Y 1, Y 2,..., Y j,..., Y n ) every result s encumbered

More information

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION NEURO-FUZZY INFERENE SYSTEM FOR E-OMMERE WEBSITE EVALUATION Huan Lu, School of Software, Harbn Unversty of Scence and Technology, Harbn, hna Faculty of Appled Mathematcs and omputer Scence, Belarusan State

More information

The Current Employment Statistics (CES) survey,

The Current Employment Statistics (CES) survey, Busness Brths and Deaths Impact of busness brths and deaths n the payroll survey The CES probablty-based sample redesgn accounts for most busness brth employment through the mputaton of busness deaths,

More information

The Application of Fractional Brownian Motion in Option Pricing

The Application of Fractional Brownian Motion in Option Pricing Vol. 0, No. (05), pp. 73-8 http://dx.do.org/0.457/jmue.05.0..6 The Applcaton of Fractonal Brownan Moton n Opton Prcng Qng-xn Zhou School of Basc Scence,arbn Unversty of Commerce,arbn zhouqngxn98@6.com

More information

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting Causal, Explanatory Forecastng Assumes cause-and-effect relatonshp between system nputs and ts output Forecastng wth Regresson Analyss Rchard S. Barr Inputs System Cause + Effect Relatonshp The job of

More information

Project Networks With Mixed-Time Constraints

Project Networks With Mixed-Time Constraints Project Networs Wth Mxed-Tme Constrants L Caccetta and B Wattananon Western Australan Centre of Excellence n Industral Optmsaton (WACEIO) Curtn Unversty of Technology GPO Box U1987 Perth Western Australa

More information

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation Exhaustve Regresson An Exploraton of Regresson-Based Data Mnng Technques Usng Super Computaton Antony Daves, Ph.D. Assocate Professor of Economcs Duquesne Unversty Pttsburgh, PA 58 Research Fellow The

More information

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP) 6.3 / -- Communcaton Networks II (Görg) SS20 -- www.comnets.un-bremen.de Communcaton Networks II Contents. Fundamentals of probablty theory 2. Emergence of communcaton traffc 3. Stochastc & Markovan Processes

More information

Inequality and The Accounting Period. Quentin Wodon and Shlomo Yitzhaki. World Bank and Hebrew University. September 2001.

Inequality and The Accounting Period. Quentin Wodon and Shlomo Yitzhaki. World Bank and Hebrew University. September 2001. Inequalty and The Accountng Perod Quentn Wodon and Shlomo Ytzha World Ban and Hebrew Unversty September Abstract Income nequalty typcally declnes wth the length of tme taen nto account for measurement.

More information

Transition Matrix Models of Consumer Credit Ratings

Transition Matrix Models of Consumer Credit Ratings Transton Matrx Models of Consumer Credt Ratngs Abstract Although the corporate credt rsk lterature has many studes modellng the change n the credt rsk of corporate bonds over tme, there s far less analyss

More information

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy Fnancal Tme Seres Analyss Patrck McSharry patrck@mcsharry.net www.mcsharry.net Trnty Term 2014 Mathematcal Insttute Unversty of Oxford Course outlne 1. Data analyss, probablty, correlatons, vsualsaton

More information

A DATA MINING APPLICATION IN A STUDENT DATABASE

A DATA MINING APPLICATION IN A STUDENT DATABASE JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 005 VOLUME NUMBER (53-57) A DATA MINING APPLICATION IN A STUDENT DATABASE Şenol Zafer ERDOĞAN Maltepe Ünversty Faculty of Engneerng Büyükbakkalköy-Istanbul

More information

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence 1 st Internatonal Symposum on Imprecse Probabltes and Ther Applcatons, Ghent, Belgum, 29 June 2 July 1999 How Sets of Coherent Probabltes May Serve as Models for Degrees of Incoherence Mar J. Schervsh

More information

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm Document Clusterng Analyss Based on Hybrd PSO+K-means Algorthm Xaohu Cu, Thomas E. Potok Appled Software Engneerng Research Group, Computatonal Scences and Engneerng Dvson, Oak Rdge Natonal Laboratory,

More information

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES Zuzanna BRO EK-MUCHA, Grzegorz ZADORA, 2 Insttute of Forensc Research, Cracow, Poland 2 Faculty of Chemstry, Jagellonan

More information

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008 Rsk-based Fatgue Estmate of Deep Water Rsers -- Course Project for EM388F: Fracture Mechancs, Sprng 2008 Chen Sh Department of Cvl, Archtectural, and Envronmental Engneerng The Unversty of Texas at Austn

More information

L10: Linear discriminants analysis

L10: Linear discriminants analysis L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss

More information

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña Proceedngs of the 2008 Wnter Smulaton Conference S. J. Mason, R. R. Hll, L. Mönch, O. Rose, T. Jefferson, J. W. Fowler eds. A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION

More information

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST) yaoqi.feng@yahoo.

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST) yaoqi.feng@yahoo. ICSV4 Carns Australa 9- July, 007 RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL Yaoq FENG, Hanpng QIU Dynamc Test Laboratory, BISEE Chna Academy of Space Technology (CAST) yaoq.feng@yahoo.com Abstract

More information

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background: SPEE Recommended Evaluaton Practce #6 efnton of eclne Curve Parameters Background: The producton hstores of ol and gas wells can be analyzed to estmate reserves and future ol and gas producton rates and

More information

Searching for Interacting Features for Spam Filtering

Searching for Interacting Features for Spam Filtering Searchng for Interactng Features for Spam Flterng Chuanlang Chen 1, Yun-Chao Gong 2, Rongfang Be 1,, and X. Z. Gao 3 1 Department of Computer Scence, Bejng Normal Unversty, Bejng 100875, Chna 2 Software

More information

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES In ths chapter, we wll learn how to descrbe the relatonshp between two quanttatve varables. Remember (from Chapter 2) that the terms quanttatve varable

More information

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek HE DISRIBUION OF LOAN PORFOLIO VALUE * Oldrch Alfons Vascek he amount of captal necessary to support a portfolo of debt securtes depends on the probablty dstrbuton of the portfolo loss. Consder a portfolo

More information

Traditional versus Online Courses, Efforts, and Learning Performance

Traditional versus Online Courses, Efforts, and Learning Performance Tradtonal versus Onlne Courses, Efforts, and Learnng Performance Kuang-Cheng Tseng, Department of Internatonal Trade, Chung-Yuan Chrstan Unversty, Tawan Shan-Yng Chu, Department of Internatonal Trade,

More information

NON-CONSTANT SUM RED-AND-BLACK GAMES WITH BET-DEPENDENT WIN PROBABILITY FUNCTION LAURA PONTIGGIA, University of the Sciences in Philadelphia

NON-CONSTANT SUM RED-AND-BLACK GAMES WITH BET-DEPENDENT WIN PROBABILITY FUNCTION LAURA PONTIGGIA, University of the Sciences in Philadelphia To appear n Journal o Appled Probablty June 2007 O-COSTAT SUM RED-AD-BLACK GAMES WITH BET-DEPEDET WI PROBABILITY FUCTIO LAURA POTIGGIA, Unversty o the Scences n Phladelpha Abstract In ths paper we nvestgate

More information

Fragility Based Rehabilitation Decision Analysis

Fragility Based Rehabilitation Decision Analysis .171. Fraglty Based Rehabltaton Decson Analyss Cagdas Kafal Graduate Student, School of Cvl and Envronmental Engneerng, Cornell Unversty Research Supervsor: rcea Grgoru, Professor Summary A method s presented

More information

IMPACT ANALYSIS OF A CELLULAR PHONE

IMPACT ANALYSIS OF A CELLULAR PHONE 4 th ASA & μeta Internatonal Conference IMPACT AALYSIS OF A CELLULAR PHOE We Lu, 2 Hongy L Bejng FEAonlne Engneerng Co.,Ltd. Bejng, Chna ABSTRACT Drop test smulaton plays an mportant role n nvestgatng

More information

HOUSEHOLDS DEBT BURDEN: AN ANALYSIS BASED ON MICROECONOMIC DATA*

HOUSEHOLDS DEBT BURDEN: AN ANALYSIS BASED ON MICROECONOMIC DATA* HOUSEHOLDS DEBT BURDEN: AN ANALYSIS BASED ON MICROECONOMIC DATA* Luísa Farnha** 1. INTRODUCTION The rapd growth n Portuguese households ndebtedness n the past few years ncreased the concerns that debt

More information

Enterprise Master Patient Index

Enterprise Master Patient Index Enterprse Master Patent Index Healthcare data are captured n many dfferent settngs such as hosptals, clncs, labs, and physcan offces. Accordng to a report by the CDC, patents n the Unted States made an

More information

PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIGIOUS AFFILIATION AND PARTICIPATION

PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIGIOUS AFFILIATION AND PARTICIPATION PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIIOUS AFFILIATION AND PARTICIPATION Danny Cohen-Zada Department of Economcs, Ben-uron Unversty, Beer-Sheva 84105, Israel Wllam Sander Department of Economcs, DePaul

More information

MAPP. MERIS level 3 cloud and water vapour products. Issue: 1. Revision: 0. Date: 9.12.1998. Function Name Organisation Signature Date

MAPP. MERIS level 3 cloud and water vapour products. Issue: 1. Revision: 0. Date: 9.12.1998. Function Name Organisation Signature Date Ttel: Project: Doc. No.: MERIS level 3 cloud and water vapour products MAPP MAPP-ATBD-ClWVL3 Issue: 1 Revson: 0 Date: 9.12.1998 Functon Name Organsaton Sgnature Date Author: Bennartz FUB Preusker FUB Schüller

More information

Chapter XX More advanced approaches to the analysis of survey data. Gad Nathan Hebrew University Jerusalem, Israel. Abstract

Chapter XX More advanced approaches to the analysis of survey data. Gad Nathan Hebrew University Jerusalem, Israel. Abstract Household Sample Surveys n Developng and Transton Countres Chapter More advanced approaches to the analyss of survey data Gad Nathan Hebrew Unversty Jerusalem, Israel Abstract In the present chapter, we

More information

Influence and Correlation in Social Networks

Influence and Correlation in Social Networks Influence and Correlaton n Socal Networks Ars Anagnostopoulos Rav Kumar Mohammad Mahdan Yahoo! Research 701 Frst Ave. Sunnyvale, CA 94089. {ars,ravkumar,mahdan}@yahoo-nc.com ABSTRACT In many onlne socal

More information

Methodology to Determine Relationships between Performance Factors in Hadoop Cloud Computing Applications

Methodology to Determine Relationships between Performance Factors in Hadoop Cloud Computing Applications Methodology to Determne Relatonshps between Performance Factors n Hadoop Cloud Computng Applcatons Lus Eduardo Bautsta Vllalpando 1,2, Alan Aprl 1 and Alan Abran 1 1 Department of Software Engneerng and

More information

METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS

METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS Lus Eduardo Bautsta Vllalpando 1,2, Alan Aprl 1 and Alan Abran 1 1 Department of Software Engneerng

More information

Prediction of Disability Frequencies in Life Insurance

Prediction of Disability Frequencies in Life Insurance Predcton of Dsablty Frequences n Lfe Insurance Bernhard Köng Fran Weber Maro V. Wüthrch October 28, 2011 Abstract For the predcton of dsablty frequences, not only the observed, but also the ncurred but

More information

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1. HIGHER DOCTORATE DEGREES SUMMARY OF PRINCIPAL CHANGES General changes None Secton 3.2 Refer to text (Amendments to verson 03.0, UPR AS02 are shown n talcs.) 1 INTRODUCTION 1.1 The Unversty may award Hgher

More information

Sketching Sampled Data Streams

Sketching Sampled Data Streams Sketchng Sampled Data Streams Florn Rusu, Aln Dobra CISE Department Unversty of Florda Ganesvlle, FL, USA frusu@cse.ufl.edu adobra@cse.ufl.edu Abstract Samplng s used as a unversal method to reduce the

More information

Prediction of Disability Frequencies in Life Insurance

Prediction of Disability Frequencies in Life Insurance 1 Predcton of Dsablty Frequences n Lfe Insurance Bernhard Köng 1, Fran Weber 1, Maro V. Wüthrch 2 Abstract: For the predcton of dsablty frequences, not only the observed, but also the ncurred but not yet

More information

Survival analysis methods in Insurance Applications in car insurance contracts

Survival analysis methods in Insurance Applications in car insurance contracts Survval analyss methods n Insurance Applcatons n car nsurance contracts Abder OULIDI 1 Jean-Mare MARION 2 Hervé GANACHAUD 3 Abstract In ths wor, we are nterested n survval models and ther applcatons on

More information

Financial Instability and Life Insurance Demand + Mahito Okura *

Financial Instability and Life Insurance Demand + Mahito Okura * Fnancal Instablty and Lfe Insurance Demand + Mahto Okura * Norhro Kasuga ** Abstract Ths paper estmates prvate lfe nsurance and Kampo demand functons usng household-level data provded by the Postal Servces

More information

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35,000 100,000 2 2,200,000 60,000 350,000

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35,000 100,000 2 2,200,000 60,000 350,000 Problem Set 5 Solutons 1 MIT s consderng buldng a new car park near Kendall Square. o unversty funds are avalable (overhead rates are under pressure and the new faclty would have to pay for tself from

More information

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

v a 1 b 1 i, a 2 b 2 i,..., a n b n i. SECTION 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS 455 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS All the vector spaces we have studed thus far n the text are real vector spaces snce the scalars are

More information

Predicting Software Development Project Outcomes *

Predicting Software Development Project Outcomes * Predctng Software Development Project Outcomes * Rosna Weber, Mchael Waller, June Verner, Wllam Evanco College of Informaton Scence & Technology, Drexel Unversty 3141 Chestnut Street Phladelpha, PA 19104

More information

Scale Dependence of Overconfidence in Stock Market Volatility Forecasts

Scale Dependence of Overconfidence in Stock Market Volatility Forecasts Scale Dependence of Overconfdence n Stoc Maret Volatlty Forecasts Marus Glaser, Thomas Langer, Jens Reynders, Martn Weber* June 7, 007 Abstract In ths study, we analyze whether volatlty forecasts (judgmental

More information

A Probabilistic Theory of Coherence

A Probabilistic Theory of Coherence A Probablstc Theory of Coherence BRANDEN FITELSON. The Coherence Measure C Let E be a set of n propostons E,..., E n. We seek a probablstc measure C(E) of the degree of coherence of E. Intutvely, we want

More information

1 De nitions and Censoring

1 De nitions and Censoring De ntons and Censorng. Survval Analyss We begn by consderng smple analyses but we wll lead up to and take a look at regresson on explanatory factors., as n lnear regresson part A. The mportant d erence

More information

The impact of hard discount control mechanism on the discount volatility of UK closed-end funds

The impact of hard discount control mechanism on the discount volatility of UK closed-end funds Investment Management and Fnancal Innovatons, Volume 10, Issue 3, 2013 Ahmed F. Salhn (Egypt) The mpact of hard dscount control mechansm on the dscount volatlty of UK closed-end funds Abstract The mpact

More information

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT Toshhko Oda (1), Kochro Iwaoka (2) (1), (2) Infrastructure Systems Busness Unt, Panasonc System Networks Co., Ltd. Saedo-cho

More information

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall SP 2005-02 August 2005 Staff Paper Department of Appled Economcs and Management Cornell Unversty, Ithaca, New York 14853-7801 USA Farm Savngs Accounts: Examnng Income Varablty, Elgblty, and Benefts Brent

More information

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 Learnng from Large Dstrbuted Data: A Scalng Down Samplng Scheme for Effcent Data Processng Che Ngufor and Janusz Wojtusak part

More information

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 24, 819-840 (2008) Data Broadcast on a Mult-System Heterogeneous Overlayed Wreless Network * Department of Computer Scence Natonal Chao Tung Unversty Hsnchu,

More information

Decision Tree Model for Count Data

Decision Tree Model for Count Data Proceedngs of the World Congress on Engneerng 2012 Vol I Decson Tree Model for Count Data Yap Bee Wah, Norashkn Nasaruddn, Wong Shaw Voon and Mohamad Alas Lazm Abstract The Posson Regresson and Negatve

More information

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University Characterzaton of Assembly Varaton Analyss Methods A Thess Presented to the Department of Mechancal Engneerng Brgham Young Unversty In Partal Fulfllment of the Requrements for the Degree Master of Scence

More information

Statistical algorithms in Review Manager 5

Statistical algorithms in Review Manager 5 Statstcal algorthms n Reve Manager 5 Jonathan J Deeks and Julan PT Hggns on behalf of the Statstcal Methods Group of The Cochrane Collaboraton August 00 Data structure Consder a meta-analyss of k studes

More information

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001 Proceedngs of the Annual Meetng of the Amercan Statstcal Assocaton, August 5-9, 2001 LIST-ASSISTED SAMPLING: THE EFFECT OF TELEPHONE SYSTEM CHANGES ON DESIGN 1 Clyde Tucker, Bureau of Labor Statstcs James

More information

The Greedy Method. Introduction. 0/1 Knapsack Problem

The Greedy Method. Introduction. 0/1 Knapsack Problem The Greedy Method Introducton We have completed data structures. We now are gong to look at algorthm desgn methods. Often we are lookng at optmzaton problems whose performance s exponental. For an optmzaton

More information

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network 700 Proceedngs of the 8th Internatonal Conference on Innovaton & Management Forecastng the Demand of Emergency Supples: Based on the CBR Theory and BP Neural Network Fu Deqang, Lu Yun, L Changbng School

More information

Selecting Best Employee of the Year Using Analytical Hierarchy Process

Selecting Best Employee of the Year Using Analytical Hierarchy Process J. Basc. Appl. Sc. Res., 5(11)72-76, 2015 2015, TextRoad Publcaton ISSN 2090-4304 Journal of Basc and Appled Scentfc Research www.textroad.com Selectng Best Employee of the Year Usng Analytcal Herarchy

More information

Mining Multiple Large Data Sources

Mining Multiple Large Data Sources The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 24 Mnng Multple Large Data Sources Anmesh Adhkar, Pralhad Ramachandrarao 2, Bhanu Prasad 3, and Jhml Adhkar 4 Department of

More information

1 Example 1: Axis-aligned rectangles

1 Example 1: Axis-aligned rectangles COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton

More information

The Journal of Systems and Software

The Journal of Systems and Software The Journal of Systems and Software 82 (2009) 241 252 Contents lsts avalable at ScenceDrect The Journal of Systems and Software journal homepage: www. elsever. com/ locate/ jss A study of project selecton

More information

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining Rsk Model of Long-Term Producton Schedulng n Open Pt Gold Mnng R Halatchev 1 and P Lever 2 ABSTRACT Open pt gold mnng s an mportant sector of the Australan mnng ndustry. It uses large amounts of nvestments,

More information

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Luby s Alg. for Maximal Independent Sets using Pairwise Independence Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent

More information

RequIn, a tool for fast web traffic inference

RequIn, a tool for fast web traffic inference RequIn, a tool for fast web traffc nference Olver aul, Jean Etenne Kba GET/INT, LOR Department 9 rue Charles Fourer 90 Evry, France Olver.aul@nt-evry.fr, Jean-Etenne.Kba@nt-evry.fr Abstract As networked

More information