A novel Method for Data Mining and Classification based on

A novel Method for Data Mnng and Classfcaton based on Ensemble Learnng 1 1, Frst Author Nejang Normal Unversty;Schuan Nejang 641112,Chna, E-mal: lhan-gege@126.com Abstract Data mnng has been attached great mportance n nformaton ndustry. he man reason s that data mnng stores lots of data whch are broadly applcable. Besdes, these data are urgently requred to be transformed nto useful nformaton and knowledge. hs paper manly concerns a sorted and branchng problem of data mnng and desgned an ensembled KNN classfer based on dstance learnng. hs classfer frstly performed flterng uncorrelated attrbutes n data sets based on nformaton gan and t flters redundant attrbutes wth lower correlaton degree. hen, through baggng to ntegraton, the generated classfer cannot only make use of ts self operaton to carry out randomly selecton towards tranng samples. But t also mplemented random flterng on classfer attrbute to enhance dfferentaton between sub-classfers durng guaranteeng the accuracy of sub-classfers. Durng dstance learnng, wth near component method, the optmzed calculaton can be mplemented wth leave-one-out cross valdaton. he experment data shows that classfcaton effects of new classfers s sgnfcantly mproved than tradtonal KNN and sngle classfcaton method of dstance learnng. Keyword: Ensemble Learnng, Data Mnng, Baggng Method, k-nearest Neghbor Algorthm 1. Introducton Due to relatve technologcal development and practcal workng requrement, database technology s appled to store as well as manage data and machne learnng technology s used to analyze data. herefore, a great deal of knowledge hdden n data has been dscovered and the analyss as well as the reorganzaton of these data fnally forms a concernng research feld:knowledge Dscovery n Database, KDD [1]. It generally refers to dscoverng mode or communcaton methods from source data. KDD covers the whole data mnng process contanng from the very frst target of makng busness to the fnal result analyss whle data mnng only makes a descrpton on the sub-process of usng algorthm on mnng to carry out data mnng. However, recently, people fnd that there s a lot of work n data mnng whch can be accomplshed by statstcs and they consder that the best strategy s to organcally combne statstcs wth data mnng [2] and the technology of data mnng s the most crtcal step n KDD. Classfcaton technology s an mportant branch n data mnng. here are many methods applyng to classfcaton data mnng, lke decson tree method, genetc algorthm, Bayesan networks, rough set, KNN, related rules, [3-8]. On the desgn of comprehensvely ensemble classfcaton n ths paper, the method of calculatng average nformaton gan s performed durng data pretreatment to flter uncorrelated attrbutes and the ensemble method s to use basc baggng ensemble method to generate sub-classfer to synthesze results whle each sub-classfer s an mproved KNN classfer. As one of the most tradtonal classfcaton algorthms n data mnng, algorthm KNN has stll been broadly appled n many areas due to ts smplcty, effcency and nonparametrc qualty[9]. owards dsadvantages of KNN, people put forward some effcent mprovng methods. On one hand, many learnng algorthms of KNN dstance measure models have been brought forward amng at the advantages of sngleness on KNN dstance measure models. Representatve algorthms contan MLCC (Metrc Learnng by Collapsng Class),LMNN(Dstance Metrc Learnng for Large Margn Nearest Neghbor Classfcaton)and NCA(Neghborhood Components Analyss).hese three dstance learnng methods [10-12] appled dfferent mathematcal models and tranng measure acqured can narrow down the dstance of smlar samples n these tranng data but enlarge the dstance between dfferent samples. On the other hand, algorthm KNN ntegrates wth other algorthms to mprove Advances n nformaton Scences and Servce Scences(AISS) Volume5, Number6, March 2013 do:10.4156/aiss.vol5.ssue6.109 916

classfcaton effect ncludng ntegratng SVM wth algorthm KNN, ntegratng genetc algorthm wth fuzzy KNN algorthm and ntegraton Bayesan classfer wth KNN classfer. owards the above two methods mprovement, ths paper ntegrates algorthm Baggng wth KNN classfer of dstance learnng and puts forward an ensemble KNN classfer based on dstance learnng. 2. Related Study 2.1. KNN Classfcaton Algorthm KNN, whch s based on statstcs, ams to classfy the category of the majorty of the samples n each nearest neghbor accordng to the k of the test sample n the feature space. he basc method s: all of the examples are put nto an N-dmensonal space.usually, each example x s expressed as a feature vector{a 1 (x),a 2 (x),,a n (x)}. a (x) denotes the attrbute value r. he smlarty measurement between the two examples x and x j commonly s calculate by eucldean dstance. n j r r j 2 (1) r1 d( x, x ) ( a ( x ) a ( x )) KNN s a knd of weak classfer for t s very senstve to the dmenson trap. o avod the measurement errors caused by the smlarty measure,[13] add a feature weghts for each attrbute for an mprovement.hat s, dfferent attrbutes n classfcaton have dfferent nfluence. hen the mproved dstance s calculated as m 2 d( xk, c) ( wj xkjcj) (2) j1 A meanngful method to mprove the KNN algorthm s to reduce weght as much as possble through adjustng weghts w j. he smlarty between vector e p and e q n weghted smlarty matrx can be expressed as: w 1 pq (3) w 1 d pq w pq we get by optmzaton wth certan crtera s used as the new smlarty. hs method contrbutes to mprove the classfcaton results and t reduce the smlarty of the smlarty wthn the ncreasng bg class. 2.2. Ensemble Learnng Algorthms Ensemble learnng algorthm s a technology whch s used to mprove the accuracy of the classfcaton algorthm.it developed gradually from the feld of machne learnng technology Ensemble learnng algorthm manly nclude two types of methods: Baggng and Boostng. hs paper takes advantage of the baggng method for KNN classfcaton methods.beacuse ths paper need dstance learnng for each sub-classfer. If there are a sub-categores wth larger number of attrbutes and samples,dstance learnng process wll be extremely tme-consumng.he formaton process of each sub-classfers n baggng method s ndependent of each other, whch make t possble for multple threads or onlne parallel processng method to balance the metrc learnng total operaton of the sub-classfers.so Baggng algorthm can smplfy part of the calculaton. Baggng generated classfers by the measure of return random samplng technque,or we called t bootstrappng samplng. In ths method, the dfference between the ensemble members are obtaned by resamplng for the Bootstrappng, or by provdng dfferences through tranng the randomness and ndependence of the samples. Baggng method s manly used for unstable learnng algorthm, neural networks and decson trees.for example, Baggng reduces the varance produced by based classfer through predctng vote for these classfers, then reduce the generalzaton error. For stable learnng algorthm, such as Nave Bayes method, the baggng ntegraton can not decrease the generalzaton error. 917

he algorthm can be descrbed as follows: We set orgnal tranng sample as D = {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )).N s the tranng sample number.in the tranng phase: For t = 1,2,... Do / / s the number of ndvduals n the baggng ntegraton (1) extracte m nputs from the tranng sample randomly; (2) obtane model H n accordance wth a gven learnng algorthm (3) put the tranng sample back Return to collecton (h1, h2,..., hr). Forecast perod: H x 1 H x ht x t 1 arc yy max ht x t 1 (4) (5) 3. Ensemble KNN Classfer based on Dstance Learnng 3.1. Algorthm Process A marked tranng data set {(x,y )}n=1 was supposed to contan n samples and each sample possessed d attrbutes X R as well as data set has c classfcatons y-={1,2,,c}. hs paper ntroduced a comprehensve ntegraton KNN classfcaton algorthm based on dstance learnng. At frst, attrbute flterng was used to elmnate the attrbutes whose correlaton s smaller than closng values. hen, usng baggng ntegraton algorthm generated sub-classfer and each sub-classfer was acqured by randomly attrbutng selecton on the bass of orgnal data sets. In the end, each sub-classfer was carred out dstance learnng to calculate dstance measurement model amng at each dfferent sub-classfer and models would be classfed on testng samples after learnng. Algorthm process manly ncludes ntal attrbute flterng, baggng ntegraton algorthm and KNN algorthm of dstance learnng. Its process s shown as fgure 1. 918

Dataset ranng Attrbutes fltraton hreshold f control the Fltraton Sub-Classfer generated by bootstrap Sub-Classfer 1 Sub-Classfer 2 Sub-Classfer t Attrbutes of Attrbutes of Attrbutes of Sub-Classfer 1 are Sub-Classfer 2 are Sub-Classfer t are random elmnated random elmnated random elmnated Control preserve for attrbute dsturbance parameter Sub-Classfer 1 Dstance measurement A 1 Sub-Classfer 2 Dstance measurement A 2 Sub-Classfer t Dstance measurement A t Sub-Classfer 1 Classfcaton results S 1 Sub-Classfer 2 Classfcaton results S 2 Sub-Classfer t Classfcaton results S t Comprehensve classfcaton results wth majorty votng system Fnal classfcaton results Fgure 1. Ensemble KNNe lassfer based on dstance learnng method 3.2. Attbutes Fltraton of ranng Dataset In ntal data set, snce many attrbutes are not correlated to learnng target, correlated attrbutes learnng s nterrupted. Input attrbute s randomly selected to generate sub-attrbute set to tran sub-classfers so that the nfluence of uncorrelated attrbutes s amplfed so as to result n correcton reducton of classfers. Because of ths, uncorrelated attrbutes need to be fltered before ntegraton.hs paper adopts a method based on nformaton gan to flter attrbutes. Specfcally, ths method s appled to calculate all nformaton gan of orgnal attrbutes n data set. Some nformaton gan whch s smaller than the attrbute wth specfc threshold f should be removed as uncorrelated attrbute. Here, one thrd of the average value on nformaton gan of f selected attrbutes s taken as the threshold value on attrbute fltraton. 919

3.3. Baggng Methods o establsh an effectve ensemblng,each of the sub-classfer need to own hgher accuracy and the dfference among the classfers should be great.knn method s relatvely hard to be ensembled by Baggng algorthm.for example:gve a data set whch has N samples.he probablty selecton that sample s selected obey the dstrbuton of Posson dstrbuton approxmately wthλ= 1.he the probablty that sample emerges at least once s 1 (1/e) 0.632.We assume the two-classfer problems need to generate t sub-classfers. When and only when the tmes of that some testng sample emerges n the neghborhood of tranng set less than t/2,the classfcaton results pf testng sample wll change.it s obvous that ths probablty contnues to decrease wth the ncreasng of t.so the sub-classfer wth accuracy and dfferences can not be acqured only by bootstrap methods whch randomly selects tranng sample. he Baggng method for the calculaton of generatng the sub-classfers n ths paper, absorbs the advantage of FABSIR n [14] to add dsturbance to nput attrbutes.he process of generatng sub-classfers s:on the one hand,we extract samples of orgn dataset to compose sub-classfers by bootstrap method;on the other hand,the proporton of sub-classfers attrbutes are controlled by parameters.he parts of the attrbutes are elmnated randomly.assume the number of generated sub-classfers s t.durng the process,the attrbutes of these t sub-classfers wll be elmnated randomly agan.gve parameter factor s s used to control the amount of elmnated attrbutes.if d attrbutes left after the orgn dataset s fltered,he sub-classfer has attrbutes of d s.hs knd of method can mprove the effect of classfcaton,whch s determned by the dfference among the sub-classfers generated by the method. 3.4. Dstance Measurement Learnng We ntroduced a dstance learnng KNN classfer based on near component analyss (NCA) to be used to mprove classfcaton accuracy of each sub-classfer and the followng formula presents specfc realzaton of dstance learnng algorthm. We suppose that a marked sub-classfer contans n real nputtng vectors x l,,x n,x R and the correspondng types are marked c l,,c n. Choosng a dstance vector makes classfcaton effect of KNN to reach the best value. Dstance learnng method adopted Mahalanobs dstance model to seek soluton on sem postve defnte matrx seekng Q=A A. It s shown as formula 1 ( x, y) ( x y) Q( x y) ( Ax Ay) ( Ax Ay) (6) Error rate of leave-one-out valdaton s a dscontnuous functon of a varable matrx A and the very tny change of matrx A wll greatly mpact testng results of KNN s leave-one-out valdaton. hus, ths algorthm ntroduced a dfferental non-lnear functon, that s, each pont takes another pont j as ts adjacency and nherts the probablty of typng tag c as P j. On the bass of Mahalanobs dstance, non-lnear functon softmax s used to defne P j as: P j 2 exp( Ax Ax j ) 2 Ax k Ax j exp( ) Due to randomly selecton, the probablty P of correct classfcaton on pont s P Pj (8) jc herefore, the best classfcaton effect s the number on correctly classfed pont whch can reach the hghest. We set the target functon as: j (9) jc f ( A) P P hs s a matrx functon and the method NCA s through conjugate gradent method or quadratc programmng to optmze ths matrx functon and gets the optmal soluton A so as to get the optmstc dstance learnng method. Besdes, the target functon can be transformed nto target functon towards (7) 920

gradent A x j =x -x j. f 2 A Pj ( xjxj Pk xk xk ) A After soluton. (10) jc k f 2 A ( P Pkxkxk Pjxjxj) A (11) jc hs paper made use of conjugate gradent method referred n [15] to calculate the maxmum value of formula 3 to acqure the optmum value A matrx. he acqured new dstance measure d(x,y)=(x-y) Q(x-y) Q(x-y)=(A x -A y ) (A x -A y ) can be regarded as dstance measure of algorthm KNN to carry out classfcaton. 4. Experment Result Analyss Fve groups of commonly used UC datasets are appled to detect ths algorthm s classfcaton effect and these data of data sets are all numerc. Snce KNN cannot deal wth mssng data, some mssng data are gnored before ntroducng data sets. he nformaton of data sets s shown as the followng table able 1. Fve data sets used n experment Dataset Samples Features Categores Wne 178 13 3 Segement 200 19 7 Irs 150 4 3 Bal 625 4 3 Ion 351 35 2 Leave-one-out staggered valdaton s used to calculate the classfcaton correcton rate of each data set. hat s, towards a data set wth n samples, each sample s selected to take as a testng sample at one tme and the rest n-1 samples can make up new data set as tranng set. owards testng samples classfcaton, the same steps wll repeat n tmes n order that each sample can act as a testng sample. In the end, samples wll be tested durng these n cycles to be classfed based on correct numbers to get the correcton rate. At frst, dfferent value K s set to compare the classfcaton effect between tradtonal KNN as well as algorthm KNN of dstance learnng wth these fve groups data sets. Sub-classfer number of ntegraton learnng s set as 50, the value K s 1 and the coeffcent of attrbutng flterng s 0.33. Because of dfferent data sets, attrbute extractng coeffcent s gets 0.6 n Ion data set, gets 0.8 n Segment data set and gets 1 n all Wne, Irss, Bal data sets. he results of usng method leave-one-out to perform cross valdaton are shown as the followng tables. able 2. Comparson results of Wne Wne K=1 K=3 K=5 K=10 radtonal KNN 0.9494 0.9663 0.9494 0.9551 Sngle-Dstance study 0.9888 0.9831 0.9831 1.0000 Sngle-Ensemble 0.9831 0.9888 0.9775 0.9775 Ensembled-Dstance study 1.0000 1.0000 1.0000 1.0000 921

able 3. Comparson results of Segment Segment K=1 K=3 K=5 K=10 radtonal KNN 0.9690 0.9557 0.9514 0.9448 Sngle-Dstance study 0.9748 0.9700 0.9690 0.9548 Sngle-Ensemble 0.9633 0.9513 0.9528 0.9681 Ensembled-Dstance study 0.9787 0.9654 0.9631 0.9741 able 4. Comparson results of Irs Irs K=1 K=3 K=5 K=10 radtonal KNN 0.9133 0.9467 0.9133 0.9448 Sngle-Dstance study 0.9067 0.9267 0.9333 0.9533 Sngle-Ensemble 0.9467 0.9400 0.9467 0.9667 Ensembled-Dstance study 0.9867 0.9800 0.9601 0.9667 able 5. Comparson results of Bal Bal K=1 K=3 K=5 K=10 radtonal KNN 0.9133 0.9467 0.9133 0.9448 Sngle-Dstance study 0.9067 0.9267 0.9333 0.9533 Sngle-Ensemble 0.9467 0.9400 0.9467 0.9667 Ensembled-Dstance study 0.9867 0.9800 0.9601 0.9667 able 6. Comparson results of Ion Ion K=1 K=3 K=5 K=10 radtonal KNN 0.8476 0.8775 0.9023 0.7136 Sngle-Dstance study 0.9861 0.8917 0.8946 0.8860 Sngle-Ensemble 0.9377 0.8803 0.8860 0.8945 Ensembled-Dstance study 0.9402 0.9173 0.9145 0.9144 Next, each data set wll be carred out calculaton by means of fve dfferent algorthms to compare results n the end. hey are tradtonal KNN classfer, mproved KNN classfer of Mahalanobs dstance, ntegraton algorthm FABSIR of multmode dsturbance, KNN classfer after NCA dstance learnng and ntegraton learnng KNN classfer ntroduced by ths paper. Wth the same coeffcent values acqured n each tem, classfcaton result s shown as followng table. able 7. Comparson of 5 methods KNN Improved KNN FABSIR NCA NEW Wne 0.9494 0.9438 0.9785 0.9888 1.0000 Segment 0.9690 0.9697 0.9715 0.9748 0.9928 Irs 0.9133 0.9267 0.9254 0.9067 0.9867 Bal 0.7248 0.7344 0.8145 0.9200 0.9648 Ion 0.8476 0.8775 0.8753 0.8860 0.9401 Average 0.8802 0.8904 0.9022 0.9353 0.9769 From expermental datas, we can see that the new algorthm, whose classfcaton effect has been obvously mproved comparng to sngle ntegraton algorthm and sngle dstance learnng algorthm, combnes ntegraton algorthm and NCA dstance learnng algorthm together. owards some data sets 922

lke Bal, dstance learnng algorthm s more obvously mprovng classfcaton effect. However, the mprovement of new algorthm s classfcaton effect brought forward consumpton of calculaton quantty especally usng leave-one-out method to perform cross valdaton. he tmes of a data set needng to perform dstance learnng equalze sub classfers of x samples. However, large scale of data sets wth many samples and attrbutes, NCA seems to be slower n dstance learnng. Besdes, there are also many mxng data sets of numercal data and character data and dstance model wth ths algorthm s KNN classfer cannot be classfed. hus, on one hand, ths algorthm has to carry out further mprovement of calculaton speed. Data sets wth mxed type need to reconstruct dstance measurement of KNN classfer so as to be adapted to the data sets wth character nformaton. 5. Concluson hs paper realzed an ensemble KNN classfer based on dstance learnng.hs classfer frstly performed flterng uncorrelated attrbutes n data sets based on nformaton gan and t flters redundant attrbute wth lower correlaton degree. hen, through ntegraton of baggng, the generated classfer cannot only make use of ts self operaton to carry out randomly selecton towards tranng samples but t also mplemented random flterng on classfer attrbute to enhance dfferentaton between sub-classfers durng guaranteeng the accuracy of sub-classfers. All sub-classfers on baggng ntegraton algorthm have made use of classfer KNN on the bass of near component analyss on dstance learnng and dstance learnng measure s appled to calculate classfcaton results. In the end, majorty votng system s appled to synthesze classfcaton results to acqure fnal judgment. From experment results, snce ntegraton algorthm s appled and each sub-classfer carred out dstance learnng, the classfer whch was put forward n ths paper has mproved more obvously on the effect durng comparng to sngle ntegraton learnng algorthm or sngle dstance learnng algorthm. 6. References [1] Holmström, Hampus,"Estmaton of sngle-tree characterstcs usng the knn method and plotwse aeral photograph nterpretatons",forest Ecology and Management, vol.167, no.13, pp.303-314, 2002. [2] Zhu Janpng,"Data Compresson of ransactonal Database Attrbute Item n Data Mnng",Statstcs & Informaton Forum,vol.26,no.5,pp.136-141,2006. [3] Reza Entezar-Malek, Arash Rezae, and Behrouz Mnae-Bdgol, "Comparson of Classfcaton Methods Based on the ype of Attrbutes and Sample Sze ", JCI, Vol. 4, No.3, pp.94-102, 2009. [4] an Junshan,He We,Qng Yan,"Applcaton of genetc algorthm n data mnng",he Proceedngs of the 1st Internatonal Workshop on Educaton echnology and Computer Scence, ECS,pp.353-356, 2009. [5] L Yanme,Zhang Zhuoku,"Data Mnng Based on Bayesan Networks",Computer Smulaton,vol.18,no.2,pp.560-564,2008. [6] Ytan Xu, Haozh Zhang, Lasheng Wang, "Rough Margn-Based Lnear υ Support Vector Machne", JCI, Vol. 5, No. 8, pp. 226 ~ 232, 2010. [7] Baek SeongJoon,Sung KoengMo,"Fast K-nearest-neghbour search algorthm for nonparametrc classfcaton",electroncs Letters, vol 36, no.21, pp.1821-1822, 2000. [8] Lu JanJang,"Research on Algorthms of Mmng Assocaton Rules wth Weghted Items ",Journal of Computer Research and Development,vol.9,no.10,pp.178-182,2002. [9] Hosen Alzadeh, Behrouz Mnae-Bdgol and Saeed K. Amrgholpour, "A New Method for Improvng the Performance of K Nearest Neghbor usng Clusterng echnque", JCI, Vol. 4, No. 2, pp. 84 ~ 92, 2009. [10] Wang Jun,Woznca Adam, Kalouss Alexandros,"Learnng neghborhoods for metrc learnng",lecture Notes n Computer Scence, vol.7523,no.1,pp.223-236, 2012. [11] Yoo, SungGoo,Chong, Klo,"Obstacle avodance system usng a sngle camera and LMNN fuzzy controller",journal of Insttute of Control, Robotcs and Systems, vol.15, no.2, pp. 192-197,2009. 923

[12] Mant, Jrapong,Youngkong, Prakarnkat,"Neghborhood components analyss n semg sgnal dmensonalty reducton for gat phase pattern recognton",he proceedng of 6th Internatonal Conference on Broadband Communcatons and Bomedcal Applcatons, Program,pp 86-90, 2011. [13] Shen Chuanhe, Wang Xangrong,Yu D,"Feature weghtng of support vector machnes based on dervatve salency analyss and ts applcaton to fnancal data mnng",internatonal Journal of Advancements n Computng echnology, vol.4, no.1, pp.199-206,2012. [14] Zhou, ZhHua,Yu Yang,"Ensemblng local learners through multmodal perturbaton",ieee ransactons on Systems, Man, and Cybernetcs,vol.35, no.4,pp.725-735, 2005. [15] L huaxn,"conjugate Gradent Appled to Image Reconstructon",C heory and Applcatons,vol.16,no.2,2007. 924