CLASSIFYING FEATURE DESCRIPTION FOR SOFTWARE DEFECT PREDICTION

Proeengs of e 20 Internatonal Conferene on Wavelet Analyss an Pattern Reognton, Guln, 0-3 July, 20 CLASSIFYING FEAURE DESCRIPION FOR SOFWARE DEFEC PREDICION LING-FENG ZHANG, ZHAO-WEI SHANG College of Computer Sene, Chongqng Unversty, Chongqng 400030, Chna E-MAIL: zhanglngfeng@qu.eu.n, szw@qu.eu.n Abstrat: o overome e lmtaton of numer feature esrpton of software moules n Software efet preton, we propose a novel moule esrpton tehnology, whh employs e lassfyng feature, raer an numeral feature to esrbe e software moule. Frstly, we onstrut nepenent lassfer on eah software metr. hen e lassfyng results n eah feature are use to represent every moule. We apply two fferent feature lassfer algorms (base on mean rteron an mnmum error rate rteron, respetvely) to obtan e lassfyng feature esrpton of software moules. By usng e propose esrpton tehnology, e srmnaton of eah metr s enlarge stntly. Also, lassfyng feature esrpton s smpler ompare to numer esrpton, whh woul aelerate e spee of preton moel learnng an reue e storage spae of massve ata sets. Experment results on four NASA ata sets (CM, KC, KC2 an PC) emonstrate e effetveness of lassfyng feature esrpton, an our algorms an sgnfantly mprove e performane of software efet preton. Keywors: Feature lassfer esrpton; bnary lassfaton; software efet preton. Introuton As software systems grow n sze an omplexty, t beomes nreasngly ffult to mantan e relablty of software prouts. Usually, software efet s e maor fator nfluenng software relablty. he maorty of a system s faults, over 80%, exst n about 20% of moules whh s known as e 80:20 rule []. hus, e possblty of estmatng e fault moules s extremely mportant for mnmzng ost an mprovng e effetveness of software testng proess. he early preton of fault-proneness of e moules an also allow software evelopers to alloate e lmte resoures on ose efet-prone moules suh at hgh relablty software an be proue on tme an wn buget [2]. Software fault-proneness s estmate base on software metrs, whh prove quanttatve esrptons of software moule. A number of stues prove empral evene at orrelaton exsts between some software metrs an fault-proneness [3]. By usng ose metrs feature, software efet preton s usually vewe as a bnary lassfaton task, whh lassfes software moules nto fault-prone (fp) an non-fault-prone (nfp). Many mahne learnng an statstal tehnques have been apple to onstrut preton moels base on e measurement of stat oe attrbutes [4][5]. For example, Dsrmnant Analyss, Logst Regresson, Regresson rees, Nearest Neghbor (NN), Ranom Forest, Bayes, Artfal Neural Networks an Support Vetor Mahnes (SVM) an so on. he ommon pont of prevous stues s at eah moule s represente by e numer software metrs retly. he attrbute types of eah metr are real, ategoral an ntegral. We all t numer feature esrpton. hs s also e ommon esrpton tehnology use n pattern reognton an mahne learnng. Nevereless, ue to e omplex relatonshps between software metrs an fault-proneness, s kn of representaton s a lmtaton to e lassfaton effetveness of eah software metr. Inee, f we treat eah software metr nvually, e values of metr lak of srmnaton. ake MaCabe s EV(g) for example. MaCabe s EV(g) s a frequently-use software metr. Fgure shows e value strbuton of e metr n CM ata set. Fgure. Dstrbuton of MaCabe s EV(g) metr n CM ata set. he numer values of e metr feature are ve nto 0 ntervals rangng from to 30. he vertal axs shows e proporton of values at fall wn s nterval of eah lass. 978--4577-0282-2//$26.00 20 IEEE 38

Proeengs of e 20 Internatonal Conferene on Wavelet Analyss an Pattern Reognton, Guln, 0-3 July, 20 From Fgure we an see at e values of e metr are overlappng n eah nterval. Also, e strbuton of e two lasses are roughly e same. hat s, uner numer esrpton, hs metr ontans lttle lassfaton nformaton. he same phenomenon also ours n oer software metrs. o nrease e lassfaton effetveness of eah feature, we propose a novel feature esrpton tehnology whh s name lassfyng feature esrpton. Frstly, we onstrut nepenent lassfer on eah software metr. hen e lassfyng results, alle lassfyng feature, n eah feature are use to represent e software moules. In s paper, we obtan e lassfyng feature esrpton of software moules by two fferent feature lassfer algorms base on mean rteron an mnmum error rate rteron, respetvely. By usng lassfyng feature raer an e numer feature, we enoy e followng avantages: () Classfers on eah feature expen e lassfaton effetveness of eah software metr, an may obtan hgher lassfaton auray. Classfyng feature esrpton oesn t hange e feature menson of eah moule. As a onsequene, stanar mahne learnng preton tehnques, orgnally esgne for numer feature, are also avalable n lassfyng feature spae. (2) Classfer feature s smpler, whh oul aelerate e spee of moel learnng on e software efet ata sets. Software efet preton s a stanar bnary lassfaton problem. hus, e lassfaton results of eah feature lassfer are also efne by bnary values, whh makes oer operatons more onvenent. (3) Compare w tratonal numer feature esrpton, lassfyng feature esrpton oupes less storage spae. Bnary lassfyng feature esrpton of software moule oul be vewe as a sparse representaton of e numer feature. hs makes storage of massve ata sets avalable. he remaner of s paper s organze as follows. Seton 2 ntroues e learnng moel base on numer software metrs n preton of software efet. he propose lassfyng feature esrpton s esrbe n seton 3. he feature lassfyng algorms are susse n seton 4. Seton 5 follows w e experments, n whh e performane of lassfyng feature esrpton s teste n etal. Fnally, onlusons are presente n Seton 6. 2. Software efet preton learnng moel A number of stues prove empral evene at orrelaton exsts between some software metrs an fault-proneness. hus, we an maematally esrbe e software efet preton moel as a bnary lassfaton task. Let us assume e tranng set tr N + S = { X, Y} = {( x, y )}, w N, x = an y {0,}. Eah nstane x s represente by { x, x2,, x},where x s e feature representaton for x n feature. e -mensonal feature spae x oul be vewe as a pont n. y enotes e lass x, fault-prone or non-fault-prone. F = { x, x, x } label assoate w he attrbute types of eah feature 2 n oul be real, ategoral an ntegral. ratonally, Preton moels are onstrute on e feature spae retly. For a test moule x = ( x, x,, x ), we obtan e preton results y (fault-prone or non-fault-prone) from e moel etermne by e tranng set. Fgure 2(a) shows e tratonal learnng proess of e moel. he suess of e preton moel learnng reles manly on e representaton tehnology e moule esrbe, as well as on e preton moel operatng n ose tranng set. Varous aspets of preton moel have been stue base on mahne learnng strateges. However, e same mportant representaton tehnology s mostly gnore by e exstng lteratures. In fat, a sutable software moule esrpton ats as e bass of establshng suessful preton moel. o nrease e lassfaton effetveness of eah metr, we propose a novel moule representaton tehnology alle lassfyng feature, whh s obtane by e feature lassfers onstrute on eah software metr. he mplement proess of software efet preton moel base on lassfyng feature s shown n Fgure 2(b). tr S = { X, Y} hx () tr C y x x (a) (b) C {, } S = X Y y Fgure 2. Software efet preton learnng moel 39

Proeengs of e 20 Internatonal Conferene on Wavelet Analyss an Pattern Reognton, Guln, 0-3 July, 20 Usng e feature lassfers, bo e tranng moule set an test moule are esrbe by lassfyng feature before bulng a preton moel. 3. Classfyng feature esrpton Let S tr { C, } {(, )} N C = X Y = x y enote e lassfyng = + feature tranng set, w N, x an y {0,}. Eah nstane x = [ x, x 2,, x ] s a pont n e -mensonal bnary lassfyng feature spae. y enotes e lass label assoate w x, orresponng to e label of x. he attrbute types of eah lassfyng feature CF = { x, x2, xn} are hosen wn 0 an. Let hx ( ) = { h( x), h2( x), h ( x)} enote e feature lassfers on eah feature. An e lassfaton results of hx ( ) are equal to e lass labels of orgnal tranng set, whh oul be esrbe as: hx ( ): x [0,] () In oer wor, hx ( ) represents e mappng from e numer feature spae spae : Φ: to e Classfyng feature h( x) (2) Aorng to e above efnton, for eah nstane of tranng set, we have a smpler esrpton, where all e lassfyng features are represente by 0 an. Fgure 3 shows e mplementaton proess of lassfyng feature esrpton. he key ea s at lassfers n eah feature an help enlarge e srmnaton of eah metr between e fferent lasses. Fgure 3. Classfyng feature esrpton Fgure 4 shows an eal lassfaton example of numer feature esrpton an lassfyng feature esrpton. In s example, e two lasses are represente by stars an amons. Eah moule has two metr features x an y, whh are normalze between 0 an. Fgure 4 (a) shows e strbuton of numer feature n e 2D feature spae. he two lasses are lnearly separable n e 2D feature spae. Now, we onstrut nepenent lassfer on eah feature. A smple reshol lassfer s hosen n s example. Let s set 0.5 as e reshol of e two features. If e value of feature s bgger an e reshol, e lassfaton result s. An f e value s smaller an e reshol, e lassfaton result s 0. hen, we an obtan e lassfyng feature esrpton of eah nstane by e bnary lassfaton results. Fgure4 (b) shows e strbuton of lassfyng feature of e two lasses. (a) Fgure 4. Ieal lassfaton problem w numer feature an lassfyng feature In s example, we apply a feature reshol lassfer to obtan e bnary lassfyng feature representaton of eah metr. By applyng e lassfyng feature esrpton, moules n two lasses are represente as (0,) an (,0). From Fgure 4 we an see at lassfyng feature representaton mproves e average stane an margn between two lasses sgnfantly. Also, e smple esrpton of 0 an reues e storage spae of e ata an makes easer for oer operaton. 4. Feature lassfer algorm In bnary lassfaton task, e lassfaton results of eah feature lassfer are hosen for 0 an. All e lassfyng features are represente 0 an. he am of feature lassfers s expan e fferene between e two lasses usng smple lassfy rule. In s paper, eah feature lassfer s efne by a measure on e value of x, whh oul be smple as a reshol lassfer. he lassfer hx ( ) s efne as follows: f x > reshol h( x) (3) 0 else Let = { t, t2, t } enote e reshol set of e features etermne by e tranng set. he lassfaton results of eah feature lassfer are use as e novel feature of eah moule. In s way, all e lassfyng features are represente by 0 an. he lassfyng feature of eah moule an be efne as: (b) 40

Proeengs of e 20 Internatonal Conferene on Wavelet Analyss an Pattern Reognton, Guln, 0-3 July, 20 f x x (4) 0 else From e reshols obtane from e tranng set, we an get e feature esrpton of e numer testng sample. ake x = ( x, x,, x ) for example, e lassfyng feature esrpton x s also obtane by: f x x (5) 0 else Learnng a goo reshol plays a rual role n eah feature lassfer. In s paper, e optmal reshol of eah attrbute s etermne by two fferent tehnologes, e mean rteron an e mnmum error rate rteron. 4.. Mean rteron hs rteron assumes at e value eah feature n e two lasses obey e unform strbuton. hus e means of features an be use to represent e values of all e features. hen, we hoose e mpont of e means of eah feature of e two lasses as a lassfaton reshol. + + + Let X = { x, =, n } an X = { x, =, n } enote e fault-prone an non-fault-prone subsets, respetvely. he means of eah feature n two lasses oul be alulate by: n + + + m = x +, n + m = x (6) n n he reshol of mean rteron s efne as: + m + m t = (7) 2 he pseuo-oe of e feature lassfer algorm base on mean rteron s lste n Fgure 5. Feature Classfer Algorm Input: X /*tranng set*/ Y /*lass label*/ Varables: x /*e feature of e y /*e lass label of e Output: nstane*/ nstane */ /*reshol*/ X /*lassfyng feature tranng set*/ BEGIN. for to o 2. Calulate m + an 3. + t = mean( m, m ) m. 4. x 0 5. en for END f x else Fgure 5. he pseuo oe of feature lassfer algorm base on mean rteron 4.2. Mnmum error rate rteron hs rteron s base on a proess of omplete searh, whh ensures e mnmum of nstanes n e tranng set are mslassfe. he reshol t of feature s hosen from all e feasble nterval values. Frstly, e values of F are sorte for small to large, represente as * * * * F {, = x x xn}. hen, we assume e reshol t of e eah feature to be eah nterval value of e sorte featuresreshol, an alulate a error rate Error. he reshol whh leas to e mnmum error rate s hosen for e feature, whh s esrbe as t = reshol (8), agr mn ( error( ) he pseuo-oe of e feature lassfer algorm base on mnmum error rate rteron s lste n Fgure 6. Feature Classfer Algorm 2 Input: X /*tranng set*/ Y /*lass label*/ Varables: x /*e feature of e y /*e lass label of e Output: nstane*/ nstane */ /*reshol*/ X /*lassfyng feature tranng set*/ BEGIN. for to o 2. * * * * { 2 } F = x, x x = sort( F, as). n 3. for to n o 4. f = 5. * reshol = x ε 6. Else f = n 7. * reshol = xn + ε 8. Else 9. * * reshol = 0.5*( x + x + ) 4

Proeengs of e 20 Internatonal Conferene on Wavelet Analyss an Pattern Reognton, Guln, 0-3 July, 20 0. enf. Error = sum _ num( x < reshol + x > reshol)/ n 2. Error 2 = sum_ num( x > reshol + x < reshol)/ n 3. Error = mn( Error, Error2) 4. enfor 5. t = reshol, agr mn ( error) 6. x 0 7. enfor END f x else + + Fgure 6. he pseuo oe of feature lassfer algorm base on mean rteron Above all, we have onstrute lassfers on e features an obtane e lassfyng feature esrpton of e software moules by e two feature lassfer algorms. Fgure 7 shows e strbuton of lassfyng feature of MaCabe s EV(g) n CMata set. (a) Fgure 7. Dstrbuton of lassfyng feature of MaCabe s EV(g). (a). base on mean rteron. (b). base on mnmum error rate rteron. Compare w e strbuton of numer feature n Fgure, e lassfyng features of e metr are more separable between e two lasses. Espeally n e esrpton base on mnmum error rate rteron, f we e treat e metr as e sngle feature of moules, only selom moules are mslassfe. he key ea of lassfyng feature esrpton s mprovng e lassfaton performane of features by e lassfers onstrute on em. 5. Experments In s seton, we evaluate e effetveness of e propose Classfyng Feature (CF) esrpton an e two feature lassfer algorm, represente by CF (base on mean rteron) an CF2(base on mnmum error rate). he experment s onute on 4 benhmark ata set from NASA ata set (KC2, KC, CM an PC), whh are publly aessble from e NASA IV&V Falty Metrs Data Program. Eah ataset ontans twenty one metrs as feature an e assoate epenent Boolean varable, (b) fault-prone or non-fault-prone. he performane of preton of software efet preton s typally evaluate usng a onfuson matrx, whh s shown n able. In s seton we use e ommonly use performane measures: auray, preson, reall an F-measure. he preton results of ree algorm (NN, Bayes an SVM) on ese four ata sets are shown n able 2, able 3, able 4 an able 5 (%). Atual able. Preton results Prete Non-Fault-Prone Fault-Prone Non-Fault-Prone N (rue Negatve) FP (False Postve) Fault-Prone FN (False Negatve) P (rue Postve) able 2. Preton results on CMI ataset Auray Preson Reall F-measure NN 62.2530 63.7050 62.667 62.868 NN+CF 69.805 73.7753 65.9792 68.4305 NN+CF2 7.2872 70.766 77.458 73.453 Bayes 65.63 76.3280 48.9375 56.0973 Bayes+CF 67.2842 75.2083 54.7083 62.228 Bayes+CF2 69.9048 69.4953 76.0208 7.8738 SVM 57.4464 50.0080 88.667 63.3539 SVM+CF 70.7560 70.6232 74.6875 7.8267 SVM+CF2 70.5476 70.4383 75.3958 72.068 able 3. Preton results on KC ataset Auray Preson Reall F-measure NN 7.7895 73.879 70.3292 7.945 NN+CF 73.457 77.7793 67.8858 72.3474 NN+CF2 75.9856 75.0230 80.6224 77.4708 Bayes 65.6968 82.397 43.095 56.950 Bayes+CF 70.0498 78.9030 57.7469 66.547 Bayes+CF2 73.806 73.8372 76.7078 74.966 SVM 69.8030 63.8792 98.430 77.256 SVM+CF 74.0659 75.4972 74.3673 74.7933 SVM+CF2 74.3654 72.27 82.4023 76.797 able 4. Preton results on KC2 ataset Auray Preson Reall F-measure NN 69.420 70.955 67.7644 68.97 NN+CF 78.0460 82.29 72.9567 76.9966 42

Proeengs of e 20 Internatonal Conferene on Wavelet Analyss an Pattern Reognton, Guln, 0-3 July, 20 NN+CF2 76.4909 76.6820 77.9327 76.9494 Bayes 69.6679 87.0080 47.6442 60.7960 Bayes+CF 75.8396 85.2068 63.983 72.6060 Bayes+CF2 77.8562 78.6577 78.490 78.509 SVM 63.687 58.5960 95.600 72.5795 SVM+CF 77.5370 78.4068 77.572 77.7266 SVM+CF2 77.7838 76.8082 8.5385 78.833 able 5. Preton results on PC ataset Auray Preson Reall F-measure NN 62.24 63.4266 62.2556 62.3820 NN+CF 67.76 72.3669 59.299 64.0422 NN+CF2 73.965 74.8950 74.044 73.9790 Bayes 65.0083 8.9888 39.6992 5.8326 Bayes+CF 65.4569 74.550 50.2726 59.222 Bayes+CF2 67.699 68.938 67.4530 67.4687 SVM 58.6220 87.008 34.6992 36.2836 SVM+CF 65.7736 67.903 63.4868 64.8424 SVM+CF2 69.303 68.743 73.9944 70.7430 From able 2 to able 5, In all e four atasets, lassfyng feature esrpton aheves e hghest results n bo auray an F-measure uner e ree lassfers. In preson, exept Bayes, lassfyng feature esrpton performs better an numer feature ones. In reall, lassfyng feature aheves hgher preton results n NN an Bayes (SVM, hgher n PC only). Compare to numer feature esrpton, e two feature lassfer algorms prove slght mprovement n most of e measurements. o efntvely onfrm s fat, we ompute e average preton results of e four ataset, whh are shown n Fgure 8. From Fgure 8, t s observe at lassfyng feature esrpton outperforms e numer feature esrpton n auray uner e ree lassfers from 3.6% (Bayes, CF) to 0.4% (SVM, CF2). F-measure onsers e harmon mean of preson an reall. It an be observe at e feature lassfers algorms aheve hgher F-measure an e normal ones. Moreover, he F-measure are sgnfantly hgher from 4.0% (NN, CF) to 6.88% (Bayes, CF2). 6. Conlusons hs paper propose a novel feature esrpton meo for software moules n preton of software efets, alle lassfyng feature. he man avantage of s esrpton n omparson to tratonal numer metrs s at e lassfaton effetveness of eah metr s mprove effetvely. For future work, we wll nvestgate e applablty of e lassfyng feature esrpton to oer oman an generalze t to mult-lassfaton problem. (a) Auray (b) Preson () Reall () F-measure Aknowlegements Fgure 8. Average results of e four atasets. hs paper s supporte by Proet No.CDJXS08226 an No. CDJRC080009 of e Funamental Researh Funs for e Central Unverstes an Proet No. CSC200BB227 of e Natural Sene Founaton of Chongqng. Referenes [] Gonra, I., Applyng mahne learnng to software fault-proneness preton, he Journal of Systems an Software, Vol. 8, pp. 86-95, 2008. [2] Zheng, J., Cost-senstve boostng neural networks for software efet preton, Expert Systems w Applatons, Vol. 37, pp. 4537-4543, 200. [3] Gll, G. an Kemerer, C., Cylomat omplexty ensty an software mantenane proutvty, IEEE ransatons on Software Engneerng, Vol. 7, No. 2, pp. 284-288, 99. [4] Guo, L., Ma, Y., Cuk, B. an Sngh, H., Robust Preton of Fault-Proneness by Ranom Forests, Proeengs of e 5 Internatonal Symposum on Software Relablty Engneerng(ISSRE 04), pp. 47-428, 2004. [5] Guang-e, L. an Wen-yong, W., Reah on an euatonal software efet preton moel base on SVM, Entertanment for Euaton. Dgtal ehnques An Systems 6249, pp. 25-222, 20 43