OI: 10.2298/CSIS120130044Z Custerng based Two-Stage Text Cassfcaton Requrng Mnma Tranng ata Xue Zhang 1,2 and Wangxn Xao 3,4 1 Key Laboratory of Hgh Confdence Software Technooges, Mnstry of Educaton, Pekng Unversty, Bejng 100871, Chna 1 Schoo of Eectroncs Engneerng and Computer Scence, Pekng Unversty, Bejng 100871, Chna 2 epartment of Physcs, Shangqu Norma Unversty, Shangqu 476000, Chna jane_zhang@pku.edu.cn 3 epartment of Computer Scence, Jnggangshan Unversty, J an 343009, Chna 4 Schoo of Traffc and Transportaton Engneerng, Changsha Unversty of Scence and Technoogy, Changsha 410114, Chna wx.xao@roh.cn Abstract. Custerng has been empoyed to expand tranng data n some sem-supervsed earnng methods. Custerng based methods are based on the assumpton that the earned custers under the gudance of nta tranng data can somewhat characterze the underyng dstrbuton of the data set. However, our experments show that whether such assumpton hods s based on both the separabty of the consdered data set and the sze of the tranng data set. It s often voated on data set of bad separabty, especay when the nta tranng data are too few. In ths case, custerng based methods woud perform worse. In ths paper, we propose a custerng based two-stage text cassfcaton approach to address the above probem. In the frst stage, abeed and unabeed data are frst custered wth the gudance of the abeed data. Then a sef-tranng stye custerng strategy s used to teratvey expand the tranng data under the gudance of an orace or expert. At the second stage, dscrmnatve cassfers can subsequenty be traned wth the expanded abeed data set. Unke other custerng based methods, the proposed custerng strategy can effectvey cope wth data of bad separabty. Furthermore, our proposed framework converts the chaengng probem of sparsey abeed text cassfcaton nto a supervsed one, therefore, supervsed cassfcaton modes, e.g. SVM, can be apped, and technques proposed for supervsed earnng can be used to further mprove the cassfcaton accuracy, such as feature seecton, sampng methods and data edtng or nose fterng. Our expermenta resuts demonstrated the effectveness of our proposed approach especay when the sze of the tranng data set s very sma. Keywords: text cassfcaton, custerng, actve sem-supervsed custerng, two-stage cassfcaton.
Xue Zhang and Wangxn Xao 1. Introducton The goa of automatc text cassfcaton s to automatcay assgn documents to a number of predefned categores. It s of great mportance due to the ever-expandng amount of text documents avaabe n dgta form n many rea-word appcatons, such as web-page cassfcaton and recommendaton, ema processng and fterng. Text cassfcaton has once been consdered as a supervsed earnng task, and a arge number of supervsed earnng agorthms have been deveoped, such as Support Vector Machnes (SVM) [1], Naïve Bayes [2], Nearest Neghbor [3], and Neura Networks [4]. A comparatve study was gven n [5]. SVM has been recognzed as one of the most effectve text cassfcaton methods. Furthermore, a number of technques sutabe for supervsed earnng have been proposed to mprove cassfcaton accuracy, such as feature seecton, data edtng or nose fterng, and sampng methods aganst bas. A supervsed cassfcaton mode often needs a very arge number of tranng data to enabe the cassfer s good generazaton. The cassfcaton accuracy of tradtona supervsed text cassfcaton agorthms degrades dramatcay wth the decrease of the number of tranng data n each cass. As we know, manuay abeng the tranng data for a machne earnng agorthm s a tedous and tme-consumng process, and even unpractca (e.g., onne web-page recommendaton). Correspondngy, one mportant chaenge for automatc text cassfcaton s how to reduce the number of abeed documents that are requred for budng reabe text cassfer. Ths eads to an actve research probem, sem-supervsed earnng. There have been proposed a number of sem-supervsed text cassfcaton methods, ncudng Transductve SVM (TSVM) [6], Co-Tranng [7] and EM [8]. A comprehensve revew coud be found n [9]. By exporng nformaton contaned n unabeed data, these methods obtan consderabe mprovement over supervsed methods wth reatvey sma sze of tranng data set. However, most of these methods adopt the teratve approach whch tran an nta cassfer based on the dstrbuton of the abeed data. They st face dffcutes when the abeed data set s extremey sma snce they w have a poor startng pont and cumuate more errors n teratons when the extremey few abeed data are far apart from correspondng cass centers due to the hgh dmensonaty. To address the probem of sparsey abeed text cassfcaton, we present a custerng based two-stage text cassfcaton method wth both abeed and unabeed data. Expermenta resuts on severa rea-word data sets vadate the effectveness of our proposed approach. Our contrbutons can be summarzed as foows. We propose a nove custerng based two-stage cassfcaton approach that requres mnma tranng data to acheve hgh cassfcaton accuracy. In order to mprove the accuracy of the sef-abeed tranng data by custerng, we propose an actve sem-supervsed custerng method to cope wth data sets of bad separabty. On the bass of custerng, we convert the chaengng probem of sparsey abeed text cassfcaton nto supervsed one. Thus supervsed 1628 ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
Custerng based Two-Stage Text Cassfcaton Requrng Mnma Tranng ata cassfcaton modes and technques sutabe for text cassfcaton can be used to further mprove the overa performance. We conduct extensve experments to vadate our approach and study reated ssues. The rest of ths paper s organzed as foows. Secton 2 revews severa exstng methods. Our custerng method s gven n Secton 3 wth some anayss. The detaed agorthm s then presented n Secton 4. Expermenta resuts are presented n Secton 5. Secton 6 concudes ths paper. 2. Reated Work Custerng has been apped n many sub-domans of the probem of text cassfcaton, ncudng feature compresson or extracton [10], semsupervsed earnng [11], and custerng n arge-scae cassfcaton probems [12,13]. The foowng w revew severa reated work about custerng adng cassfcaton n the area of sem-supervsed earnng. A comprehensve revew for text cassfcaton aded by custerng can be found n [14]. Custerng has been used to extract nformaton from unabeed data n order to boost the cassfcaton task. There are roughy four cases of sem-supervsed cassfcaton aded by custerng. In partcuary, custerng s used: (1) to create a tranng set from the unabeed set [15], (2) to augment an exstng abeed set wth new documents from the unabeed data set [11], (3) to augment the data set wth new features [8,16], and (4) to co-tran a cassfer [17,18]. More recenty, smutaneous earnng frameworks for custerng and cassfcaton have been proposed [19,20]. To make use of unabeed data, one assumpton whch s made, expcty or mpcty, by most of the sem-supervsed earnng agorthms s the so-caed custer assumpton that two ponts are key to have the same cass abe f there s a path connectng them passng through regons of hgh densty ony. That s, the decson boundary shoud e n regons of ow densty. Based on the deas of spectra custerng and random waks, a framework for constructng kernes whch mpement the custer assumpton was proposed n [21]. Aso based on custer assumpton, [22] apped spectra custerng to represent the abeed and unabeed data. By custerng unabeed data wth abeed data usng probabstc and fuzzy approaches, [23] proposed a framework to mprove the performance of base cassfer wth unabeed data. In text cassfcaton, there are often many ow-densty areas between postve and negatve abeed exampes because of the hgh dmensonaty and data sparseness. Ths stuaton w be worsened wth the decrease of the number of tranng data n each cass. The most reated work s the custerng based text cassfcaton (CBC) approach [11]. In CBC, frsty, sem-supervsed soft k-means s used to custer the abeed and unabeed data nto k custers, where k s set to the number of casses n the cassfcaton task. p% most confdent unabeed exampes from each custer (.e. the ones nearest to the custer s centrod) are added to the ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012 1629
Xue Zhang and Wangxn Xao tranng data set. Then TSVM s traned on the augmented tranng data set and unabeed data set. Smary, p% most confdent unabeed exampes from each cass (.e. the ones wth the argest margn) are added to the tranng data set. CBC terates the step of custerng and the step of cassfcaton aternatvey unt there s no unabeed data eft. In CBC, n order to guarantee the abeng accuracy, the vaue of p shoud be sma enough. That s, after the custerng step n each teraton, the tranng data set s augmented wth very few exampes. Therefore, the cassfer n the foowng cassfcaton step shoud have an accepted performance wth sma sze of tranng data set. Ths put a strong constrant on the seected cassfcaton modes. CBC can hardy perform we wth supervsed cassfcaton modes, e.g. SVM, whch w be demonstrated ater. The success of CBC s based on the assumpton that even when some of the data ponts are wrongy cassfed, the most confdent data ponts,.e. the ones wth argest margn under cassfcaton mode and the ones nearest to the centrods under custerng mode, are confdenty cassfed or custered. Ths assumpton guarantees the hgh accuracy of the sef-abeed tranng data and correspondngy the good performance of the agorthm. We separate ths assumpton nto custerng assumpton and cassfcaton assumpton for convenence. However, our emprca experments show that the assumptons are often voated on data sets of bad separabty. Frsty, custerng assumpton can t be hod n ths case, at east for the soft-constrant k-means [11]. In fact, each custer s centrod may ocate n: 1) the doman of ts correspondng true cass, 2) the border of ts true cass and other casses, 3) the doman of other cass. The probabty that the ast two cases occur ncreases wth the degradng of data separabty. In the ast two cases (we ca them custer bas), CBC w ntroduce more nose nto the tranng data set n ts custerng step, whch mght make the cassfcaton assumpton aso be voated snce the nose w have a bg effect due to the ntay very few truy abeed tranng data. Then the foowng teratve steps w further cumuate more errors. In sparsey abeed text cassfcaton, the extremey few tranng data make many technques whch are usefu for ameoratng data separabty, e.g. feature seecton, not effectve, because the tranng data can not characterze the whoe data set we. When the sze of tranng data set s extremey sma, unsupervsed earnng gves better performance than supervsed and sem-supervsed earnng agorthms. In ths paper, we deveop an actve sem-supervsed custerng based two-stage approach to address the probem of sparsey abeed text cassfcaton. fferent from CBC, our am s to convert the probem of sparsey abeed text cassfcaton nto a supervsed one by usng custerng. Therefore, supervsed cassfcaton modes, e.g. SVM, can be apped, and technques proposed for supervsed earnng can be used to further mprove the cassfcaton accuracy, such as feature seecton, sampng methods and data edtng or nose fterng. Furthermore, our proposed actve sem-supervsed custerng method ams to cope wth data sets wth any separabty. The goa of custerng here s to generate enough tranng data for supervsed earnng wth hgh accuracy. 1630 ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
Custerng based Two-Stage Text Cassfcaton Requrng Mnma Tranng ata 3. Actve Sem-supervsed Custerng Usng custerng to ad sem-supervsed cassfcaton, the key pont es n that the custerng resuts can to some extent characterze the underyng dstrbuton of the whoe data set. Ony n ths case, custerng s hepfu to augment tranng data set or extract usefu features to mprove the performance of cassfcaton. Athough custerng methods are more robust to the bas caused by the nta sparsey abeed data, emprca experences show that the resuts of custerng mght aso be based (e.g. the custer bas of cases 2) and 3)), sometmes heavy, especay on data sets of bad separabty. The soft-constraned k-means n CBC can reduce bas n the abeed exampes by basng the constrants (the gudance of the ntay abeed data) not on exact exampes but on ther centrod. But t st cannot cope wth the bas we n tranng data on data sets wth bad separabty. Tabe 1 gves two custerng based agorthms to augment tranng data. They mpement the teratve renforcement strategy. In each teraton, a custerng method s used to custer the whoe data set wth the gudance of abeed tranng data, and then severa exampes are seected accordng to some crtera and abeed wth the abes of the centrods they beong to. In SemCC agorthm, the custerng method s soft-constraned k-means adopted n CBC agorthm. The custerng method (we ca t as actve soft-constraned k-means) n SemCCAc agorthm s proposed n order to address the custer-bas probem. Tabe 1. Two Custerng Agorthms: SemCC and SemCCAc Input: Labeed data set and unabeed data set u, the number of teratons maxiter, p Output: Augmented abeed data set Intaze: =, u = u, ter=0 Agorthm SemCC: Whe ter<maxiter and u Φ ter=ter+1 Cacuate nta centrods: 1 o x, 1,...,, j, t j c x j n o o. n s the number of exampes n j, and set current centrods whose abe s. The abes of the centrods t( o ) t( ) are equa to abes of the correspondng exampes. o ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012 1631
Xue Zhang and Wangxn Xao Repeat unt custer resut doesn t change any more Assgn t( o ) to each x u that are nearer to than to other centrods. Update current centrods: 1 o x j t j c x j u j n, 1,...,,, number of exampes n Cacuate the nearest centrods t( o ) t( ), ext the oop. o whose abe s. From each custer, seect p% exampes nearest to o, add them to u o j for each o, n s the o, f x u whch s, and deete them from u. Agorthm SemCCAc: Whe ter<maxiter and u Φ ter=ter+1 Cacuate nta centrods: 1 o x, 1,...,, j, t j c x j j n, and set current centrods o o. n s the number of exampes n whose abe s. The abes of the centrods t( o ) t( ) are equa to abes of o the correspondng exampes. Repeat unt custer resut doesn t change any more Assgn t ( o ) to each x u that are nearer to o than to other centrods. Update current centrods: o 1 n x j t j, 1,..., c,, j number of exampes n Cacuate the nearest centrods t( o ) t( ), ext the oop. o x j u whose abe s. From each custer, seect p% exampes nearest to confdences, x 1,,x m u o j for each, n s the o, f x u whch s o and sort them wth descendng order of 1632 ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
Custerng based Two-Stage Text Cassfcaton Requrng Mnma Tranng ata If the true abe of x 1 and x m equas to t o ) : ese end add the m exampes to deete them from u add x 1 and x m wth ther true abes to deete x 1 and x m from SemCCAc s dfferent from SemCC n the abeng strategy for the seected exampes of hghest confdences accordng to the custerng resuts. In soft-constraned k-means, t doesn t take the found centrods ocaton nto consderaton. It just abes the seected exampes nearest to each centrod. Therefore, t w ntroduce much nose nto the tranng data set wth the presence of custer bas. In actve soft-constraned k-means, t frst estmates the ocaton of each centrod. Ony for custers whose centrods ocate wthn ther true casses, t abes a the seected exampes nearest to the correspondng centrods wth the abes of ther centrods. For the custers wth the presence of custer bas, t just abes two exampes wth ther true abe for each custer. An mportant probem n actve soft-constraned k-means s how to estmate the ocaton of each custer s centrod. The strategy used here s to nqure the true abes of two exampes (the nearest and the farthest exampes to the centrod n the seected p% exampes) by resortng to an orace or expert for each custer. If the two exampes have the same abe wth that of ther centrod, then a the p% seected exampes are abeed wth the abe of the centrod. Otherwse, ony the two exampes are added to tranng data set wth ther true abes. The strategy s based on the ntuton that custer bas s more key happened when the two exampes have dfferent abes wth that of ther centrod. When the two exampes have the same abe, but dfferent from ther centrod, the custer s centrod s most key ocate n the doman of other casses. When one of the two exampes has the same abe wth that of the centrod, the custer s centrod s most key ocate n the border of the true cass and other casses. We can fter out much nose by usng ths strategy. It s aso deghtfu that custer bas can be rectfed n the foowng teratons n SemCCAc by estmatng the ocaton of custers centrods, whch property guarantees the hgh accuracy of sef-abeed tranng data n spte of the poor startng ponts. u ( ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012 1633
Accuracy Xue Zhang and Wangxn Xao 1 0.95 0.9 0.85 0.8 SemCCAc(p=0.5) SemCC(p=0.5) SemCCAc(p=1) SemCC(p=1) 0.75 0 10 20 30 40 50 60 70 80 90 100 maxiter Fg.1. Accuracy of sef-abeed tranng data wth teratons In fgure 1, we depct the average accuracy of sef-abeed tranng data by appyng the two custerng agorthms to a text cassfcaton probem n 20 runs (same2, consstng of two most smar casses n 20Newsgroups, 5 tranng data for each cass). From fgure 1, t coud be found that the accuracy of sef-abeed tranng data by SemCCAc s sgnfcanty hgher than that by SemCC. Wth the ncrease of p vaue, the accuracy degrades frst, but then t rses wth the ncrease of teratons. When maxiter=1, SemCC degrades to the soft-constraned k-means. We can aso see that the average accuracy n SemCC s beow 0.95 when maxiter=1, whch ndcates that soft-constraned k-means ntroduces nose nto the tranng data set wth a certan probabty. Ths nose w hurt the foowng cassfer s earnng, especay when the sze of the nta tranng data set s very sma. We thnk the phenomenon of custer bas can partay expan why the performance of CBC mproves sowy than those of TSVM and co-tranng wth the ncrease of tranng data. Wth the ncrease of teratons, the accuracy of sef-abeed tranng data by SemCC degrades much faster than that of SemCCAc. Therefore, technques to cope wth the custer bas are very mportant for custerng based sem-supervsed cassfcaton. Ths aso tes us that the proposed actve sem-supervsed custerng method s effectve for addressng the probem of custer bas. 1634 ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
Custerng based Two-Stage Text Cassfcaton Requrng Mnma Tranng ata 4. Two-Stage Cassfcaton Framework: ACTC In ths secton, we present the deta of the Actve sem-supervsed Custerng based Two-stage text Cassfcaton agorthm (ACTC). A documents are tokenzed nto terms and we construct one component for each dstnct term. Thus each document s represented by a vector ( w 1, w 2,..., wp) where w j s weghted by TFIF. The cosne functon s used n the custerng agorthm to cacuate the dstance from an exampe to the centrod. In the cassfcaton stage, we use a SVM cassfer traned wth the augmented tranng data to cassfy the whoe data set. The detaed agorthm s presented n tabe 2. ACTC conssts of two stages: custerng stage and cassfcaton stage. In the custerng stage, SemCCAc s used to augment the tranng data set. Users can set the vaues of maxiter and p to determne how many new documents shoud be abeed by SemCCAc. At the second stage, dscrmnatve cassfers can subsequenty be traned wth the expanded abeed data set. Soft-constraned k-means s n fact a generatve cassfer [11]. Accordng to [24], generatve cassfers reach ther asymptotc performance faster than dscrmnatve cassfers, but usuay ead to hgher asymptotc error than dscrmnatve cassfers. Ths motvates us to combne custerng wth dscrmnatve cassfers together to address the probem of sparsey abeed text cassfcaton. ACTC n fact converts the probem of sparsey abeed text cassfcaton nto a supervsed one, thus supervsed cassfcaton modes sutabe for text cassfcaton can be used. Moreover, the technques proposed for supervsed earnng can be used to mprove the performance. For nstance, t s unavodabe to fasey abe some exampes n the custerng stage, then data edtng or nose fterng technques are expected to mprove the performance of ACTC. Other technques aso can be used to mprove the performance, such as feature seecton and sampng. Tabe 2. ACTC and CBCSVM Input: Labeed data set and unabeed data set u and the number of teratons maxiter, p Output: The fu abeed set = + u A cassfer L Agorthm ACTC: 1. Custerng Stage Use SemCCAc (repeat maxiter teratons) to augment the tranng data set and we get an augmented tranng data set 2. Cassfcaton Stage ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012 1635
Xue Zhang and Wangxn Xao Tran a SVM cassfer L based on to cassfy the whoe data set.. And use the earned cassfer Agorthm CBCSVM: ter=0 1. whe ter<maxiter/2 ter=ter+1 1.1 Custerng step Use soft-constraned k-means to custerng the whoe data set, and seect p% unabeed exampes nearest to ts centrod for each custer and add them to 1.2 Cassfcaton step Tran a SVM cassfer based on. From each cass, seect p% unabeed exampes wth the argest margn, and add them to 2. Tran a SVM cassfer L based on to cassfy the whoe data set.. And use the earned cassfer In order to verfy the two-stage framework performs better than CBC agorthm n supervsed earnng, we substtute SVM for TSVM n CBC, named CBCSVM. Note that, f we use TSVM as the cassfer, the performance of both agorthms w be expected to get mproved. However, the tme compexty of a TSVM cassfer s much hgher than that of a SVM cassfer, because t repeatedy swtches estmated abes of unabeed data and tres to fnd the maxma margn hyperpane. The more unabeed data are, the more tme t requres. The worse of the data separabty s, the more tme t requres. For exampe, on same2, whch conssts of two most smar casses of 20Newsgroups and 1000 exampes n each cass, TSVM requres severa hours to compete when 5 tranng data for each cass are used. SVM ony needs about 1 second. Wth enough tranng data, the performance of SVM s expected to be smar wth that of TSVM, but t requres much ess tme. Ths motvates us to propose the two-stage cassfcaton method, whch converts the probem of sparsey abeed text cassfcaton nto a supervsed one. CBCSVM s aso gven n tabe 2 for convenence. Snce CBC seects p% unabeed exampes both n custerng and cassfcaton steps n each teraton, we set the number of teratons to haf of that n ACTC n order to make them have the same seecton tmes. The dfference between our approach and CBC s that, we expand the tranng data set by a sef-tranng stye custerng process and resortng to an orace or expert to evauate the custers centrods. After competon of the tranng data expanson, dscrmnatve cassfers coud be traned on the expanded tranng data set. Therefore, ACTC puts ess constrant on the 1636 ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
Custerng based Two-Stage Text Cassfcaton Requrng Mnma Tranng ata cassfcaton mode, whch enabes us to treat the foowng cassfcaton stage as a supervsed earnng probem. 5. Performance Evauaton 5.1. ata sets For a consstent evauaton, we conduct our emprca experments on two benchmark data sets, 20NewsGroups and Reuters-21578. 20Newsgroups s one famous Web-reated data coecton. From the orgna 20 Newsgroups data set, same2, consstng of 2 very smar newsgroups (comp.wndows.x, comp.os.ms-wndows) s used to evauate the performance of the agorthms. Same2 contans 2000 nstances, 1000 for each cass. We use Ranbow software 1 to preprocess the data (removng stop words and words whose document frequency are ess than 3, stemmng) and we get 7765 unque terms for same2. Then terms are weghted wth ther TFIF vaues. The Reuters-21578 corpus contans Reuters news artces from 1987. We ony show the expermenta resuts of tran1.svm n LWE 2 snce the agorthms have the smar performance on other Reuters data sets. Tran1.svm contans 1239 documents (two cass) and 6889 unque terms. 5.2. Evauaton Metrc We use macro-averagng of F1 measure among a casses to evauate the cassfcaton resut. For each cass [ 1, c], et A be the number of documents whose rea abe s, and B the number of documents whose abe s predcted to be, and C the number of correcty predcted documents n ths cass. The precson and reca of the cass are defned as P C / B and R C / A respectvey. For each cass, the F1 metrc s defned as F1 2 P R /( P R) where P and R are precson and reca for a partcuar cass. F1 metrc takes nto account both precson and reca, thus t s a more comprehensve metrc than ether precson or reca when separatey consdered. The macro-averagng F1 s a measurement whch evauates the overa performance of the cassfcaton mode. It s defned as: 1 c Macro _ F1 2 P /( ) 1 R P R c (1) 1 http://www.cs.cmu.edu/~mccaum/bow/ 2 http://ews.uuc.edu/~jnggao3/kdd08transfer ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012 1637
Xue Zhang and Wangxn Xao 5.3. Expermenta Resuts The SVM ght package 3 s used n our experments for the mpementaton of SVM usng defaut confguratons. We frst compare ACTC and CBCSVM wth dfferent teratons on two data sets. SVM and SemCCAc are used as the basene n order to see the benefts brought by our two-stage cassfcaton framework and CBCSVM. We set p=0.5. We conduct the experments 30 runs and the average resuts are gven. The number of tranng data s 5 for each cass and randomy samped n each run. Fgures 2 and 3 gve the Macro_F1 performance wth dfferent teratons on same2 and Reuters respectvey. In ACTC and CBCSVM, parameter maxiter determnes the number of sef-abeed tranng data. Larger vaue of maxiter means more sef-abeed tranng data and arger sze of the tranng data set for the fna SVM tranng. That s, the sze of tranng data set for the fna SVM ncreases wth the ncrease of maxiter. From fgure 2, we can see that ACTC sgnfcanty outperforms the other agorthms wth any vaue of maxiter and ts performance mproves wth the ncrease of the maxiter. Ths ndcates the foowng two aspects. One s that SVM cassfer sgnfcanty benefts from the augmented tranng data set by comparng ts performance wth that of SVM traned on the nta tranng data set. The other s that the sef-abeed tranng data are of hgh accuracy so that the beneft from the sef-abeed tranng data exceeds the negatve effect of the nose contaned n the sef-abeed tranng data. Ths accords wth that shown n fgure 1 n secton 3. The performance of CBCSVM degrades sghty wth the ncrease of maxiter. Because soft-constraned k-means cannot cope wth custer bas we, t ntroduces more nose nto the sef-abeed tranng data whch further put negatve effect on the SVM tranng. Such nose cumuates n the foowng teratons, whch make the fna SVM perform worse than that n ACTC. SemCCAc outperforms SVM, whch accords wth the former concuson that unsupervsed earnng gves better performance than supervsed earnng when the sze of tranng data set s extremey sma. 3 http://svmght.joachms.org/ 1638 ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
Macro-F1 Macro-F1 Custerng based Two-Stage Text Cassfcaton Requrng Mnma Tranng ata 0.95 same2ter 0.9 0.85 0.8 0.75 SemCCAc ACTC CBCSVM SVM 0.7 0.65 0.6 0.55 10 20 30 40 50 60 70 80 90 maxiter Fg.2. Performance wth maxiter on same2 0.9 reuters2ter 0.8 0.7 0.6 0.5 SemCCAc ACTC CBCSVM SVM 0.4 10 20 30 40 50 60 70 80 90 maxiter Fg.3. Performance wth maxiter on Reuters From fgure 3, we can see that ACTC outperforms the other agorthms when maxiter>20. Ths may es n the fact that the sef-abeed tranng data are unbaanced for each cass n SemCCAc, and that SemCCAc may fter out usefu exampes when t copes wth the custer bas, whch have reatvey arger effect on the fna SVM performance when the sze of tranng data set s sma. Ths s expected to be mproved by exporng sampng technque on tranng data set, e.g. over-sampng. On Reuters data set, SemCCAc ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012 1639
Macro-F1 Xue Zhang and Wangxn Xao outperforms CBCSVM sghty and sgnfcanty outperforms SVM. Ths ndcates that custerng gves better performance than that of SVM when the nta tranng data set s sma. To evauate the performance of ACTC wth a arge range of abeed data, we run the agorthm together wth CBCSVM, SVM and SemCCAc on dfferent percentage of the abeed data on the above two data sets. Fgures 4 and 5 gve the resuts. We set p=0.5 and maxiter=60. We conduct the experments 30 runs and the average resuts are gven. Tranng data are randomy samped n each run. 1 same2seeds 0.9 0.8 0.7 SemCCAc ACTC CBCSVM SVM 0.6 0.5 5 10 20 40 80 100 Number of tranng data n each cass Fg.4. Performance wth number of tranng data on same2 ACTC performs best on the two data sets wth a sze of tranng data set. SVM performs worst when the sze of tranng data s 5 for each cass. Then ts performance mproves fast wth the ncrease of tranng data. SVM outperforms CBCSVM and SemCCAc when the sze of tranng data set s arger than 20 on same2 and arger than 10 on Reuters. Wth the ncrease of tranng data, the performance of ACTC, CBCSVM, and SemCCAc grows very sowy. For ACTC and CBCSVM, the reason may be due to the effect of nose contaned n the sef-abeed tranng data. Therefore data edtng or nose fterng technques may be hepfu to mprove the performance. After nose fterng, feature seecton and sampng may aso be hepfu to mprove the overa performance. ACTC aways sgnfcanty outperforms CBCSVM and SemCCAc, whch ndcates that our two-stage cassfcaton framework s superor to that of CBC, and that the combnaton of generatve mode wth dscrmnatve mode can overcome the shortcomngs of both modes. 1640 ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
Macro-F1 Custerng based Two-Stage Text Cassfcaton Requrng Mnma Tranng ata 1 Reuterseeds 0.9 0.8 0.7 0.6 0.5 0.4 5 10 20 40 80 100 Number of tranng data n each cass SemCCAc ACTC CBCSVM SVM Fg.5. Performance wth sze of tranng data set on Reuters In ACTC and CBCSVM, the parameter p determnes the number of sef-abeed exampes n each seecton process. Larger vaue of p ndcates more exampes are sef-abeed n each seecton, so fewer teratons are needed when the number of sef-abeng exampes are fxed. But more nose may be ntroduced nto the tranng data set (pease refer to fgure 1). 6. Concuson Ths paper presents an actve sem-supervsed custerng based two-stage cassfcaton framework for sparsey abeed text cassfcaton. In order to address the custer bas probem, an actve sem-supervsed custerng method s proposed. We use a sef-tranng stye custerng method to augment the tranng data set, so that we can convert the chaengng probem of sparsey abeed text cassfcaton nto a supervsed one. Therefore supervsed cassfcaton modes can be used, e.g. SVM, and usefu technques for supervsed earnng can be empoyed to further mprove the performance. The experments show the superor performance of our method over SVM and CBC (SVM as base earner). In the future, we pan to evauate other custerng methods to address the custer bas probem, e.g. affnty propagaton custerng and densty based custerng. In terms of nose contro, data edtng or nose fterng technques w aso be expored. Other drectons ncude nvestgatng the probems of exampe seecton, confdence assessment, and resampng technques. ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012 1641
Xue Zhang and Wangxn Xao Acknowedgment. The authors woud ke to thank the anonymous revewers for ther usefu advce. Ths work s partay supported by the Natona Natura Scence Foundaton of Chna (No.61127005, No.61170054, No.50708085, and No.50978127), the speca scentfc research fundng of Research Insttute of Hghway, Mnstry of Transport (No.1206030211003), and the Project of Educaton epartment of Jangx Provnce (No.GJJ08415). References 1. Joachms, T.: Text categorzaton wth support vector machnes: Learnng wth Many Reevant Features. In Proceedngs of the European Conference on Machne Learnng. Chemntz, Germany, Apr 21 24, 137--142 (1998). 2. Lews,..: Naïve Bayes at forty: The ndependence assumpton n nformaton retreva. In Proceedngs of the European Conference on Machne Learnng. Chemntz, Germany, Apr 21 24 (1998). 3. Masand, B., Lnoff, G., Watz,.: Cassfyng news stores usng memory based reasonng. In Proceedngs of the 15th Internatona ACM/SIGIR Conference on Research & eveopment n Informaton Retreva. Copenhagen, enmark, June 21-24, 59-64 (1992). 4. Ng, T. H., Goh, W. B., Low, K. L.: Feature seecton, percepton earnng and a usabty case study for text categorzaton. In Proceedngs of the 20th Annua Internatona ACM SIGIR Conference on Research and eveopment n Informaton Retreva. Phadepha, PA, USA, Juy 27-31, 1997. 5. Yang, Y. & Lu, X.: An re-examnaton of text categorzaton. In Proceedngs of the 22nd Annua Internatona ACM SIGIR Conference on Research and eveopment n Informaton Retreva. Berkeey, CA, USA, August 15-19, 1999. 6. Joachms, T.: Transductve nference for text cassfcaton usng support vector machnes. In Proceedngs of the 16th nternatona conference on machne earnng (ICML1999). Bed, Sovena, June 27-30, 200-209 (1999). 7. Bum, A., Mtche, T.: Combnng abeed and unabeed data wth Co-Tranng. In Proceedngs of the 11th Annua Conference on Computatona Learnng Theory. Madson, Wsconsn, Juy 24-26, 92-100 (1998). 8. Ngam, K., McCaurn, A. K., Thrun, S., Mtche, T.: Text cassfcaton from abeed and unabeed documents usng EM. Machne Learnng, 39(2/3):103-134, 2000. 9. Seeger, M.: Learnng wth abeed and unabeed data. Technca report, Ednburgh Unversty, 2001. 10. Sonm, N., Tshby, N.: ocument Custerng usng Word Custers va the Informaton Botteneck Method. In Proceedngs of the 23rd Annua Internatona ACM SIGIR Conference on Research and eveopment n Informaton Retreva. Athens, Greece, Juy 24-28, 208--215 (2000). 11. Zeng, H. J., Wang, X. H., Chen, Z., Ma, W. Y.: CBC: Custerng based text cassfcaton requrng mnma abeed data. In Proceedngs of the 3rd IEEE Internatona Conference on ata Mnng. Mebourne, Forda, USA, November 19 22, 2003. 12. Yu, H., Yang, J., Han, J.: Cassfyng arge data sets usng SVMs wth herarchca custers. n Proceedngs of the 9th ACM SIGK 2003, Washngton, C, USA, 2003. 13. Evans, R., Pfahrnger, B., Homes, G.: Custerng and Cassfcaton. 7 th Internatona conference on nformaton technoogy n Asa (CITA 11). Sarawak, Maaysa, Juy 12-13, 1-8 (2011). 1642 ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
Custerng based Two-Stage Text Cassfcaton Requrng Mnma Tranng ata 14. Kyrakopouou, A.: (2008). Text cassfcaton aded by custerng: a terature revew. Toos n Artfca Integence, 233-252, 2008. 15. Fung, G., Mangasaran, O.L.: (2001). Sem-supervsed support vector machnes for unabeed data cassfcaton. Optm. Methods Software, 2001, v15 1. 29-44. 16. A. Kyrakopouou, T. Kaambouks. (2008). Combnng custerng wth cassfcaton for spam detecton n soca bookmarkng systems. n Proceedngs of ECML/PK scovery Chaenge 2008 (RSC 2008), Antwerp, Begum, 2008, pp. 47 54. 17. Kyrakopouou, A.: Usng Custerng and Co-Tranng to Boost Cassfcaton Performance. In Proceedngs of the 19th IEEE Internatona Conference on Toos wth Artfca Integence. Patras, Greece, October 29-31, 325-330 (2007). 18. Raskutt, B., Ferrá, H., Kowaczyk, A.: (2002). Combnng custerng and co-tranng to enhance text cassfcaton usng unabeed data. In Proceedngs of the 8th ACM SIGK nternatona conference on Knowedge dscovery and data mnng. Edmonton, Aberta, Canada, Juy 23-26, 2002. 19. Ca, W., Chen, S., Zhang,.: A mutobjectve smutaneous earnng framework for custerng and cassfcaton. IEEE Transactons on Neura Networks, 21(2): 185 200, 2010. 20. Qan, Q., Chen, S., Ca, W.: Smutaneous custerng and cassfcaton over custer structure representaton. Pattern Recognton, 2011, October 27. 21. Chapee, O., Weston, J., Schokopf, B.: Custer kernes for sem-supervsed earnng. Advances n Neura Informaton Processng Systems In NIPS 2002, Vo. 15 (2003), 585-592. 22. Zhou,., Bousquet, O., La, T. N., Weston, J., Schokopf, B.: Learnng wth oca and goba consstency. Advances n Neura Informaton Processng Systems 16, 321-328, 2004. 23. Keswan, G., Ha, L.O.: Text cassfcaton wth enhanced sem-supervsed fuzzy custerng. Handbook of Fuzzy Computaton, 1994, 511-515. 24. Ng, A. Y., Jordan, M. I.: On dscrmnatve vs. generatve cassfers: A comparson of ogstc regresson and nave Bayes. Advances n Neura Informaton Processng Systems 14, 2002. Xue Zhang, receved the BS degree n eectronc engneerng from Xan Unversty, Xan, Chna, n 1999. She receved the MS degree n contro theory and contro engneerng from Southwest Unversty of Scence and Technoogy, Manyang, Chna, n 2003, and receved the Ph degree n computer scence from Southeast Unversty, Nanjng, Chna, n 2007. From 2008 to the present, she s a postdoctora feow n Pekng Unversty. Her research nterests ncude data mnng and machne earnng, wth emphass on the appcatons to text mnng and bonformatcs. Wangxn Xao, receved the Ph degree n traffc nformaton and contro engneerng from Southeast Unversty, Nanjng, Chna, n 2004. From 2005 to 2007, he engaged n postdoctora research n Wuhan Unversty of Technoogy. Snce 2008 he has been an assocate professor n Research Insttute of Hghway Mnstry of Transport. From 2009 to 2011, he was aso a postdoctora feow n Changsha Unversty of Scence and Technoogy. Hs research nterests ncude pattern recognton, Integent Transport Systems (ITS) and data mnng wth appcatons to traffc data. Receved: January 30, 2012; Accepted:ecember 05, 2012. ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012 1643