Clustering based TwoStage Text Classification Requiring Minimal Training Data


 Sabina McDowell
 2 years ago
 Views:
Transcription
1 OI: /CSIS Z Custerng based TwoStage Text Cassfcaton Requrng Mnma Tranng ata Xue Zhang 1,2 and Wangxn Xao 3,4 1 Key Laboratory of Hgh Confdence Software Technooges, Mnstry of Educaton, Pekng Unversty, Bejng , Chna 1 Schoo of Eectroncs Engneerng and Computer Scence, Pekng Unversty, Bejng , Chna 2 epartment of Physcs, Shangqu Norma Unversty, Shangqu , Chna 3 epartment of Computer Scence, Jnggangshan Unversty, J an , Chna 4 Schoo of Traffc and Transportaton Engneerng, Changsha Unversty of Scence and Technoogy, Changsha , Chna Abstract. Custerng has been empoyed to expand tranng data n some semsupervsed earnng methods. Custerng based methods are based on the assumpton that the earned custers under the gudance of nta tranng data can somewhat characterze the underyng dstrbuton of the data set. However, our experments show that whether such assumpton hods s based on both the separabty of the consdered data set and the sze of the tranng data set. It s often voated on data set of bad separabty, especay when the nta tranng data are too few. In ths case, custerng based methods woud perform worse. In ths paper, we propose a custerng based twostage text cassfcaton approach to address the above probem. In the frst stage, abeed and unabeed data are frst custered wth the gudance of the abeed data. Then a seftranng stye custerng strategy s used to teratvey expand the tranng data under the gudance of an orace or expert. At the second stage, dscrmnatve cassfers can subsequenty be traned wth the expanded abeed data set. Unke other custerng based methods, the proposed custerng strategy can effectvey cope wth data of bad separabty. Furthermore, our proposed framework converts the chaengng probem of sparsey abeed text cassfcaton nto a supervsed one, therefore, supervsed cassfcaton modes, e.g. SVM, can be apped, and technques proposed for supervsed earnng can be used to further mprove the cassfcaton accuracy, such as feature seecton, sampng methods and data edtng or nose fterng. Our expermenta resuts demonstrated the effectveness of our proposed approach especay when the sze of the tranng data set s very sma. Keywords: text cassfcaton, custerng, actve semsupervsed custerng, twostage cassfcaton.
2 Xue Zhang and Wangxn Xao 1. Introducton The goa of automatc text cassfcaton s to automatcay assgn documents to a number of predefned categores. It s of great mportance due to the everexpandng amount of text documents avaabe n dgta form n many reaword appcatons, such as webpage cassfcaton and recommendaton, ema processng and fterng. Text cassfcaton has once been consdered as a supervsed earnng task, and a arge number of supervsed earnng agorthms have been deveoped, such as Support Vector Machnes (SVM) [1], Naïve Bayes [2], Nearest Neghbor [3], and Neura Networks [4]. A comparatve study was gven n [5]. SVM has been recognzed as one of the most effectve text cassfcaton methods. Furthermore, a number of technques sutabe for supervsed earnng have been proposed to mprove cassfcaton accuracy, such as feature seecton, data edtng or nose fterng, and sampng methods aganst bas. A supervsed cassfcaton mode often needs a very arge number of tranng data to enabe the cassfer s good generazaton. The cassfcaton accuracy of tradtona supervsed text cassfcaton agorthms degrades dramatcay wth the decrease of the number of tranng data n each cass. As we know, manuay abeng the tranng data for a machne earnng agorthm s a tedous and tmeconsumng process, and even unpractca (e.g., onne webpage recommendaton). Correspondngy, one mportant chaenge for automatc text cassfcaton s how to reduce the number of abeed documents that are requred for budng reabe text cassfer. Ths eads to an actve research probem, semsupervsed earnng. There have been proposed a number of semsupervsed text cassfcaton methods, ncudng Transductve SVM (TSVM) [6], CoTranng [7] and EM [8]. A comprehensve revew coud be found n [9]. By exporng nformaton contaned n unabeed data, these methods obtan consderabe mprovement over supervsed methods wth reatvey sma sze of tranng data set. However, most of these methods adopt the teratve approach whch tran an nta cassfer based on the dstrbuton of the abeed data. They st face dffcutes when the abeed data set s extremey sma snce they w have a poor startng pont and cumuate more errors n teratons when the extremey few abeed data are far apart from correspondng cass centers due to the hgh dmensonaty. To address the probem of sparsey abeed text cassfcaton, we present a custerng based twostage text cassfcaton method wth both abeed and unabeed data. Expermenta resuts on severa reaword data sets vadate the effectveness of our proposed approach. Our contrbutons can be summarzed as foows. We propose a nove custerng based twostage cassfcaton approach that requres mnma tranng data to acheve hgh cassfcaton accuracy. In order to mprove the accuracy of the sefabeed tranng data by custerng, we propose an actve semsupervsed custerng method to cope wth data sets of bad separabty. On the bass of custerng, we convert the chaengng probem of sparsey abeed text cassfcaton nto supervsed one. Thus supervsed 1628 ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
3 Custerng based TwoStage Text Cassfcaton Requrng Mnma Tranng ata cassfcaton modes and technques sutabe for text cassfcaton can be used to further mprove the overa performance. We conduct extensve experments to vadate our approach and study reated ssues. The rest of ths paper s organzed as foows. Secton 2 revews severa exstng methods. Our custerng method s gven n Secton 3 wth some anayss. The detaed agorthm s then presented n Secton 4. Expermenta resuts are presented n Secton 5. Secton 6 concudes ths paper. 2. Reated Work Custerng has been apped n many subdomans of the probem of text cassfcaton, ncudng feature compresson or extracton [10], semsupervsed earnng [11], and custerng n argescae cassfcaton probems [12,13]. The foowng w revew severa reated work about custerng adng cassfcaton n the area of semsupervsed earnng. A comprehensve revew for text cassfcaton aded by custerng can be found n [14]. Custerng has been used to extract nformaton from unabeed data n order to boost the cassfcaton task. There are roughy four cases of semsupervsed cassfcaton aded by custerng. In partcuary, custerng s used: (1) to create a tranng set from the unabeed set [15], (2) to augment an exstng abeed set wth new documents from the unabeed data set [11], (3) to augment the data set wth new features [8,16], and (4) to cotran a cassfer [17,18]. More recenty, smutaneous earnng frameworks for custerng and cassfcaton have been proposed [19,20]. To make use of unabeed data, one assumpton whch s made, expcty or mpcty, by most of the semsupervsed earnng agorthms s the socaed custer assumpton that two ponts are key to have the same cass abe f there s a path connectng them passng through regons of hgh densty ony. That s, the decson boundary shoud e n regons of ow densty. Based on the deas of spectra custerng and random waks, a framework for constructng kernes whch mpement the custer assumpton was proposed n [21]. Aso based on custer assumpton, [22] apped spectra custerng to represent the abeed and unabeed data. By custerng unabeed data wth abeed data usng probabstc and fuzzy approaches, [23] proposed a framework to mprove the performance of base cassfer wth unabeed data. In text cassfcaton, there are often many owdensty areas between postve and negatve abeed exampes because of the hgh dmensonaty and data sparseness. Ths stuaton w be worsened wth the decrease of the number of tranng data n each cass. The most reated work s the custerng based text cassfcaton (CBC) approach [11]. In CBC, frsty, semsupervsed soft kmeans s used to custer the abeed and unabeed data nto k custers, where k s set to the number of casses n the cassfcaton task. p% most confdent unabeed exampes from each custer (.e. the ones nearest to the custer s centrod) are added to the ComSIS Vo. 9, No. 4, Speca Issue, ecember
4 Xue Zhang and Wangxn Xao tranng data set. Then TSVM s traned on the augmented tranng data set and unabeed data set. Smary, p% most confdent unabeed exampes from each cass (.e. the ones wth the argest margn) are added to the tranng data set. CBC terates the step of custerng and the step of cassfcaton aternatvey unt there s no unabeed data eft. In CBC, n order to guarantee the abeng accuracy, the vaue of p shoud be sma enough. That s, after the custerng step n each teraton, the tranng data set s augmented wth very few exampes. Therefore, the cassfer n the foowng cassfcaton step shoud have an accepted performance wth sma sze of tranng data set. Ths put a strong constrant on the seected cassfcaton modes. CBC can hardy perform we wth supervsed cassfcaton modes, e.g. SVM, whch w be demonstrated ater. The success of CBC s based on the assumpton that even when some of the data ponts are wrongy cassfed, the most confdent data ponts,.e. the ones wth argest margn under cassfcaton mode and the ones nearest to the centrods under custerng mode, are confdenty cassfed or custered. Ths assumpton guarantees the hgh accuracy of the sefabeed tranng data and correspondngy the good performance of the agorthm. We separate ths assumpton nto custerng assumpton and cassfcaton assumpton for convenence. However, our emprca experments show that the assumptons are often voated on data sets of bad separabty. Frsty, custerng assumpton can t be hod n ths case, at east for the softconstrant kmeans [11]. In fact, each custer s centrod may ocate n: 1) the doman of ts correspondng true cass, 2) the border of ts true cass and other casses, 3) the doman of other cass. The probabty that the ast two cases occur ncreases wth the degradng of data separabty. In the ast two cases (we ca them custer bas), CBC w ntroduce more nose nto the tranng data set n ts custerng step, whch mght make the cassfcaton assumpton aso be voated snce the nose w have a bg effect due to the ntay very few truy abeed tranng data. Then the foowng teratve steps w further cumuate more errors. In sparsey abeed text cassfcaton, the extremey few tranng data make many technques whch are usefu for ameoratng data separabty, e.g. feature seecton, not effectve, because the tranng data can not characterze the whoe data set we. When the sze of tranng data set s extremey sma, unsupervsed earnng gves better performance than supervsed and semsupervsed earnng agorthms. In ths paper, we deveop an actve semsupervsed custerng based twostage approach to address the probem of sparsey abeed text cassfcaton. fferent from CBC, our am s to convert the probem of sparsey abeed text cassfcaton nto a supervsed one by usng custerng. Therefore, supervsed cassfcaton modes, e.g. SVM, can be apped, and technques proposed for supervsed earnng can be used to further mprove the cassfcaton accuracy, such as feature seecton, sampng methods and data edtng or nose fterng. Furthermore, our proposed actve semsupervsed custerng method ams to cope wth data sets wth any separabty. The goa of custerng here s to generate enough tranng data for supervsed earnng wth hgh accuracy ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
5 Custerng based TwoStage Text Cassfcaton Requrng Mnma Tranng ata 3. Actve Semsupervsed Custerng Usng custerng to ad semsupervsed cassfcaton, the key pont es n that the custerng resuts can to some extent characterze the underyng dstrbuton of the whoe data set. Ony n ths case, custerng s hepfu to augment tranng data set or extract usefu features to mprove the performance of cassfcaton. Athough custerng methods are more robust to the bas caused by the nta sparsey abeed data, emprca experences show that the resuts of custerng mght aso be based (e.g. the custer bas of cases 2) and 3)), sometmes heavy, especay on data sets of bad separabty. The softconstraned kmeans n CBC can reduce bas n the abeed exampes by basng the constrants (the gudance of the ntay abeed data) not on exact exampes but on ther centrod. But t st cannot cope wth the bas we n tranng data on data sets wth bad separabty. Tabe 1 gves two custerng based agorthms to augment tranng data. They mpement the teratve renforcement strategy. In each teraton, a custerng method s used to custer the whoe data set wth the gudance of abeed tranng data, and then severa exampes are seected accordng to some crtera and abeed wth the abes of the centrods they beong to. In SemCC agorthm, the custerng method s softconstraned kmeans adopted n CBC agorthm. The custerng method (we ca t as actve softconstraned kmeans) n SemCCAc agorthm s proposed n order to address the custerbas probem. Tabe 1. Two Custerng Agorthms: SemCC and SemCCAc Input: Labeed data set and unabeed data set u, the number of teratons maxiter, p Output: Augmented abeed data set Intaze: =, u = u, ter=0 Agorthm SemCC: Whe ter<maxiter and u Φ ter=ter+1 Cacuate nta centrods: 1 o x, 1,...,, j, t j c x j n o o. n s the number of exampes n j, and set current centrods whose abe s. The abes of the centrods t( o ) t( ) are equa to abes of the correspondng exampes. o ComSIS Vo. 9, No. 4, Speca Issue, ecember
6 Xue Zhang and Wangxn Xao Repeat unt custer resut doesn t change any more Assgn t( o ) to each x u that are nearer to than to other centrods. Update current centrods: 1 o x j t j c x j u j n, 1,...,,, number of exampes n Cacuate the nearest centrods t( o ) t( ), ext the oop. o whose abe s. From each custer, seect p% exampes nearest to o, add them to u o j for each o, n s the o, f x u whch s, and deete them from u. Agorthm SemCCAc: Whe ter<maxiter and u Φ ter=ter+1 Cacuate nta centrods: 1 o x, 1,...,, j, t j c x j j n, and set current centrods o o. n s the number of exampes n whose abe s. The abes of the centrods t( o ) t( ) are equa to abes of o the correspondng exampes. Repeat unt custer resut doesn t change any more Assgn t ( o ) to each x u that are nearer to o than to other centrods. Update current centrods: o 1 n x j t j, 1,..., c,, j number of exampes n Cacuate the nearest centrods t( o ) t( ), ext the oop. o x j u whose abe s. From each custer, seect p% exampes nearest to confdences, x 1,,x m u o j for each, n s the o, f x u whch s o and sort them wth descendng order of 1632 ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
7 Custerng based TwoStage Text Cassfcaton Requrng Mnma Tranng ata If the true abe of x 1 and x m equas to t o ) : ese end add the m exampes to deete them from u add x 1 and x m wth ther true abes to deete x 1 and x m from SemCCAc s dfferent from SemCC n the abeng strategy for the seected exampes of hghest confdences accordng to the custerng resuts. In softconstraned kmeans, t doesn t take the found centrods ocaton nto consderaton. It just abes the seected exampes nearest to each centrod. Therefore, t w ntroduce much nose nto the tranng data set wth the presence of custer bas. In actve softconstraned kmeans, t frst estmates the ocaton of each centrod. Ony for custers whose centrods ocate wthn ther true casses, t abes a the seected exampes nearest to the correspondng centrods wth the abes of ther centrods. For the custers wth the presence of custer bas, t just abes two exampes wth ther true abe for each custer. An mportant probem n actve softconstraned kmeans s how to estmate the ocaton of each custer s centrod. The strategy used here s to nqure the true abes of two exampes (the nearest and the farthest exampes to the centrod n the seected p% exampes) by resortng to an orace or expert for each custer. If the two exampes have the same abe wth that of ther centrod, then a the p% seected exampes are abeed wth the abe of the centrod. Otherwse, ony the two exampes are added to tranng data set wth ther true abes. The strategy s based on the ntuton that custer bas s more key happened when the two exampes have dfferent abes wth that of ther centrod. When the two exampes have the same abe, but dfferent from ther centrod, the custer s centrod s most key ocate n the doman of other casses. When one of the two exampes has the same abe wth that of the centrod, the custer s centrod s most key ocate n the border of the true cass and other casses. We can fter out much nose by usng ths strategy. It s aso deghtfu that custer bas can be rectfed n the foowng teratons n SemCCAc by estmatng the ocaton of custers centrods, whch property guarantees the hgh accuracy of sefabeed tranng data n spte of the poor startng ponts. u ( ComSIS Vo. 9, No. 4, Speca Issue, ecember
8 Accuracy Xue Zhang and Wangxn Xao SemCCAc(p=0.5) SemCC(p=0.5) SemCCAc(p=1) SemCC(p=1) maxiter Fg.1. Accuracy of sefabeed tranng data wth teratons In fgure 1, we depct the average accuracy of sefabeed tranng data by appyng the two custerng agorthms to a text cassfcaton probem n 20 runs (same2, consstng of two most smar casses n 20Newsgroups, 5 tranng data for each cass). From fgure 1, t coud be found that the accuracy of sefabeed tranng data by SemCCAc s sgnfcanty hgher than that by SemCC. Wth the ncrease of p vaue, the accuracy degrades frst, but then t rses wth the ncrease of teratons. When maxiter=1, SemCC degrades to the softconstraned kmeans. We can aso see that the average accuracy n SemCC s beow 0.95 when maxiter=1, whch ndcates that softconstraned kmeans ntroduces nose nto the tranng data set wth a certan probabty. Ths nose w hurt the foowng cassfer s earnng, especay when the sze of the nta tranng data set s very sma. We thnk the phenomenon of custer bas can partay expan why the performance of CBC mproves sowy than those of TSVM and cotranng wth the ncrease of tranng data. Wth the ncrease of teratons, the accuracy of sefabeed tranng data by SemCC degrades much faster than that of SemCCAc. Therefore, technques to cope wth the custer bas are very mportant for custerng based semsupervsed cassfcaton. Ths aso tes us that the proposed actve semsupervsed custerng method s effectve for addressng the probem of custer bas ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
9 Custerng based TwoStage Text Cassfcaton Requrng Mnma Tranng ata 4. TwoStage Cassfcaton Framework: ACTC In ths secton, we present the deta of the Actve semsupervsed Custerng based Twostage text Cassfcaton agorthm (ACTC). A documents are tokenzed nto terms and we construct one component for each dstnct term. Thus each document s represented by a vector ( w 1, w 2,..., wp) where w j s weghted by TFIF. The cosne functon s used n the custerng agorthm to cacuate the dstance from an exampe to the centrod. In the cassfcaton stage, we use a SVM cassfer traned wth the augmented tranng data to cassfy the whoe data set. The detaed agorthm s presented n tabe 2. ACTC conssts of two stages: custerng stage and cassfcaton stage. In the custerng stage, SemCCAc s used to augment the tranng data set. Users can set the vaues of maxiter and p to determne how many new documents shoud be abeed by SemCCAc. At the second stage, dscrmnatve cassfers can subsequenty be traned wth the expanded abeed data set. Softconstraned kmeans s n fact a generatve cassfer [11]. Accordng to [24], generatve cassfers reach ther asymptotc performance faster than dscrmnatve cassfers, but usuay ead to hgher asymptotc error than dscrmnatve cassfers. Ths motvates us to combne custerng wth dscrmnatve cassfers together to address the probem of sparsey abeed text cassfcaton. ACTC n fact converts the probem of sparsey abeed text cassfcaton nto a supervsed one, thus supervsed cassfcaton modes sutabe for text cassfcaton can be used. Moreover, the technques proposed for supervsed earnng can be used to mprove the performance. For nstance, t s unavodabe to fasey abe some exampes n the custerng stage, then data edtng or nose fterng technques are expected to mprove the performance of ACTC. Other technques aso can be used to mprove the performance, such as feature seecton and sampng. Tabe 2. ACTC and CBCSVM Input: Labeed data set and unabeed data set u and the number of teratons maxiter, p Output: The fu abeed set = + u A cassfer L Agorthm ACTC: 1. Custerng Stage Use SemCCAc (repeat maxiter teratons) to augment the tranng data set and we get an augmented tranng data set 2. Cassfcaton Stage ComSIS Vo. 9, No. 4, Speca Issue, ecember
10 Xue Zhang and Wangxn Xao Tran a SVM cassfer L based on to cassfy the whoe data set.. And use the earned cassfer Agorthm CBCSVM: ter=0 1. whe ter<maxiter/2 ter=ter Custerng step Use softconstraned kmeans to custerng the whoe data set, and seect p% unabeed exampes nearest to ts centrod for each custer and add them to 1.2 Cassfcaton step Tran a SVM cassfer based on. From each cass, seect p% unabeed exampes wth the argest margn, and add them to 2. Tran a SVM cassfer L based on to cassfy the whoe data set.. And use the earned cassfer In order to verfy the twostage framework performs better than CBC agorthm n supervsed earnng, we substtute SVM for TSVM n CBC, named CBCSVM. Note that, f we use TSVM as the cassfer, the performance of both agorthms w be expected to get mproved. However, the tme compexty of a TSVM cassfer s much hgher than that of a SVM cassfer, because t repeatedy swtches estmated abes of unabeed data and tres to fnd the maxma margn hyperpane. The more unabeed data are, the more tme t requres. The worse of the data separabty s, the more tme t requres. For exampe, on same2, whch conssts of two most smar casses of 20Newsgroups and 1000 exampes n each cass, TSVM requres severa hours to compete when 5 tranng data for each cass are used. SVM ony needs about 1 second. Wth enough tranng data, the performance of SVM s expected to be smar wth that of TSVM, but t requres much ess tme. Ths motvates us to propose the twostage cassfcaton method, whch converts the probem of sparsey abeed text cassfcaton nto a supervsed one. CBCSVM s aso gven n tabe 2 for convenence. Snce CBC seects p% unabeed exampes both n custerng and cassfcaton steps n each teraton, we set the number of teratons to haf of that n ACTC n order to make them have the same seecton tmes. The dfference between our approach and CBC s that, we expand the tranng data set by a seftranng stye custerng process and resortng to an orace or expert to evauate the custers centrods. After competon of the tranng data expanson, dscrmnatve cassfers coud be traned on the expanded tranng data set. Therefore, ACTC puts ess constrant on the 1636 ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
11 Custerng based TwoStage Text Cassfcaton Requrng Mnma Tranng ata cassfcaton mode, whch enabes us to treat the foowng cassfcaton stage as a supervsed earnng probem. 5. Performance Evauaton 5.1. ata sets For a consstent evauaton, we conduct our emprca experments on two benchmark data sets, 20NewsGroups and Reuters Newsgroups s one famous Webreated data coecton. From the orgna 20 Newsgroups data set, same2, consstng of 2 very smar newsgroups (comp.wndows.x, comp.os.mswndows) s used to evauate the performance of the agorthms. Same2 contans 2000 nstances, 1000 for each cass. We use Ranbow software 1 to preprocess the data (removng stop words and words whose document frequency are ess than 3, stemmng) and we get 7765 unque terms for same2. Then terms are weghted wth ther TFIF vaues. The Reuters corpus contans Reuters news artces from We ony show the expermenta resuts of tran1.svm n LWE 2 snce the agorthms have the smar performance on other Reuters data sets. Tran1.svm contans 1239 documents (two cass) and 6889 unque terms Evauaton Metrc We use macroaveragng of F1 measure among a casses to evauate the cassfcaton resut. For each cass [ 1, c], et A be the number of documents whose rea abe s, and B the number of documents whose abe s predcted to be, and C the number of correcty predcted documents n ths cass. The precson and reca of the cass are defned as P C / B and R C / A respectvey. For each cass, the F1 metrc s defned as F1 2 P R /( P R) where P and R are precson and reca for a partcuar cass. F1 metrc takes nto account both precson and reca, thus t s a more comprehensve metrc than ether precson or reca when separatey consdered. The macroaveragng F1 s a measurement whch evauates the overa performance of the cassfcaton mode. It s defned as: 1 c Macro _ F1 2 P /( ) 1 R P R c (1) ComSIS Vo. 9, No. 4, Speca Issue, ecember
12 Xue Zhang and Wangxn Xao 5.3. Expermenta Resuts The SVM ght package 3 s used n our experments for the mpementaton of SVM usng defaut confguratons. We frst compare ACTC and CBCSVM wth dfferent teratons on two data sets. SVM and SemCCAc are used as the basene n order to see the benefts brought by our twostage cassfcaton framework and CBCSVM. We set p=0.5. We conduct the experments 30 runs and the average resuts are gven. The number of tranng data s 5 for each cass and randomy samped n each run. Fgures 2 and 3 gve the Macro_F1 performance wth dfferent teratons on same2 and Reuters respectvey. In ACTC and CBCSVM, parameter maxiter determnes the number of sefabeed tranng data. Larger vaue of maxiter means more sefabeed tranng data and arger sze of the tranng data set for the fna SVM tranng. That s, the sze of tranng data set for the fna SVM ncreases wth the ncrease of maxiter. From fgure 2, we can see that ACTC sgnfcanty outperforms the other agorthms wth any vaue of maxiter and ts performance mproves wth the ncrease of the maxiter. Ths ndcates the foowng two aspects. One s that SVM cassfer sgnfcanty benefts from the augmented tranng data set by comparng ts performance wth that of SVM traned on the nta tranng data set. The other s that the sefabeed tranng data are of hgh accuracy so that the beneft from the sefabeed tranng data exceeds the negatve effect of the nose contaned n the sefabeed tranng data. Ths accords wth that shown n fgure 1 n secton 3. The performance of CBCSVM degrades sghty wth the ncrease of maxiter. Because softconstraned kmeans cannot cope wth custer bas we, t ntroduces more nose nto the sefabeed tranng data whch further put negatve effect on the SVM tranng. Such nose cumuates n the foowng teratons, whch make the fna SVM perform worse than that n ACTC. SemCCAc outperforms SVM, whch accords wth the former concuson that unsupervsed earnng gves better performance than supervsed earnng when the sze of tranng data set s extremey sma ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
13 MacroF1 MacroF1 Custerng based TwoStage Text Cassfcaton Requrng Mnma Tranng ata 0.95 same2ter SemCCAc ACTC CBCSVM SVM maxiter Fg.2. Performance wth maxiter on same2 0.9 reuters2ter SemCCAc ACTC CBCSVM SVM maxiter Fg.3. Performance wth maxiter on Reuters From fgure 3, we can see that ACTC outperforms the other agorthms when maxiter>20. Ths may es n the fact that the sefabeed tranng data are unbaanced for each cass n SemCCAc, and that SemCCAc may fter out usefu exampes when t copes wth the custer bas, whch have reatvey arger effect on the fna SVM performance when the sze of tranng data set s sma. Ths s expected to be mproved by exporng sampng technque on tranng data set, e.g. oversampng. On Reuters data set, SemCCAc ComSIS Vo. 9, No. 4, Speca Issue, ecember
14 MacroF1 Xue Zhang and Wangxn Xao outperforms CBCSVM sghty and sgnfcanty outperforms SVM. Ths ndcates that custerng gves better performance than that of SVM when the nta tranng data set s sma. To evauate the performance of ACTC wth a arge range of abeed data, we run the agorthm together wth CBCSVM, SVM and SemCCAc on dfferent percentage of the abeed data on the above two data sets. Fgures 4 and 5 gve the resuts. We set p=0.5 and maxiter=60. We conduct the experments 30 runs and the average resuts are gven. Tranng data are randomy samped n each run. 1 same2seeds SemCCAc ACTC CBCSVM SVM Number of tranng data n each cass Fg.4. Performance wth number of tranng data on same2 ACTC performs best on the two data sets wth a sze of tranng data set. SVM performs worst when the sze of tranng data s 5 for each cass. Then ts performance mproves fast wth the ncrease of tranng data. SVM outperforms CBCSVM and SemCCAc when the sze of tranng data set s arger than 20 on same2 and arger than 10 on Reuters. Wth the ncrease of tranng data, the performance of ACTC, CBCSVM, and SemCCAc grows very sowy. For ACTC and CBCSVM, the reason may be due to the effect of nose contaned n the sefabeed tranng data. Therefore data edtng or nose fterng technques may be hepfu to mprove the performance. After nose fterng, feature seecton and sampng may aso be hepfu to mprove the overa performance. ACTC aways sgnfcanty outperforms CBCSVM and SemCCAc, whch ndcates that our twostage cassfcaton framework s superor to that of CBC, and that the combnaton of generatve mode wth dscrmnatve mode can overcome the shortcomngs of both modes ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
15 MacroF1 Custerng based TwoStage Text Cassfcaton Requrng Mnma Tranng ata 1 Reuterseeds Number of tranng data n each cass SemCCAc ACTC CBCSVM SVM Fg.5. Performance wth sze of tranng data set on Reuters In ACTC and CBCSVM, the parameter p determnes the number of sefabeed exampes n each seecton process. Larger vaue of p ndcates more exampes are sefabeed n each seecton, so fewer teratons are needed when the number of sefabeng exampes are fxed. But more nose may be ntroduced nto the tranng data set (pease refer to fgure 1). 6. Concuson Ths paper presents an actve semsupervsed custerng based twostage cassfcaton framework for sparsey abeed text cassfcaton. In order to address the custer bas probem, an actve semsupervsed custerng method s proposed. We use a seftranng stye custerng method to augment the tranng data set, so that we can convert the chaengng probem of sparsey abeed text cassfcaton nto a supervsed one. Therefore supervsed cassfcaton modes can be used, e.g. SVM, and usefu technques for supervsed earnng can be empoyed to further mprove the performance. The experments show the superor performance of our method over SVM and CBC (SVM as base earner). In the future, we pan to evauate other custerng methods to address the custer bas probem, e.g. affnty propagaton custerng and densty based custerng. In terms of nose contro, data edtng or nose fterng technques w aso be expored. Other drectons ncude nvestgatng the probems of exampe seecton, confdence assessment, and resampng technques. ComSIS Vo. 9, No. 4, Speca Issue, ecember
16 Xue Zhang and Wangxn Xao Acknowedgment. The authors woud ke to thank the anonymous revewers for ther usefu advce. Ths work s partay supported by the Natona Natura Scence Foundaton of Chna (No , No , No , and No ), the speca scentfc research fundng of Research Insttute of Hghway, Mnstry of Transport (No ), and the Project of Educaton epartment of Jangx Provnce (No.GJJ08415). References 1. Joachms, T.: Text categorzaton wth support vector machnes: Learnng wth Many Reevant Features. In Proceedngs of the European Conference on Machne Learnng. Chemntz, Germany, Apr 21 24, (1998). 2. Lews,..: Naïve Bayes at forty: The ndependence assumpton n nformaton retreva. In Proceedngs of the European Conference on Machne Learnng. Chemntz, Germany, Apr (1998). 3. Masand, B., Lnoff, G., Watz,.: Cassfyng news stores usng memory based reasonng. In Proceedngs of the 15th Internatona ACM/SIGIR Conference on Research & eveopment n Informaton Retreva. Copenhagen, enmark, June 2124, (1992). 4. Ng, T. H., Goh, W. B., Low, K. L.: Feature seecton, percepton earnng and a usabty case study for text categorzaton. In Proceedngs of the 20th Annua Internatona ACM SIGIR Conference on Research and eveopment n Informaton Retreva. Phadepha, PA, USA, Juy 2731, Yang, Y. & Lu, X.: An reexamnaton of text categorzaton. In Proceedngs of the 22nd Annua Internatona ACM SIGIR Conference on Research and eveopment n Informaton Retreva. Berkeey, CA, USA, August 1519, Joachms, T.: Transductve nference for text cassfcaton usng support vector machnes. In Proceedngs of the 16th nternatona conference on machne earnng (ICML1999). Bed, Sovena, June 2730, (1999). 7. Bum, A., Mtche, T.: Combnng abeed and unabeed data wth CoTranng. In Proceedngs of the 11th Annua Conference on Computatona Learnng Theory. Madson, Wsconsn, Juy 2426, (1998). 8. Ngam, K., McCaurn, A. K., Thrun, S., Mtche, T.: Text cassfcaton from abeed and unabeed documents usng EM. Machne Learnng, 39(2/3): , Seeger, M.: Learnng wth abeed and unabeed data. Technca report, Ednburgh Unversty, Sonm, N., Tshby, N.: ocument Custerng usng Word Custers va the Informaton Botteneck Method. In Proceedngs of the 23rd Annua Internatona ACM SIGIR Conference on Research and eveopment n Informaton Retreva. Athens, Greece, Juy 2428, (2000). 11. Zeng, H. J., Wang, X. H., Chen, Z., Ma, W. Y.: CBC: Custerng based text cassfcaton requrng mnma abeed data. In Proceedngs of the 3rd IEEE Internatona Conference on ata Mnng. Mebourne, Forda, USA, November 19 22, Yu, H., Yang, J., Han, J.: Cassfyng arge data sets usng SVMs wth herarchca custers. n Proceedngs of the 9th ACM SIGK 2003, Washngton, C, USA, Evans, R., Pfahrnger, B., Homes, G.: Custerng and Cassfcaton. 7 th Internatona conference on nformaton technoogy n Asa (CITA 11). Sarawak, Maaysa, Juy 1213, 18 (2011) ComSIS Vo. 9, No. 4, Speca Issue, ecember 2012
17 Custerng based TwoStage Text Cassfcaton Requrng Mnma Tranng ata 14. Kyrakopouou, A.: (2008). Text cassfcaton aded by custerng: a terature revew. Toos n Artfca Integence, , Fung, G., Mangasaran, O.L.: (2001). Semsupervsed support vector machnes for unabeed data cassfcaton. Optm. Methods Software, 2001, v A. Kyrakopouou, T. Kaambouks. (2008). Combnng custerng wth cassfcaton for spam detecton n soca bookmarkng systems. n Proceedngs of ECML/PK scovery Chaenge 2008 (RSC 2008), Antwerp, Begum, 2008, pp Kyrakopouou, A.: Usng Custerng and CoTranng to Boost Cassfcaton Performance. In Proceedngs of the 19th IEEE Internatona Conference on Toos wth Artfca Integence. Patras, Greece, October 2931, (2007). 18. Raskutt, B., Ferrá, H., Kowaczyk, A.: (2002). Combnng custerng and cotranng to enhance text cassfcaton usng unabeed data. In Proceedngs of the 8th ACM SIGK nternatona conference on Knowedge dscovery and data mnng. Edmonton, Aberta, Canada, Juy 2326, Ca, W., Chen, S., Zhang,.: A mutobjectve smutaneous earnng framework for custerng and cassfcaton. IEEE Transactons on Neura Networks, 21(2): , Qan, Q., Chen, S., Ca, W.: Smutaneous custerng and cassfcaton over custer structure representaton. Pattern Recognton, 2011, October Chapee, O., Weston, J., Schokopf, B.: Custer kernes for semsupervsed earnng. Advances n Neura Informaton Processng Systems In NIPS 2002, Vo. 15 (2003), Zhou,., Bousquet, O., La, T. N., Weston, J., Schokopf, B.: Learnng wth oca and goba consstency. Advances n Neura Informaton Processng Systems 16, , Keswan, G., Ha, L.O.: Text cassfcaton wth enhanced semsupervsed fuzzy custerng. Handbook of Fuzzy Computaton, 1994, Ng, A. Y., Jordan, M. I.: On dscrmnatve vs. generatve cassfers: A comparson of ogstc regresson and nave Bayes. Advances n Neura Informaton Processng Systems 14, Xue Zhang, receved the BS degree n eectronc engneerng from Xan Unversty, Xan, Chna, n She receved the MS degree n contro theory and contro engneerng from Southwest Unversty of Scence and Technoogy, Manyang, Chna, n 2003, and receved the Ph degree n computer scence from Southeast Unversty, Nanjng, Chna, n From 2008 to the present, she s a postdoctora feow n Pekng Unversty. Her research nterests ncude data mnng and machne earnng, wth emphass on the appcatons to text mnng and bonformatcs. Wangxn Xao, receved the Ph degree n traffc nformaton and contro engneerng from Southeast Unversty, Nanjng, Chna, n From 2005 to 2007, he engaged n postdoctora research n Wuhan Unversty of Technoogy. Snce 2008 he has been an assocate professor n Research Insttute of Hghway Mnstry of Transport. From 2009 to 2011, he was aso a postdoctora feow n Changsha Unversty of Scence and Technoogy. Hs research nterests ncude pattern recognton, Integent Transport Systems (ITS) and data mnng wth appcatons to traffc data. Receved: January 30, 2012; Accepted:ecember 05, ComSIS Vo. 9, No. 4, Speca Issue, ecember
18
Multiagent System for Custom Relationship Management with SVMs Tool
Mutagent System for Custom Reatonshp Management wth SVMs oo Yanshan Xao, Bo Lu, 3, Dan Luo, and Longbng Cao Guangzhou Asan Games Organzng Commttee, Guangzhou 5063, P.R. Chna Facuty of Informaton echnoogy,
More informationAn Ensemble Classification Framework to Evolving Data Streams
Internatona Journa of Scence and Research (IJSR) ISSN (Onne): 397064 An Ensembe Cassfcaton Framework to Evovng Data Streams Naga Chthra Dev. R MCA, (M.Ph), Sr Jayendra Saraswathy Maha Vdyaaya, Coege of
More informationAn Efficient Job Scheduling for MapReduce Clusters
Internatona Journa of Future Generaton ommuncaton and Networkng, pp. 391398 http://dx.do.org/10.14257/jfgcn.2015.8.2.32 An Effcent Job Schedung for MapReduce usters Jun Lu 1, Tanshu Wu 1, and Mng We Ln
More informationForecasting the Direction and Strength of Stock Market Movement
Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye cjngwe@stanford.edu mchen5@stanford.edu nanye@stanford.edu Abstract  Stock market s one of the most complcated systems
More informationThe Development of Web Log Mining Based on ImproveKMeans Clustering Analysis
The Development of Web Log Mnng Based on ImproveKMeans Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract.
More informationImage Segmentation and Classification Lecture 5. Professor Michael Brady FRS FREng Hilary Term 2005
Image Segmentaton and Cassfcaton Lecture 5 Professor Mchae Brady FRS FREng Hary Term 2005 The goa Segmentaton means to dvde up the mage nto a patchwork of regons, each of whch s homogeneous, that s, the
More informationForecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network
700 Proceedngs of the 8th Internatonal Conference on Innovaton & Management Forecastng the Demand of Emergency Supples: Based on the CBR Theory and BP Neural Network Fu Deqang, Lu Yun, L Changbng School
More informationPrediction of Success or Fail of Students on Different Educational Majors at the End of the High School with Artificial Neural Networks Methods
Predcton of Success or Fa of on Dfferent Educatona Maors at the End of the Hgh Schoo th Artfca Neura Netors Methods Sayyed Mad Maznan, Member, IACSIT, and Sayyede Azam Aboghasempur Abstract The man obectve
More informationApproximation Algorithms for Data Distribution with Load Balancing of Web Servers
Approxmaton Agorthms for Data Dstrbuton wth Load Baancng of Web Servers LChuan Chen Networkng and Communcatons Department The MITRE Corporaton McLean, VA 22102 chen@mtreorg HyeongAh Cho Department of
More informationSUPPORT VECTOR MACHINE FOR REGRESSION AND APPLICATIONS TO FINANCIAL FORECASTING
SUPPORT VECTOR MACHINE FOR REGRESSION AND APPICATIONS TO FINANCIA FORECASTING Theodore B. Trafas and Husen Ince Schoo of Industra Engneerng Unverst of Okahoma W. Bod Sute 4 Norman Okahoma 739 trafas@ecn.ou.edu;
More informationPredicting Advertiser Bidding Behaviors in Sponsored Search by Rationality Modeling
Predctng Advertser Bddng Behavors n Sponsored Search by Ratonaty Modeng Hafeng Xu Centre for Computatona Mathematcs n Industry and Commerce Unversty of Wateroo Wateroo, ON, Canada hafeng.ustc@gma.com Dy
More informationAdaptive MultiCompositionality for Recursive Neural Models with Applications to Sentiment Analysis
Proceedngs of the TwentyEghth AAAI Conference on Artfca Integence Adapte MutCompostonaty for Recurse Neura Modes wth Appcatons to Sentment Anayss L Dong Furu We Mng Zhou Ke Xu State Key Lab of Software
More informationWhat is Candidate Sampling
What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble
More informationCardiovascular Event Risk Assessment Fusion of Individual Risk Assessment Tools Applied to the Portuguese Population
Cardovascuar Event Rsk Assessment Fuson of Indvdua Rsk Assessment Toos Apped to the Portuguese Popuaton S. Paredes, T. Rocha, P. de Carvaho, J. Henrques, J. Moras*, J. Ferrera, M. Mendes Abstract Cardovascuar
More informationSIMPLIFYING NDA PROGRAMMING WITH PROt SQL
SIMPLIFYING NDA PROGRAMMING WITH PROt SQL Aeen L. Yam, Besseaar Assocates, Prnceton, NJ ABSRACf The programmng of New Drug Appcaton (NDA) Integrated Summary of Safety (ISS) usuay nvoves obtanng patent
More informationPredictive Control of a Smart Grid: A Distributed Optimization Algorithm with Centralized Performance Properties*
Predctve Contro of a Smart Grd: A Dstrbuted Optmzaton Agorthm wth Centrazed Performance Propertes* Phpp Braun, Lars Grüne, Chrstopher M. Keett 2, Steven R. Weer 2, and Kar Worthmann 3 Abstract The authors
More informationINVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMAHDR NETWORKS
21 22 September 2007, BULGARIA 119 Proceedngs of the Internatonal Conference on Informaton Technologes (InfoTech2007) 21 st 22 nd September 2007, Bulgara vol. 2 INVESTIGATION OF VEHICULAR USERS FAIRNESS
More informationHacia un Modelo de Red Inmunológica Artificial Basado en Kernels. Towards a Kernel Based Model for Artificial Immune Networks
Haca un Modeo de Red Inmunoógca Artfca Basado en Kernes Towards a Kerne Based Mode for Artfca Immune Networs Juan C. Gaeano, Ing. 1, Fabo A. Gonzáez, PhD. 1 Integent Systems Research Lab, Natona Unversty
More informationA Resources Allocation Model for MultiProject Management
A Resources Aocaton Mode for MutProect Management Hamdatou Kane, Aban Tsser To cte ths verson: Hamdatou Kane, Aban Tsser. A Resources Aocaton Mode for MutProect Management. 9th Internatona Conference
More informationA Secure PasswordAuthenticated Key Agreement Using Smart Cards
A Secure PasswordAuthentcated Key Agreement Usng Smart Cards Ka Chan 1, WenChung Kuo 2 and JnChou Cheng 3 1 Department of Computer and Informaton Scence, R.O.C. Mltary Academy, Kaohsung 83059, Tawan,
More informationAn InterestOriented Network Evolution Mechanism for Online Communities
An InterestOrented Network Evoluton Mechansm for Onlne Communtes Cahong Sun and Xaopng Yang School of Informaton, Renmn Unversty of Chna, Bejng 100872, P.R. Chna {chsun,yang}@ruc.edu.cn Abstract. Onlne
More informationSearching for Interacting Features for Spam Filtering
Searchng for Interactng Features for Spam Flterng Chuanlang Chen 1, YunChao Gong 2, Rongfang Be 1,, and X. Z. Gao 3 1 Department of Computer Scence, Bejng Normal Unversty, Bejng 100875, Chna 2 Software
More informationFeature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College
Feature selecton for ntruson detecton Slobodan Petrovć NISlab, Gjøvk Unversty College Contents The feature selecton problem Intruson detecton Traffc features relevant for IDS The CFS measure The mrmr measure
More informationSemiSupervised Text Classification Using Partitioned EM
SemSupervsed Text Classfcaton Usng Parttoned EM Gao Cong 1, Wee Sun Lee 1, Haoran Wu 1, Bng Lu 2 1 Department of Computer Scence, Natonal Unversty of Sngapore, Sngapore 117543 {conggao, leews, wuhaoran}@comp.nus.edu.sg
More informationAssessing Student Learning Through Keyword Density Analysis of Online Class Messages
Assessng Student Learnng Through Keyword Densty Analyss of Onlne Class Messages Xn Chen New Jersey Insttute of Technology xc7@njt.edu Brook Wu New Jersey Insttute of Technology wu@njt.edu ABSTRACT Ths
More informationWeb Spam Detection Using Machine Learning in Specific Domain Features
Journal of Informaton Assurance and Securty 3 (2008) 220229 Web Spam Detecton Usng Machne Learnng n Specfc Doman Features Hassan Najadat 1, Ismal Hmed 2 Department of Computer Informaton Systems Faculty
More informationNeural Networkbased Colonoscopic Diagnosis Using Online Learning and Differential Evolution
Neura Networbased Coonoscopc Dagnoss Usng Onne Learnng and Dfferenta Evouton George D. Magouas, Vasss P. Paganaos * and Mchae N. Vrahats * Department of Informaton Systems and Computng, Brune Unversty,
More informationLogistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification
Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson
More informationAn Alternative Way to Measure Private Equity Performance
An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate
More informationA DATA MINING APPLICATION IN A STUDENT DATABASE
JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 005 VOLUME NUMBER (5357) A DATA MINING APPLICATION IN A STUDENT DATABASE Şenol Zafer ERDOĞAN Maltepe Ünversty Faculty of Engneerng BüyükbakkalköyIstanbul
More informationLETTER IMAGE RECOGNITION
LETTER IMAGE RECOGNITION 1. Introducton. 1. Introducton. Objectve: desgn classfers for letter mage recognton. consder accuracy and tme n takng the decson. 20,000 samples: Startng set: mages based on 20
More informationTCP/IP Interaction Based on Congestion Price: Stability and Optimality
TCP/IP Interacton Based on Congeston Prce: Stabty and Optmaty Jayue He Eectrca Engneerng Prnceton Unversty Ema: jhe@prncetonedu Mung Chang Eectrca Engneerng Prnceton Unversty Ema: changm@prncetonedu Jennfer
More informationOverview. Naive Bayes Classifiers. A Sample Data Set. Frequencies and Probabilities. Connectionist and Statistical Language Processing
Overvew Nave Bayes Classfers Connectonst and Statstcal Language Processng Frank Keller keller@col.unsb.de Computerlngustk Unverstät des Saarlandes Sample data set wth frequences and probabltes Classfcaton
More informationAgglomeration economies in manufacturing industries: the case of Spain
Aggomeraton economes n manufacturng ndustres: the case of Span Oga AonsoVar JoséMaría ChamorroRvas Xua GonzáezCerdera Unversdade de Vgo October 001 Abstract: Ths paper anayses the extent of geographca
More informationNonlinear data mapping by neural networks
Nonlnear data mappng by neural networks R.P.W. Dun Delft Unversty of Technology, Netherlands Abstract A revew s gven of the use of neural networks for nonlnear mappng of hgh dmensonal data on lower dmensonal
More informationBayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending
Proceedngs of 2012 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 25 (2012) (2012) IACSIT Press, Sngapore Bayesan Network Based Causal Relatonshp Identfcaton and Fundng Success
More informationCan Auto Liability Insurance Purchases Signal Risk Attitude?
Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? ChuShu L Department of Internatonal Busness, Asa Unversty, Tawan ShengChang
More informationSwingFree Transporting of TwoDimensional Overhead Crane Using Sliding Mode Fuzzy Control
SwngFree Transportng of TwoDmensona Overhead Crane Usng Sdng Mode Fuzzy Contro Dantong Lu, Janqang, Dongn Zhao, and We Wang Astract An adaptve sdng mode fuzzy contro approach s proposed for a twodmensona
More informationResearch on Single and Mixed Fleet Strategy for Open Vehicle Routing Problem
276 JOURNAL OF SOFTWARE, VOL 6, NO, OCTOBER 2 Research on Snge and Mxed Feet Strategy for Open Vehce Routng Probe Chunyu Ren Heongjang Unversty /Schoo of Inforaton scence and technoogy, Harbn, Chna Ea:
More informationAsymptotically Optimal Inventory Control for AssembletoOrder Systems with Identical Lead Times
Asymptotcay Optma Inventory Contro for AssembetoOrder Systems wth Identca ead Tmes Martn I. Reman Acateucent Be abs, Murray H, NJ 07974, marty@research.beabs.com Qong Wang Industra and Enterprse Systems
More informationDEFINING %COMPLETE IN MICROSOFT PROJECT
CelersSystems DEFINING %COMPLETE IN MICROSOFT PROJECT PREPARED BY James E Aksel, PMP, PMISP, MVP For Addtonal Informaton about Earned Value Management Systems and reportng, please contact: CelersSystems,
More informationAn Efficient Greedy Method for Unsupervised Feature Selection
hs artce has been accepted for pubcaton at the 11 IEEE 11th Internatona Conference on Data Mnng An Effcent Greedy Method for Unsupervsed Feature Seecton Ahmed K. Farahat A Ghods Mohamed S. Kame Unversty
More informationA Simple CongestionAware Algorithm for Load Balancing in Datacenter Networks
A Smpe CongestonAware Agorthm for Load Baancng n Datacenter Networs Mehrnoosh Shafee, and Javad Ghader, Coumba Unversty Abstract We study the probem of oad baancng n datacenter networs, namey, assgnng
More informationOnline Wireless Mesh Network Traffic Classification using Machine Learning
Journal of Computatonal Informaton Systems 7:5 (2011) 15241532 Avalable at http://www.jofcs.com Onlne Wreless Mesh Network Traffc Classfcaton usng Machne Learnng Chengje GU 1,, Shuny ZHANG 1, Xaozhen
More informationModule 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..
More informationA hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm
Avalable onlne www.ocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(7):18841889 Research Artcle ISSN : 09757384 CODEN(USA) : JCPRC5 A hybrd global optmzaton algorthm based on parallel
More informationIdentifying Workloads in Mixed Applications
, pp.395400 http://dx.do.org/0.4257/astl.203.29.8 Identfyng Workloads n Mxed Applcatons Jeong Seok Oh, Hyo Jung Bang, Yong Do Cho, Insttute of Gas Safety R&D, Korea Gas Safety Corporaton, ShghungSh,
More informationComparison of workflow software products
Internatona Conference on Computer Systems and Technooges  CompSysTech 2006 Comparson of worfow software products Krasmra Stoova,Todor Stoov Abstract: Ths research addresses probems, reated to the assessment
More informationUsing ContentBased Filtering for Recommendation 1
Usng ContentBased Flterng for Recommendaton 1 Robn van Meteren 1 and Maarten van Someren 2 1 NetlnQ Group, Gerard Brandtstraat 2628, 1054 JK, Amsterdam, The Netherlands, robn@netlnq.nl 2 Unversty of
More informationDynamic Virtual Network Allocation for OpenFlow Based Cloud Resident Data Center
56 IEICE TRANS. COMMUN., VOL.E96 B, NO. JANUARY 203 PAPER Speca Secton on Networ Vrtuazaton, and Fuson Patform of Computng and Networng Dynamc Vrtua Networ Aocaton for OpenFow Based Coud Resdent Data Center
More informationXAC086 Professional Project Management
1 XAC086 Professona Project anagement Ths Lecture: Tte s so manager shoud ncude a s management pan a document that gudes any experts agree Some faure project to Ba, ba, ba, ba Communcaton anagement Wee
More informationStudy on CET4 Marks in China s Graded English Teaching
Study on CET4 Marks n Chna s Graded Englsh Teachng CHE We College of Foregn Studes, Shandong Insttute of Busness and Technology, P.R.Chna, 264005 Abstract: Ths paper deploys Logt model, and decomposes
More informationGender Classification for RealTime Audience Analysis System
Gender Classfcaton for RealTme Audence Analyss System Vladmr Khryashchev, Lev Shmaglt, Andrey Shemyakov, Anton Lebedev Yaroslavl State Unversty Yaroslavl, Russa vhr@yandex.ru, shmaglt_lev@yahoo.com, andrey.shemakov@gmal.com,
More informationCS 2750 Machine Learning. Lecture 17a. Clustering. CS 2750 Machine Learning. Clustering
Lecture 7a Clusterng Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square Clusterng Groups together smlar nstances n the data sample Basc clusterng problem: dstrbute data nto k dfferent groups such that
More informationMultisensor Data Fusion for Cyber Security Situation Awareness
Avalable onlne at www.scencedrect.com Proceda Envronmental Scences 0 (20 ) 029 034 20 3rd Internatonal Conference on Envronmental 3rd Internatonal Conference on Envronmental Scence and Informaton Applcaton
More informationProduct Quality and Safety Incident Information Tracking Based on Web
Product Qualty and Safety Incdent Informaton Trackng Based on Web News 1 Yuexang Yang, 2 Correspondng Author Yyang Wang, 2 Shan Yu, 2 Jng Q, 1 Hual Ca 1 Chna Natonal Insttute of Standardzaton, Beng 100088,
More informationLuby s Alg. for Maximal Independent Sets using Pairwise Independence
Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent
More information7.5. Present Value of an Annuity. Investigate
7.5 Present Value of an Annuty Owen and Anna are approachng retrement and are puttng ther fnances n order. They have worked hard and nvested ther earnngs so that they now have a large amount of money on
More informationAn Adaptive and Distributed Clustering Scheme for Wireless Sensor Networks
2007 Internatonal Conference on Convergence Informaton Technology An Adaptve and Dstrbuted Clusterng Scheme for Wreless Sensor Networs Xnguo Wang, Xnmng Zhang, Guolang Chen, Shuang Tan Department of Computer
More informationLogistic Regression. Steve Kroon
Logstc Regresson Steve Kroon Course notes sectons: 24.324.4 Dsclamer: these notes do not explctly ndcate whether values are vectors or scalars, but expects the reader to dscern ths from the context. Scenaro
More informationVision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION
Vson Mouse Saurabh Sarkar a* a Unversty of Cncnnat, Cncnnat, USA ABSTRACT The report dscusses a vson based approach towards trackng of eyes and fngers. The report descrbes the process of locatng the possble
More informationProduct Approximate Reasoning of Online Reviews Applying to Consumer Affective and Psychological Motives Research
Appled Mathematcs & Informaton Scences An Internatonal Journal 2011 NSP 5 (2) (2011), 45S51S Product Approxmate Reasonng of Onlne Revews Applyng to Consumer Affectve and Psychologcal Motves Research Narsa
More informationAn Enhanced SuperResolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement
An Enhanced SuperResoluton System wth Improved Image Regstraton, Automatc Image Selecton, and Image Enhancement YuChuan Kuo ( ), ChenYu Chen ( ), and ChouShann Fuh ( ) Department of Computer Scence
More informationConversion between the vector and raster data structures using Fuzzy Geographical Entities
Converson between the vector and raster data structures usng Fuzzy Geographcal Enttes Cdála Fonte Department of Mathematcs Faculty of Scences and Technology Unversty of Combra, Apartado 38, 3 454 Combra,
More informationA machine vision approach for detecting and inspecting circular parts
A machne vson approach for detectng and nspectng crcular parts DuMng Tsa Machne Vson Lab. Department of Industral Engneerng and Management YuanZe Unversty, ChungL, Tawan, R.O.C. Emal: edmtsa@saturn.yzu.edu.tw
More informationSimple Interest Loans (Section 5.1) :
Chapter 5 Fnance The frst part of ths revew wll explan the dfferent nterest and nvestment equatons you learned n secton 5.1 through 5.4 of your textbook and go through several examples. The second part
More informationMaster s Thesis. Configuring robust virtual wireless sensor networks for Internet of Things inspired by brain functional networks
Master s Thess Ttle Confgurng robust vrtual wreless sensor networks for Internet of Thngs nspred by bran functonal networks Supervsor Professor Masayuk Murata Author Shnya Toyonaga February 10th, 2014
More information"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *
Iranan Journal of Scence & Technology, Transacton B, Engneerng, ol. 30, No. B6, 789794 rnted n The Islamc Republc of Iran, 006 Shraz Unversty "Research Note" ALICATION OF CHARGE SIMULATION METHOD TO ELECTRIC
More informationSupport Vector Machines
Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan support vector machnes.
More informationFactored Conditional Restricted Boltzmann Machines for Modeling Motion Style
Factored Condtona Restrcted Botzmann Machnes for Modeng Moton Stye Graham W. Tayor GWTAYLOR@CS.TORONTO.EDU Geoffrey E. Hnton HINTON@CS.TORONTO.EDU Department of Computer Scence, Unversty of Toronto, Toronto,
More informationDescriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications
CMSC828G Prncples of Data Mnng Lecture #9 Today s Readng: HMS, chapter 9 Today s Lecture: Descrptve Modelng Clusterng Algorthms Descrptve Models model presents the man features of the data, a global summary
More information8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by
6 CHAPTER 8 COMPLEX VECTOR SPACES 5. Fnd the kernel of the lnear transformaton gven n Exercse 5. In Exercses 55 and 56, fnd the mage of v, for the ndcated composton, where and are gven by the followng
More informationSemantic Link Analysis for Finding Answer Experts *
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 28, 5165 (2012) Semantc Lnk Analyss for Fndng Answer Experts * YAO LU 1,2,3, XIAOJUN QUAN 2, JINGSHENG LEI 4, XINGLIANG NI 1,2,3, WENYIN LIU 2,3 AND YINLONG
More informationFace Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)
Face Recognton Problem Face Verfcaton Problem Face Verfcaton (1:1 matchng) Querymage face query Face Recognton (1:N matchng) database Applcaton: Access Control www.vsage.com www.vsoncs.com Bometrc Authentcaton
More informationNaive Rule Induction for Text Classification based on Keyphrases
Nave Rule Inducton for Text Classfcaton based on Keyphrases Nktas N. Karankolas & Chrstos Skourlas Department of Informatcs, Technologcal Educatonal Insttute of Athens, Greece. Abstract In ths paper,
More informationAPPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT
APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT Toshhko Oda (1), Kochro Iwaoka (2) (1), (2) Infrastructure Systems Busness Unt, Panasonc System Networks Co., Ltd. Saedocho
More information2) A singlelanguage trained classifier: one. classifier trained on documents written in
Openng the ega terature Porta to mutngua access E. Francescon, G. Perugne ITTIG Insttute of Lega Informaton Theory and Technooges Itaan Natona Research Counc, Forence, Itay Te: +39 055 43999 Fax: +39 055
More informationLossless Data Compression
Lossless Data Compresson Lecture : Unquely Decodable and Instantaneous Codes Sam Rowes September 5, 005 Let s focus on the lossless data compresson problem for now, and not worry about nosy channel codng
More informationIncreasing Supported VoIP Flows in WMNs through LinkBased Aggregation
Increasng Supported VoIP Fows n WMNs through nbased Aggregaton J. Oech, Y. Hamam, A. Kuren F SATIE TUT Pretora, South Afrca oechr@gma.com T. Owa Meraa Insttute Counc of Scentfc and Industra Research (CSIR)
More informationMining Multiple Large Data Sources
The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 24 Mnng Multple Large Data Sources Anmesh Adhkar, Pralhad Ramachandrarao 2, Bhanu Prasad 3, and Jhml Adhkar 4 Department of
More information9.1 The Cumulative Sum Control Chart
Learnng Objectves 9.1 The Cumulatve Sum Control Chart 9.1.1 Basc Prncples: Cusum Control Chart for Montorng the Process Mean If s the target for the process mean, then the cumulatve sum control chart s
More informationTHE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION
Internatonal Journal of Electronc Busness Management, Vol. 3, No. 4, pp. 3030 (2005) 30 THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION YuMn Chang *, YuCheh
More informationChapter 4 Financial Markets
Chapter 4 Fnancal Markets ECON2123 (Sprng 2012) 14 & 15.3.2012 (Tutoral 5) The demand for money Assumptons: There are only two assets n the fnancal market: money and bonds Prce s fxed and s gven, that
More informationDisagreementBased MultiSystem Tracking
DsagreementBased MultSystem Trackng Quannan L 1, Xnggang Wang 2, We Wang 3, Yuan Jang 3, ZhHua Zhou 3, Zhuowen Tu 1 1 Lab of Neuro Imagng, Unversty of Calforna, Los Angeles 2 Huazhong Unversty of Scence
More informationThe Dynamics of Wealth and Income Distribution in a Neoclassical Growth Model * Stephen J. Turnovsky. University of Washington, Seattle
The Dynamcs of Weath and Income Dstrbuton n a Neocassca Growth Mode * Stephen J. Turnovsy Unversty of Washngton, Seatte Ceca GarcíaPeñaosa CNRS and GREQAM March 26 Abstract: We examne the evouton of the
More informationFault tolerance in cloud technologies presented as a service
Internatonal Scentfc Conference Computer Scence 2015 Pavel Dzhunev, PhD student Fault tolerance n cloud technologes presented as a servce INTRODUCTION Improvements n technques for vrtualzaton and performance
More informationCommunication Networks II Contents
8 / 1  Communcaton Networs II (Görg)  www.comnets.unbremen.de Communcaton Networs II Contents 1 Fundamentals of probablty theory 2 Traffc n communcaton networs 3 Stochastc & Marovan Processes (SP
More informationGaining Insights to the Tea Industry of Sri Lanka using Data Mining
Proceedngs of the Internatonal MultConference of Engneers and Computer Scentsts 2008 Vol I Ganng Insghts to the Tea Industry of Sr Lanka usng Data Mnng H.C. Fernando, W. M. R Tssera, and R. I. Athauda
More informationDesign and Development of a Security Evaluation Platform Based on International Standards
Internatonal Journal of Informatcs Socety, VOL.5, NO.2 (203) 780 7 Desgn and Development of a Securty Evaluaton Platform Based on Internatonal Standards Yuj Takahash and Yoshm Teshgawara Graduate School
More information) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance
Calbraton Method Instances of the Cell class (one nstance for each FMS cell) contan ADC raw data and methods assocated wth each partcular FMS cell. The calbraton method ncludes event selecton (Class Cell
More informationInstitute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic
Lagrange Multplers as Quanttatve Indcators n Economcs Ivan Mezník Insttute of Informatcs, Faculty of Busness and Management, Brno Unversty of TechnologCzech Republc Abstract The quanttatve role of Lagrange
More informationRESEARCH ON DUALSHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST) yaoqi.feng@yahoo.
ICSV4 Carns Australa 9 July, 007 RESEARCH ON DUALSHAKER SINE VIBRATION CONTROL Yaoq FENG, Hanpng QIU Dynamc Test Laboratory, BISEE Chna Academy of Space Technology (CAST) yaoq.feng@yahoo.com Abstract
More informationA ReplicationBased and Fault Tolerant Allocation Algorithm for Cloud Computing
A ReplcatonBased and Fault Tolerant Allocaton Algorthm for Cloud Computng Tork Altameem Dept of Computer Scence, RCC, Kng Saud Unversty, PO Box: 28095 11437 RyadhSaud Araba Abstract The very large nfrastructure
More informationPerformance Analysis and Coding Strategy of ECOC SVMs
Internatonal Journal of Grd and Dstrbuted Computng Vol.7, No. (04), pp.6776 http://dx.do.org/0.457/jgdc.04.7..07 Performance Analyss and Codng Strategy of ECOC SVMs Zhgang Yan, and Yuanxuan Yang, School
More informationFREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES
FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES Zuzanna BRO EKMUCHA, Grzegorz ZADORA, 2 Insttute of Forensc Research, Cracow, Poland 2 Faculty of Chemstry, Jagellonan
More informationBranchandPrice and Heuristic Column Generation for the Generalized TruckandTrailer Routing Problem
REVISTA DE MÉTODOS CUANTITATIVOS PARA LA ECONOMÍA Y LA EMPRESA (12) Págnas 5 38 Dcembre de 2011 ISSN: 1886516X DL: SE292706 URL: http://wwwupoes/revmetcuant/artphp?d=51 BranchandPrce and Heurstc Coumn
More informationData Mining from the Information Systems: Performance Indicators at Masaryk University in Brno
Data Mnng from the Informaton Systems: Performance Indcators at Masaryk Unversty n Brno Mkuláš Bek EUA Workshop Strasbourg, 12 December 2006 1 Locaton of Brno Brno EUA Workshop Strasbourg, 12 December
More informationMarkov Networks: Theory and Applications. Warm up
Markov Networks: Theory and Applcatons Yng Wu Electrcal Engneerng and Computer Scence Northwestern Unversty Evanston, IL 60208 yngwu@eecs.northwestern.edu http://www.eecs.northwestern.edu/~yngwu Warm up
More informationTime Series Analysis in Studies of AGN Variability. Bradley M. Peterson The Ohio State University
Tme Seres Analyss n Studes of AGN Varablty Bradley M. Peterson The Oho State Unversty 1 Lnear Correlaton Degree to whch two parameters are lnearly correlated can be expressed n terms of the lnear correlaton
More information