Learg wh Ulabeled Daa for Tex Caegorzao Usg Boosrappg ad Feaure Proeo Tehques Yougoog Ko Dep. of Copuer See, Sogag Uv. Ssu-dog, Mapo-gu Seoul, -74, Korea ky@lpzoda.sogag.a.kr Absra A wde rage of supervsed learg algorhs has bee appled o Tex Caegorzao. However, he supervsed learg approahes have soe probles. Oe of he s ha hey requre a large, ofe prohbve, uber of labeled rag doues for aurae learg. Geerally, aqurg lass labels for rag daa s osly, whle gaherg a large quay of ulabeled daa s heap. We here propose a ew auoa ex aegorzao ehod for learg fro oly ulabeled daa usg a boosrappg fraework ad a feaure proeo ehque. Fro resuls of our experes, our ehod showed reasoably oparable perforae opared wh a supervsed ehod. If our ehod s used a ex aegorzao ask, buldg ex aegorzao syses wll beoe sgfaly faser ad less expesve. Iroduo Tex aegorzao s he ask of lassfyg doues o a era uber of pre-defed aegores. May supervsed learg algorhs have bee appled o hs area. These algorhs oday are reasoably suessful whe provded wh eough labeled or aoaed rag exaples. For exaple, here are Nave Bayes (MCallu ad Nga, 998, Roho (Lews e al., 996, Neares Neghbor (knn (Yag e al., 00, TCFP (Ko ad Seo, 00, ad Suppor Veor Mahe (SVM (Joahs, 998. However, he supervsed learg approah has soe dffules. Oe key dffuly s ha requres a large, ofe prohbve, uber of labeled rag daa for aurae learg. Se a labelg ask us be doe aually, s a pafully e-osug proess. Furherore, se he applao area of ex aegorzao has dversfed fro ewswre arles ad web pages o E-als ad ewsgroup posgs, s also a dfful ask o reae rag daa for eah Jugyu Seo Dep. of Copuer See, Sogag Uv. Ssu-dog, Mapo-gu Seoul, -74, Korea seoy@s.sogag.a.kr applao area (Nga e al., 998. I hs lgh, we osder learg algorhs ha do o requre suh a large aou of labeled daa. Whle labeled daa are dfful o oba, ulabeled daa are readly avalable ad pleful. Therefore, hs paper advoaes usg a boosrappg fraework ad a feaure proeo ehque wh us ulabeled daa for ex aegorzao. The pu o he boosrappg proess s a large aou of ulabeled daa ad a sall aou of seed forao o ell he learer abou he spef ask. I hs paper, we osder seed forao he for of le words assoaed wh aegores. I geeral, se ulabeled daa are uh less expesve ad easer o olle ha labeled daa, our ehod s useful for ex aegorzao asks ludg ole daa soures suh as web pages, E-als, ad ewsgroup posgs. To auoaally buld up a ex lassfer wh ulabeled daa, we us solve wo probles; how we a auoaally geerae labeled rag doues (ahe-labeled daa fro oly le words ad how we a hadle orrely labeled doues he ahe-labeled daa. Ths paper provdes soluos for hese probles. For he frs proble, we eploy he boosrappg fraework. For he seod, we use he TCFP lassfer wh robusess fro osy daa (Ko ad Seo, 004. How a labeled rag daa be auoaally reaed fro ulabeled daa ad le words? Maybe ulabeled daa do have ay forao for buldg a ex lassfer beause hey do o oa he os pora forao, her aegory. Thus we us assg he lass o eah doue order o use supervsed learg approahes. Se ex aegorzao s a ask based o pre-defed aegores, we kow he aegores for lassfyg doues. Kowg he aegores eas ha we a hoose a leas a represeave le word of eah aegory. Ths s he sarg po of our proposed ehod. As we arry ou a boosrappg ask fro hese le words, we a fally ge labeled rag daa. Suppose, for exaple, ha we are eresed lassfyg ewsgroup posgs abou speally
Auos aegory. Above all, we a sele auooble as a le word, ad auoaally exra keywords ( ar, gear, rassso, seda, ad so o usg o-ourree forao. I our ehod, we use oex (a sequee of 60 words as a u of eag for boosrappg fro le words; s geerally osrued as a ddle sze of a seee ad a doue. We he exra ore oexs ha lude a leas oe of he le words ad he keywords. We all he erod-oexs beause hey are regarded as oexs wh he ore eag of eah aegory. Fro he erodoexs, we a ga ay words oexually oourred wh he le words ad keywords: drver, luh, ruk, ad so o. They are words frs-order o-ourree wh he le words ad he keywords. To gaher ore voabulary, we exra oexs ha are slar o erod-oexs by a slary easure; hey oa words seod-order o-ourree wh he le words ad he keywords. We fally osru oex-luser of eah aegory as he obao of erod-oexs ad oexs seleed by he slary easure. Usg he oex-lusers as labeled rag daa, a Nave Bayes lassfer a be bul. Se he Nave Bayes lassfer a label all ulabeled doues for her aegory, we a fally oba labeled rag daa (ahe-labeled daa. Whe he ahe-labeled daa s used o lear a ex lassfer, here s aoher dfful ha hey have ore orrely labeled doues ha aually labeled daa. Thus we develop ad eploy he TCFP lassfers wh robusess fro osy daa. The res of hs paper s orgazed as follows. Seo revews prevous works. I seo 3 ad 4, we expla he proposed ehod deal. Seo 5 s devoed o he aalyss of he epral resuls. The fal seo desrbes olusos ad fuure works. Relaed Works I geeral, relaed approahes for usg ulabeled daa ex aegorzao have wo dreos; Oe bulds lassfers fro a obao of labeled ad ulabeled daa (Nga, 00; Bee ad Derz, 999, ad he oher eploys luserg algorhs for ex aegorzao (Slo e al., 00. Nga suded a Expeed Maxzao (EM ehque for obg labeled ad ulabeled daa for ex aegorzao hs dsserao. He showed ha he auray of leared ex lassfers a be proved by augeg a sall uber of labeled rag daa wh a large pool of ulabeled daa. Bee ad Derz aheved sall provees o soe UCI daa ses usg SVM. I sees ha SVMs assue ha deso boudares le bewee lasses low-desy regos of sae spae, ad he ulabeled exaples help fd hese areas. Slo suggesed luserg ehques for usupervsed doue lassfao. Gve a olleo of ulabeled daa, he aeped o fd lusers ha are hghly orrelaed wh he rue ops of doues by usupervsed luserg ehods. I hs paper, Slo proposed a ew luserg ehod, he sequeal Iforao Boleek (sib algorh. 3 The Boosrappg Algorh for Creag Mahe-labeled Daa The boosrappg fraework desrbed hs paper osss of he followg seps. Eah odule s desrbed he followg seos deal.. Preproessg: Coexs are separaed fro ulabeled doues ad oe words are exraed fro he.. Cosrug oex-lusers for rag: - Keywords of eah aegory are reaed - Cerod-oexs are exraed ad verfed - Coex-lusers are reaed by a slary easure 3. Learg Classfer: Nave Bayes lassfer are leared by usg he oex-lusers 3. Preproessg The preproessg odule has wo a roles: exrag oe words ad reosrug he olleed doues o oexs. We use he Brll POS agger o exra oe words (Brll, 995. Geerally, he supervsed learg approah wh labeled daa regards a doue as a u of eag. Bu se we a use oly he le words ad ulabeled daa, we defe oex as a u of eag ad we eploy as he eag u o boosrap he eag of eah aegory. I our syse, we regard a sequee of 60 oe words wh a doue as a oex. To exra oexs fro a doue, we use sldg wdow ehques (Maarek e al., 99. The wdow s a slde fro he frs word of he doue o he las he sze of he wdow (60 words ad he erval of eah wdow (30 words. Therefore, he fal oupu of preproessg s a se of oex veors ha are represeed as oe words of eah oex.
3. Cosrug Coex-Clusers for Trag A frs, we auoaally reae keywords fro a le word for eah aegory usg o-ourree forao. The erod-oexs are exraed usg he le word ad keywords. They oa a leas oe of he le ad keywords. Fally, we a ga ore forao of eah aegory by assgg reag oexs o eah oexluser usg a slary easure ehque; he reag oexs do o oa ay keywords or le words. 3.. Creag Keyword Lss The sarg po of our ehod s ha we have le words ad olleed doues. A le word a prese he a eag of eah aegory bu ould be suffe represeg ay aegory for ex aegorzao. Thus we eed o fd words ha are seaally relaed o a le word, ad we defe he as keywords of eah aegory. The sore of sea slary bewee a le word, T, ad a word, W, s alulaed by he ose er as follows: = w s ( T, W = ( w = = where ad w represe he ourree (bary value: 0 or of words T ad W -h doue respevely, ad s he oal uber of doues he olleed doues. Ths ehod alulaes he slary sore bewee words based o he degree of her o-ourree he sae doue. Se he keywords for ex aegorzao us have he power o dsrae aegores as well as slary wh he le words, we assg a word o he keyword ls of a aegory wh he axu slary sore ad realulae he sore of he word he aegory usg he followg forula: Sore( W, ax = s( Tax, W + ( s( Tax, W s( Tseod ax, W ( where T ax s he le word wh he axu slary sore wh a word W, ax s he aegory of he le word T ax, ad T seodax s oher le word wh he seod hgh slary sore wh he word W. Ths forula eas ha a word wh hgh rakg a aegory has a hgh slary sore wh he le word of he aegory ad a hgh slary sore dfferee wh oher le words. We sor ou words assged o eah aegory aordg o he alulaed sore desedg order. We he hoose op words as keywords he aegory. Table shows he ls of keywords (op 5 for eah aegory he WebKB daa se. Table. The ls of keywords he WebKB daa se Caegory Tle Word Keywords ourse ourse assges, hours, sruor, lass, fall fauly professor assoae, ph.d, fax, eress, publaos proe proe syse, syses, researh, sofware, forao sude sude graduae, opuer, see, page, uversy 3.. Exrag ad Verfyg Cerod-Coexs We hoose oexs wh a keyword or a le word of a aegory as erod-oexs. Aog erod-oexs, soe oexs ould o have good feaures of a aegory eve hough hey lude he keywords of he aegory. To rak he porae of erod-oexs, we opue he porae sore of eah erod-oex. Frs of all, weghs (W of word w -h aegory are alulaed usg Ter Frequey (TF wh a aegory ad Iverse Caegory Frequey (ICF (Cho ad K, 997 as follows: W = TF ICF = TF (log( M log( CF (3 where CF s he uber of aegores ha oa w ad M s he oal uber of aegores. Usg word weghs (W alulaed by forula 3, he sore of a erod-oex (S k -h aegory ( s opued as follows: W Sore( S, = k + W +... + W N N (4 where N s he uber of words he erodoex. As a resul, we oba a se of words frsorder o-ourree fro erod-oexs of eah aegory. 3..3 Creag Coex-Clusers We gaher he seod-order o-ourree forao by assgg reag oexs o he oex-luser of eah aegory. For he assgg rero, we alulae slary bewee reag oexs ad erod-oexs of eah aegory. Thus we eploy he slary easure ehque by Karov ad Edela (998. I our ehod, a par of hs ehque s refored for our
purpose ad reag oexs are assged o eah oex-luser by ha revsed ehque. Measuree of word ad oex slares As slar words ed o appear slar oexs, we a opue he slary by usg oexual forao. Words ad oexs play opleeary roles. Coexs are slar o he exe ha hey oa slar words, ad words are slar o he exe ha hey appear slar oexs (Karov ad Edela, 998. Ths defo s rular. Thus s appled eravely usg wo ares, WSM ad CSM. Eah aegory has a word slary arx WSM ad a oex slary arx CSM. I eah erao, we updae WSM, whose rows ad olus are labeled by all oe words eouered he erod-oexs of eah aegory ad pu reag oexs. I ha arx, he ell (, holds a value bewee 0 ad, dag he exe o whh he -h word s oexually slar o he -h word. Also, we keep ad updae a CSM, whh holds slares aog oexs. The rows of CSM orrespod o he reag oexs ad he olus o he erod-oexs. I hs paper, he uber of pu oexs of row ad olu CSM s led o 00, osderg exeuo e ad eory alloao, ad he uber of eraos s se as 3. To opue he slares, we alze WSM o he dey arx. The followg seps are eraed ul he hages he slary values are sall eough.. Updae he oex slary arx CSM, usg he word slary arx WSM.. Updae he word slary arx WSM, usg he oex slary arx CSM. Affy forulae To splfy he syer erave reae of slary bewee words ad oexs, we defe a auxlary relao bewee words ad oexs as affy. Affy forulae are defed as follows (Karov ad Edela, 998: aff ( W X W, X = ax s ( W, W (5 aff ( X, W = axw X s ( X, X (6 I he above forulae, deoes he erao uber, ad he slary values are defed by WSM ad CSM. Every word has soe affy o he oex, ad he oex a be represeed by a veor dag he affy of eah word o. 3 Slary forulae The slary of W o W s he average affy of he oexs ha lude W o W, ad he slary of a oex X o X s a weghed average of he affy of he words X o X. Slary forulae are defed as follows: s+ ( X, X = wegh( W, X aff ( W, X W (7 X f W = W else s s + + ( W, W = ( W, W = W X wegh( X, W aff ( X, W (8 The weghs forula 7 are opued as refleg global frequey, log-lkelhood faors, ad par of speeh as used (Karov ad Edela, 998. The su of weghs forula 8, whh s a reproal uber of oexs ha oa W, s. 4 Assgg reag oexs o a aegory We deded a slary value of eah reag oex for eah aegory usg he followg ehod: s( X, = aver s ( X, S (9 C S CC I forula 9, X s a reag oex, C = {,,..., } s a aegory se, ad CC = { S,..., S } s a orod-oexs se of aegory. Eah reag oex s assged o a aegory whh has a axu slary value. Bu here ay exs osy reag oexs whh do o belog o ay aegory. To reove hese osy reag oexs, we se up a droppg hreshold usg oral dsrbuo of slary values as follows (Ko ad Seo, 000: ax{ s ( X, } µ + θσ (0 C where X s a reag oex, µ s a average of slary values s X,, σ s a ( C sadard devao of slary values, ad v θ s a ueral value orrespodg o he hreshold (% oral dsrbuo able. Fally, a reag oex s assged o he oex-luser of ay aegory whe he aegory has a axu slary above he droppg hreshold value. I hs paper, we eprally use a 5% hreshold value fro a expere usg a valdao se.
3.3 Learg he Nave Bayes Classfer Usg Coex-Clusers I above seo, we obaed labeled rag daa: oex-lusers. Se rag daa are labeled as he oex u, we eploy a Nave Bayes lassfer beause a be bul by esag he word probably a aegory, bu o a doue. Tha s, he Nave Bayes lassfer does o requre labeled daa wh he u of doues ulke oher lassfers. We use he Nave Bayes lassfer wh or odfaos based o Kullbak-Lebler Dvergee (Crave e al., 000. We lassfy a doue d aordg o he followg forula: d ; ˆ θ = ˆ θ d ; ˆ θ d ˆ θ log ; ˆ θ + V = ˆ θ V = w ; ˆ θ w ; ˆ θ w ; ˆlog d θ w d; ˆ θ N( w, d ( where s he uber of words doue d, w s he -h word he voabulary, N(w,d s he frequey of word w doue d. Here, he Laplae soohg s used o esae he probably of word w lass ad he probably of lass as follows: where N( w, G + N( w, G ( w ; ˆ = V V + N( w, G = P θ ( + G P ˆ = C + G ( θ (3 s he ou of he uber of es word w ours he oex-luser ( aegory. G of 4 Usg a Feaure Proeo Tehque for Hadlg Nosy Daa of Mahe-labeled Daa We fally obaed labeled daa of a doues u, ahe-labeled daa. Now we a lear ex lassfers usg he. Bu se he ahelabeled daa are reaed by our ehod, hey geerally lude far ore orrely labeled doues ha he hua-labeled daa. Thus we eploy a feaure proeo ehque for our ehod. By he propery of he feaure proeo ehque, a lassfer (he TCFP lassfer a have robusess fro osy daa (Ko ad Seo, 004. As see our expere resuls, TCFP showed he hghes perforae aog oveoal lassfers usg ahe-labeled daa. The TCFP lassfer wh robusess fro osy daa Here, we sply desrbe he TCFP lassfer usg he feaure proeo ehque (Ko ad Seo, 00; 004. I hs approah, he lassfao kowledge s represeed as ses of proeos of rag daa o eah feaure deso. The lassfao of a es doue s based o he vog of eah feaure of ha es doue. Tha s, he fal predo sore s alulaed by auulag he vog sores of all feaures. Frs of all, we us alulae he vog rao of eah aegory for all feaures. Se elees wh a hgh TF-IDF value proeos of a feaure us beoe ore useful lassfao rera for he feaure, we use oly elees wh TF-IDF values above he average TF-IDF value for vog. Ad he seleed elees parpae proporoal vog wh he sae porae as he TF-IDF value of eah elee. The vog rao of eah aegory a feaure s alulaed by he followg forula: r r(, = w(, dl y(, ( l ( l I r w(, dl ( l I (4 r I forula 4, w (, d s he wegh of er doue d, I deoes a se of elees seleed for vog ad y(, ( l { 0.} s a fuo; f he aegory for a elee (l s equal o, he oupu value s. Oherwse, he oupu value s 0. Nex, se eah feaure separaely voes o feaure proeos, oexual forao s ssg. Thus we alulae o-ourree frequey of feaures he rag daa ad odfy TF-IDF values of wo ers ad a es doue by o-ourree frequey bewee he; ers wh a hgh o-ourree frequey value have hgher er weghs. Fally, he vog sore of eah aegory he -h feaure of a es doue d s alulaed by he followg forula: vs(, = w( r, d r(, log( + χ ( (5 where w(,d deoes a odfed er wegh by he o-ourree frequey ad χ ( deoes he alulaed χ sass value of.
Table. The op ro-avg F sores ad preso-reall breakeve pos of eah ehod. (bass (NB (Roho (knn (SVM (TCFP Newsgroups 79.36 83.46 83 79.95 8.49 86.9 WebKB 73.63 73. 75.8 68.04 73.74 75.47 Reuers 88.6 88.3 86.6 85.65 87.4 89.09 The oule of he TCFP lassfer s as follow:. pu : es doue: d r =<,,, >. a proess For eah feaure w(,d s alulaed For eah feaure For eah aegory voe[ ]=voe[ ]+vs(, by Forula 5 predo = arg ax voe[ ] 5 Epral Evaluao 5. Daa Ses ad Expereal Segs To es our ehod, we used hree dffere kds of daa ses: UseNe ewsgroups (0 Newsgroups, web pages (WebKB, ad ewswre arles (Reuers 578. For far evaluao Newsgroups ad WebKB, we eployed he fvefold ross-valdao ehod. The Newsgroups daa se, olleed by Ke Lag, oas abou 0,000 arles evely dvded aog 0 UseNe dsusso groups (MCallu ad Nga, 998. I hs paper, we used oly 6 aegores afer reovg 4 aegores: hree sellaeous aegores (alk.pols.s, alk.relgo.s, ad op.os.s-wdows.s ad oe duplae eag aegory (op.sys. b.p.hardware. The seod daa se oes fro he WebKB proe a CMU (Crave e al., 000. Ths daa se oas web pages gahered fro uversy opuer see depares. The Reuers 578 Dsrbuo.0 daa se osss of,90 arles ad 90 op aegores fro he Reuers ewswre. Lke oher sudy (Nga, 00, we used he e os populous aegores o defy he ews op. Abou 5% doues fro rag daa of eah daa se are seleed for a valdao se. We appled a sasal feaure seleo ehod (χ sass o a preproessg sage for eah lassfer (Yag ad Pederse, 997. As perforae easures, we followed he sadard defo of reall, preso, ad F easure. For evaluao perforae average aross aegores, we used he ro-averagg ehod (Yag e al., 00. Resuls o Reuers are repored as preso-reall breakeve pos, whh s a sadard forao rereval easure for bary lassfao (Joahs, 998. Tle words our expere are seleed aordg o aegory aes of eah daa se (see Table as a exaple. 5. Expereal Resuls 5.. Observg he Perforae Aordg o he Nuber of Keywords Frs of all, we deere he uber of keywords our ehod usg he valdao se. The uber of keywords s led by he op -h keyword fro he ordered ls of eah aegory. Fgure dsplays he perforae a dffere uber of keywords (fro 0 o 0 eah daa se. Fgure. The oparso of perforae aordg o he uber of keywords We se he uber of keywords o Newsgroups, 5 WebKB, ad 3 Reuers eprally. Geerally, we reoed ha he uber of keywords be bewee ad 5. 5.. Coparg our Mehod Usg TCFP wh hose Usg oher Classfers I hs seo, we prove he superory of TCFP over he oher lassfers (SVM, knn, Nave Bayes (NB, Roo rag daa wh uh osy daa suh as ahe-labeled daa. As show Table, we obaed he bes perforae usg TCFP a all hree daa ses. Le us defe he oaos. (bass deoes he Nave Bayes lassfer usg labeled oexs ad (NB deoes he Nave Bayes lassfer usg ahe-labeled daa as
rag daa. The sae aer s appled for oher lassfers. (TCFP aheved ore advaed sores ha (bass: 6.83 Newsgroups,.84 WebKB, ad 0.47 Reuers. 5..3 Coparg wh he Supervsed Nave Bayes Classfer For hs expere, we osder wo possble ases for labelg ask. The frs ask s o label a par of olleed doues ad he seod s o label all of he. As he frs ask, we bul up a ew rag daa se; osss of 500 dffere doues radoly hose fro approprae aegores lke he expere (Slo e al., 00. As a resul, we repor perforaes fro wo kds of Nave Bayes lassfers whh are leared fro 500 rag doues ad he whole rag doues respevely. Table 3. The oparso of our ehod ad he supervsed NB lassfer (TCFP NB (500 NB (All Newsgroups 86.9 7.68 9.7 WebKB 75.47 74. 85.9 Reuers 89.09 8. 9.64 I Table 3, he resuls of our ehod are hgher ha hose of NB(500 ad are oparable o hose of NB(All all daa ses. Espeally, he resul Reuers reahed.55 lose o ha of NB(All hough used he whole labeled rag daa. 5..4 Ehag our Mehod fro Choosg Keywords by Hua The a proble of our ehod s ha he perforae depeds o he qualy of he keywords ad le words. As we have see Table 3, we obaed he wors perforae he WebKB daa se. I fa, le words ad keywords of eah aegory he WebKB daa se also have hgh frequey oher aegores. We hk hese faors orbue o a oparavely poor perforae of our ehod. If keywords as well as le words are suppled by huas, our ehod ay aheve hgher perforae. However, hoosg he proper keywords for eah aegory s a uh dfful ask. Moreover, keywords fro developers, who have suffe kowledge abou a applao doa, do o guaraee hgh perforae. I order o overoe hs proble, we propose a hybrd ehod for hoosg keywords. Tha s, a developer obas 0 addae keywords fro our keyword exrao ehod ad he hey a hoose proper keywords fro he. Table 4 shows he resuls fro hree daa ses. Table 4. The oparso of our ehod ad ehag ehod (TCFP Ehag (TCFP Iprovee Newsgroups 86.9 86.3 +0.04 WebKB 75.47 77.59 +. Reuers 89.09 89.5 +0.43 As show Table 4, espeally we ould aheve sgfa provee he WebKb daa se. Thus we fd ha he ew ehod for hoosg keywords s ore useful a doa wh ofused keywords bewee aegores suh as he WebKB daa se. 5..5 Coparg wh a Cluserg Tehque I relaed works, we preseed wo approahes usg ulabeled daa ex aegorzao; oe approah obes ulabeled daa ad labeled daa, ad he oher approah uses he luserg ehque for ex aegorzao. Se our ehod does o use ay labeled daa, ao be farly opared wh he forer approahes. Therefore, we opare our ehod wh a luserg ehque. Slo e al. (00 proposed a ew luserg algorh (sib for usupervsed doue lassfao ad verfed he superory of hs algorh. I hs experes, he sib algorh was superor o oher luserg algorhs. As we se he sae expereal segs as Slo s experes ad odu experes, we verfy ha our ehod ouperfors hs sib algorh. I our experes, we used he ro-averagg preso as perforae easure ad wo revsed daa ses: revsed_ng, revsed_reuers. These daa ses were revsed he sae way aordg o Slo s paper as follows: I revsed_ng, he aegores of Newsgroups were ued wh respe o 0 ea-aegores: fve op aegores, hree pols aegores, wo spors aegores, hree relgos aegores, ad wo rasporao aegores o fve bg eaaegores. The revsed_reuers used he 0 os freque aegores he Reuers 578 orpus uder he ModApe spl. As show Table 5, our ehod shows 6.65 advaed sore revsed_ng ad 3. advaed sore revsed_reuers. Table 5. The oparso of our ehod ad sib sib (TCFP Iprovee revsed_ng 79.5 86.5 +6.65 revsed_reuers 85.8 89 +3.
6 Colusos ad Fuure Works Ths paper has addressed a ew usupervsed or se-usupervsed ex aegorzao ehod. Though our ehod uses oly le words ad ulabeled daa, shows reasoably oparable perforae oparso wh ha of he supervsed Nave Bayes lassfer. Moreover, ouperfors a luserg ehod, sib. Labeled daa are expesve whle ulabeled daa are expesve ad pleful. Therefore, our ehod s useful for low-os ex aegorzao. Furherore, f soe ex aegorzao asks requre hgh auray, our ehod a be used as a asssa ool for easly reag labeled rag daa. Se our ehod depeds o le words ad keywords, we eed addoal sudes abou he haraerss of addae words for le words ad keywords aordg o eah daa se. Akowledgee Ths work was suppored by gra No. R0-003- 000-588-0 fro he bas Researh Progra of he KOSEF Referees K. Bee ad A. Derz, 999, Se-supervsed Suppor Veor Mahes, Advaes Neural Iforao Proessg Syses, pp. 368-374. E. Brll, 995, Trasforao-Based Error-drve Learg ad Naural Laguage Proessg: A Case Sudy Par of Speeh Taggg, Copuaoal Lguss, Vol., No. 4. K. Cho ad J. K, 997, Auoa Tex Caegorzao o Herarhal Caegory Sruure by usg ICF (Iverse Caegory Frequey Weghg, I Pro. of KISS oferee, pp. 507-50. M. Crave, D. DPasquo, D. Freag, A. MCallu, T. Mhell, K. Nga, ad S. Slaery, 000, Learg o osru kowledge bases fro he World Wde Web, Arfal Iellgee, 8(-, pp. 69-3. T. Joahs, 998, Tex Caegorzao wh Suppor Veor Mahes: Learg wh May Releva Feaures. I Pro. of ECML, pp. 37-4. Y. Karov ad S. Edela, 998, Slary-based Word Sese Dsabguao, Copuaoal Lguss, Vol. 4, No., pp. 4-60. Y. Ko ad J. Seo, 000, Auoa Tex Caegorzao by Usupervsed Learg, I Pro. of COLING 000, pp. 453-459. Y. Ko ad J. Seo, 00, Tex Caegorzao usg Feaure Proeos, I Pro. of COLING 00, pp. 467-473. Y. Ko ad J. Seo, 004, Usg he Feaure Proeo Tehque based o he Noralzed Vog Mehod for Tex Classfao, Iforao Proessg ad Maagee, Vol. 40, No., pp. 9-08. D.D. Lews, R.E. Shapre, J.P. Calla, ad R. Papka, 996, Trag Algorhs for Lear Tex Classfers. I Pro. of SIGIR 96, pp.89-97. Y. Maarek, D. Berry, ad G. Kaser, 99, A Iforao Rereval Approah for Auoaally Cosruo Sofware Lbrares, IEEE Trasao o Sofware Egeerg, Vol. 7, No. 8, pp. 800-83. A. MCallu ad K. Nga, 998, A Coparso of Eve Models for Nave Bayes Tex Classfao. AAAI 98 workshop o Learg for Tex Caegorzao, pp. 4-48. K. P. Nga, A. MCallu, S. Thru, ad T. Mhell, 998, Learg o Classfy Tex fro Labeled ad Ulabeled Doues, I Pro. of AAAI-98. K. P. Nga, 00, Usg Ulabeled Daa o Iprove Tex Classfao, The dsserao for he degree of Door of Phlosophy. N. Slo, N. Freda, ad N. Tshby, 00, Usupervsed Doue Classfao usg Sequeal Iforao Maxzao, I Pro. of SIGIR 0, pp. 9-36. Y. Yag ad J. P. Pederse. 997, Feaure seleo sasal leag of ex aegorzao. I Pro. of ICML 97, pp. 4-40. Y. Yag, S. Slaery, ad R. Gha. 00, A sudy of approahes o hyperex aegorzao, Joural of Iellge Iforao Syses, Vol. 8, No..