Inductve Leanng Algothms and Repesentatons fo Text Categozaton Susan Dumas Mcosoft Reseach One Mcosoft Way Redmond, WA 98052 sdumas@mcosoft.com John Platt Mcosoft Reseach One Mcosoft Way Redmond, WA 98052 jplatt@mcosoft.com Davd Heckeman Mcosoft Reseach One Mcosoft Way Redmond, WA 98052 heckema@mcosoft.com Mehan Saham Compute Scence Depatment Stanfod Unvesty Stanfod, CA 94305-9010 saham@cs.stanfod.edu 1. ABSTRACT Text categozaton the assgnment of natual language texts to one o moe pedefned categoes based on the content s an mpotant component n many nfomaton oganzaton and management tasks. We compae the effectveness of fve dffeent automatc leanng algothms fo text categozaton n tems of leanng speed, ealtme classfcaton speed, and classfcaton accuacy. We also examne tanng set sze, and altenatve document epesentatons. Vey accuate text classfes can be leaned automatcally fom tanng examples. Lnea Suppot Vecto Machnes (SVMs) ae patculaly pomsng because they ae vey accuate, quck to tan, and quck to evaluate. 1.1 Keywods Text categozaton, classfcaton, suppot vecto machnes, machne leanng, nfomaton management. 2. INTRODUCTION As the volume of nfomaton avalable on the Intenet and copoate ntanets contnues to ncease, thee s gowng nteest n helpng people bette fnd, flte, and manage these esouces. Text categozaton the assgnment of natual language texts to one o moe pedefned categoes based on the content s an mpotant component n many nfomaton oganzaton and management tasks. Its most wdespead applcaton to date has been fo assgnng subject categoes to documents to suppot text eteval, outng and flteng. Automatc text categozaton can play an mpotant ole n a wde vaety of moe flexble, dynamc and pesonalzed nfomaton management tasks as well: eal-tme sotng of emal o fles nto folde heaches; topc dentfcaton to suppot topc-specfc pocessng opeatons; stuctued seach and/o bowsng; o fndng documents that match long-tem standng nteests o moe dynamc task-based nteests. Classfcaton technologes should be able to suppot categoy stuctues that ae vey geneal, consstent acoss ndvduals, and elatvely statc (e.g., Dewey Decmal o Lbay of Congess classfcaton systems, Medcal Subject Headngs (MeSH), o Yahoo! s topc heachy), as well as those that ae moe dynamc and customzed to ndvdual nteests o tasks (e.g., emal about the CIKM confeence). In many contexts (Dewey, MeSH, Yahoo!, CybePatol), taned pofessonals ae employed to categoze new tems. Ths pocess s vey tme-consumng and costly, thus lmtng ts applcablty. Consequently thee s nceased nteest n developng technologes fo automatc text categozaton. Rule-based appoaches smla to those used n expet systems ae common (e.g., Hayes and
Wensten s CONSTRUE system fo classfyng Reutes news stoes, 1990), but they geneally eque manual constucton of the ules, make gd bnay decsons about categoy membeshp, and ae typcally dffcult to modfy. Anothe stategy s to use nductve leanng technques to automatcally constuct classfes usng labeled tanng data. Text classfcaton poses many challenges fo nductve leanng methods snce thee can be mllons of wod featues. The esultng classfes, howeve, have many advantages: they ae easy to constuct and update, they depend only on nfomaton that s easy fo people to povde (.e., examples of tems that ae n o out of categoes), they can be customzed to specfc categoes of nteest to ndvduals, and they allow uses to smoothly tadeoff pecson and ecall dependng on the task. A gowng numbe of statstcal classfcaton and machne leanng technques have been appled to text categozaton, ncludng multvaate egesson models (Fuh et al., 1991; Yang and Chute, 1994; Schütze et al., 1995), neaest neghbo classfes (Yang, 1994), pobablstc Bayesan models (Lews and Rnguette, 1994), decson tees (Lews and Rnguette, 1994), neual netwoks (Wene et al., 1995; Schütze et al., 1995), and symbolc ule leanng (Apte et al., 1994; Cohen and Snge, 1996). Moe ecently, Joachms (1998) has exploed the use of Suppot Vecto Machnes (SVMs) fo text classfcaton wth pomsng esults. In ths pape we descbe esults fom expements usng a collecton of hand-tagged fnancal newswe stoes fom Reutes. We use supevsed leanng methods to buld ou classfes, and evaluate the esultng models on new test cases. The focus of ou wok has been on compang the effectveness of dffeent nductve leanng algothms (Fnd Smla, Naïve Bayes, Bayesan Netwoks, Decson Tees, and Suppot Vecto Machnes) n tems of leanng speed, eal-tme classfcaton speed, and classfcaton accuacy. We also exploed altenatve document epesentatons (wods vs. syntactc phases, and bnay vs. non-bnay featues), and tanng set sze. 3. INDUCTIVE LEARNING METHODS 3.1 Classfes A classfe s a functon that maps an nput attbute vecto, x = ( x 1,x 2,x3,...xn), to a confdence that the nput belongs to a class that s, f ( x ) = confdence(class),. In the case of text classfcaton, the attbutes ae wods n the document and the classes coespond to text categoes (e.g., typcal Reutes categoes nclude acqustons, eanngs, nteest). Examples of classfes fo the Reutes categoy nteest nclude: f (nteest AND ate) OR (quately), then confdence(nteest categoy) = 0.9 confdence(nteest categoy) = 0.3*nteest + 0.4*ate + 0.7*quately Some of the classfes that we consde (decson tees, naïve-bayes classfe, and Bayes nets) ae pobablstc n the sense that confdence(class) s a pobablty dstbuton. 3.2 Inductve Leanng of Classfes Ou goal s to lean classfes lke these usng nductve leanng methods. In ths pape we compaed fve leanng methods: Fnd Smla (a vaant of Roccho s method fo elevance feedback) Decson Tees Naïve Bayes Bayes Nets Suppot Vecto Machnes (SVM) We descbe these dffeent models n detal n secton 2.4. All methods eque only on a small amount of labeled tanng data (.e., examples of tems n each categoy) as nput. Ths tanng data s used to lean paametes of the classfcaton model. In the testng o evaluaton phase, the effectveness of the model s tested on pevously unseen nstances. Leaned classfes ae easy to constuct and update. They eque only subject knowledge ( I know t when I see t ) and not pogammng o ule-wtng sklls. Inductvely leaned classfes make t easy fo uses to customze categoy defntons, whch s mpotant fo some applcatons. In addton, all the leanng methods we looked at povde gaded estmates of categoy membeshp allowng fo tadeoffs between pecson and ecall, dependng on the task. 3.3 Text Repesentaton and Featue Selecton Each document s epesented as a vecto of wods, as s typcally done n the popula vecto epesentaton fo nfomaton eteval (Salton & McGll, 1983). Fo the Fnd Smla algothm, tf*df tem weghts ae computed and all featues ae used. Fo the othe leanng algothms, the featue space s educed substantally (as descbed below) and only bnay featue values ae used a wod ethe occus o does not occu n a document. Fo easons of both effcency and effcacy, featue selecton s wdely used when applyng machne leanng methods to text categozaton. To educe the numbe of featues, we fst emove featues based on oveall fequency counts, and then select a small numbe of featues based on the ft to categoes. Yang and Pedesen (1997) compae a numbe of methods fo featue selecton. We used the mutual nfomaton measue. The mutual nfomaton MI(x, c) between a featue, x, and a categoy, c s defned as:
MI( x, c) = x { 0,1} c {, 1} P( x, c)log P( x, c) 0 P( x ) P( c) We select the k featues fo whch mutual nfomaton s lagest fo each categoy. These featues ae used as nput to the vaous nductve leanng algothms. Fo the SVM and decson-tee methods we used k=300, and fo the emanng methods we used k=50. We dd not goously exploe the optmum numbe of featues fo ths poblem, but these numbes povded good esults on a tanng valdaton set so they wee used fo testng. 3.4 Inductve Leanng of Classfes 3.4.1 Fnd Smla Ou Fnd Smla method s a vaant of Roccho s method fo elevance feedback (Roccho,. 1971) whch s a popula method fo expandng use quees on the bass of elevance judgements. In Roccho s fomulaton, the weght assgned to a tem s a combnaton of ts weght n an ognal quey, and judged elevant and elevant documents. x j = α x x, j el q, j + β + γ n x, j non el N n The paametes α, β, and γ contol the elatve mpotance of the ognal quey vecto, the postve examples and the negatve examples. In the context of text classfcaton, thee s no ntal quey, so α=0. We also set γ=0 so we could easly use avalable code. Thus, fo ou Fnd Smla method the weght of each tem s smply the aveage (o centod) of ts weghts n postve nstances of the categoy. Thee s no explct eo mnmzaton nvolved n computng the Fnd Smla weghts. Thus, thee s no leanng tme so to speak, except fo takng the sum of weghts fom postve examples of each categoy. Test nstances ae classfed by compang them to the categoy centods usng the Jaccad smlaty measue. If the scoe exceeds a theshold, the tem s classfed as belongng to the categoy. 3.4.2 Decson Tees A decson tee was constucted fo each categoy usng the appoach descbed by Chckeng et al. (1997). The decson tees wee gown by ecusve geedy splttng, and splts wee chosen usng the Bayesan posteo pobablty of model stuctue. We used a stuctue po that penalzed each addtonal paamete wth pobablty 0.1, and deved paamete pos fom a po netwok as descbed n Chckeng et al. (1997) wth an equvalent sample sze of 10. A class pobablty athe than a bnay decson s etaned at each node. 3.4.3 Naïve Bayes A naïve-bayes classfe s constucted by usng the tanng data to estmate the pobablty of each categoy gven the document featue values of a new nstance. We use Bayes theoem to estmate the pobabltes: P( x C = ck ) P( C = ck ) P( C = ck x) = P( x) P x C = c ( k The quantty ) s often mpactcal to compute wthout smplfyng assumptons. Fo the Naïve Bayes classfe (Good, 1965), we assume that the featues X 1, X n ae condtonally ndependent, gven the categoy vaable C. Ths smplfes the computatons yeldng: P ( x C = c ) = P( x C = c ) k Despte the fact the assumpton of condtonal ndependence s geneally not tue fo wod appeaance n documents, the Naïve Bayes classfe s supsngly effectve. 3.4.4 Bayes Nets Moe ecently, thee has been nteest n leanng moe expessve Bayesan netwoks (Heckeman et al., 1995) as well as methods fo leanng netwoks specfcally fo classfcaton (Saham, 1996). Saham, fo example, allows fo a lmted fom of dependence between featue vaables, thus elaxng the vey estctve assumptons of the Naïve Bayes classfe. We used a 2-dependence Bayesan classfe that allows the pobablty of each featue x to be dectly nfluenced by the appeaance/non-appeaance of at most two othe featues. 3.4.5 Suppot Vecto Machnes (SVMs) Vapnk poposed Suppot Vecto Machnes (SVMs) n 1979 (Vapnk, 1995), but they have only ecently been ganng populaty n the leanng communty. In ts smplest lnea fom, an SVM s a hypeplane that sepaates a set of postve examples fom a set of negatve examples wth maxmum magn see Fgue 1. w Fgue 1 Lnea Suppot Vecto Machne The fomula fo the output of a lnea SVM s u = w x b, whee w s the nomal vecto to the hypeplane, and x s the nput vecto. In the lnea case, the magn s defned by the dstance of the hypeplane to the neaest of the postve and negatve k VXSSRUWYHFWRUV
examples. Maxmzng the magn can be expessed as an optmzaton poblem: y ( w x b) 1, mnmze 1 w 2 2 subject to whee x s the th tanng example and y s the coect output of the SVM fo the th tanng example. Of couse, not all poblems ae lnealy sepaable. Cotes and Vapnk (1995) poposed a modfcaton to the optmzaton fomulaton that allows, but penalzes, examples that fall on the wong sde of the decson bounday. Addtonal extensons to non-lnea classfes wee descbed by Bose et al. n 1992. SVMs have been shown to yeld good genealzaton pefomance on a wde vaety of classfcaton poblems, ncludng: handwtten chaacte ecognton (LeCun et al., 1995), face detecton (Osuna et al., 1997) and most ecently text categozaton (Joachms, 1998). We used the smplest lnea veson of the SVM because t povded good classfcaton accuacy, s fast to lean and fast fo classfyng new nstances. Tanng an SVM eques the soluton of a QP poblem Any quadatc pogammng (QP) optmzaton method can be used to lean the weghts, w, on the bass of tanng examples. Howeve, many QP methods can be vey slow fo lage poblems such as text categozaton. We used a new and vey fast method developed by Platt (1998) whch beaks the lage QP poblem down nto a sees of small QP poblems that can be solved analytcally. Addtonal mpovements can be ealzed because the tanng sets used fo text classfcaton ae spase and bnay. Once the weghts ae leaned, new tems ae classfed by computng w x whee w s the vecto of leaned weghts, and x s the bnay vecto epesentng the new document to classfy. Afte tanng the SVM, we ft a sgmod to the output of the SVM usng egulazed maxmum lkelhood fttng, so that the SVM can poduce posteo pobabltes that ae dectly compaable between categoes. 4. REUTERS DATA SET 4.1 Reutes-21578 (ModApte splt) We used the new veson of Reutes, the so-called Reutes- 21578 collecton. (Ths collecton s publcly avalable at: http://www.eseach.att.com/~lews/eutes21578.html.) We used the 12,902 stoes that had been classfed nto 118 categoes (e.g., copoate acqustons, eanngs, money maket, gan, and nteest). The stoes aveage about 200 wods n length. We followed the ModApte splt n whch 75% of the stoes (9603 stoes) ae used to buld classfes and the emanng 25% (3299 stoes) to test the accuacy of the esultng models n epoducng the manual categoy assgnments. The stoes ae splt tempoally, so the tanng tems all occu befoe the test tems. The mean numbe of categoes assgned to a stoy s 1.2, but many stoes ae not assgned to any of the 118 categoes, and some stoes ae assgned to 12 categoes. The numbe of stoes n each categoy vaed wdely as well, angng fom eanngs whch contans 3964 documents to casto-ol whch contans only one test document. Table 1 shows the ten most fequent categoes along wth the numbe of tanng and test examples n each. These 10 categoes account fo 75% of the tanng nstances, wth the emande dstbuted among the othe 108 categoes. &DWHJRU\Ã1DPH 1XPÃ7UDLQ 1XPÃ7HVW (DUQ $FTXLVLWLRQV 0RQH\I[ *UDLQ &UXGH 7UDGH,QWHUHVW 6KLS :KHDW &RUQ Table 1 Numbe of Tanng/Test Items 4.2 Summay of Inductve Leanng Pocess fo Reutes Fgue 2 summazes the pocess we use fo testng the vaous leanng algothms. Text fles ae pocessed usng Mcosoft s Index Seve. All featues ae saved along wth the tf*df weghts. We dstngushed between wods occung n the Ttle and Body of the stoes. Fo the Fnd Smla method, smlaty s computed between test examples and categoy centods usng all these featues. Fo all othe methods, we educe the featue space by elmnatng wods that appea n only a sngle document (hapax legomena), then selectng the k wods wth hghest mutual nfomaton wth each categoy. These k-element bnay featue vectos ae used as nput to fou dffeent leanng algothms. Fo SVMs and decson tees k=300, and fo the othe methods, k=50. Decson tee text fles wod counts pe fle data set Naïve Bayes Index Seve test classfe Featue selecton Bayes nets Fgue 2 Schematc of Leanng Pocess Fnd smla Leanng Methods Suppot vecto machne
A sepaate classfe s leaned fo each categoy. New nstances ae classfed by computng a scoe and compang the scoe wth a leaned theshold. New nstances exceedng the theshold ae sad to belong to the categoy. As aleady mentoned, all classfes output a gaded measue of categoy membeshp, so dffeent thesholds can be set to favo pecson o ecall dependng on the applcaton fo Reutes we optmzed the aveage of pecson and ecall (detals below). All model paametes and thesholds ae set to optmze pefomance on a valdaton set and ae not modfed dung testng. Fo Reutes, the tanng set contans 9603 stoes and the test set 3299 stoes. In ode to decde whch models to use we pefomed ntal expements on a subset of the tanng data, whch we subdvded nto 7147 tanng stoes and 2456 valdaton stoes fo ths pupose. We used ths to set the numbe of featues (k), decson thesholds and document epesentatons to use fo the fnal uns. We estmated paametes fo these chosen models usng the full 9603 tanng stoes and evaluated pefomance on the 3299 test tems. We dd not futhe optmze pefomance by tunng paametes to acheve optmal pefomance n the test set. 5. RESULTS 5.1 Tanng Tme Tanng tmes fo the 9603 tanng examples vay substantally acoss methods. We tested these algothms on a 266MHz Pentum II unnng Wndows NT. Unless othewse noted tmes ae fo the 10 lagest categoes, because they take longest to lean. Fnd Smla s the fastest leanng method (<1 CPU sec/categoy) because thee s no explct eo mnmzaton. The lnea SVM s the next fastest (<2 CPU secs/categoy). These ae both substantally faste than Naïve Bayes (8 CPU secs/categoy), Bayes Nets (~145 CPU secs/categoy) o Decson Tees (~70 CPU secs/categoy). In geneal, pefomng the mutual-nfomaton featue-extacton step takes much moe tme than any of the nductve leanng algothms. The lnea SVM wth SMO, fo example, takes an aveage of 0.26 CPU seconds to tan a categoy when aveaged ove all 118 Reutes categoes. The tanng speeds fo the SVM ae patculaly mpessve, snce tanng speed has been a bae to ts wde spead applcablty fo lage poblems. Platt s SMO algothm s oughly 30 tmes faste than the popula chunkng algothm on the Reutes data set (Vapnk, 1995). 5.2 Classfcaton Speed fo New Instances In many applcatons, t s mpotant to quckly classfy new nstances. All of the classfes we exploed ae vey fast n ths egad all eque less than 2 msec to detemne f a new document should be assgned to a patcula categoy. Fa moe tme s spent n pe-pocessng the text to extact even smple wods than s spent n categozaton. Wth the SVM model, fo example, we need only compute w x, whee w s the vecto of leaned weghts, and x s featue vecto fo the new nstance. Snce featues ae bnay, ths s just the sum of up to 300 numbes. 5.3 Classfcaton Accuacy Many evaluaton ctea fo classfcaton have been poposed. The most popula measues ae based on pecson and ecall. Pecson s the popoton of tems placed n the categoy that ae eally n the categoy, and Recall s the popoton of tems n the categoy that ae actually placed n the categoy. We epot the aveage of pecson and ecall (the so-called beakeven pont) fo compaablty to eale esults n text classfcaton. In addton, we plot pecson as a functon of ecall n ode to undestand the elatonshp among methods at dffeent ponts along ths cuve. Table 2 summazes mcoaveaged beak even pefomance fo the 5 dffeent leanng algothms fo the 10 most fequent categoes as well as the oveall scoe fo all 118 categoes. Suppot Vecto Machnes wee the most accuate method, aveagng 92% fo the 10 most fequent categoes and 87% ove all 118 categoes. Accuacy fo Decson Tees was 3.6% lowe, aveagng 88.4% fo the 10 most fequent categoes. Bayes Nets povded some pefomance mpovement ove Naïve Bayes as expected, but the advantages wee athe small. As has pevously been epoted, all the moe advanced leanng algothms ncease pefomance by 15-20% compaed wth Roccho- Fndsm NBayes BayesNets Tees LneaSVM ean 92.9% 95.9% 95.8% 97.8% 98.0% acq 64.7% 87.8% 88.3% 89.7% 93.6% money-fx 46.7% 56.6% 58.8% 66.2% 74.5% gan 67.5% 78.8% 81.4% 85.0% 94.6% cude 70.1% 79.5% 79.6% 85.0% 88.9% tade 65.1% 63.9% 69.0% 72.5% 75.9% nteest 63.4% 64.9% 71.3% 67.1% 77.7% shp 49.2% 85.4% 84.4% 74.2% 85.6% wheat 68.9% 69.7% 82.7% 92.5% 91.8% con 48.2% 65.3% 76.4% 91.8% 90.3% Avg Top 10 64.6% 81.5% 85.0% 88.4% 92.0% Avg All Cat 61.7% 75.2% 80.0% N/A 87.0% Table 2 Beakeven Pefomance fo 10 Lagest Categoes, and ove all 118 Categoes.
style quey expanson (Fnd Smla). Both SVMs and Decson Tees poduce vey hgh oveall classfcaton accuacy, and ae among the best known esults fo ths test collecton. Most pevous esults have used the olde Reutes collecton, so t s dffcult to compae pecsely, but 85% s the best mco-aveaged beakeven pont pevously epoted (Yang, 1997). Joachms (1998) used the new collecton, and ou SVM esults ae moe accuate (87% fo ou lnea SVM vs. 84.2% fo Joachms lnea SVM and 86.5% fo hs adal bass functon netwok wth gamma equals 0.8) and fa moe effcent fo both ntal model leanng and fo ealtme classfcaton of new nstances. It s also woth notng that Joachms chose optmal paametes based on the test data and used only the 90 categoes that have at least one tanng and test tem, and ou esults would mpove some f we dd the same. Apte, et al. (1998) have ecently epoted accuaces slghtly bette than ous (87.8%) fo a system wth 100 decson tees. The appoach nvolves leanng many decson tees usng an adaptve esamplng appoach (boostng) and s much moe complex to lean than ou one smple lnea classfe. The 92% beakeven pont (fo the top 10 categoes) coesponds oughly to 92% pecson at 92% ecall. Note, howeve, that the decson theshold can be vaed to poduce hghe pecson (at the cost of lowe ecall), o hghe ecall (at the cost of lowe pecson), as appopate fo dffeent applcatons. A use would be qute happy wth 92% pecson fo nfomaton dscovey tasks, but mght want addtonal human confmaton befoe deletng mpotant emal messages wth ths level of accuacy. Fgue 3 shows a epesentatve ROC cuve fo the categoy gan. The advantages of SVM can be seen ove the ente ecall-pecson space. Pecson 1 0.8 0.6 0.4 0.2 0 LSVM Decson Tee Naïve Bayes Fnd Smla 0 0.2 0.4 0.6 0.8 1 Re call Fgue 3 Pecson-Recall Cuve fo Categoy gan Although we have not conducted any fomal tests, the leaned classfes appea to be ntutvely easonable. Fo example, the SVM epesentaton fo the categoy nteest ncludes the wods pme (.70), ate (.67), nteest (.63), ates (.60), and dscount (.46) wth lage postve weghts, and the wods goup (-.24), yea (-.25), sees (-.33) wold (-.35), and dls (-.71) wth lage negatve weghts. 5.4 Othe Expements 5.4.1 Sample Sze Fo an applcaton lke Reutes, t s easy to magne developng a lage tanng copus of the sot we woked wth (e.g., a few categoes had moe than 1000 postve tanng nstances). Fo othe applcatons, tanng data may be much hade to come by. Fo ths eason we examned how many postve tanng examples wee necessay to povde good genealzaton pefomance. We looked at pefomance fo the 10 most fequent categoes, vayng the numbe of postve nstances but keepng the negatve data the same. Fo the lnea SVM, usng 100% of the tanng data (7147 stoes), the mco-aveaged beakeven pont s 92%. Fo smalle tanng sets we took multple andom samples and epot the aveage scoe. Usng only 10% of the tanng sets data pefomance s 89.6%, wth a 5% sample 86.2%, and wth a 1% sample 72.6%. When we get down to a tanng set wth only 1% of the postve examples, most of the categoes have fewe than 5 tanng nstances esultng n somewhat unstable pefomance fo some categoes. In geneal, havng 20 o moe tanng nstances povdes stable genealzaton pefomance. Whle the numbe of examples needed pe categoy wll vay acoss applcaton, we fnd these esults encouagng. In addton, t s mpotant to note that n most categozaton scenaos, the dstbuton of nstances vaes temendously acoss categoes some categoes wll have hundeds o thousands of nstances, and othes only a few (a knd of Zpf s law fo categoy sze). In such cases, the most popula categoes wll quckly eceve the necessay numbe of tanng examples n the nomal couse of opeaton. 5.4.2 Smple wods vs. NLP-deved phases Fo all the esults epoted so fa, we smply used the default pe-pocessng povded by Mcosoft s Index Seve, esultng n sngle wods as ndex tems. We wanted to exploe how NLP analyses mght mpove classfcaton accuacy. Fo example, the phase nteest ate s moe pedctve of the Reutes categoy nteest than s ethe the wod nteest o ate. We used NLP analyses n a vey smply fashon to ad n the extacton of che phases fo ndexng accuacy (see Lews and Spack Jones, 1996 fo an ovevew of elated NLP ssues). We consdeed: factods (e.g., Salomon_Bothes_Intenatonal, Apl_8) mult-wod dctonay entes (e.g., New_Yok, nteest_ate) noun phases (e.g., fst_quate, modest_gowth)
As befoe, we used tf*df weghts fo Fnd Smla and the mutual nfomaton cteon fo selectng featues fo Naïve Bayes and SVM. Unfotunately, the NLP-deved phases dd not mpove classfcaton accuacy. Fo the SVM, the NLP featues actually educed pefomance on the 118 categoes by 0.2% Because of these ntal esults, we dd not ty the NLP-deved phases fo Decson Tees o the moe complex 2-dependence Bayesan netwok, o use NLP featues n any of the fnal evaluatons. 5.4.3 Bnay vs. 0/1/2 featues We also looked at whethe movng to a che epesentaton than bnay featues would mpove categozaton accuacy. To ths end, we consdeed a epesentaton that encoded wods as appeang 0,1, o >=2 tmes n each document. Intal esults usng ths epesentaton wth Decson Tee classfes dd not yeld mpoved pefomance, so we dd not pusue ths futhe. 6. SUMMARY Vey accuate text classfes can be leaned automatcally fom tanng examples, as othes have shown. The accuacy of ou smple lnea SVM s among the best epoted fo the Reutes-21578 collecton. In addton, the model s vey smple (300 bnay featues pe categoy), and Platt s SMO tanng method fo SVMs povdes a vey effcent method fo leanng the classfe at least 30 tmes faste than the chunkng method fo QP, and 35 tmes faste than the next most accuate classfe (Decson Tees) we examned. Classfcaton of new tems s fast as well snce we need only compute the sum of the leaned weghts fo featues n the test tems. We found that the smplest document epesentaton (usng ndvdual wods delmted by whte spaces wth no stemmng) was at least as good as epesentatons nvolvng moe complcated syntactc and mophologcal analyss. And, epesentng documents as bnay vectos of wods, chosen usng a mutual nfomaton cteon fo each categoy, was as good as fne-ganed codng (at least fo Decson Tees). Joachms (1998) wok s smla to ous n ts use of SVMs fo the pupose of text categozaton. Ou esults ae somewhat moe accuate than hs, but, moe mpotantly, based on a much smple and moe effcent model. Joachms best esults ae obtaned usng a non-lnea adal bass functon of 9962 eal-valued nput featues (based on the popula tf*df tem weghts). In contast, we use a sngle lnea functon of 300 bnay featues pe categoy. SVMs wok well because they ceate a classfe whch maxmzes the magn between postve and negatve examples. Othe algothms, such as boostng (Schape, et al., 1998), have been shown to maxmze magn and ae also vey effectve at text categozaton. We have also used SVMs fo categozng emal messages and Web pages wth esults compaable to those epoted hee -- SVMs ae the most accuate classfe and the fastest to tan. We hope to extend the text epesentaton models to nclude addtonal stuctual nfomaton about documents, as well as knowledge-based featues whch have been shown to povde substantal mpovements n classfcaton accuacy (Saham et al., 1998). Fnally, we wll look at extendng ths wok to automatcally classfy tems nto heachcal categoy stuctues. We beleve that nductve leanng methods lke the ones we have descbed can be used to suppot flexble, dynamc, and pesonalzed nfomaton access and management n a wde vaety of tasks. Lnea SVMs ae patculaly pomsng snce they ae both vey accuate and fast. 7. REFERENCES [1] Apte, C., Dameau, F. and Wess, S. Automated leanng of decson ules fo text categozaton. ACM Tansactons on Infomaton Systems, 12(3), 233-251, 1994. [2] Apte, C., Dameau, F. and Wess, S.. Text Mnng wth decson ules and decson tees. Poceedngs of the Confeence on Automated Leanng and Dscovey, CMU, June, 1998. [3] Bose, B. E., Guyon, I. M., and Vapnk, V., A Tanng Algothm fo Optmal Magn Classfes. Ffth Annual Wokshop on Computatonal Leanng Theoy, ACM, 1992. [4] Chckeng D., Heckeman D., and Meek, C. A Bayesan appoach fo leanng Bayesan netwoks wth local stuctue. In Poceedngs of Thteenth Confeence on Uncetanty n Atfcal Intellgence, 1997. [5] Cohen, W.W. and Snge, Y. Context-senstve leanng methods fo text categozaton In SIGIR 96: Poceedngs of the 19th Annual Intenatonal ACM SIGIR Confeence on Reseach and Development n Infomaton Reteval, 307-315, 1996. [6] Cotes, C., and Vapnk, V., Suppot vecto netwoks. Machne Leanng, 20, 273-297, 1995. [7] Fuh, N., Hatmanna, S., Lustg, G., Schwantne, M., and Tzeas, K. A/X A ule-based mult-stage ndexng system fo lage subject felds. In Poceedngs of RIAO 91, 606-623, 1991. [8] Good, I.J. The Estmaton of Pobabltes: An Essay on Moden Bayesan Methods. MIT Pess, 1965. [9] Hayes, P.J. and Wensten. S.P. CONSTRUE/TIS: A system fo content-based ndexng of a database of news stoes. In Second Annual Confeence on Innovatve Applcatons of Atfcal Intellgence, 1990. [10] Heckeman, D. Gege, D. and Chckeng, D.M. Leanng Bayesan netwoks: the combnaton of knowledge and statstcal data. Machne Leanng, 20, 131-163, 1995.
[11] Joachms, T. Text categozaton wth suppot vecto machnes: Leanng wth many elevant featues. In Poceedngs 10 th Euopean Confeence on Machne Leanng (ECML), Spnge Velag, 1998. http://wwwa.cs.undotmund.de/dokimente/joachms_97a.ps.gz [12] LeCun, Y., Jackel, L. D., Bottou, L., Cotes, C., Denke, J. S., Ducke, H., Guyon, I., Mulle, U. A., Sacknge, E., Smad, P. and Vapnk, V. Leanng algothms fo classfcaton: A compason on handwtten dgt ecognton. Neual Netwoks: The Statstcal Mechancs Pespectve, 261-276, 1995. [13] Lews, D.D.. An evaluaton of phasal and clusteed epesentatons on a text categozaton task. In SIGIR 92: Poceedngs of the 15 th Annual Intenatonal ACM SIGIR Confeence on Reseach and Development n Infomaton Reteval, 37-50, 1992. [14] Lews, D.D. and Hayes, P.J. (Eds.) ACM Tansactons on Infomaton Systems Specal Issue on Text Categozaton, 12(3), 1994. [15] Lews, D.D. and Rnguette, M.. A compason of two leanng algothms fo text categozaton. In Thd Annual Symposum on Document Analyss and Infomaton Reteval, 81-93, 1994. [16] Lews. D.D. and Spack Jones. K. Natual language pocessng fo nfomaton eteval. Communcatons of the ACM, 39(1), 92-101, Januay 1996. [17] Lews, D.D., Schape, R., Callan, J.P., and Papka, R. Tanng algothms fo lnea text classfes. In SIGIR '96: Poceedngs of the 19th Annual Intenatonal ACM SIGIR Confeence on Reseach and Development n Infomaton Reteval, 298-306, 1996. [18] Osuna, E., Feund, R., and Gos, F. Tanng suppot vecto machnes: An applcaton to face detecton. In Poceedngs of Compute Vson and Patten Recognton '97, 130-136, 1997. [19] Platt, J. Fast tanng of SVMs usng sequental mnmal optmzaton. To appea n: B. Scholkopf, C. Buges, and A. Smola (Eds.) Advances n Kenel Methods Suppot Vecto Leanng, MIT Pess, 1998. [20] Roccho, J.J. J. Relevance feedback n nfomaton eteval. In G.Salton (Ed.), The SMART Reteval System: Expements n Automatc Document Pocessng, 313-323. Pentce Hall, 1971. [21] Saham, M. Leanng Lmted Dependence Bayesan Classfes. In KDD-96: Poceedngs of the Second Intenatonal Confeence on Knowledge Dscovey and Data Mnng, 335-338, AAAI Pess, 1996. http://obotcs.stanfod.edu/uses/saham/papesd/kdd96-lean-bn.ps [22] Saham, M., Dumas, S., Heckeman, D., Hovtz, E. A Bayesan appoach to flteng junk e-mal. AAAI 98 Wokshop on Text Categozaton, July 1998. http://obotcs.stanfod.edu/uses/saham/papesd/spam.ps [23] Salton, G. and McGll, M. Intoducton to Moden Infomaton Reteval. McGaw Hll, 1983. [24] Schape, R., Feund, Y., Batlett, P. and Lee, W. S. Boostng the magn: A new explanaton fo the effectveness of votng methods. Annals of Statstcs, to appea, 1998. [25] Schütze, H., Hull, D. and Pedesen, J.O. A compason of classfes and document epesentatons fo the outng poblem. In SIGIR 95: Poceedngs of the 18th Annual Intenatonal ACM SIGIR Confeence on Reseach and Development n Infomaton Reteval, 229-237, 1995. [26] Vapnk, V., The Natue of Statstcal Leanng Theoy, Spnge-Velag, 1995. [27] Wene E., Pedesen, J.O. and Wegend, A.S. A neual netwok appoach to topc spottng. In Poceedngs of the Fouth Annual Symposum on Document Analyss and Infomaton Reteval (SDAIR 95), 1995. [28] Yang, Y. Expet netwok: Effectve and effcent leanng fom human decsons n text categozaton and eteval. SIGIR '94: Poceedngs of the 17th Annual Intenatonal ACM SIGIR Confeence on Reseach and Development n Infomaton Reteval, 13-22, 1994. [29] Yang. Y. and Chute, C.G. An example-based mappng method fo text categozaton and eteval. ACM Tansactons on Infomaton Systems, 12(3), 252-277, 1994. [30] Yang, Y. and Pedesen, J.O. A compaatve study on featue selecton n text categozaton. In Machne Leanng: Poceedngs of the Fouteenth Intenatonal Confeence (ICML 97), 412-420, 1997. [31] Yang, Y. An evaluaton of statstcal appoaches to text categozaton. CMU Techncal Repot, CMU-CS- 97-127, Apl 1997. [32] The Reutes-21578 collecton s avalable at: http://www.eseach.att.com/~lews/eutes21578.html