A Chi-Square-Test for Word Importance Differentiation in Text Classification

011 Internatonal Conference on Informaton and Electroncs Engneerng IPCSIT vol.6 (011) (011) IACSIT Press, Sngapore A Ch-Square-Test for Word Importance Dfferentaton n Text Classfcaton Phayung Meesad 1, Pudsadee Boonrawd and Vatnee Nupan,3 + 1 Department of Teacher Tranng n Electrcal Engneerng Department of Informaton Technology 3 Department of Insttute of Computer and nformaton Technology Kng Mongut s Unversty of Technology North Bango, Bango, Thaland. Abstract. Text classfcaton s the man ssue n order to support searches of dgtal lbrares and the Internet. Most approaches suffer from the hgh dmensonalty of feature space, e.g. word frequency vectors. To overcome ths problem, a new feature selecton technque based on a new applcaton of the ch square test s used. Experments have shown that the determnaton of word mportance may ncrease the speed of the classfcaton algorthm and save ther resource use sgnfcantly. By dong so, a success rate of 9.0% could be reached usng documents of the ACM dgtal lbrary. Keywords: dgtal lbrary, text classfcaton, support vector machne, feature selecton, Ch-Square-test 1. Introducton Development and advanced research s necessary to help facltate recommendng expert systems, complex searches, summarzng results of retreval theores and algorthms, also tools that mae t easer for researchers to develop the next generaton. A semantc system s consdered an mportant part of the problem of Informaton Overload because data ncreases every day and nformaton retreval today uses eywords. But ths cannot process the meanng of a word and ts relatonshp to other words[1]. Therefore many researchers have tred to research semantc retreval. Text mnng s part of ths research, data classfcaton models are used to teach the computer automatcally, but one thng to consder s the ambguty caused by the classfcaton of nformaton provded []. Saengsr [3] and Haruechayasa [6] used feature selecton whch s the prncple of word frequency measurng Informaton Gan, Gan Rato, Cfs, Document frequency and Ch-Square to select the frequency of terms and attrbutes because these can reduce resources and ncrease the speed of processng. Technques for data classfcaton that many researchers use are Decson Tree [4] Naïve Bayes [5] Support Vector Machne (SVM) [6] and Tammasr [7] whch apples Support Vector Machne and Grd n credt score. The axs adjustment ernel functon wth approprate parameters should get the best results for data classfcaton. Thus, a major dffculty of text categorzaton s the hgh dmensonalty of the feature space. Feature selecton s an mportant step n text categorzaton to reduce the feature space. Ths research use feature selecton methods such as Informaton Gan, Gan Rato, Cfs, Document frequency, Ch-Square, Consstency and Flter to compare the based methods. After that, text classfcaton s employed to create a new model.. A Revew of Text Categorzaton and Feature Selecton Text categorzaton s the process of automatcally assgnng a text document nto some predefned categores and buldng models. For the text doman, features are a set of terms extracted from document corpus. The document corpus must be analysed to determne the ambguous words because those words create confuson n the classfcaton. Documents are represented by eywords or ndexes whch are used for retreval, also frequences of words are mplemented usng the followng prncples. + Correspondng author. Tel.: + 66 8664-0179; fax: + 66-91-019. E-mal address: vtn@mutnb.ac.th, pym@mutnb.ac.th, pudsadee@mutnb.ac.thh 110

.1 Feature Selecton The man problem for text categorzaton s the hgh dmensonalty of feature space. The feature set for a text document s a set of unque terms or words that occur n all documents. Feature selecton s a method whch reduces the number of attrbutes. The advantage of reducng the attrbute lst s the processng speed, whch n turn gans hgher performance. Saengsr [3] and Haruechayasa [6] presented seven feature selecton models. The feature selecton methods are as follows. Ch-Square (χ ): based on the statstcal theory. It measures the lac of ndependence between the terms n the category [3]. Shown n equaton 1. ( ) χ n O E = (1) = 1 E Consstency: the set of attrbutes evaluated by the level of compatblty of a subset of attrbutes. Consstency of any subset can never be lower than that of the full set of attrbutes; hence the usual practce s to use ths subset evaluator n conjuncton wth a Random or Exhaustve search whch loos for the smallest subset wth consstency equal to that of the full set of attrbutes. Flter: the methods are based on performance evaluaton metrc calculated drectly from the data, wthout drect feedbac from predctors that wll fnally be used on data wth reduced number of features. Such algorthms are usually computatonally less expensve than those from the frst or the second group. Informaton Gan (IG): to fnd node mpurty, ths s the man dea to select the best splt. Several concepts are GINI Index, Entropy and Msclassfcaton error [8]. INFO based on Entropy measurement reduces because of the separate method. Entropy at a gven node t s gven n (): j j INFO = p( )log p ( ) Entropy ( t) () t t j p ( ) s assocated wth frequency of category j at node t. t n Gan Entropy() t Entropy() n = 1 (3) = INFO s shown n (3) the parent node, t s splt nto partton; n s number of records n partton. Nevertheless, bas splts can happen wth large parttons. Gan Rato (GR) : technque mproves the problem of INFO. The structure of method s created by usng to-down desgn. GR was developed by Qunlan n 1986 and based on evaluaton of nformaton theory. Generally, probablty, (P(v )), s to answer v, then the nformaton () of the answer s gven by [9]. SpltINFO s presented n (4) to resolve bas n INFO. In (4), INFO s adapted by usng the entropy of the parttonng (SpltINFO). Thus, hgher Entropy parttonng s adjusted. n n SpltINFO = log (4) = 1 n n INFO GanRato = Δ (5) SpltINFO Document frequency (DF): s a number of terms whch occur n a document. The value can be calculated for each term from a document corpus. All unque terms that have document frequency n tranng set less than some predefned threshold were removed [6]. Cfs: s the measurement process whch determnes hgh correlaton of dmensonalty subset wth class and gnores relaton among them. Therefore, rrelevant features are reduced and power features [3].. Classfcaton Classfcaton s a data mnng (machne learnng) technque used to predct group membershp for data nstances. 111

Decson trees: tree-shaped structures that represent sets of decsons. These decsons generate rules for the classfcaton of a dataset. Specfc decson tree methods nclude Classfcaton and Regresson Trees (CART) and Ch Square Automatc Interacton Detecton (CHAID) [9]. The Nave Bayes (NB): algorthm was frst proposed and used for text categorzaton tas by D. Lews (1998) [10]. NB s based on the bayes theorem n the probablstc frame wor. The basc dea s to use the jont probabltes of words and categores to estmate the probabltes of categores gven a document. NB algorthm maes the assumpton of word ndependence,.e., the condtonal probablty of a word gven a category s assumed to be ndependent from the condtonal probabltes of other words gven that category. Support Vector Machne (SVM): s the machne learnng algorthm ntroduced by Vapn [11]. SVM apples n credt scorng [7]. SVM s based on the structural rs mnmzaton wth the error-bound analyss. SVM models are a close cousn to classcal multlayer perceptron neural networs. Usng a ernel functon, SVM s are an alternatve tranng method for polynomal, radal bass functon shown n equaton (6) (7). Polynomal functon ernel: (SVMP) Radal bass functon ernel (SVMR) 3. Experment and Dscusson ( x, x ) = (1 + x x ) d (6) j j (, ) = exp( γ ) j j x x x x (7) To evaluate the proposed methodology, expermental smulatons were performed. Abstract data from ACM Dgtal Lbrary [1] Doman Informaton System were used. The data conssted of 1,099 documents from 009-010. The data was pre-processed to obtan only data needed. The text analyss component converts sem structured data such as documents nto structured data stored n a database. The felds are dvded nto ttle, author, abstract, eywords etc. Ambguty words are consdered to be part of the confuson matrx. A confuson matrx s a vsualzaton tool typcally used n supervsed learnng. Each column of the matrx represents the nstances n a predcted class. The LexTo program [13], perform text processng and eywords selecton, remove stop words and stemmng. WEKA, an open-source machne learnng tool, was used to perform the experments. WEKA has many data mnng tools to be employed. In ths study, Decson Tree, Naïve Bayes, BayesNet, Support Vector Machne, whch are classfcaton mechansms, were used for judgement n feature selecton process. The performance metrcs to evaluate the text categorzaton used n the experments were accuracy precson, recall and F-measure. The selected algorthms were tranng wth the 10-fold cross valdaton technque. The feature selecton for classfcaton model n Fg. 1 and the expermental results are summarzed n Table 1 below. Fg. 1: Feature Selecton for Classfcaton Model. 11

Fg. : Shows F-Measure values for feature selecton and classfcaton. Fg. 3: Hgh performance of NB,BN and SVM Kernel Functon Radal bass functon for based C and gamma. Table 1: Text Classfcaton Evaluaton Results From Table 1, we can see that χ method for feature selecton had the best performance. Measurement of classfcaton was undertaen va F-measure. The results are as follows: SVMR = 9.0%, NB = 91.70%, BN = 91.40%, SVMP = 90.40% and ID3 = 86.0%, respectvely. The result matched wth the study gven n Saengsr [3] and Haruechayasa [6]. 113

Fg. SVMR classfcaton used feature selecton evaluate by F-Measure. The result are as follows ChSquare = 9.0%, Consstency, Flter, InfoGan = 91.0%, GanRato = 91.00%, No Reducton = 90.80, DF = 90.80% and Cfs = 90.70%. The Optmal values of ths parameter are adjusted by C and gamma parameters shown n Fg. 3. 4. Conclusons and Future Wors In ths paper, Ch-Square-Test wth the best classfcaton model s proposed to overcome the hgh dmensonalty of feature space. Data used n the experments came from the ACM Dgtal Lbrary, Doman Informaton System, durng 009-010, whch comprsed of 1,099 documents. Searchng used eywords or ndexes to represent the document. The experments show that the proposed method mproves the performance of text categorzaton technques usng Ch-Square (χ ) for feature selecton wth the F-measure of 9.0%. The best classfcaton model s based on Support Vector Machne wth radal bass functon (SVMR). Feature selecton can reduce the number or features whle preserve the hgh performance of classfers. Future wor, to further test our approach we can ncrease the number of datasets and ts number of patterns to see f ths has any postve or negatve results. 5. Acnowledgements Choochart Haruechayasa Human Language Technology Laboratory (HLT) Natonal Electroncs and Computer Technology Center (NECTEC) Thaland Scence Par, Klong Luang, Pathumthan 110, Thaland helped for provde Lexto and suggest. 6. References [1] S. Saeneapaya, A Development of Knowledge Warehouse Prototype for Knowledge Sharng Support: Plant Dseases Case Stusy, Inforamton Technology, Faculty of Computer Engneerng, Kasetsart Unversty, 005. [] K. Thongln, S. Vanchayobon and W. Wett, Word Sense Dsambguaton and Attrbute Selecton Usng Gan Rato and RBF Neural Networ, IEEE Conference Innovaton and Vson for the Future n Computng & Communcaton Technologes (RIVF 08), 008. [3] P. Saengsr, P. Meesad, S. Na Wchan and U. Herwg, Comparson of Hybrd Feature Selecton Models on Gene Expresson Data, IEEE Internatonal Conference on ICT and Knowledge Engneerng, 010, pp.13-18. [4] Ko. Youngjoong and Seo. Jungyun, Usng the Feature Projecton Technque Based on a Normalzed Votng Method for Text Classfcaton, Informaton Processng & Management. Vol. 40, pp.191-08, 004. [5] K. Canasa and J. Chuleerat, Tha Text Classfcaton based on NaïveBayes, Faculty of Computer Scence. Kasetsart Unversty, 001. [6] C. Haruechayasa, W. Jtrttum, C. Sangeettraarn, and C. Damrongrat, Implementng News Artcle Category Browsng Based on Text Categorzaton Technque, The 008 IEEE/WIC/ACM Internatonal Conference on Web Intellgence (WI-08) worshop on Intellgent Web Interacton (IWI 008), 008, pp.143-146. [7] D. Tammasr, and P.Meesad, Credt Scorrng usng Data Mnng based on Support Vector Machne and Grd, The 5th Natonal Conference on Computng and Informaton Technology, 009, pp.49-57. [8] Pang-Nng Tan, Mchael Stenbach, and Vpn Kumar, Introducton to Data Mnng, Addson Wesley, 006, pp.150-163. [9] Qunlan, J. R, Inducton of Decson Trees, Machne Learnng 1(1), 006, pp.81-106. [10] D. Lews. Nave bayes at forty: The ndependence assumpton n nformaton retreval. Proc. of European Conf. on Machne Learnng, pages 4 15, 1998 [11] V. Vapn, The Nature of Statstcal Learnng Theory, Sprnger, New Yor, 1995. [1] The ACM Portal s publshed by the Assocaton for Computng Machnery. Copyrght 009-010, Inc. Avalable onlne at http://portal.acm.org/portal.cfm [13] LexTo : Tha Lexeme Toenzer Avalable onlne at http://www.sansarn.com/lexto/ 114