A Chi-Square-Test for Word Importance Differentiation in Text Classification

Similar documents
Forecasting the Direction and Strength of Stock Market Movement

Performance Analysis and Coding Strategy of ECOC SVMs

Searching for Interacting Features for Spam Filtering

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

Can Auto Liability Insurance Purchases Signal Risk Attitude?

PEER REVIEWER RECOMMENDATION IN ONLINE SOCIAL LEARNING CONTEXT: INTEGRATING INFORMATION OF LEARNERS AND SUBMISSIONS

Gender Classification for Real-Time Audience Analysis System

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Improved SVM in Cloud Computing Information Mining

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

What is Candidate Sampling

Using Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Comparison of Domain-Specific Lexicon Construction Methods for Sentiment Analysis

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

An Interest-Oriented Network Evolution Mechanism for Online Communities

Single and multiple stage classifiers implementing logistic discrimination

A Secure Password-Authenticated Key Agreement Using Smart Cards

Traffic-light a stress test for life insurance provisions

Chapter 6. Classification and Prediction

BERNSTEIN POLYNOMIALS

L10: Linear discriminants analysis

How To Classfy Onlne Mesh Network Traffc Classfcaton And Onlna Wreless Mesh Network Traffic Onlnge Network

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification

Semantic Link Analysis for Finding Answer Experts *

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Multi-sensor Data Fusion for Cyber Security Situation Awareness

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

Decision Tree Model for Count Data

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Detecting Credit Card Fraud using Periodic Features

BUSINESS PROCESS PERFORMANCE MANAGEMENT USING BAYESIAN BELIEF NETWORK. 0688,

STATISTICAL DATA ANALYSIS IN EXCEL

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

Title Language Model for Information Retrieval

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

Fast Fuzzy Clustering of Web Page Collections

IMPACT ANALYSIS OF A CELLULAR PHONE

Data Visualization by Pairwise Distortion Minimization

A Multi-mode Image Tracking System Based on Distributed Fusion

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Support Vector Machines

Using Content-Based Filtering for Recommendation 1

Assessing Student Learning Through Keyword Density Analysis of Online Class Messages

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

Probabilistic Latent Semantic User Segmentation for Behavioral Targeted Advertising*

Offline Verification of Hand Written Signature using Adaptive Resonance Theory Net (Type-1)

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

SIMPLE LINEAR CORRELATION

Statistical Methods to Develop Rating Models

Full-Text Clustering Methods for Current Research Directions Detection *

AUTHENTICATION OF OTTOMAN ART CALLIGRAPHERS

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Web Object Indexing Using Domain Knowledge *

Research on Transformation Engineering BOM into Manufacturing BOM Based on BOP

A Simple Approach to Clustering in Excel

Using Association Rule Mining: Stock Market Events Prediction from Financial News

Comparison of Control Strategies for Shunt Active Power Filter under Different Load Conditions

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

DATA MINING CLASSIFICATION ALGORITHMS FOR KIDNEY DISEASE PREDICTION

An artificial Neural Network approach to monitor and diagnose multi-attribute quality control processes. S. T. A. Niaki*

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

Mobile App Recommendations with Security and Privacy Awareness

A Multi-Camera System on PC-Cluster for Real-time 3-D Tracking

A spam filtering model based on immune mechanism

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35, , ,200,000 60, ,000

How To Analyze News From A News Report

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Recurrence. 1 Definitions and main statements

Statistical Approach for Offline Handwritten Signature Verification

Implementation of Deutsch's Algorithm Using Mathcad

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Exploiting Recommendation on Social Media Networks

Transcription:

011 Internatonal Conference on Informaton and Electroncs Engneerng IPCSIT vol.6 (011) (011) IACSIT Press, Sngapore A Ch-Square-Test for Word Importance Dfferentaton n Text Classfcaton Phayung Meesad 1, Pudsadee Boonrawd and Vatnee Nupan,3 + 1 Department of Teacher Tranng n Electrcal Engneerng Department of Informaton Technology 3 Department of Insttute of Computer and nformaton Technology Kng Mongut s Unversty of Technology North Bango, Bango, Thaland. Abstract. Text classfcaton s the man ssue n order to support searches of dgtal lbrares and the Internet. Most approaches suffer from the hgh dmensonalty of feature space, e.g. word frequency vectors. To overcome ths problem, a new feature selecton technque based on a new applcaton of the ch square test s used. Experments have shown that the determnaton of word mportance may ncrease the speed of the classfcaton algorthm and save ther resource use sgnfcantly. By dong so, a success rate of 9.0% could be reached usng documents of the ACM dgtal lbrary. Keywords: dgtal lbrary, text classfcaton, support vector machne, feature selecton, Ch-Square-test 1. Introducton Development and advanced research s necessary to help facltate recommendng expert systems, complex searches, summarzng results of retreval theores and algorthms, also tools that mae t easer for researchers to develop the next generaton. A semantc system s consdered an mportant part of the problem of Informaton Overload because data ncreases every day and nformaton retreval today uses eywords. But ths cannot process the meanng of a word and ts relatonshp to other words[1]. Therefore many researchers have tred to research semantc retreval. Text mnng s part of ths research, data classfcaton models are used to teach the computer automatcally, but one thng to consder s the ambguty caused by the classfcaton of nformaton provded []. Saengsr [3] and Haruechayasa [6] used feature selecton whch s the prncple of word frequency measurng Informaton Gan, Gan Rato, Cfs, Document frequency and Ch-Square to select the frequency of terms and attrbutes because these can reduce resources and ncrease the speed of processng. Technques for data classfcaton that many researchers use are Decson Tree [4] Naïve Bayes [5] Support Vector Machne (SVM) [6] and Tammasr [7] whch apples Support Vector Machne and Grd n credt score. The axs adjustment ernel functon wth approprate parameters should get the best results for data classfcaton. Thus, a major dffculty of text categorzaton s the hgh dmensonalty of the feature space. Feature selecton s an mportant step n text categorzaton to reduce the feature space. Ths research use feature selecton methods such as Informaton Gan, Gan Rato, Cfs, Document frequency, Ch-Square, Consstency and Flter to compare the based methods. After that, text classfcaton s employed to create a new model.. A Revew of Text Categorzaton and Feature Selecton Text categorzaton s the process of automatcally assgnng a text document nto some predefned categores and buldng models. For the text doman, features are a set of terms extracted from document corpus. The document corpus must be analysed to determne the ambguous words because those words create confuson n the classfcaton. Documents are represented by eywords or ndexes whch are used for retreval, also frequences of words are mplemented usng the followng prncples. + Correspondng author. Tel.: + 66 8664-0179; fax: + 66-91-019. E-mal address: vtn@mutnb.ac.th, pym@mutnb.ac.th, pudsadee@mutnb.ac.thh 110

.1 Feature Selecton The man problem for text categorzaton s the hgh dmensonalty of feature space. The feature set for a text document s a set of unque terms or words that occur n all documents. Feature selecton s a method whch reduces the number of attrbutes. The advantage of reducng the attrbute lst s the processng speed, whch n turn gans hgher performance. Saengsr [3] and Haruechayasa [6] presented seven feature selecton models. The feature selecton methods are as follows. Ch-Square (χ ): based on the statstcal theory. It measures the lac of ndependence between the terms n the category [3]. Shown n equaton 1. ( ) χ n O E = (1) = 1 E Consstency: the set of attrbutes evaluated by the level of compatblty of a subset of attrbutes. Consstency of any subset can never be lower than that of the full set of attrbutes; hence the usual practce s to use ths subset evaluator n conjuncton wth a Random or Exhaustve search whch loos for the smallest subset wth consstency equal to that of the full set of attrbutes. Flter: the methods are based on performance evaluaton metrc calculated drectly from the data, wthout drect feedbac from predctors that wll fnally be used on data wth reduced number of features. Such algorthms are usually computatonally less expensve than those from the frst or the second group. Informaton Gan (IG): to fnd node mpurty, ths s the man dea to select the best splt. Several concepts are GINI Index, Entropy and Msclassfcaton error [8]. INFO based on Entropy measurement reduces because of the separate method. Entropy at a gven node t s gven n (): j j INFO = p( )log p ( ) Entropy ( t) () t t j p ( ) s assocated wth frequency of category j at node t. t n Gan Entropy() t Entropy() n = 1 (3) = INFO s shown n (3) the parent node, t s splt nto partton; n s number of records n partton. Nevertheless, bas splts can happen wth large parttons. Gan Rato (GR) : technque mproves the problem of INFO. The structure of method s created by usng to-down desgn. GR was developed by Qunlan n 1986 and based on evaluaton of nformaton theory. Generally, probablty, (P(v )), s to answer v, then the nformaton () of the answer s gven by [9]. SpltINFO s presented n (4) to resolve bas n INFO. In (4), INFO s adapted by usng the entropy of the parttonng (SpltINFO). Thus, hgher Entropy parttonng s adjusted. n n SpltINFO = log (4) = 1 n n INFO GanRato = Δ (5) SpltINFO Document frequency (DF): s a number of terms whch occur n a document. The value can be calculated for each term from a document corpus. All unque terms that have document frequency n tranng set less than some predefned threshold were removed [6]. Cfs: s the measurement process whch determnes hgh correlaton of dmensonalty subset wth class and gnores relaton among them. Therefore, rrelevant features are reduced and power features [3].. Classfcaton Classfcaton s a data mnng (machne learnng) technque used to predct group membershp for data nstances. 111

Decson trees: tree-shaped structures that represent sets of decsons. These decsons generate rules for the classfcaton of a dataset. Specfc decson tree methods nclude Classfcaton and Regresson Trees (CART) and Ch Square Automatc Interacton Detecton (CHAID) [9]. The Nave Bayes (NB): algorthm was frst proposed and used for text categorzaton tas by D. Lews (1998) [10]. NB s based on the bayes theorem n the probablstc frame wor. The basc dea s to use the jont probabltes of words and categores to estmate the probabltes of categores gven a document. NB algorthm maes the assumpton of word ndependence,.e., the condtonal probablty of a word gven a category s assumed to be ndependent from the condtonal probabltes of other words gven that category. Support Vector Machne (SVM): s the machne learnng algorthm ntroduced by Vapn [11]. SVM apples n credt scorng [7]. SVM s based on the structural rs mnmzaton wth the error-bound analyss. SVM models are a close cousn to classcal multlayer perceptron neural networs. Usng a ernel functon, SVM s are an alternatve tranng method for polynomal, radal bass functon shown n equaton (6) (7). Polynomal functon ernel: (SVMP) Radal bass functon ernel (SVMR) 3. Experment and Dscusson ( x, x ) = (1 + x x ) d (6) j j (, ) = exp( γ ) j j x x x x (7) To evaluate the proposed methodology, expermental smulatons were performed. Abstract data from ACM Dgtal Lbrary [1] Doman Informaton System were used. The data conssted of 1,099 documents from 009-010. The data was pre-processed to obtan only data needed. The text analyss component converts sem structured data such as documents nto structured data stored n a database. The felds are dvded nto ttle, author, abstract, eywords etc. Ambguty words are consdered to be part of the confuson matrx. A confuson matrx s a vsualzaton tool typcally used n supervsed learnng. Each column of the matrx represents the nstances n a predcted class. The LexTo program [13], perform text processng and eywords selecton, remove stop words and stemmng. WEKA, an open-source machne learnng tool, was used to perform the experments. WEKA has many data mnng tools to be employed. In ths study, Decson Tree, Naïve Bayes, BayesNet, Support Vector Machne, whch are classfcaton mechansms, were used for judgement n feature selecton process. The performance metrcs to evaluate the text categorzaton used n the experments were accuracy precson, recall and F-measure. The selected algorthms were tranng wth the 10-fold cross valdaton technque. The feature selecton for classfcaton model n Fg. 1 and the expermental results are summarzed n Table 1 below. Fg. 1: Feature Selecton for Classfcaton Model. 11

Fg. : Shows F-Measure values for feature selecton and classfcaton. Fg. 3: Hgh performance of NB,BN and SVM Kernel Functon Radal bass functon for based C and gamma. Table 1: Text Classfcaton Evaluaton Results From Table 1, we can see that χ method for feature selecton had the best performance. Measurement of classfcaton was undertaen va F-measure. The results are as follows: SVMR = 9.0%, NB = 91.70%, BN = 91.40%, SVMP = 90.40% and ID3 = 86.0%, respectvely. The result matched wth the study gven n Saengsr [3] and Haruechayasa [6]. 113

Fg. SVMR classfcaton used feature selecton evaluate by F-Measure. The result are as follows ChSquare = 9.0%, Consstency, Flter, InfoGan = 91.0%, GanRato = 91.00%, No Reducton = 90.80, DF = 90.80% and Cfs = 90.70%. The Optmal values of ths parameter are adjusted by C and gamma parameters shown n Fg. 3. 4. Conclusons and Future Wors In ths paper, Ch-Square-Test wth the best classfcaton model s proposed to overcome the hgh dmensonalty of feature space. Data used n the experments came from the ACM Dgtal Lbrary, Doman Informaton System, durng 009-010, whch comprsed of 1,099 documents. Searchng used eywords or ndexes to represent the document. The experments show that the proposed method mproves the performance of text categorzaton technques usng Ch-Square (χ ) for feature selecton wth the F-measure of 9.0%. The best classfcaton model s based on Support Vector Machne wth radal bass functon (SVMR). Feature selecton can reduce the number or features whle preserve the hgh performance of classfers. Future wor, to further test our approach we can ncrease the number of datasets and ts number of patterns to see f ths has any postve or negatve results. 5. Acnowledgements Choochart Haruechayasa Human Language Technology Laboratory (HLT) Natonal Electroncs and Computer Technology Center (NECTEC) Thaland Scence Par, Klong Luang, Pathumthan 110, Thaland helped for provde Lexto and suggest. 6. References [1] S. Saeneapaya, A Development of Knowledge Warehouse Prototype for Knowledge Sharng Support: Plant Dseases Case Stusy, Inforamton Technology, Faculty of Computer Engneerng, Kasetsart Unversty, 005. [] K. Thongln, S. Vanchayobon and W. Wett, Word Sense Dsambguaton and Attrbute Selecton Usng Gan Rato and RBF Neural Networ, IEEE Conference Innovaton and Vson for the Future n Computng & Communcaton Technologes (RIVF 08), 008. [3] P. Saengsr, P. Meesad, S. Na Wchan and U. Herwg, Comparson of Hybrd Feature Selecton Models on Gene Expresson Data, IEEE Internatonal Conference on ICT and Knowledge Engneerng, 010, pp.13-18. [4] Ko. Youngjoong and Seo. Jungyun, Usng the Feature Projecton Technque Based on a Normalzed Votng Method for Text Classfcaton, Informaton Processng & Management. Vol. 40, pp.191-08, 004. [5] K. Canasa and J. Chuleerat, Tha Text Classfcaton based on NaïveBayes, Faculty of Computer Scence. Kasetsart Unversty, 001. [6] C. Haruechayasa, W. Jtrttum, C. Sangeettraarn, and C. Damrongrat, Implementng News Artcle Category Browsng Based on Text Categorzaton Technque, The 008 IEEE/WIC/ACM Internatonal Conference on Web Intellgence (WI-08) worshop on Intellgent Web Interacton (IWI 008), 008, pp.143-146. [7] D. Tammasr, and P.Meesad, Credt Scorrng usng Data Mnng based on Support Vector Machne and Grd, The 5th Natonal Conference on Computng and Informaton Technology, 009, pp.49-57. [8] Pang-Nng Tan, Mchael Stenbach, and Vpn Kumar, Introducton to Data Mnng, Addson Wesley, 006, pp.150-163. [9] Qunlan, J. R, Inducton of Decson Trees, Machne Learnng 1(1), 006, pp.81-106. [10] D. Lews. Nave bayes at forty: The ndependence assumpton n nformaton retreval. Proc. of European Conf. on Machne Learnng, pages 4 15, 1998 [11] V. Vapn, The Nature of Statstcal Learnng Theory, Sprnger, New Yor, 1995. [1] The ACM Portal s publshed by the Assocaton for Computng Machnery. Copyrght 009-010, Inc. Avalable onlne at http://portal.acm.org/portal.cfm [13] LexTo : Tha Lexeme Toenzer Avalable onlne at http://www.sansarn.com/lexto/ 114