Statistical Feature Selection Techniques for Arabic Text Categorization

Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000 ext. 20000 rehab@just.edu.jo ABSTRACT This paper compares a few statistical feature selection techniques for Arabic text. Feature selection is especially important for text classification because, when dealing with text, the number of features/words increases rapidly. This makes the document-term matrix a sparse one which affects the performance of classifiers in terms of accuracy and in terms of processing time. One opts to reduce the number of features during preprocessing by eliminating stopwords, or stemming. This paper adds one step to preprocessing by reducing the number of features based on their merit as determined by some statistical measures such as correlation, Chi square, deviation, and uncertainty. The dataset that was used in this study consists of 4,000 Arabic documents that fall into two categories. Naïve Bayes with 3-fold cross validation is used in this study. Results show that weight by correlation gave very high accuracy with very little time. Still, the highest accuracy was achieved in the base case, when no filters were used, but it took the classifier twice the time required with weight by correlation. Categories and Subject Descriptors H.2.8. Data mining. I.2.7 Natural language processing. I.5.2 Classifier design and evaluation Feature evaluation and selection. General Terms Algorithms, Management, Performance, Design, Experimentation. Keywords Text categorization, text classification, feature selection, Chi square, weight by correlation 1. INTRODUCTION Text classification or categorization is the process of assigning a document to one or more labels or categories [12]. The first is known as single-label classification. The latter is addressed in the literature as multi-label classification [15]. Text categorization is a supervised process since the set of labels is known a priori. It has many applications such as spam filtering, topic identification, and tailored news or advertisement delivery. The exponential growth of the number of features is common in text categorization [12]. Since every unique token in every document would appear as a dimension in the document-term matrix. Most text categorization techniques reduce this large number of features by eliminating stopwords, or stemming. This is effective to a certain extent but the remaining number of features is still huge. More sophisticated feature selection techniques have been reported in the literature [1, 2, 4, 9, 16, 17, 18]. One needs to distinguish between feature selection and feature reduction. In feature selection, a subset of the features in the document-term matrix is selected based on their merit in the classification process [14]. These fall into filter, and wrapper approaches. Filters rely on a measure (most often a statistical one) to calculate the merit of a given feature. Only features which weight is greater than a threshold are kept for classification. Filters are independent from the classifier and therefore they are inexpensive to use. Examples of filters include weight by correlation, chi square, information gain [8], mutual information [10], Gini index [13] and many more. Wrappers [1], by comparison, evaluate features by training a classifier; this means that the classifier s accuracy is calculated over several subsets of features determined by greedy algorithms. Wrappers yield better results but they are expensive and may suffer from overfitting. Feature reduction, on the other hand, transforms the original set of features into new features by applying some transformation function. The new feature set contains far fewer features or dimensions than the original set [2]. Common feature reduction techniques include term clustering [3], latent semantic indexing [7, 9], and principal component analysis [2, 19]. These are also expensive to use but they yield good results over a reduced dimension feature space. In this paper, we investigate the effects of weight by correlation, chi square, deviation and uncertainty filters on the classification accuracy. The Naïve Bayes classifier was used in this study. The dataset, that we experimented with, consists of 4000 Arabic The Fourth International Conference on Information and Communication Systems (ICICS 2013), Irbid, Jordan, April, 23-25, 2013.

documents collected from the Internet and labeled manually. The results indicate that weight by correlation achieves high classification accuracy and minimum time to complete the classification. Working on the set of all features, after removing stopwords and stemming, gave the maximum accuracy but on the expense of processing time. This paper is organized as follows. Section 1 has introduced this work. Section 2, by comparison, explains the necessary background and related work. Section 3, on the other hand, explains the research methodology followed in this work. Section 4 discusses the experimentations and obtained results. Finally, Section 5, draws the conclusions of this work. 2. BACKGROUND AND RELATED WORK Feature selection and feature reduction are well-studied problems in the literature. Several sophisticated algorithms have been introduced and used [1, 2, 4, 9, 16, 17, 18]. The bulk of these works addressed structured data. Text classification or categorization aim at assigning a label to a document. Since text is unstructured, words or tokens which appear in the collection of documents are used as features or attributes. The importance of a word or term is commonly determined by using TFIDF (term frequency inverse document frequency). Usually, the number of features or terms is huge and this could adversely affect accuracy as a result of overfitting or it increases processing time. Thus feature selection is commonly used with text classification. Feature selection/reduction for Arabic text has been investigated in the literature. Duwairi, Refae and Khasawneh [3] investigated three heuristics for feature reduction in Text classification. Specifically, stemming, light stemming and word clustering were used. The dataset consists of 15000 documents. Naïve Bayes and KNN were used for classification. Their results show that light stemming supersedes the other two methods. Mesleh [10] compared the effects of 17 feature selection techniques on the accuracy of SVM when applied to a corpus that consists of 7842 documents written in Arabic. He concluded that Chi-square leads the best precision and recall values. Mesleh did not apply stemming on the text but he applied normalization of certain letters (such as hamza ( ), alef mad ( ), alef with hamza on top (>), alef with Hamza on bottom (<), hamza on ya (}) are normalized to alef without Hamza (A)). Stopwords were also removed from the text documents. We are using Bukwalter s transliteration system for writing Arabic letters. Harrag and El-Qawasmeh [6] have used SVD (Singular Value Decomposition) with Neural Networks to classify documents. Their work falls under feature reduction rather than selection. Harag, El-Qawasmeh & Pichappan [5] applied decision tree for classifying Arabic documents on two rather small dataset collected manually. The accuracy of the classifier was calculated once without feature selection and once with feature selection. The feature selection measures that were used are term frequency (TF), document frequency (DF) and their ratio (TF/DF). The best filter is determined for each dataset and then the decision tree was used to classify documents. Their results demonstrate feature selection improves accuracy. The features that they used are well known and well used in the literature. 3. RESEARCH METHODOLOGY 3.1 Data Set The dataset consists of 4,000 Arabic documents which fall into two categories. The first category contains documents that are related to economics and therefore it is labeled Economics. This category consists of 2,000 documents. The second category, on the other hand, contains documents that are related to Politics activities and therefore it is labeled Politics. The Politics category consists of 2,000 documents as well. This dataset was created manually by collecting documents from the internet. It was labeled manually as well. 3.2 Preprocessing Rapidminer [11] was used in the preprocessing and classification tasks. Rapidminer is an open source data mining tool that supports many data mining tasks and it is readily usable for text. Preprocessing, in this work, consists of the following steps carried out in order on every document in the dataset: 1. Replace Tokens: here every non-arabic character is replaced by a whitespace character. The Regular expression [a-za-z] was replaced by s. 2. Tokenize: here the words or tokens of the document are extracted. A token means any sequence of Arabic alphabet. 3. Filter Stopwords: here all stopwords are removed from the documents. We used Rapidminer built-in Arabic stopword list. 4. Stemming: we utilized Rapidminer built-in light stemming algorithm. Stemming in Arabic falls into two classes: light stemming and root extraction (often referred to as stemming). The first means that a word will not be reduced to its three letter root but common prefixes and suffixes are removed. In root extraction, a word is reduced to its three letter root. A very small percentage of Arabic words have quad-literal or penta-literal roots. 3.3 Classification Model Naïve Bayes is used in this research. After preprocessing, the document-term matrix is generated and the importance of a term is expressed using TF-IDF. The accuracy of a classification task is based on Precision and Recall. The precision/recall for every class is calculated. The overall performance of the classifier is determined by averaging precision/recall for the two classes. Because the dataset is relatively small, we utilized 3-fold cross validation. 3.4 Feature Selection Model

In order to assess the impact of feature selection on text classification, we first ran the classification task on the dataset without any feature selection apart from the preprocessing that was explained in Section 3.2. We calculated and stored the precision and recall. After that, the classification task was run for several times. At each run, we used a specific feature selection method and accuracy, in terms of precision and recall, was calculated. The idea is to assess the suitability of feature selection for text classification. The following paragraphs explain each of the filters that we have used in this study: Weight by Correlation This measure calculates the weight (degree of association) of attributes with respect to the class label. The weight belongs to [- 1, 1] and could be normalized to [0, 1]. From a classification perspective, the higher the correlation between a feature and a class label is, the better the discrimination power of the feature (i.e. the feature is an important one for classification) is. Weight by Chi Square This filter computes the lack of independence between a feature and a class. It determines if the distribution of observed frequencies is different from expected distribution. The higher the Chi Square values are the better the features are. Weight by Deviation Computes weights based on the (normalized) standard deviation of the attributes. Weight by Uncertainty This operator calculates the importance of a feature by measuring the symmetrical uncertainty with respect to the class. 4. EXPERIMENTATION AND RESULT ANALYSIS The first series of experiments aimed at comparing the performance of Weight by Correlation, Chi Square, Deviation and Uncertainty. Figure 1 shows the average precision and recall for the previous filters compared against the base case. We used Naïve Base Classifier with 3-fold cross-validation. The dataset was preprocessed by removing stopwords, and light-stemming the remaining words. We used a personal computer with 16GB RAM and 3.4GHz i7-3770 Intel Core CPU with Windows 7 Enterprise 64 bit. As the table shows the base case gave really good results and the subsequent filters did not enhance the classifier s accuracy. Feature selection using Correlation and Uncertainty gave much better accuracy when compared with Chi Square and Deviation. Figure 1: Precision and Recall for Various Filters with Stemming Table 1 shows the time, in seconds, that was necessary to complete the classification task under different filters. As it can be seen from the table, weight by correlation took very little time when compared with the chi-square statistics. Table 1: Classification Time in seconds Filter Name Time in Seconds Base Case 583 Correlation 266 Chi Square 1192 Deviation 505 Uncertainty 1399 Figure 2 plots precision, recall and the time required to complete the classification task. The figure indicates that using weight by correlation gave slightly lower accuracy but it needed only 266 second to complete. By comparison, the base case with no filters gave the highest accuracy but on the expense of time. Feature selection is an optimization problem that takes into consideration several factors including classifier s accuracy and time necessary to complete the classification task. Figure 2: Precision, Recall and Time against Several Filters 5. CONCLUSIONS

The work reported in this paper has compared the effects of statistical feature selection techniques on the accuracy of the Naïve Bayes Classifier. Specifically, the effects of weight by correlation, chi square, deviation and uncertainty were investigated. The dataset consists of 4000 Arabic documents which are uniformly distributed over two classes. The results of the experiments demonstrate that weight by correlation gives the highest accuracy when compared with other filters. It also shows that correlation and deviation takes much less time to be computed. Occasionally, slight decrease in accuracy is acceptable in favor of speedy computation. Recall that feature selection runs a classification task with a subset of the features aiming to avoid overfitting (and thus increasing performance), and reducing time. REFERENCES [1] Ahmadizar F, & Hemmati M, & Rabanimotlagh A., Two- Stage Text Feature Selection Method Using Fuzzy Entropy Measure and Ant Colony Optimization. Proceedings of the 20 th Iranian Conference on Electrical Engineering (ICEE), Pages 695 700, May 15 17, Tehran, Iran, 2012. [2] Anghelescu A. & Muchnik I., Combinatorial PCA and SVM Methods for Feature Selection in Learning Classifications (Applications to Text Categorization), Proceedings of the IEEE KIMAS Conference, Boston, MA, UAS, Oct. 1 3, pages 491 496, 2003. [3] Duwairi R., Al-Refai M. & Khasawneh N., "Feature Reduction Techniques for Arabic Text Categorization". Journal of the American Society for Information Science and Technology (JASIST), Volume 60, Issue 11, pages: 2347-2352, 2009. [4] Figueiredo F. et al, Word Co-occurrence Features for Text Classification, Journal of Information Systems, Vol 36, Pages 843 858, 2011. [5] Haraq F., El-Qawasmeh E, & Pichappan P., Improving Arabic Text Categorization using Decision Trees, Proceedings of the First International Conference on Networked Digital Technologies, July 28 31, Pages. 110 115, 2009. [6] Harrag F. & El-Qawasmah, E., Improving Arabic Text Categorization Using Neural Network with SVD, Journal of Digital Information Management, Vol. 8, No. 2, pages 125 135, 2010. [7] Harrag F, El-Qawasmeh E. & Al-Salman A., A Comparative Study of Statistical Feature Reduction Methods for Arabic Text Categorization, Networked Digital Technologies (NDT 2010), Part II, Communications in Computer and Information Science, Vol. 88, Pages 676 682, 2010. [8] Lee C. & Lee G. G., Information Gain and Divergence-based Feature Selection for Machine Learning-based Text Categorization, Journal of Information Processing and Management, Volume 42, Pages 155-165, 2006. [9] Meng J., & Lin H., A Two-Stage Feature Selection Method for Text Categorization, Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 1492 1496, August 10 12, Yanti, China, 2010. [10] Mesleh A., Feature Sub-set Selection Metrics for Arabic Text Classification, Journal of Pattern Recognition Letters, Vol. 32, pages 1922-1929, 2011. [11] Rapidminer, http://rapid-i.com/, last accessed 16-Jan-2013. [12] Sebastiani, F. Text Categorization. In A. Zanasi (Ed.). Text Mining and Its Applications to Intelligence, CRM and Knowledge Management. Pages 109 129. Southampton, UK: WIT Press, 2005. [13] Sing S. R. & Murthy H. A. & Gonsalves T. A., Feature Selection for Text Classification based on Gini Coefficient of Inequality, Proceeding of 4th JMLR International Workshop on Feature Selection in Data Mining, pages 76 85, June 21, Hyderabad, India, 2010. [14] Seo,Y., Ankolekar, A., & Sycara, K. Feature Selection for Extracting Semantically Rich Words. Technical Report CMU- RI-TR-04-18, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA., 2004. [15] Tsoumakas G., Katakis I. Multi-Label Classification: An Overview. Journal of Data Warehousing & Mining, Vol. 3, No. 3, Pages 1-13, 2007. [16] Uysal A. K. & Gunal S., A novel probabilistic Feature Selection Method for Text Classification, Journal of Knowledge-Based Systems, Vol. 36, Pages 226 235, 2012. [17] Wang et al, Feature Selection with Maximum Information Metric in Text Categorization, Proceedings of the 1st International Conference on Information Science and Engineering (ICISE), pages 857 860, Dec 26-28, Nanjing, China, 2009. [18] Wang S. et al, A Feature Selection Method based on Improved Fisher s Discriminant Ratio for Text Sentiment Classification, Journal of Expert Systems with Applications, Vol. 38, pages 8696 8702, 2011. [19] Yang J. et al, A New Feature Selection based on Comprehensive Measurement both in Inter-Category and Intra-Category for Text Categorization, Journal of Information Processing and Measurement, Vol. 48, Pages 741 754, 2012.