Automatic Text Categorization using NTC

Transcription

1 Automatic Text Categorization using NTC Taeho Jo School of Computer and Information Engineering, Inha University Abstract In this research, we propose NTC (Neural Text Categorizer) as the approach to text categorization. Traditional approaches to text categorization require encoding documents into numerical vectors which leads to the two main problems: huge dimensionality and sparse distribution in each numerical vector. In this research, documents are encoded into string vectors instead of numerical vectors, and a new neural network called NTC which receive a string vector as its input vector is used for text categorization. The goal of this research is to avoid the two main problems by encoding documents into alternative structured data to numerical vectors. We will validate the performance of NTC by comparing it with other machine learning algorithms on the standard test bed, Reuter Introduction Text categorization refers to the process of assigning one or some of the predefined categories to each document. There are two preliminary tasks before performing a task of text categorization: the predefinition of categories and the preparation of sample labeled documents. After the preliminary tasks, text categorization systems learn the sample labeled documents, in order to build their own classification capacity. Using the classification capacity, the system classifies unseen documents. Therefore, the system performs the task of text categorization through the three steps: the preliminary tasks, the learning, and the classification. It requires encoding documents into numerical vectors for using one of traditional approaches to text categorization. By indexing a corpus which consists of sample labeled documents, words called feature candidates are generated. Among them, only some are selected as features, and various criteria for doing that were already defined [7]. Values are assigned to each feature in representing a document into a numerical vector whose attributes are the selected features. Note that encoding documents so causes the two main problems: the huge dimensionality and the sparse distribution. The first problem caused by encoding documents into numerical vectors is the huge dimensionality. It means the requirement of many features for avoiding representing a document into a zero vector. The number of features is usually several hundreds according to the previous literatures [13][15][1]. Although the number of features is small compared with the number of feature candidates, it is still large as the input size. This problem costs very much time and system resources for processing each document and the demand of many sample documents proportionally to the number of features. The second problem caused by encoding documents so is the sparse distribution. It refers to the phenomena in which zero values takes almost positions of each numerical vector. Under the problem, there is the potential possibility that complete zero vectors exist with regardless of categories among representations of sample labeled documents. There is also the possibility of encoding unseen documents into zero vectors. The problem makes very poor the robustness of the learning and the classification of approaches to text categorization. The fundamental idea of this research is to encode documents into string vectors, instead of numerical vectors. A string as an element of the string vectors refers to a combination of alphabets of a language as the basic semantic unit. A string vector refers to a finite ordered set of strings. Note that string vectors with their identical elements but their different orders are treated as different ones. The intention of encoding documents so is to avoid the two main problems which are characterized by encoding documents into numerical vectors. In order to solve the two main problems, the neural network model, called NTC (Neural Text Categorizer), is proposed as the approach to text categorization. The neural network model receives string vectors as its input vectors and initializes its weights with frequencies /09/$ IEEE 26

2 of elements within each class. The weights are optimized as the learning process of the neural networks model to minimize its classification error of training examples. When an unseen string vector as the representation of an unseen document is given, the value of each output node is computed by summing of weights connected to itself, and the unseen document is classified into the category corresponding to the output node with its maximum value. Since the proposed neural network model uses string vectors as its input vectors, it is free from the two main problem. The proposed neural network model is expected to improve the performance of text categorization by avoiding the two main problems. In order to use the proposed neural network model, documents are encoded into string vectors, as mentioned above. Dominance of a particular value never happen in string vectors; any problem which is similar as the sparse distribution does not happen in each string vector. Much smaller dimension of string vectors compared with that of numerical vectors is sufficient for the robust learning and classification. Therefore, the two main problems never happen in using the proposed neural network model for tasks of text categorization. This paper consists of six sections, including this section. In section 2, we will explore relevant literatures. In section 3 and 4, we will describe the definition of string vectors and the learning algorithm and classification of the proposed neural network model, respectively. In section 5, we will validate empirically the performance of the proposed neural network model by comparing it with other machine learning algorithms on the test bed, Reuter In section 6, we will mention the significance of this research and further research as the conclusion of this article. 2 Previous Works This section is concerned with previous works relevant to this research. Even if many approaches to text categorization already proposed, we will mention the four representative and popular approaches: KNN (K Nearest Neighbor), NB (Naive Bayes), SVM (Support Vector Machine), and Neural Networks. It requires encoding documents into numerical vectors for using one of them for text categorization; the two main problems are caused. String kernel was proposed in using the SVM for text categorization as the solution to the two main problems, but it failed to improve the performance. In this section, we will explore the previous works on previous approaches to text categorization and previous solution to the two main problems. The KNN may be considered as a typical and popular approach to text categorization [1]. The KNN was initially created by Cover and Hart in 1967 as a genetic classification algorithm [2]. It was initially applied to text categorization by Massand et al in 1992 [3]. KNN was recommended by Yang in 1999 [4] and by Sebastiani in 2002 [1] as a practical approach to text categorization. Therefore, the KNN has been aimed as the base approach in other literatures as the base approach [1]. The Naive Bayes may be considered as another approach to text categorization. It was initially created by Kononenko in 1989, based on Bayes Rule [5]. Its application to text categorization was mentioned in the textbook by Mitchell in 1997 [6]. Assuming that the Naive Bayes is the popular approach, in 1999, Mladenic and Grobelink proposed and evaluated feature selection methods [7]. The Naive Bayes has been compared with other subsequent approaches in text categorization [9] [10]. Recently, the SVM was recommended as the practical approach to text categorization [9] [10]. It was initially introduced in her magazine article by Hearst in 1998 [8]. In the same year, it was applied to text categorization by Joachims [9]. It was adopted as the approach to spam mail filtering as a practical instance of text categorization in 1999 by Druker et al [10]. Furthermore, the SVM is popularly used not only for text categorization tasks but also for any other pattern classification tasks [11]. Neural Networks may be considered as an approach to text categorization, and among them, the MLP (Multiple Layers Perceptron) with back propagation is the most popular model. The neural network model was initially created in 1986 by Mcelland and Rumelhart, and it was intended to for performing tasks of pattern classification and nonlinear regressions as a supervised learning algorithm [12]. It was initially applied to text categorization in 1995 by Wiener [13]. Its performance was validated by comparing it with KNN in his master thesis on the test bed, Reuter [13]. Even if the neural network classifies documents more accurately, it takes very much time for learning training documents. The string kernel was proposed as the solution to the two main problems which is inherent in encoding documents into numerical vectors. It was initially proposed by Lodhi et al in 2002 as the kernel function of SVM [15]. The string kernel receives two raw texts as its inputs and computes their syntactical similarity between them. Since documents don t need to be encoded into numerical vectors, the two main problems are naturally avoided. However, it costed very time for computing the similarity and failed to improve the 27

3 performance of text categorization. This research has three advantages as mentioned in section 1. The first advantage of this research is to avoid the two main problems by encoding documents into alternative structured data to numerical vectors. The second advantage is that string vectors are more transparent than numerical vectors with respect to the content of its full text; it is easier to guess the content of document by seeing its string vector than by its numerical vector. The third advantage as one derived from the second advantage is that it is potentially easier to trace why each document is classified. Therefore, this research proposes the neural network which received string vectors as its input data because of the three advantages. 3 String Vector This section is concerned with the general aspect of string vectors. A string vector refers to an ordered set of strings. In the context of a natural language, a string indicates a word or a vocabulary. In other words, words or vocabularies are given as elements of string vectors. Therefore, this section describes in detail string vectors as the alternative representation of documents to numerical vectors. A string vector is defined as a set of words which is ordered and has its fixed size. It is denoted by [s 1,s 2,...,s d ] where s i denotes a string, and there are d elements. When representing documents into string vectors, their sizes are fixed with d, and it is called the dimension of string vectors. Since the elements are ordered in each string vector, two string vectors with their identical elements but different orders are treated as different ones. The reason is that each position of an element has its own different feature. Table 1 illustrate differences between string vectors and numerical vectors. The first difference is that numerical values are given as elements in numerical vectors, while strings are given as elements in string vectors. The second difference is that the similarity measure between two numerical vectors is the cosine similarity or the Euclidean distance, while that between two string vectors is the semantic average similarity 1. The third difference between the two types of structured data is that features for encoding documents into numerical vectors are words, while those for encoding them into string vectors are statistical linguistic and posting properties of words. Therefore, a string vector is the vector where numerical values are replaced by strings in a numerical vector. 1 The average semantic similarity is not described here, since it is not involved in executing the proposed neural network. Table 1. The Comparison of Numerical Vectors and String Vectors Numerical Vector String Vector Element Numerical Value String Similarity Measure Inner Products Semantic Similarity Euclidean Distance Attributes Words Property of Words The differences between string vectors and bags of words are illustrated in table 2. Both types of structured data have strings as their elements. As the similarity measure, cardinality of intersection of two bags of words is used while the average semantic similarity is used in string vectors. A bag of words is defined as an unordered infinite set of words, while a string vector is defined as an ordered finite set of words. Although a bag of words and a string vector look similar as each other, they should be distinguished from each other, based on table 2. Table2.TheComparisonofBagofWordsand String Vectors Element Numerical Vector String String Vector Similarity Measure Number of Shared Words Semantic Similarity Set Unordered Infinite Set Ordered Finite Set There are three advantages in representing documents into string vectors. The first advantage is to avoid completely the two main problems: the huge dimensionality and the sparse distribution. The second advantage is that string vectors are characterized as more transparent representations of documents than numerical vectors; it is easier to guess the content of documents only from their representations. The third advantage is that there is the potential possibility of tracing more easily why documents are classified so. However, in order to use string vectors more freely, it is necessary to make mathematical foundations. 4 Neural Text Categorizer This section is concerned with the architecture, learning algorithm, and classification of the proposed neural network called NTC. Documents are encoded into string vectors for using the NTC. In the neural network, the special layer, called the learning layer, exists. With respect to the architecture of the NTC, 28

4 the input nodes are directly connected to the output nodes, and the weights between the input and output nodes are decided by the learning nodes. In this section, we will describe the architecture and the learning process of NTC. The architecture of the NTC is illustrated in figure 1. The input layer receives an input vector given as a string vector and the number of nodes in the layer is consistent with the dimension of the string vector. The output layer generates categorical scores which indicates the likelihood of the given input vector to each category, and the number of nodes in the layer is consistent with the number of the predefined categories or classes. The learning layer decides the weights between the input and output layers differently depending on the given input vector, and the number of nodes in the layer is also consistent with the number of categories or classes. Therefore, with respect to its architecture, the NTC has the three layers as shown in figure 1. nodes as its categorical score. The weights are decided by referring the table which is owned by its corresponding learning node. The category corresponding to the output node which generate its maximum categorical score is decided as the category of the given example. Therefore, the output of this process is one of the predefined categories, assuming that the NTC is applied to text categorization without the decomposition. The property which characterizes the NTC exists. It is that the learning layer exists inherently in the NTC, and it has its own table, as the reference for deciding weights, which consists of words and their weights. Each input node receives a string as an element of a string vector, and learning nodes decide the weights connected from the input node by referring to their own tables. In the current version of the NTC, if the word is not registered in a table, its weight is assigned to zero. The issue on this case will be considered in further research. 5 Empirical Results Figure 1. Overall Architecture of the NTC In context of the learning process, the first step of the NTC is to initialize the weights between the input and output layers. Let s assume that the NTC is applied to text categorization without decomposing the task into binary classification tasks. A set of the training string vectors is partitioned category by category. Each learning node has its own table which consists of words and their weights. Frequencies of elements of string vectors within each category assigned in the table as the initial weights. Therefore, the initial step is to set up the tables in learning nodes. The learning process of the NTC refers to the process of optimizing the weights in the tables of the learning nodes. Each training example is classified by summing the initial weights and selecting the category corresponding to the maximal sum. If the training example is classified correctly, the weights are not updated. Otherwise, the weights are incremented toward the target category and those are decremented toward the classified category. The optimized weights are generated as the output of this process. In the NTC, each example is classified by summing the optimized weights, whether it is a training or unseen example. Each output node generates the summation of weights connected to itself from the input This section is concerned with the empirical validation of the performance of NTC. We use the collection of news articles, called Reuter 21578, as the test bed. This set of experiments involves the five approaches: KNN, NB, SVM, NNBP, and NTC. The F1 measure is used for evaluating the performance of each approach to text categorization. In this section, the test bed and configurations of the approaches involved in the set of experiments are described, and the results of the set of experiments are presented and discussed. Table 3 illustrates the predefined categories and the number of news articles per each category in the test bed, Reuter The ten most frequent categories are selected among more than 100 categories in this set of experiments. As illustrated in table 3, the collection of news articles is partitioned into two sets: the training and the test set. The selection of ten most frequent categories and the partition are subject to the literature [16]. The Reuter is popularly used as the standard test bed for evaluating approaches to text categorization [1]. The configurations of the involved approaches are illustrated in table 4. The parameters of the SVM and the KNN, the capacity and the number of nearest neighbors, are set as four and three, respectively but the NB has no parameter. The parameters of the NNBP such as the number of hidden nodes and the learning rate are arbitrary set as shown in table 4. News articles are eneocded into 500 dimensional numerical vectors and 50 dimensional string vectors. Therefore, the configurations of the involved ap- 29

5 Table 3. The Most Frequent Categories in Reuter21378 Category Name #Training #Test #Total Earn Acq Money-Fx Crude Grain Trade Interest Ship Wheat Corn Figure 2. The Results of This Set of Experiments 6 Conclusion proaches are set as shown in table 4. Table 4. The Configurations of Participating Approaches Approaches Parameter Configurations SVM Capacity = 4.0 KNN #nearest number = 3 Naive Bayes N/A NNBP Hidden Layer: 10 hidden nodes Learning rate: 0.3 #Training Epochs: 1000 NTC Learning rate: 0.3 #Training Epochs: 100 The results of comparing the involved approach with each other are presented in figure 2. Among the five bars, the black bar indicates the performance of the proposed approach. The left group indicates the microaveraged F1 measure of the five approaches. The right group does the macro-averaged F1 measure. The proposed approach, NTC, shows its best performance among the five approaches. We need to discuss the results illustrated in figure 2. Note that documents are encoded into the smaller dimension of string vectors than that of numerical vectors. The NTC shows its best performance among the five approaches which are involved in this set of experiments. Even if the NTC has its comparable performance to the others, it is more practical approach, since its input size is very smaller. Therefore, this set of experiments shows that the proposed approach is most practical with respect to both the better performance and more compact input size. The four contributions are considered as the significance of this research. For first, this research proposes the practical approach, according to the results of the set of experiments. For second, it solved the two main problems, the huge dimensionality and the sparse distribution which are inherent in encoding documents into numerical vectors. For third, it created a new neural network, called NTC, which receives string vectors differently from the previous neural networks. For last, it provides the potential easiness for tracing why each document is classified so. Let s consider the four remaining tasks as the further research. The first task is to apply the proposed neural network to categorization of documents within a specific domain such as medicine, law, and engineering. The second task is to modify it into the regression version. The third task is to create an unsupervised neural network based on the proposed neural network. The last task is to develop a text categorization system where the approach is adopted as its categorization engine. References [1] F. Sebastiani, Machine Learning in Automated Text Categorization, pp1-47, ACM Computing Survey, Vol 34, No 1, [2] T.M. Cover and P.E. Hart, Nearest Neighbor Pattern Classification,pp21-27, IEEE Transaction on Information Theory, Vol 13, [3] B. Massand, G. Linoff, and D. Waltz, Classifying News Stories using Memory based Reasoning, pp59-65, The Proceedings of 15th ACM International Conference on Research and Development in Information Retrieval,

6 [4] Y. Yang, An evaluation of statistical approaches to text categorization, pp67-88, Information Retrieval, Vol 1, No 1-2, [5] I. Kononenko, ID3, sequential Bayes, naive Bayes and Bayesian neural networks, pp91-98 The Proceedings of 4th European Working Session on Learning, Montpellier,1989. [6] T. M. Mitchell, Machine Learning, McGraw-Hill, [7] D. Mladenic and M. Grobelink, Feature Selection for unbalanced class distribution and Na?ve Bayes, pp , The Proceedings of International Conference on Machine Learning, [8] M. Hearst, Support Vector Machines, pp18-28, IEEE Intelligent Systems, Vol 13, No 4, [9] T. Joachims, Text Categorization with Support Vector Machines: Learning with many Relevant Features, pp , The Proceedings of 10th European Conference on Machine Learning, [10] H. Drucker, D. Wu, and V. N. Vapnik, Support Vector Machines for Spam Categorization, pp , IEEE Transaction on Neural Networks, Vol 10, No 5, [11] N. Cristianini and J. Shawe-Taylor, Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, [12] J. McClelland and D. Rumelhart, Parallel Distributed Processing, Vol 1 and 2, MIT Press, [13] E. D. Wiener, A Neural Network Approach to Topic Spotting in Text, The Thesis of Master of University of Colorado, [14] M. E. Ruiz and P. Srinivasan, Hierarchical Text Categorization Using Neural Networks, pp87-118, Information Retrieval, Vol 5, No 1, [15] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, pp , Journal of Machine Learning Research, Vol 2, No 2, [16] A. Estabrooks, T. Jo, and N. Japkowicz, A Multiple Resampling Method for Learning from Imbalanced Data Sets, pp18-36, Computational Intelligence, Vol 28, No 1,,