SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Transcription

1 I J I T E ISSN: (1-2), 2012, pp SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and Engineering, Pondicherry Engineering College 2 M.Tech (Information Security), Pondicherry Engineering College, Pondicherry s: charuladha@pec.edu, sasirekal69@pec.edu Abstract: is a convenient way to communicate among the users in the Internet. The growth of users in the Internet and the abuse of by unwanted users cause an exponential increase of s in user s mailbox, which is known as Spam. It is defined as Junk , Unsolicited Commercial , and Unsolicited Bulk . It produces huge economic loss to large scale organizations due to network bandwidth consumption and mail server processing overload. Text Categorization is prominent to sort out the set of s into categories from a predefined set automatically. text classification plays a major role of more pliable, vigorous, and also personalized. This paper provides a review of various text classification processes, phases of that process and methods used at each phase for Spam filtering. Keywords: s, Text Categorization, Text Classification, Spam filtering I. INTRODUCTION Electronic mail ( ) is a communication channel between people on the internet. It is an efficient and popular communication mechanism as the number of Internet user s increases. Thus, becomes a major problem for individuals and organizations, because it is prone to misuse. The posting of Unsolicited messages, known as Spam. Spam is also known as Unsolicited Commercial or Junk , which floods the Internet user s electronic mailboxes. These junk s may contain phishing messages, advertising, viruses or quasi legal services. When a user is flooded with a large amount of spam, the chance of he or she forgot to read a legitimate message increases. As a result, many readers will have to spend a major portion of their time removing unwanted messages. spamming may lead to more unfavorable situation if the recipient replies to the messages, which will cause the recipients addresses available to be attacked by other spammers. Spam also creates a burden on mail servers and Internet traffic, all for unwanted messages. There are many different approaches which attempt to solve the spam flood. To do efficient Spam filtering, Text Categorization is a prominent approach. It helps to sort out a set of documents into categories from a predefined set automatically. Text Categorization [1] is the task of automatically sorting a set of documents into categories such as topics from a predefined set. This tasks falls at the crossroads of Machine Learning and Information Retrieval. The automated Text Categorization into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. The advantages of this approach over the Knowledge engineering approach are a very good effectiveness, considerable savings in terms of expert power, and straightforward portability to different domains. Various applications of Text Categorization are automatic indexing for Boolean information retrieval systems, document organization, Text filtering, Word Sense Disambiguation, hierarchical categorization of web pages. Automatic indexing [2] with controlled dictionaries is related to automated metadata

2 234 K. Saruladha and L. Sasirekha generation. In digital libraries, one is usually interested in tagging documents by metadata that explains them under a variety of aspects. Indexing with controlled vocabulary is an instance of the general problem of document base organization. Text filtering is the activity of classifying a stream of incoming documents dispatched in an asynchronous way by an information producer to an information consumer. An filter mail might be trained to discard junk mail and further classify non junk mail into topical categories of interest to the user. Various Text Categorization methods like Naïve Bayesian, Support Vector Machine, Decision tree, Fuzzy logic can be applied to Spam detection and filtering process. This paper is organized as follows. Section 2 summarizes the taxonomy of text classification algorithms, Section 3 discusses the Rule based classifiers for classification, Section 4 discusses the linear classifiers, Example based classifiers and Section 5 is the conclusion. II. TAXONOMY OF TEXT CLASSIFICATION ALGORITHMS Classification algorithms can classify anything which has features. Three broad classes of automatic classification algorithms can be distinguished. A large number of automatic learning algorithms developed in the Artificial Intelligence Community have recently come to be applied in the Information Retrieval context. According to the Text Classification algorithms, it is briefly categorized into three classifiers namely, Rule based classifiers, linear classifiers and Example based classifiers. III. RULE BASED CLASSIFIERS Rule based classifiers learn by inferring a set of rules (a disjunction of conjunctions of atomic tests like this feature has that value ) from preclassified documents. Fuzzy Logic Fuzzy Logic is a rule based classifier for Spam classification and filtering. Fuzzy logic uses linguistic variables [3], overlapping classes and approximate reasoning to model a classification problem. Fuzzy logic leads to Spam detection as the classes of Spam and non-spam messages, which overlap over a fuzzy boundary. Fuzzybased spam detection first preprocess the documents (removing all stop words such as the, it as well as HTML tags), by building a fuzzymodel of overlapping categories {Spam, Valid} with membership functions derived from the training set and classifies the input messages by calculating the fuzzy similarity value between the received message on each category. The strength in this approach is, it scans the contents of the message to predict its category rather than relying on a fixed pre-specified set of keywords. Figure 1: Taxonomy of Text Classification Algorithms P. Sudhakar et al. [4] proposed a fuzzy rule implementation for Spam classification, to improve the performance of the classifier, and fuzzy rules are generated and applied to all incoming s. Three fuzzy rules were constructed for efficient Spam classification. Rule 1 was functional on input parameter- address of the Sender. Based on the rule, the address of the sender was extracted from and compared against the Origin based Spam Filter techniques such as Blacklist, White list which contains the address of the spammers. The attack factor was set from the range of 0.25 to -0.25, if any match were found in the above origin based Spam filters. Rule 2 was functional on input parameter- IP address of the Sender. Similar to the Rule 1, the IP address of the sender is compared with the origin based spam filters and filter out the Spam s based on the value of the attack factor. Rule 3 was functional on input parameter- Subject

3 Survey of Text Classification Algorithms for Spam Filtering 235 Words. Every may contain one or more words in Subject line. The words present in the Subject line are taken, and compared against the Origin based Spam filters with the Impact factor value. Based on the above three rules, it is faster to detect the Spam s and this process can be extended based on user s attitude. Therefore, this approach can adapt to spammer tactics and build its knowledge base. IV. LINEAR CLASSIFIERS In Linear classifiers, for each class, a class profile is computed, a vector of weights, one for each feature, based on occurrence frequency and probabilistic reasoning. For each class and document, a score is obtained by taking an inproduct of class profile and document profile. Naïve (or simple) Bayesian classification is based on the estimation of conditional probabilities. Support Vector Machines computes an optional linear classifier by transforming the feature space. This class furthermore comprises some heuristically learning algorithms from Artificial Intelligence, like the perceptron in which the weights are obtained in an adventurous way of learning process. According to the linear classifiers, the text classification is categorized as Decision trees, Naïve Bayes classifier, Support Vector Machine. 1. Decision Trees A decision tree is predictive model that expands a tree of decision and their possible consequences, including chance event outcomes, and resource costs. The outcomes can be discrete or as in case of regression trees, conjunction of features that lead to the classifications at various leaves. Popular decision tree learning methods are C4.5, CART, ID3, and Naïve Tree [5]. Decision trees will be able to generate understandable rules and produce high classification accuracy and good performance evaluation with the given datasets. It is easy to adapt and dynamically build the knowledge. C4.5 Decision Tree C4.5 Decision tree classifier takes the form of a tree structure with nodes and branches. To construct a decision trees for an application, C4.5 (a variant of ID3 learning algorithm) is used. It forms a tree structure, processing in a top-down approach. It selects the best attribute as the root node. This selection is based on information gain. Information gain of an attribute is the expected reduction in entropy, a measure of homogeneity of the set of instances, when the instances are classified by the attribute alone. It measures, how the attributes are classified. Once the attribute for the root node is determined, branches are created for all the values associated with that attribute. Similarly, the next attribute is selected. When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. Tree pruning in C4.5 address the problem of over fitting the data. Such tree pruning methods use statistical measures to remove the least reliable branches, generally resulting in faster classification and also to improve the ability of the tree to correctly classify the independent test datasets. This process continues for all the remaining attributes until the leaf nodes, displaying classes are reached. Decision trees produce high classification accuracy, compared to Support Vector Machine, Naïve Bayes and Neural network. CART CART is Classification and Regression Trees algorithm. It is a data exploration and prediction algorithm. It progressively splits the set of training examples into smaller and smaller subsets on the basis of possible answers to a series of questions posed by the designer. When all samples in each subset acquire the same category label, each subset becomes pure; such a condition would terminate that portion of the tree. Text documents are typically characterized by very high dimensional feature spaces. Such excessive detailing or noisy training data run the risk of over fitting. In order to avoid over fitting and improve generalization accuracy, it is necessary to employ some pruning technique. CART uses the Gini Impurity Parameter to pick only the most appropriate features for each parameter. ID3 ID3 is Iterative Dichotomiser algorithm computes entropy based Information Gain for optimized

4 236 K. Saruladha and L. Sasirekha feature selection. The recursion feature selection algorithm continues until there is only one class remaining in the data, or there are no features left. NAÏVE TREE Kohavi et al proposes a hybrid algorithm that combines the elegance of a recursive tree-based partitioning technique such as C4.5 with the robustness of Naïve Bayesian categorizers that is applies at each leaf. By applying various datasets as inputs to NT, C4.5 and Naïve Bayesian, the average accuracy of NT is show to be 84.47%, 81.91% for C4.5 and 81.69% for Naïve Bayesian. 2. Naïve Bayesian Classifier Bayesian Classifier [6] is a learning method based on probabilistic approach. It is commonly used in text categorization. The basic idea is to use the joint probabilities of words and categories to estimate the probabilities of categories given a document. Bayesian Classifier works on the principle that the probability of an event occurring in the future can be inferred from the previous occurrences of that event. It applies Bayesian statistics with strong independence assumptions on the features that drive the classification process. During its training phase, a Naïve Bayesian Classifier learns the posterior word probabilities. The authors constructed a corpus Ling-Spam with 2411 non spam and 481 spam message and used a parameter ë to induce greater penalty to false positives. The paper demonstrated that the weighted accuracy of a Naïve Bayesian filter can exceed 99%. Variations of the basic algorithm, using word positions and multi word N-grams as attribute have also yielded good results. The main strength of Naïve Bayesian algorithm lies in its simplicity. Since the variables are mutually independent, only the variances of individual class variables need to be determined rather than handling the entire set of covariance. This makes Naïve Bayesian one of the most efficient models for filtering. It is robust, continuously improving its accuracy while adapting to each user preferences when the user identifies incorrect classification by following continuous rectified training of the model. 3. Support Vector Machine The idea of Support Vector Machines (SVM) was proposed by Vapnik [7]. It is a supervised learning method based on structural risk minimization. It subjects every category to a separate binary classifier. It classifies a dataset by constructing an N- dimensional hyper plane that separates the data into two categories. In a simple two dimensional space, a hyper plane that separates linearly separable classes can be represented. The instances are properly separated by a linear separator (straight line). It is possible to find an infinite number of such lines. Hence, there is one linear separator that gives the greatest separation between the classes. It is called the maximum margin hyper plane and can be found using the convex hulls of two classes. When the classes are linearly separable, the convex hulls do not overlap. SVM are the instances that are closest to the maximum margin hyper plane and support vector for the instances. When there are more than two attributes, support vector machines find an N-1 dimensional hyper plane in order to optimally separate the data points represented in N dimensional space. Instead of using linear hyper planes, many implementations of these algorithms use kernel functions. These kernel functions lead to non-linear classification surfaces, such as polynomial, radial or sigmoid surfaces. SVM use kernel functions that transform the data to higher dimensional space where the linear separation is possible. The choice of kernel function depends upon the application. Training a SVM is quadratic optimization problem. It is possible to use Quadratic Plane (QP) optimization algorithm for that purpose. To avoid over fitting, cross validation is used to evaluate the fitting provided by each parameter value set tried during the grid or pattern search process. The main advantage of SVM is training the datasets are relatively easy, tradeoff between classifier complexity and error can be controlled explicitly and non-traditional data like strings and trees can be used as input to SVM instead of feature vectors. The weakness is identification of suitable kernel function for the problem characteristic is an intricate task. SVM are very popular algorithms for text categorization, and it is the best learning algorithms for spam filtering tasks. SVM leads to

5 Survey of Text Classification Algorithms for Spam Filtering 237 applications in image classification and handwriting recognition. They are very much effective in biometrics problem. Soft Margin SVM A sharp separation is not always possible, thus the Soft Margin SVM always chooses a hyper plane that splits the example as cleanly as possible, while still maximizing the distance between the nearest cleanly split examples. The main strength of the SVM is its ability to exhibit better performance even if a plethora of features is used; it self-tunes itself and maintains accuracy and generalization. Therefore, there is no compelling need to find the optimum number of features. Neural Network Neural network [8] is a collection of interconnected nodes or neurons. It has a large class of models and learning method. Neural Networks records one at a time, and learn by comparing the classification of the record with the known actual classification of the record. The errors from the initial classification of the first record is fed back into the network, and used to modify the network algorithm the second time around, and so on for many iterations. EXAMPLE BASED CLASSIFIERS Example based classifiers classifies a new document, finding the k documents nearest to it in the training set and doing some form of majority voting on the classes of these nearest neighbors. K Nearest Neighbors The KNN technique [9] proceeds by choosing the first random points as initial seed clusters. Next, it enters a learning phase when training data points are iteratively assigned to a cluster whose center is located at the nearest distance (e.g. Euclidean distance). Cluster centers are repeatedly adjusted to the mean of their currently acquired data points. The classification algorithm tries to find the K nearest neighbor of a test data point and uses a majority vote to determine its class label. The main strength of the KNN algorithm is that it provides good generalized accuracy on many domains and the learning phase is fast. V. CONCLUSION In this paper we discussed the problem of Spam and gave an overview of Taxonomy of Text Classification algorithms based on Spam filtering techniques. Three text classification algorithms such as Rule based classifiers, linear classifiers and example based classifiers were briefly discussed for each phase of the classifier. In future, machine learning classifiers such as J48, Alternate Decision tree, Decision Stump, Boosting Algorithms, Naïve Trees, CART can be used for text classification in Spam detection, to improve the high level of accuracy and good performance evaluation. References [1] Mohammed. Abdul. Wajeed, Dr. T. Adilakshmi, Text Classification Using Machine Learning Journal of Theoretical and Applied Information Technology, [2] Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, 34(1), [3] L. A. Zadeh, The Concept of a Linguistic Variable and Its Application to Approximate Reasoning-I, Information Science, 8, , [4] P. Sudhakar, G. Poonkuzhali, K. Thiagarajan, R. Kripa Keshav, K. Sarukesi, Fuzzy Logic for Spam Deduction, Recent Researches in Applied Computer and Applied Computational Science, ISBN: [5] Aman Kumar Sharma, Suruchi Sahni, A Comparative Study of Classification Algorithms for Spam Data Analysis, International Journal on Computer Science and Engineering (IJSCE), 3(5), [6] Johan Hovold, Naïve Bayes Spam Filtering Using Word Position Based Attributes, International Conference of and Anti Spam, [7] Drucker HD, Wu D, Vapnik V Support Vector Machines for Spam Categorization, IEEE Transactions on Neural Networks, [8] Alia Taha Sabri, Adel Hamdan Mohammads, Bassam Al-Shargabi, Developing New Continuous Learning Approach for Spam Detection using Artificial Neural Network(CLA_ANN), ejsr.htm [9] P.I Nakov, P.M. Dobrikov, Non-Parametric Spam Filtering Based On KNN and LSA, Procs of the 33th National Spring Conference, 2004.