Lasso-based Spam Filtering with Chinese s

Transcription

1 Journal of Computational Information Systems 8: 8 (2012) Available at Lasso-based Spam Filtering with Chinese s Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1 School of Information Engineering, East China Jiaotong University, Nanchang , China 2 Division of Scientific Research, Jiangxi University of Finance and Economics, Nanchang , China Abstract In spam filtering, the classifier built directly with high-dimensional and sparse data depicted by vector space model, will causing computation increased and poor generalization. Feature extraction or feature selection is commonly taken to reduce dimension before classifier training. A spam filtering approach based on l1 regularized multivariate linear model named Lasso regression is proposed in this paper, which aims to build a regression model for spam filtering and select the important terms automatically. Based on the selected terms, logistic regression (LR) models are built. The simulations are implemented with TREC06C, the results tell that LR plus lasso term selection achieve better performance. Keywords: Lasso; Feature Selection; LAR; Spam Filtering; Logistic Regression 1 Introduction Spam s, commonly known as unsolicited bulk s (UBE) or unsolicited commercial e- mails (UCE) [1], which has been an increasing severe problem with the Internet. Increasing in exponential speed all the time, spam s give rise to waste of social resources and loss of productivity (misuse of storage space and computational resources, time spent in reading and removing spam), and even destroy Internet Security. To deal with them, automatic spam filtering technologies should be taken. Among the techniques to combat spam s, the content-based spam filtering technology is a promising and effective approach [2]. Generally, there are two different approaches to detect spam s [2, 3, 4]: generative models (for instance, Naive Bayes [4]) and discriminative models such as SVM [2, 5], and Logistic Regression [3, 6]. It is a truth university acknowledged that the methods support vector machines and Naive Bayes classifiers are considered the top-performers [7] in text classification. And in the TREC (Text Retrieval Conference), the LR gained an extraordinarily This work is supported by National Natural Science Foundation of China ( ), Social Sciences Foundation of the State Education Ministry with No. 10YJC630379, the Natural Science Foundation of Jiangxi Province (2010GZS0034) and Technology Project underjiangxi Education Administration Department (GJJ10446). Corresponding author. address: xianlongok@163.com (Xianlong ZHANG) / Copyright 2012 Binary Information Press April 2012

2 3316 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) distinguished performance [8]. We adopt the LR as the classifier in this paper.the rest of this paper is organized as follows. The second section presents the related work. Section 3 describes the representation and VSM [9, 10] (Vector Space Model). The new feature selection method LARS is introduced in Section 4. In Section 5, we employed the new methodology in our experiments. Experimental results are also giving in this section. Finally conclusion is drawn in Section 6 and the Section 7 is the acknowledgment. 2 Related Works Spam detection can be considered as a problem of a binary document classification and each can be regarded as a document. An contains the body, the subject and other header fields. Vector Space Model, which also called bag-of-words method, is the most widely used to represent documents. With it, each can be processed and represented as a high dimensional sparse vector. In this way, s available will make up high dimensional data (with thousands of features), where many features are irrelevant or redundant. High dimensional data will increase calculation complexity and reduce generalization of the classifiers. So dimensionality reduction is crucial step in spam filtering. Feature selection is an approach for dimension reduction [11], which aiming to search an optimal features subset from a high dimensional feature space by using the statistic method or information theory. The related measures [12] are document frequency, information gain (IG), χ 2 -statistic and so on [7, 13]. Lasso [14] (Least absolute shrinkage and selection operator) regression is a multivariate linear regression with a bound on the sum of the absolute values of the coefficients, which can select variables and estimate coefficients simultaneously. After lasso proposed, there are many advanced techniques put forward based on lasso, such as elastic net and group lasso [15]. And moreover Least Angle Regression [14] (LAR) was proposed by Efron to deal with lasso computation efficiently [14]. In this paper, a new spam filtering approach based on lasso is proposed and used to Chinese spam filtering. 3 Representation representation is first step in spam filtering, and VSM commonly employed. In VSM, individual terms in each are collected to construct feature set T m = (t 1, t 2,..., t m ) ontaining m terms. Each can be represented as a vector d i = (w 1 (d i ), w 2 (d i ),...w m (d i )), w j (d i ) is the weight of term t j in the d i. So by gathering each data vector, the total corpus is represented as the term-document matrix X n m = (x 1, x 2,..., x m ), here x i = (w i (d 1 ), w i (d 2 ),...w i (d n )) T, n is the document number. The value of w j (d i ) is calculated using normalized ltc [13] function, which defined as: w j (d i ) = log (f ji + 1.0) log ( N ) n i (1) m [log (f ji + 1.0) log ( N n i )] 2 k=1 Where f ji is the number of occurrence of term t j appears in document d i, N means the number of the total documents set, n i denote the number of documents in N in which t j occurs, m

3 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) is the number of the terms. With the available corpus of n s, term-document matrix X = (x 1, x 2,..., x m ) with a size of n m can be produced, where vectors x 1, x 2,..., x m are n-dimensional vectors with respect to m features. And the responses for n s constitute vector y, its component values depend on the corresponding label. That is, if k is a spam then y k = 0, otherwise when it is a ham and y k = 1. 4 Lars Algorithm In this section, the approach for term selection is put forward in detail. Given data X with m features, X and y are firstly standardized so that n n x ij = 0, y i and 1 n x 2 ij = 1 for i=1 i=1 N i=1 afterwards usage. Let β = (β 1, β 2,..., β m ) 2, and the lasso is equivalent to the following optimization problem: minimize : S(β) = n m m (y i x ij β j ) 2 + λ β j (2) i=1 j=1 j=1 λ > 0, is a tunable regularization parameter. In order to solve this problem with L1 penalty efficiently, LARS (Least Angel Regression) was put forward, being improved from stagewise with high precision and easy computation. In LARS, the correlation between the term x i and target y is defined as: c i = x T i (y µ), µ = Xβ (3) X j3 X j2 u3 u2 u1 X j1 Fig. 1: The procedure of LARS LARS works with the procedure, shown in Fig 1. At the beginning of the algorithm µ = 0, and then the correlation values are calculated, and the maximum coefficient c j1 be found, meaning that the variable x j1 is most correlated with the predictor y µ, then added to the active set A. After that the largest step possible in the direction of this predictor is taken until some other predictor, say x j2, has as much correlation as x j1 with the current residual. Then LARS proceeds in a direction equiangular between the two predictors until a third variable x j1 earns its way into the most correlated set. LARS then proceeds equiangular between the three variables to find the next variable. LARS implementation is listed in Table 1.

4 3318 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) u The direction of angle bisector c The correlation Table 1: The LARS algorithm µ = 0. while y µ(k) < ε and ( A <= m) µ The predictor r, ˆr step length 1 c = X T (y µ) = (c 1, c 2,..., c m ) T,A = {j : c j = max(c) = C}, X A = [...x j..., j A] 2 u(k) = X A w A, ( w A = a A G 1 A 1 A a A = (1 T A G 1 A 1 A) 1 2 ) 3 a A = X T u = (a 1, a 2,..., a m ) 4 r = min + { } { } β j w j, if A < m ˆr = min + C cj j A C a A a j, C+c j a A +a j else ˆr = C A A j A 5 if ˆr < r µ(k) = µ(k 1) + ru(k) β A = β A + rw A A = A {ĵ} else µ(k) = µ(k 1) + ˆru(k) β A = β A + ˆrw A A = A {ĵ} return β A 5 Simulation 5.1 Preprocessing with document The Chinese corpus used for simulation experiments comes from the 2006 TREC Spam Filtering Track public datasets trec06c [16]. It consists of spam s and ham s. Before representing the corpus into term-document matrix, the contents should be abstracted and analyzed. A Visual C++ application program was made to extract the subject, content, and other major information from the original text. Since the Chinese document text has no obvious space between characters and always including some numbers, symbols, so the Chinese word segmentation is necessary. Therefore, ICTCLAS from Chinese Science Academy is utilized to achieve word segmentation. In the progress, those useless features, such as stop words, white space, punctuations, and so on are deleted. With statistics on the corpus, it s found that there are terms in all and 8879 terms appear in more than 60 s (document frequency is less than 0.2%. Then the 8879 terms are chosen as the original feature variables. Moreover, the terms appear in the subjects are set a higher weight because the subject may contains more important information. 5.2 Evlauation method The TREC Spam Filter Evaluation [16], developed for TREC 2005, provides a standardized method for evaluating spam filtering techniques. There are several statistics commonly used. hm% : the ham misclassification percentage. sm% : the spam misclassification percentage. 1-ROCA%: the area above the ROC curve, the most crucial statistics.

5 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) lam%: logistic average misclassification percentage, defined as lam% = log it 1 ( Here log it(x) = log( x 1 x ),and log it 1 (x) = h =.1 a statistics of sm% when hm%=0.1 log it(hm%) + log it(sm%) ) (4) 2 ex 1+e x 5.3 Experiments and result Actually, there are many s without contents will lead to some zero vectors, which should be removed to avoid computation problems. Because of the spam s taking the majority of experimented s, the selection of spam and ham is in a ratio of 2:1. Before the experiment, the selected data are partitioned into training datasets and testing datasets according to some rules. Firstly, the Lasso regression is used for filter algorithm to classify s. And in the meantime the important features are picked up, which coefficients is not zero in the lasso solution. Then another filter based on logistic regression is experimented on the new data which only contains the selected terms based on the lasso. And the next, the new data were use to training and testing on the LR classifier. With the training data {X, y} (X is the data matrix with n documents and y is the responses vector representing the label for the n s), the 10 fold cross-validation methods are employed to train the lasso regression filter with varying regularization parameters. With the solution β from lasso algorithm, the corresponding terms which coefficients are not zero can make an effective term subset. Changing the regularization parameter, the penalty of the L1 norm will be controlled, and the number of the selected terms will be decided accordingly. Here λ is set as λ i (12, 10, 8, 7) and the independent experiments are implemented with 10 fold crossvalidations. That is to say, 10 lasso models are built up for each selected regularization parameter. They are also checked with the testing data. The average number of the selected terms and the evaluation statistics, which were analysis carefully, are calculated and presented in Fig 2(a) and Table 2. With each lasso model for each regularization parameter, a new term-matrix data, which cover those with no-zero coefficients selected terms are built. And the new data are also divided into two groups train dataset and test dataset. Then the train dataset are used to train the logistic regression classifier. And the new LR classifier will be checked with testing dataset. The experiment results are listed in Fig 2(b) and Table 3. According to the TREC Spam Filter Evaluation standard, a distinguished spam filter would have a small area above the Receiver Operating Characteristic (ROC) Curve. Comparing the results showed in Fig 2(a) and Fig 2(b), it will be easily to know that the logistic regression plus lasso term selection has a better performance than lasso, but the previous need additional time cost to build logistic regression model. Furthermore, the performance will be improved with the increasing number of the selected terms. And in this experiment we get the best ROC curve when the λ values 7 as it showed in the figures From Table 2 and Table 3, it can also be seen that, the filter will get a better performance with the increasing of the selected terms, and the filters get the best performance when the λ = 7. By comparison, the filters with lasso-based term selection and logistic regression are superior to the lasso-based filters. Afterwards, more experiments are carried out to prove that the method proposed can achieve better performance. Another two typical feature selection methods, IG and

6 3320 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) %Spam Misclassification(logit scale) λ=12 λ=10 λ=8 λ=7 ROC %Spam Misclassification(logit scale) λ=12 λ=10 λ=8 λ=7 ROC %Ham Misclassification(logit scale) (a) The ROC curve of lasso %Ham Misclassification(logit scale) (b) The ROC curve with the lasso-based term selection and logistic regression Fig. 2: The ROC curve of Logistic regression and Lasso regression Table 2: Results with the lasso regression λ Terms 1-ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Table 3: Result with the lasso-based term selection and logistic regression λ Terms 1-ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) χ 2 are used to choose the important terms, and with this, another two logical regression models are built to classify the s. And the ROC curves and results are presented in Fig 3, Table 4 and Table 5. Fig 3(a) and Fig 3(b) showed that χ 2 feature selection has smaller area above the Receiver Operating Characteristic (ROC) Curve than the IG method. But both of them are inferior to lasso approach. It can be said that lasso is a better feature selection method in spam filtering. In the same way, a filter with a better performance will have small numerical value in 1-ROCA%, hm :.1% and lam%. The results of χ 2 and IG feature selection are also having poor performance on hm :.1% and lam%, compared with lasso selection method. So the conclusion that the lasso do well on the feature selection in spam filtering, can be drawn.

7 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) %Spam Misclassification(logit scale) terms=577 terms=820 terms=1370 terms=1823 ROC %Spam Misclassification(logit scale) terms=577 terms=820 terms=1370 terms=1823 ROC %Ham Misclassification(logit scale) (a) The ROC curve of Logistic using χ 2 to select terms Fig. 3: The ROC curve of Logistic using χ 2 and IG %Ham Misclassification(logit scale) (b) The ROC curve of Logistic using IG to select terms Table 4: Result of the logical regression using 2 variable selection λ Terms 1-ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Table 5: Result of the logical regression using IG variable selection λ Terms 1-ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 6 Conclusion In this paper, the approach to lasso-based spam filtering is presented. The key features which coefficient value are no-zero on the lasso solution, can be selected for feature reduction. Based on it, the logistic filters are built and tested with Chinese text . Lasso regression can selected the terms and estimated regression coefficients simultaneously, and on another hand, the lasso-based spam filters can also be built straightly. The approach is compared with two other term selection methods, and all of them work with the classifiers based on logistic regression. The simulation results proved that the lasso approach succeeds in term selection.there are still many challenges in spam filtering, the senders will spam the junk- s in other ways, such as word obfuscation, tokenization [17], and even image spam. Employing lasso method on the new problems will be our future research job.

8 3322 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) References [1] A. IltIk, T. G Ng R, Time-efficient spam filtering using n-gram models, Pattern Recognition Letters, 2008, pp [2] H. Yong, H. Xiaoning, Y. Muyun, Q. Haoliang, and S. Chao, Chinese Spam Filter Based on Relaxed Online Support Vector Machine, Proc. Asian Language Processing (IALP), 2010 International Conference on, 2010, pp [3] H. Yong, Y. Muyun, Q. Haoliang, H. Xiaoning, and L. Sheng, The Improved Logistic Regression Models for Spam Filtering, Proc. Asian Language Processing, IALP 09. International Conference on, 2009, pp [4] S. Lu, D. Chiang, H. Keh, and H. Huang, Chinese text classification by the Naive Bayes Classifier and the associative classifier with multiple confidence threshold values, Knowledge-Based Systems 2010, pp [5] H. Drucker, D. Wu, and V. N. Vapnik, Support vector machines for spam categorization, Neural Networks, IEEE Transactions on 1999, pp [6] H. Qi, X. He, Y. Han, M. Yang, and S. Li, Information Theory Based Feature Valuing for Logistic Regression for Spam Filtering, Asian Language Processing, International Conference on 2010, pp [7] T. Almeida, J. Almeida, and A. Yamakami, Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers, Journal of Internet Services and Applications, 2011, pp [8] G. Cormack, TREC 2007 Spam Track Overview, Proc. Proceedings of the Sixteenth Text REtrieval Conference, National Institute of Standards and Technology (NIST), [9] F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv.2002, pp [10] X. Tai, F. Ren, and K. Kita, An information retrieval model based on vector space method by supervised learning, Inf. Process. Manage, 2002, pp [11] I. A. Gheyas, and L. S. Smith, Feature subset selection in large dimensionality domains, Pattern Recognition, 2010, pp [12] W. Zhang, T. Yoshida, and X. Tang, A comparative study of TF*IDF, LSI and multi-words for text classification, Expert Systems With Applications2011, pp [13] Y. Li, C. Luo, and S. M. Chung, Text Clustering with Feature Selection by Using Statistical Data, Knowledge and Data Engineering, IEEE Transactions on 2008, pp [14] B. Efron, T. Hastie, L. Johnstone, and R. C. C. H. Tibshirani, Least angle regression, Annals Of Statistics, 2004, pp [15] L. Meier, S. van de Geer, and P. Buhlmann, The group lasso for logistic regression, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2008, pp [16] G. V. Cormack, TREC 2006 Spam Track Overview, Book TREC 2006 Spam Track Overview, Series TREC 2006 Spam Track Overview,ed., Editor ed, [17] D. Sculley, G. Wachman, and C. E. Brodley, Spam Filtering Using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers, 2006, pp. 1.