Lasso-based Spam Filtering with Chinese Emails

Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1 School of Information Engineering, East China Jiaotong University, Nanchang 330013, China 2 Division of Scientific Research, Jiangxi University of Finance and Economics, Nanchang 330013, China Abstract In spam email filtering, the classifier built directly with high-dimensional and sparse email data depicted by vector space model, will causing computation increased and poor generalization. Feature extraction or feature selection is commonly taken to reduce dimension before classifier training. A spam filtering approach based on l1 regularized multivariate linear model named Lasso regression is proposed in this paper, which aims to build a regression model for spam filtering and select the important terms automatically. Based on the selected terms, logistic regression (LR) models are built. The simulations are implemented with TREC06C, the results tell that LR plus lasso term selection achieve better performance. Keywords: Lasso; Feature Selection; LAR; Spam Filtering; Logistic Regression 1 Introduction Spam emails, commonly known as unsolicited bulk emails (UBE) or unsolicited commercial e- mails (UCE) [1], which has been an increasing severe problem with the Internet. Increasing in exponential speed all the time, spam emails give rise to waste of social resources and loss of productivity (misuse of storage space and computational resources, time spent in reading and removing spam), and even destroy Internet Security. To deal with them, automatic spam filtering technologies should be taken. Among the techniques to combat spam emails, the content-based spam filtering technology is a promising and effective approach [2]. Generally, there are two different approaches to detect spam emails [2, 3, 4]: generative models (for instance, Naive Bayes [4]) and discriminative models such as SVM [2, 5], and Logistic Regression [3, 6]. It is a truth university acknowledged that the methods support vector machines and Naive Bayes classifiers are considered the top-performers [7] in text classification. And in the TREC (Text Retrieval Conference), the LR gained an extraordinarily This work is supported by National Natural Science Foundation of China (61065003), Social Sciences Foundation of the State Education Ministry with No. 10YJC630379, the Natural Science Foundation of Jiangxi Province (2010GZS0034) and Technology Project underjiangxi Education Administration Department (GJJ10446). Corresponding author. Email address: xianlongok@163.com (Xianlong ZHANG). 1553 9105 / Copyright 2012 Binary Information Press April 2012

3316 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) 3315 3322 distinguished performance [8]. We adopt the LR as the classifier in this paper.the rest of this paper is organized as follows. The second section presents the related work. Section 3 describes the email representation and VSM [9, 10] (Vector Space Model). The new feature selection method LARS is introduced in Section 4. In Section 5, we employed the new methodology in our experiments. Experimental results are also giving in this section. Finally conclusion is drawn in Section 6 and the Section 7 is the acknowledgment. 2 Related Works Spam detection can be considered as a problem of a binary document classification and each email can be regarded as a document. An email contains the email body, the subject and other header fields. Vector Space Model, which also called bag-of-words method, is the most widely used to represent documents. With it, each email can be processed and represented as a high dimensional sparse vector. In this way, emails available will make up high dimensional data (with thousands of features), where many features are irrelevant or redundant. High dimensional data will increase calculation complexity and reduce generalization of the classifiers. So dimensionality reduction is crucial step in spam filtering. Feature selection is an approach for dimension reduction [11], which aiming to search an optimal features subset from a high dimensional feature space by using the statistic method or information theory. The related measures [12] are document frequency, information gain (IG), χ 2 -statistic and so on [7, 13]. Lasso [14] (Least absolute shrinkage and selection operator) regression is a multivariate linear regression with a bound on the sum of the absolute values of the coefficients, which can select variables and estimate coefficients simultaneously. After lasso proposed, there are many advanced techniques put forward based on lasso, such as elastic net and group lasso [15]. And moreover Least Angle Regression [14] (LAR) was proposed by Efron to deal with lasso computation efficiently [14]. In this paper, a new spam filtering approach based on lasso is proposed and used to Chinese email spam filtering. 3 Email Representation Email representation is first step in spam email filtering, and VSM commonly employed. In VSM, individual terms in each email are collected to construct feature set T m = (t 1, t 2,..., t m ) ontaining m terms. Each email can be represented as a vector d i = (w 1 (d i ), w 2 (d i ),...w m (d i )), w j (d i ) is the weight of term t j in the email d i. So by gathering each email data vector, the total corpus is represented as the term-document matrix X n m = (x 1, x 2,..., x m ), here x i = (w i (d 1 ), w i (d 2 ),...w i (d n )) T, n is the document number. The value of w j (d i ) is calculated using normalized ltc [13] function, which defined as: w j (d i ) = log (f ji + 1.0) log ( N ) n i (1) m [log (f ji + 1.0) log ( N n i )] 2 k=1 Where f ji is the number of occurrence of term t j appears in document d i, N means the number of the total documents set, n i denote the number of documents in N in which t j occurs, m

Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) 3315 3322 3317 is the number of the terms. With the available corpus of n emails, term-document matrix X = (x 1, x 2,..., x m ) with a size of n m can be produced, where vectors x 1, x 2,..., x m are n-dimensional vectors with respect to m features. And the responses for n emails constitute vector y, its component values depend on the corresponding email label. That is, if email k is a spam then y k = 0, otherwise when it is a ham and y k = 1. 4 Lars Algorithm In this section, the approach for term selection is put forward in detail. Given email data X with m features, X and y are firstly standardized so that n n x ij = 0, y i and 1 n x 2 ij = 1 for i=1 i=1 N i=1 afterwards usage. Let β = (β 1, β 2,..., β m ) 2, and the lasso is equivalent to the following optimization problem: minimize : S(β) = n m m (y i x ij β j ) 2 + λ β j (2) i=1 j=1 j=1 λ > 0, is a tunable regularization parameter. In order to solve this problem with L1 penalty efficiently, LARS (Least Angel Regression) was put forward, being improved from stagewise with high precision and easy computation. In LARS, the correlation between the term x i and target y is defined as: c i = x T i (y µ), µ = Xβ (3) X j3 X j2 u3 u2 u1 X j1 Fig. 1: The procedure of LARS LARS works with the procedure, shown in Fig 1. At the beginning of the algorithm µ = 0, and then the correlation values are calculated, and the maximum coefficient c j1 be found, meaning that the variable x j1 is most correlated with the predictor y µ, then added to the active set A. After that the largest step possible in the direction of this predictor is taken until some other predictor, say x j2, has as much correlation as x j1 with the current residual. Then LARS proceeds in a direction equiangular between the two predictors until a third variable x j1 earns its way into the most correlated set. LARS then proceeds equiangular between the three variables to find the next variable. LARS implementation is listed in Table 1.

3318 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) 3315 3322 u The direction of angle bisector c The correlation Table 1: The LARS algorithm µ = 0. while y µ(k) < ε and ( A <= m) µ The predictor r, ˆr step length 1 c = X T (y µ) = (c 1, c 2,..., c m ) T,A = {j : c j = max(c) = C}, X A = [...x j..., j A] 2 u(k) = X A w A, ( w A = a A G 1 A 1 A a A = (1 T A G 1 A 1 A) 1 2 ) 3 a A = X T u = (a 1, a 2,..., a m ) 4 r = min + { } { } β j w j, if A < m ˆr = min + C cj j A C a A a j, C+c j a A +a j else ˆr = C A A j A 5 if ˆr < r µ(k) = µ(k 1) + ru(k) β A = β A + rw A A = A {ĵ} else µ(k) = µ(k 1) + ˆru(k) β A = β A + ˆrw A A = A {ĵ} return β A 5 Simulation 5.1 Preprocessing with email document The Chinese email corpus used for simulation experiments comes from the 2006 TREC Spam Filtering Track public datasets trec06c [16]. It consists of 21766 spam emails and 42854 ham emails. Before representing the email corpus into term-document matrix, the email contents should be abstracted and analyzed. A Visual C++ application program was made to extract the subject, content, and other major information from the original email text. Since the Chinese document text has no obvious space between characters and always including some numbers, symbols, so the Chinese word segmentation is necessary. Therefore, ICTCLAS from Chinese Science Academy is utilized to achieve word segmentation. In the progress, those useless features, such as stop words, white space, punctuations, and so on are deleted. With statistics on the email corpus, it s found that there are 49452 terms in all and 8879 terms appear in more than 60 emails (document frequency is less than 0.2%. Then the 8879 terms are chosen as the original feature variables. Moreover, the terms appear in the email subjects are set a higher weight because the email subject may contains more important information. 5.2 Evlauation method The TREC Spam Filter Evaluation [16], developed for TREC 2005, provides a standardized method for evaluating spam filtering techniques. There are several statistics commonly used. hm% : the ham misclassification percentage. sm% : the spam misclassification percentage. 1-ROCA%: the area above the ROC curve, the most crucial statistics.

Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) 3315 3322 3319 lam%: logistic average misclassification percentage, defined as lam% = log it 1 ( Here log it(x) = log( x 1 x ),and log it 1 (x) = h =.1 a statistics of sm% when hm%=0.1 log it(hm%) + log it(sm%) ) (4) 2 ex 1+e x 5.3 Experiments and result Actually, there are many emails without contents will lead to some zero vectors, which should be removed to avoid computation problems. Because of the spam emails taking the majority of experimented emails, the selection of spam and ham is in a ratio of 2:1. Before the experiment, the selected email data are partitioned into training datasets and testing datasets according to some rules. Firstly, the Lasso regression is used for filter algorithm to classify emails. And in the meantime the important features are picked up, which coefficients is not zero in the lasso solution. Then another filter based on logistic regression is experimented on the new data which only contains the selected terms based on the lasso. And the next, the new data were use to training and testing on the LR classifier. With the training data {X, y} (X is the data matrix with n email documents and y is the responses vector representing the label for the n emails), the 10 fold cross-validation methods are employed to train the lasso regression filter with varying regularization parameters. With the solution β from lasso algorithm, the corresponding terms which coefficients are not zero can make an effective term subset. Changing the regularization parameter, the penalty of the L1 norm will be controlled, and the number of the selected terms will be decided accordingly. Here λ is set as λ i (12, 10, 8, 7) and the independent experiments are implemented with 10 fold crossvalidations. That is to say, 10 lasso models are built up for each selected regularization parameter. They are also checked with the testing data. The average number of the selected terms and the evaluation statistics, which were analysis carefully, are calculated and presented in Fig 2(a) and Table 2. With each lasso model for each regularization parameter, a new term-matrix email data, which cover those with no-zero coefficients selected terms are built. And the new data are also divided into two groups train dataset and test dataset. Then the train dataset are used to train the logistic regression classifier. And the new LR classifier will be checked with testing dataset. The experiment results are listed in Fig 2(b) and Table 3. According to the TREC Spam Filter Evaluation standard, a distinguished spam filter would have a small area above the Receiver Operating Characteristic (ROC) Curve. Comparing the results showed in Fig 2(a) and Fig 2(b), it will be easily to know that the logistic regression plus lasso term selection has a better performance than lasso, but the previous need additional time cost to build logistic regression model. Furthermore, the performance will be improved with the increasing number of the selected terms. And in this experiment we get the best ROC curve when the λ values 7 as it showed in the figures From Table 2 and Table 3, it can also be seen that, the filter will get a better performance with the increasing of the selected terms, and the filters get the best performance when the λ = 7. By comparison, the filters with lasso-based term selection and logistic regression are superior to the lasso-based filters. Afterwards, more experiments are carried out to prove that the method proposed can achieve better performance. Another two typical feature selection methods, IG and

3320 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) 3315 3322 %Spam Misclassification(logit scale) 0.01 0.1 1 10 λ=12 λ=10 λ=8 λ=7 ROC %Spam Misclassification(logit scale) 0.01 0.1 1 10 λ=12 λ=10 λ=8 λ=7 ROC 50 0.01 0.1 1 10 50 %Ham Misclassification(logit scale) (a) The ROC curve of lasso 50 0.01 0.1 1 10 50 %Ham Misclassification(logit scale) (b) The ROC curve with the lasso-based term selection and logistic regression Fig. 2: The ROC curve of Logistic regression and Lasso regression Table 2: Results with the lasso regression λ Terms 1-ROCA% hm:.1% lam% 12 577 0.0305(0.0252-0.0398) 0.0859(0.0529-0.1191) 0.0962(0.0817-0.1138) 10 820 0.0276(0.0227-0.0362) 0.0751 (0.0368-0.1103) 0.0902(0.0737-0.1050) 8 1370 0.0245(0.0200-0.0322) 0.0675 (0.0368-0.1044) 0.0874 (0.0708-0.1014) 7 1823 0.0229(0.0189-0.0299) 0.0662(0.0368-0.1044) 0.0857 (0.0685-0.0991) Table 3: Result with the lasso-based term selection and logistic regression λ Terms 1-ROCA% hm:.1% lam% 12 577 0.0047(0.0025-0.0109) 0.0023(0.0015-0.0059) 0.0140(0.0072-0.0186) 10 820 0.0026(0.0006-0.0060) 0.0021(0.0015-0.0059) 0.0115(0.0063-0.0165) 8 1370 0.0013(0.0001-0.0032) 0.0010(0.0006-0.0015) 0.0072(0.0001-0.0123) 7 1823 0.0012(0.0001-0.0025) 0.0011(0.0008-0.0015) 0.0063(0.0001-0.0124) χ 2 are used to choose the important terms, and with this, another two logical regression models are built to classify the emails. And the ROC curves and results are presented in Fig 3, Table 4 and Table 5. Fig 3(a) and Fig 3(b) showed that χ 2 feature selection has smaller area above the Receiver Operating Characteristic (ROC) Curve than the IG method. But both of them are inferior to lasso approach. It can be said that lasso is a better feature selection method in spam filtering. In the same way, a filter with a better performance will have small numerical value in 1-ROCA%, hm :.1% and lam%. The results of χ 2 and IG feature selection are also having poor performance on hm :.1% and lam%, compared with lasso selection method. So the conclusion that the lasso do well on the feature selection in spam filtering, can be drawn.

Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) 3315 3322 3321 %Spam Misclassification(logit scale) 0.01 0.1 1 10 terms=577 terms=820 terms=1370 terms=1823 ROC %Spam Misclassification(logit scale) 0.01 0.1 1 10 terms=577 terms=820 terms=1370 terms=1823 ROC 50 0.01 0.1 1 10 50 %Ham Misclassification(logit scale) (a) The ROC curve of Logistic using χ 2 to select terms Fig. 3: The ROC curve of Logistic using χ 2 and IG 50 0.01 0.1 1 10 50 %Ham Misclassification(logit scale) (b) The ROC curve of Logistic using IG to select terms Table 4: Result of the logical regression using 2 variable selection λ Terms 1-ROCA% hm:.1% lam% 12 577 0.0240(0.0177-0.0285) 0.0306(0.0206-0.0456) 0.0468(0.0353-0.0573) 10 820 0.0242(0.0159-0.0312) 0.0265(0.0176-0.0368) 0.0454(0.0332-0.0526) 8 1370 0.0199(0.0133-0.0239) 0.0182(0.0088-0.0235) 0.0376(0.0249-0.0453) 7 1823 0.0141(0.0076-0.0189) 0.0153(0.0074-0.0235) 0.0335(0.0256-0.0417) Table 5: Result of the logical regression using IG variable selection λ Terms 1-ROCA% hm:.1% lam% 12 577 0.0382(0.0259-0.0568) 0.0787(0.0353-0.1765) 0.0816(0.0624-0.1005) 10 820 0.0450(0.0281-0.0599) 0.0693(0.0382-0.1059) 0.0750(0.0633-0.0845) 8 1370 0.0379(0.0295-0.0487) 0.0396(0.0250-0.0544) 0.0625(0.0499-0.0706) 7 1823 0.0323(0.0223-0.0509) 0.0269(0.0088-0.0588) 0.0553(0.0467-0.0694) 6 Conclusion In this paper, the approach to lasso-based spam filtering is presented. The key features which coefficient value are no-zero on the lasso solution, can be selected for feature reduction. Based on it, the logistic filters are built and tested with Chinese text email. Lasso regression can selected the terms and estimated regression coefficients simultaneously, and on another hand, the lasso-based spam filters can also be built straightly. The approach is compared with two other term selection methods, and all of them work with the classifiers based on logistic regression. The simulation results proved that the lasso approach succeeds in term selection.there are still many challenges in spam filtering, the senders will spam the junk-emails in other ways, such as word obfuscation, tokenization [17], and even image spam. Employing lasso method on the new problems will be our future research job.

3322 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) 3315 3322 References [1] A. IltIk, T. G Ng R, Time-efficient spam e-mail filtering using n-gram models, Pattern Recognition Letters, 2008, pp. 19 33. [2] H. Yong, H. Xiaoning, Y. Muyun, Q. Haoliang, and S. Chao, Chinese Spam Filter Based on Relaxed Online Support Vector Machine, Proc. Asian Language Processing (IALP), 2010 International Conference on, 2010, pp. 185 188. [3] H. Yong, Y. Muyun, Q. Haoliang, H. Xiaoning, and L. Sheng, The Improved Logistic Regression Models for Spam Filtering, Proc. Asian Language Processing, 2009. IALP 09. International Conference on, 2009, pp. 314 317. [4] S. Lu, D. Chiang, H. Keh, and H. Huang, Chinese text classification by the Naive Bayes Classifier and the associative classifier with multiple confidence threshold values, Knowledge-Based Systems 2010, pp. 598 604. [5] H. Drucker, D. Wu, and V. N. Vapnik, Support vector machines for spam categorization, Neural Networks, IEEE Transactions on 1999, pp. 1048 1054. [6] H. Qi, X. He, Y. Han, M. Yang, and S. Li, Information Theory Based Feature Valuing for Logistic Regression for Spam Filtering, Asian Language Processing, International Conference on 2010, pp. 166 169. [7] T. Almeida, J. Almeida, and A. Yamakami, Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers, Journal of Internet Services and Applications, 2011, pp. 183 200. [8] G. Cormack, TREC 2007 Spam Track Overview, Proc. Proceedings of the Sixteenth Text REtrieval Conference, National Institute of Standards and Technology (NIST), 2007. [9] F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv.2002, pp. 1 47. [10] X. Tai, F. Ren, and K. Kita, An information retrieval model based on vector space method by supervised learning, Inf. Process. Manage, 2002, pp. 749 764. [11] I. A. Gheyas, and L. S. Smith, Feature subset selection in large dimensionality domains, Pattern Recognition, 2010, pp. 5 13. [12] W. Zhang, T. Yoshida, and X. Tang, A comparative study of TF*IDF, LSI and multi-words for text classification, Expert Systems With Applications2011, pp. 2758 2765. [13] Y. Li, C. Luo, and S. M. Chung, Text Clustering with Feature Selection by Using Statistical Data, Knowledge and Data Engineering, IEEE Transactions on 2008, pp. 641 652. [14] B. Efron, T. Hastie, L. Johnstone, and R. C. C. H. Tibshirani, Least angle regression, Annals Of Statistics, 2004, pp. 407 499. [15] L. Meier, S. van de Geer, and P. Buhlmann, The group lasso for logistic regression, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2008, pp. 53 71. [16] G. V. Cormack, TREC 2006 Spam Track Overview, Book TREC 2006 Spam Track Overview, Series TREC 2006 Spam Track Overview,ed., Editor ed, 2006. [17] D. Sculley, G. Wachman, and C. E. Brodley, Spam Filtering Using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers, 2006, pp. 1.