An Analysis of Machine Learning Methods for Spam Host Detection

Transcription

1 An Analysis of Machine Learning Methods for Spam Host Detection Renato M. Silva, Akebo Yamakami School of Electrical and Computer Engineering University of Campinas UNICAMP Campinas, São Paulo, Brazil {renatoms, Tiago A. Almeida Department of Computer Science Federal University of São Carlos UFSCar Sorocaba, São Paulo, Brazil Abstract The web is becoming an increasingly important source of entertainment, communication, research, news and trade. In this way, the web sites compete to attract the attention of users and many of them achieve visibility through malicious strategies that try to circumvent the search engines. Such sites are known as web spam and they are generally responsible for personal injury and economic losses. Given this scenario, this paper presents a comprehensive performance evaluation of several established machine learning techniques used to automatically detect and filter hosts that disseminate web spam. Our experiments were diligently designed to ensure statistically sounds results and they indicate that bagging of decision trees, multilayer perceptron neural networks, random forest and adaptive boosting of decision trees are promising in the task of web spam classification and, hence, they can be used as a good baseline for further comparison. Keywords-spamdexing; web spam; spam host; classification. I. INTRODUCTION Search engines are powerful allies of the Internet users to find information and therefore are responsible for a significant percentage of accesses to most Web sites. Thus, to succeed, it is important that the sites increase their relevance ranking in search engines. In this way, they are better positioned in search results when users consult terms related to their contents. To achieve this purpose, strategies, known as search engine optimization (SEO), should be used [1]. There are many ethical SEO strategies, however, Ledford [1] states that is required a lot of time and hard work to learn the most successful ones. All elements of the Web site should be created to maximize your search engine rankings. But, as mentioned in the SEO guide published by Google 1, it should be considered that the ultimate consumers of the Web site are the users, not search engines. It is important that the site offers relevant content, navigation facilities and other features that benefit your visitors. The problem is that several sites prefer to invest in unethical SEO techniques and circumvent the search engines to achieve relevance without merit. The most affected are the users that when make their queries in search engines, receive unexpected and irrelevant 1 Search Engine Optimization Starter Guide. See: jp/intl/en/webmasters/docs/search-engine-optimization-starter-guide.pdf answers, often infected with malicious content. Such techniques are known as web spamming or spamdexing [2], and according to Gyongyi and Garcia-Molina [3], they can be divided into two main categories: content spam and link spam. The first one is related to the creation of pages with irrelevant keywords and the second one corresponds to the addition of links to the Web page that is intended to promote. The main reason that contribute to the growth of web spam is the economic incentive. The spammers create Web pages with irrelevant content, increase their ranking in search engines through spamming techniques and advertise products. So, they get some money for each click of visitors [3]. Recent studies indicate that the amount of web spam is dramatically increasing. According to John et. al [4], 36% of the results generated by the most popular search engines contain malicious URLs. A report produced by Websense 2 shows that 22.4% of the search results about entertainment present malicious links. Furthermore, a report published by McAfee 3 informs that 49% of the most popular search terms return some malicious site in the top 100 search results. The same study found that 1.2% of the queries return links of the malicious sites in the top 100 results. Given this scenario, this paper presents a comprehensive performance evaluation of several well-known machine learning techniques employed to automatically detect spam hosts in order to provide good baseline results for further comparison. This paper is organized as follows: in Section II we briefly describe the related papers available in the literature. Section III presents the basic concepts of the methods evaluated in this paper. In Section IV we present the database and settings that we used in the experiments. Section V presents the main results. Finally, Section VI describes the conclusions and guidelines for future work. II. RELATED WORK Some papers that address the problem of web spamming are found in the literature. 2 Websense 2010 Threat Report. See: reports/report-websense-2010-threat-report-en.pdf 3 McAfee Threats Report: First Quarter See: com/us/resources/reports/rp-quarterly-threat-q pdf

2 Silva et al. [5], [6], [7] evaluated different neural-based approaches for web spam detection. They also analyzed how different sets of features impact on the classifiers accuracy. Largillier and Peyronnet [8] presented several node aggregation methods for the demotion of the effects of web spamming on the PageRank algorithm [1]. Liu et al. [9] proposed a number of user-behavior features for separating spam pages from normal pages. They also presented a framework that combines machine-learning techniques assisted by user behavior to detect spam pages. Rungsawang et al. [10] proposed an ant colony optimization algorithm that exploits both content and link features for spam host detection. Castilho et al. [11] proposed content-based features that were used in several other relevant works and in important events. In addition, they presented a spam detection system that combines content-based and link-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. Svore et al. [2] presented a method for detecting web spam that uses content-based features and the rank-time. Gyongyi and Garcia-Molina [3] described several web spamming techniques used by spammers. Nevertheless, as each mentioned paper used a different experiment setup, it is still unclear what are the best machine learning methods available in the literature that can be used to automatically detect spam hosts. III. METHODS This section presents concepts regarding the following well-known methods that we evaluated in this paper: multilayer perceptron neural networks, support vector machines, methods based on trees, such as decision trees, random forest, bagging and adaptive boosting of trees, and k-nearest neighbor. Such methods were chosen because most of them have been evaluated and presented as the best machine learning and data mining techniques currently available [12]. A. Multilayer perceptron neural network (MLP) A multilayer perceptron neural network is a perceptrontype network that has a set of sensory units composed by an input layer, one or more intermediate (hidden) layers, and an output layer of neurons [13]. By default, MLP is a supervised learning method that uses the backpropagation algorithm which can be summarized in two stages: forward and backward [14]. In the forward stage, the signal propagates through the network, layer by layer, as follows: u l j (n) = m l 1 i=0 wji l (n)yl 1 i (n), where l = 0,1,2,...,L are the indexes of network layers. So, l = 0 represents the input layer and l = L represents the output layer. On the other hand,y l 1 i (n) is the output function relating to the neuron i in the previous layer, l 1, wji l (n) is the synaptic weight of neuron j in layer l and m l corresponds to the number of neurons in layer l. For i = 0, y0 l 1 (n) = +1 and wj0 l (n) represent the bias applied to neuron j in layer l [13]. The output of neuron j in layer l is given by yj l(n) = ϕ j (u l j (n)), where ϕ j is the activation function of j. Then, the error can be calculated by e l j (n) = yl j (n) d(n), where d(n) is the desired output for an input pattern x(n). In backward stage, the derivation of the backpropagation algorithm is performed starting from the output layer, as follows: δj L(n) = ϕ j (ul j (n))el j (n), where ϕ j is the derivative of the activation function. For l = L,L 1,...,2, is calculated: δ l 1 j (n) = ϕ j (ul 1 j (n)) m l wji l (n) δl j (n), for i=1 j = 0,1,...,m l 1. Consult Haykin [13] and Bishop [14] for more details. 1) Levenberg-Marquardt algorithm: the Levenberg- Marquardt algorithm is usually employed to optimize and accelerate the convergence of the backpropagation algorithm [14]. It is considered a second order method because it uses information about the second derivative of the error function. Details can be found in Bishop [14] and Hagan and Menhaj [15]. B. Support vector machines (SVM) Support vector machines (SVM) [16], [17], [18], [19], [20] is a machine learning method that can be used for pattern classification, regression and others learning tasks [13], [21]. This method was conceptually implemented following the idea that input vectors are non-linearly mapped to a high dimension feature space. In this feature space is constructed a linear decision surface which separates the classes of the input patterns. One of the main elements that the SVM uses to separate the patterns of distinct classes is a kernel function. Through it, the SVM constructs a decision surface nonlinear in the input space, but linear in the features space [13]. Table I presents the most popular SVM kernel functions, where γ, r and d are parameters that must be set by the user. Table I: The most popular SVM kernel functions [13], [22]. Linear k(x i,x j ) = x T i x j RBF k(x i,x j ) = exp( 1 2γ 2 x i x j 2 ),γ > 0 Polynomial k(x i,x j ) = (γx T i x j +r) d,γ > 0 Sigmoid k(x i,x j ) = tanh(γx T i x j +r) To assist the choice of the SVM parameters, Hsu et. al [22] recommend the employment of a grid search. For instance, considering the SVM with RBF kernel, in which is necessary to define the regularization parameter C and γ, the authors suggest that the grid search could be used to evaluate exponential sequences: C = 2 5,2 4,2 3...,2 15 and γ = 2 15, ,2 3.

3 C. Decision trees (C4.5) The C4.5 [23] is one of the most classical decision tree algorithms and uses both categorical and continuous attributes. It uses a divide-and-conquer approach to increase the predictive ability of the decision trees. Thus, a problem is divided in several sub-problems, by creating sub-trees in the path between the root and the leaves of the decision tree. D. Random forest A random forest [24] is a combination of decision trees in which each tree depends on the values of a random vector sampled independently and equally distributed for all trees in the forest. In this method, after generating a large number of trees, each one votes for one class of the problem. Then, the class with more number of votes is chosen. E. K-nearest neighbors (IBK) The IBK is an instance-based learning algorithm (IBL) [25]. Such method, derived from the k-nearest neighbors (KNN) classifier, is a non-incremental algorithm and aims to keep a perfect consistency with the initial training set. On the other hand, the IBL algorithm is incremental and one of its goals is maximizing classification accuracy on new instances [25]. As well as in the KNN, in IBK, the classification generated for the sample i is influenced by the outcome of the classification of its k-nearest neighbors, because similar samples often have similar classifications [25], [26]. F. Adaptive boosting (AdaBoost) The adaptive boosting [27] is a boosting algorithm widely used in pattern classification problems. In general, as any boosting method, it makes a combination of classifiers. However, it has some properties that make it more practical and easier to implement than the boosting algorithms that preceded it. One of these properties is that it does not require any prior knowledge of the predictions achieved by weak classifiers. Instead, it adapts to the bad predictions and generates a weighted majority hypothesis in which the weight of the each prediction achieved by weak classifiers it is a function of its prediction. G. Bagging The bagging [28] is a method for generating multiple versions of a classifier that are combined to achieve an aggregate classifier. The classification process is similar to the boosting methods, but according to Witten and Frank [26], unlike what occurs in the second one, in the bagging, the different models of classifiers get the same weight in the generation of a prediction. H. LogitBoost The LogitBoost method [29] is a statistical version of the boosting method and, according to Witten and Frank [26], it has some similarities with Adaboost, but it optimizes the likelihood of a class, while the Adaboost optimizes an exponential cost function. Friedman et. al [29] defines this method as an algorithm for assembly of additive logistic regression models. IV. DATABASE AND EXPERIMENT SETTINGS To give credibility to the found results and in order to make the experiments reproducible, all the tests were performed with the large, public and well-known WEBSPAM- UK2006 collection 4. It is composed by 77.9 million web pages hosted in 11,000 hosts in the UK domains. It is important to note that this corpus was used in Web Spam Challenge 5 I and II, that are the most known competitions of web spam detection techniques. In our experiment, we followed the same competition guidelines. In this way, we used three sets of 8,487 feature vectors employed to discriminate the hosts as spam or ham. Each set is composed by 6,509 hosts labeled as ham and 1,978 labeled as spam. The organizers provided three sets of features: the first one composed by 96 content-based features [11], the second one composed by 41 link-based features [30] and the third one composed by 138 transformed link-based features [11], which are the simple combination or logarithm operation of the link-based features. To address the algorithms performance, we used a random sub-sampling validation, which is also known as Monte Carlo cross-validation [31]. Such method provides more freedom to define the size of training and testing subsets. Unlike the traditional k-fold cross-validation, the random sub-sampling validation allows to do as many repetitions were desired, using any percentage of data for training and testing. In this way, we divided each simulation in 10 tests and calculated the arithmetic mean and standard deviation of the following well-known measures: spam recall, spam precision and F-measure [26], [32], [33]. In each test, we randomly selected 80% of the samples of each class to be presented to the algorithms in the training stage and the remaining ones were separated for testing. A. Settings In the following, we describe the main parameters used for each classifier evaluated in this work. 1) MLP: we evaluated the following well-known artificial neural networks: multilayer perceptron trained with the gradient descent method (MLP-GD) and multilayer perceptron trained with Levenberg-Marquardt method (MLP-LM). 4 Yahoo! Research: Web Spam Collections. Available at barcelona.research.yahoo.net/webspam/datasets/. 5 Web Spam Challenge:

4 We implemented all the MLPs with a single hidden layer and with one neuron in the output layer. In addition, we have employed a linear activation function for the neuron of output layer and a hyperbolic tangent activation function for the neurons of the intermediate layer. Thus, we normalized the data for the interval [ 1,1]. Furthermore, one of the stopping criteria that we have used was the increase of the validation set error (checked every 10 iterations). The others parameters were empirically calibrated and they are presented in Table II. Table II: Parameters used in the multilayer perceptron artificial neural networks. Parameter MLP-GD MLP-LM Max number of iterations Min limit for the mean square error (MSE) Step learning Neurons in the hidden layer ) SVM: we implemented the SVM using the LIBSVM library [21] available for the MATLAB tool. We performed simulations with the linear, RBF, sigmoid and polynomial kernel functions and we used grid search to define the parameters. However, in the SVMs with polynomial and sigmoid kernel, that have a larger number of parameters, we performed the grid search only on the parameters C and γ, due to excessive computational cost. In this case, we set the default values of the LIBSVM for the others parameters. We performed the grid search using the random subsampling validation with 80% of the samples for training and 20% for test. Then, we chosen the best parameters those that achieved the highest f-measures, and we used them to perform the experiments with SVM. In such experiments we evaluated all types of kernels. However, due to space limit, we decided to present only the results achieved by SVM with RBF kernel since it achieved the best performance. Table III presents the parameters used in the simulations with the SVM. Table III: Parameters achieved by grid search and used by SVM with RBF kernel according to the set of features and balance the classes. Content Links Transf. links Content+links C γ C γ C γ C γ Balanced Unbalanced ) Remaining methods: we implemented the remaining classifiers using the WEKA tool [34]. The AdaBoost and bagging algorithms were trained with 100 iterations and both of them employ aggregation of multiple versions of C4.5 method. For all the other approaches we used the default parameters. V. RESULTS For each evaluated method and feature set, we have performed a simulation using unbalanced classes, as originally provided in the web spam competition, with 6,509 (76.6%) samples of ham hosts and 1,978 (23.4%) samples of spam ones. Moreover, to evaluate if the unbalance of the data impacts on the classifiers accuracy, we have also performed simulations using the same number of samples in each class. For this, both classes were composed by 1,978 representatives, randomly selected. Table IV and Table V present the results achieved by each classifier exploring the content-based features, links, transformed links, and the combination of links and content using unbalanced and balanced data, respectively. In each table, the results are sorted by F-measure for each feature set and bold values indicate the highest score for each performance measure. Further, values preceded by the symbol * indicate the highest score considering all the simulations. In general, the evaluated classifiers achieved good results, since they have detected a significant amount of spam hosts with high precision, regardless of the used feature set. Furthermore, the results indicate that bagging of trees, in general, presented the best results. In average, it was able to detect 83.5% of the spam hosts with a precision rate of 82.9%. On the other hand, the SVM classifier, in general, achieved the worst performance. It is important to note that, since the data we used in our experiment is unbalanced, the results also indicate that all the evaluated techniques are superior when trained with the same amount of samples of each class. It is because the models tend to be biased to the benefit of the class with the largest amount of samples. If we compare the used set of features, we can observe that the best average result for all classifiers was achieved when transformed link-based features were employed. However, the highest F-measure was achieved by AdaBoost with the combination of content-based features and relation of links (Table V). Therefore, we can conclude that the most appropriated feature set varies according to the classifiers. In order to support our claims, we also performed a statistical analysis of the results (Table VI). For that, we ranked the methods by F-measure using the Wilcoxon ranksum test [35] and 95% of confidence interval. Such method is a statistical hypothesis test which is also sometimes called the Mann-Whitney test [35]. The methods that are at the same level in the table have statistically equivalent results. Note that, in our simulations, bagging of trees is statistically superior than other evaluated approaches for unbalanced data but statistically equal to AdaBoost, MLPs and random forest for balanced data. Therefore, it is safe to conclude that such methods can be used as a good baseline for further comparison. On the other hand, SVM is statistically inferior than all evaluated approaches independent on the balance

5 Table IV: Results achieved for each combination of classifier and feature set using unbalanced data. Recall Precision F-measure Content-based features Bagging 68.7± ± ±0.014 Random forest 65.7± ± ±0.024 MLP-LM 69.3± ± ±0.032 AdaBoost 66.6± ± ±0.024 IBK 64.6± ± ±0.011 C ± ± ±0.024 MLP-GD 57.0± ± ±0.039 LogitBoost 54.2± ± ±0.023 SVM 36.0± ± ±0.023 Link-based features Bagging 79.2± ± ±0.009 Random forest 76.5± ± ±0.027 AdaBoost 71.8± ± ±0.017 C ± ± ±0.011 MLP-LM 70.8± ± ±0.049 LogitBoost 71.8± ± ±0.009 MLP-GD 61.6± ± ±0.026 IBK 67.8± ± ±0.023 SVM 47.1± ± ±0.021 Transformed link-based features Bagging 78.9± ± ±0.014 Random forest 76.5± ± ±0.012 MLP-LM 74.9± ± ±0.027 SVM 76.6± ± ±0.012 MLP-GD 75.2± ± ±0.014 AdaBoost 72.6± ± ±0.019 LogitBoost 70.2± ± ±0.019 C ± ± ±0.016 IBK 67.5± ± ±0.011 Combination of content and link-based features AdaBoost 80.4±1.9 *86.3±1.2 *0.832±0.014 MLP-LM 81.7± ± ±0.008 Bagging *81.1± ± ±0.016 Random forest 74.4± ± ±0.006 MLP-GD 74.2± ± ±0.016 C ± ± ±0.014 LogitBoost 70.2± ± ±0.019 IBK 68.7± ± ±0.015 SVM 52.4± ± ±0.018 Table V: Results achieved for each combination of classifier and feature set using balanced data. Recall Precision F-measure Content-based features MLP-LM 86.5±2.0 *89.1± ±0.017 Bagging 83.5± ± ±0.013 MLP-GD 82.9± ± ±0.015 Random forest 81.8± ± ±0.015 AdaBoost 81.8± ± ±0.014 IBK 77.0± ± ±0.017 C ± ± ±0.012 LogitBoost 77.0± ± ±0.015 SVM 60.6± ± ±0.019 Link-based features Bagging 91.7± ± ±0.013 Random forest 88.8± ± ±0.010 MLP-LM 92.1± ± ±0.019 AdaBoost 87.7± ± ±0.009 MLP-GD 90.2± ± ±0.019 LogitBoost 88.0± ± ±0.018 C ± ± ±0.011 IBK 80.8± ± ±0.003 SVM 77.1± ± ±0.014 Transformed link-based features Bagging 91.3± ± ±0.011 AdaBoost 88.8± ± ±0.011 MLP-GD 89.6± ± ±0.019 Random forest 89.5± ± ±0.016 MLP-LM 87.5± ± ±0.027 SVM 88.5± ± ±0.007 LogitBoost 87.6± ± ±0.021 C ± ± ±0.020 IBK 81.6± ± ±0.002 Combination of content and link-based features AdaBoost 92.6± ±0.9 *0.905±0.005 Bagging *93.2± ± ±0.005 MLP-GD 92.6± ± ±0.013 Random forest 92.1± ± ±0.008 MLP-LM 91.9± ± ±0.021 LogitBoost 89.9± ± ±0.005 C ± ± ±0.009 IBK 84.7± ± ±0.012 SVM 71.4± ± ±0.015 and feature set. VI. CONCLUSIONS AND FUTURE WORK In this paper, we presented a comprehensive performance evaluation of different established machine learning algorithms used to automatically identify spam hosts based on features extracted from their web pages. For this, we employed a real, public, non-encoded and large database composed by samples represented by content-based features, link-based features, the combination of both and transformed link-based features. The results indicated that, in general, all evaluated techniques achieved good performance, regardless of the used attribute. This indicates that, besides to be efficient, they have good capability of generalization. Furthermore, since the data we used in our experiment is unbalanced, the results also indicated that all the evaluated techniques are superior when trained with the same amount of samples of each class. It is because the models tend to be biased to the benefit of the class with the largest amount of samples. Among all evaluated classifiers, the aggregation techniques, such as bagging and boosting of trees, statistically achieved the best performances, demonstrating to be promising to identify spam hosts and consequently they are recommended to be used as a good baseline for further comparison. For future work, we intend to adapt the most promising methods in order to optimize their performances and we aim to study and propose new set of features in order to increase the capacity of the classifiers prediction.

6 Table VI: Statistical analysis of the results using the Wilcoxon rank-sum test. Level Methods Unbalanced classes Balanced classes 1 Bagging bagging, AdaBoost, MLP-LM MLP-GD and random forest 2 Adaboost and MLP-LM C4.5, IBK and LogitBoost 3 MLP-LM and MLP-GD SVM 4 Random forest 5 C4.5 and LogitBoost 6 LogitBoost and IBK 7 SVM ACKNOWLEDGMENT The authors would like to thank the financial support of Brazilian agencies FAPESP, Capes and CNPq. REFERENCES [1] J. L. Ledford, Search Engine Optimization Bible, 2nd ed. Indianapolis, Indiana, USA: Wiley Publishing, [2] K. M. Svore, Q. Wu, and C. J. Burges, Improving web spam classification using rank-time features, in Proc. of the 3rd AIRWeb, Banff, Alberta, Canada, 2007, pp [3] Z. Gyongyi and H. Garcia-Molina, Spam: It s not just for inboxes anymore, Computer, vol. 38, no. 10, pp , [4] J. P. John, F. Yu, Y. Xie, A. Krishnamurthy, and M. Abadi, deseo: combating search-result poisoning, in Proc. of the 20th SEC, Berkeley, CA, USA, 2011, pp [5] R. M. Silva, T. A. Almeida, and A. Yamakami, Redes neurais artificiais para detecção de web spams, in Proc. of the 8th Brazilian Symposium on Information Systems SBSI, São Paulo, Brazil, 2012, pp [6], Artificial neural networks for content-based web spam detection, in Proc. of the 14th ICAI, Las Vegas, NV, USA, 2012, pp [7], Towards web spam filtering with neural-based approaches, in Proc. of the 13rd IBERAMIA, Cartagena de Indias, Colombia, 2012, pp [8] T. Largillier and S. Peyronnet, Webspam demotion: Low complexity node aggregation methods, Neurocomputing, vol. 76, no. 1, pp , [9] Y. Liu, F. Chen, W. Kong, H. Yu, M. Zhang, S. Ma, and L. Ru, Identifying web spam with the wisdom of the crowds, ACM Trans. on the Web, vol. 6, no. 1, pp. 2:1 2:30, [10] A. Rungsawang, A. Taweesiriwate, and B. Manaskasemsak, Spam host detection using ant colony optimization, in IT Convergence and Services, ser. Lecture Notes in Electrical Engineering, vol Springer Netherlands, 2011, pp [11] C. Castillo, D. Donato, and A. Gionis, Know your neighbors: Web spam detection using the web topology, in Proc. of the 30th SIGIR, Amsterdam, The Netherlands, 2007, pp [12] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, Top 10 algorithms in data mining, Knowledge and Information Systems, vol. 14, no. 1, pp. 1 37, [13] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed. New York, NY, USA: Prentice Hall, [14] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford: Oxford Press, [15] M. T. Hagan and M. B. Menhaj, Training feedforward networks with the marquardt algorithm, IEEE Trans. on Neural Networks, vol. 5, no. 6, pp , [16] C. Cortes and V. N. Vapnik, Support-vector networks, in Machine Learning, 1995, pp [17] T. Almeida and A. Yamakami, Compression-Based Spam Filter, Security and Communication Networks, pp. 1 9, [18], Occam s Razor-Based Spam Filter, Journal of Internet Services and Applications, pp. 1 9, [19] T. A. Almeida and A. Yamakami, Advances in Spam Filtering Techniques, in Computational Intelligence for Privacy and Security, ser. Studies in Computational Intelligence, D. Elizondo, A. Solanas, and A. Martinez-Balleste, Eds. Springer, 2012, vol. 394, pp [20], Facing the Spammers: A Very Effective Approach to Avoid Junk s, Expert Systems with Applications, vol. 39, no. 7, pp , [21] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Trans. on Intelligent Systems and Technology, vol. 2, pp. 27:1 27:27, [22] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, A practical guide to support vector classification, National Taiwan University, Tech. Rep., [23] J. R. Quinlan, C4.5: programs for machine learning, 1st ed. San Mateo, CA, USA: Morgan Kaufmann, [24] L. Breiman, Random forests, Machine Learning, vol. 45, no. 1, pp. 5 32, [25] D. W. Aha, D. Kibler, and M. K. Albert, Instance-based learning algorithms, Machine Learning, vol. 6, no. 1, pp , [26] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. San Francisco, CA: Morgan Kaufmann, [27] Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, in Proc. of the 13th ICML. Bari, Italy: Morgan Kaufmann, 1996, pp [28] L. Breiman, Bagging predictors, Machine Learning, vol. 24, pp , 1996.

7 [29] J. Friedman, T. Hastie, and R. Tibshirani, Additive logistic regression: a statistical view of boosting, Annals of Statistics, vol. 28, no. 2, pp , [30] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, Using rank propagation and probabilistic counting for link-based spam detection, in Proc. of the WebKDD 06, Philadelphia,USA, [31] J. Shao, Linear model selection by cross-validation, Journal of the American Statistical Association, vol. 88, no. 422, pp , [32] T. Almeida, J. Almeida, and A. Yamakami, Spam Filtering: How the Dimensionality Reduction Affects the Accuracy of Naive Bayes Classifiers, Journal of Internet Services and Applications, vol. 1, no. 3, pp , [33] T. Almeida, J. G. Hidalgo, and A. Yamakami, Contributions to the Study of SMS Spam Filtering: New Collection and Results, in Proceedings of the 2011 ACM Symposium on Document Engineering, Mountain View, CA, USA, 2011, pp [34] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, The WEKA data mining software: an update, SIGKDD Explorations Newsletter, vol. 11, no. 1, pp , [35] D. C. Montgomery and G. C. Runger, Applied Statistics and Probability for Engineers, 3rd ed. New York, NY, USA: John Wiley & Sons, 2002.