An Analysis of Machine Learning Methods for Spam Host Detection

Size: px
Start display at page:

Download "An Analysis of Machine Learning Methods for Spam Host Detection"

Transcription

1 An Analysis of Machine Learning Methods for Spam Host Detection Renato M. Silva, Akebo Yamakami School of Electrical and Computer Engineering University of Campinas UNICAMP Campinas, São Paulo, Brazil {renatoms, Tiago A. Almeida Department of Computer Science Federal University of São Carlos UFSCar Sorocaba, São Paulo, Brazil Abstract The web is becoming an increasingly important source of entertainment, communication, research, news and trade. In this way, the web sites compete to attract the attention of users and many of them achieve visibility through malicious strategies that try to circumvent the search engines. Such sites are known as web spam and they are generally responsible for personal injury and economic losses. Given this scenario, this paper presents a comprehensive performance evaluation of several established machine learning techniques used to automatically detect and filter hosts that disseminate web spam. Our experiments were diligently designed to ensure statistically sounds results and they indicate that bagging of decision trees, multilayer perceptron neural networks, random forest and adaptive boosting of decision trees are promising in the task of web spam classification and, hence, they can be used as a good baseline for further comparison. Keywords-spamdexing; web spam; spam host; classification. I. INTRODUCTION Search engines are powerful allies of the Internet users to find information and therefore are responsible for a significant percentage of accesses to most Web sites. Thus, to succeed, it is important that the sites increase their relevance ranking in search engines. In this way, they are better positioned in search results when users consult terms related to their contents. To achieve this purpose, strategies, known as search engine optimization (SEO), should be used [1]. There are many ethical SEO strategies, however, Ledford [1] states that is required a lot of time and hard work to learn the most successful ones. All elements of the Web site should be created to maximize your search engine rankings. But, as mentioned in the SEO guide published by Google 1, it should be considered that the ultimate consumers of the Web site are the users, not search engines. It is important that the site offers relevant content, navigation facilities and other features that benefit your visitors. The problem is that several sites prefer to invest in unethical SEO techniques and circumvent the search engines to achieve relevance without merit. The most affected are the users that when make their queries in search engines, receive unexpected and irrelevant 1 Search Engine Optimization Starter Guide. See: jp/intl/en/webmasters/docs/search-engine-optimization-starter-guide.pdf answers, often infected with malicious content. Such techniques are known as web spamming or spamdexing [2], and according to Gyongyi and Garcia-Molina [3], they can be divided into two main categories: content spam and link spam. The first one is related to the creation of pages with irrelevant keywords and the second one corresponds to the addition of links to the Web page that is intended to promote. The main reason that contribute to the growth of web spam is the economic incentive. The spammers create Web pages with irrelevant content, increase their ranking in search engines through spamming techniques and advertise products. So, they get some money for each click of visitors [3]. Recent studies indicate that the amount of web spam is dramatically increasing. According to John et. al [4], 36% of the results generated by the most popular search engines contain malicious URLs. A report produced by Websense 2 shows that 22.4% of the search results about entertainment present malicious links. Furthermore, a report published by McAfee 3 informs that 49% of the most popular search terms return some malicious site in the top 100 search results. The same study found that 1.2% of the queries return links of the malicious sites in the top 100 results. Given this scenario, this paper presents a comprehensive performance evaluation of several well-known machine learning techniques employed to automatically detect spam hosts in order to provide good baseline results for further comparison. This paper is organized as follows: in Section II we briefly describe the related papers available in the literature. Section III presents the basic concepts of the methods evaluated in this paper. In Section IV we present the database and settings that we used in the experiments. Section V presents the main results. Finally, Section VI describes the conclusions and guidelines for future work. II. RELATED WORK Some papers that address the problem of web spamming are found in the literature. 2 Websense 2010 Threat Report. See: reports/report-websense-2010-threat-report-en.pdf 3 McAfee Threats Report: First Quarter See: com/us/resources/reports/rp-quarterly-threat-q pdf

2 Silva et al. [5], [6], [7] evaluated different neural-based approaches for web spam detection. They also analyzed how different sets of features impact on the classifiers accuracy. Largillier and Peyronnet [8] presented several node aggregation methods for the demotion of the effects of web spamming on the PageRank algorithm [1]. Liu et al. [9] proposed a number of user-behavior features for separating spam pages from normal pages. They also presented a framework that combines machine-learning techniques assisted by user behavior to detect spam pages. Rungsawang et al. [10] proposed an ant colony optimization algorithm that exploits both content and link features for spam host detection. Castilho et al. [11] proposed content-based features that were used in several other relevant works and in important events. In addition, they presented a spam detection system that combines content-based and link-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. Svore et al. [2] presented a method for detecting web spam that uses content-based features and the rank-time. Gyongyi and Garcia-Molina [3] described several web spamming techniques used by spammers. Nevertheless, as each mentioned paper used a different experiment setup, it is still unclear what are the best machine learning methods available in the literature that can be used to automatically detect spam hosts. III. METHODS This section presents concepts regarding the following well-known methods that we evaluated in this paper: multilayer perceptron neural networks, support vector machines, methods based on trees, such as decision trees, random forest, bagging and adaptive boosting of trees, and k-nearest neighbor. Such methods were chosen because most of them have been evaluated and presented as the best machine learning and data mining techniques currently available [12]. A. Multilayer perceptron neural network (MLP) A multilayer perceptron neural network is a perceptrontype network that has a set of sensory units composed by an input layer, one or more intermediate (hidden) layers, and an output layer of neurons [13]. By default, MLP is a supervised learning method that uses the backpropagation algorithm which can be summarized in two stages: forward and backward [14]. In the forward stage, the signal propagates through the network, layer by layer, as follows: u l j (n) = m l 1 i=0 wji l (n)yl 1 i (n), where l = 0,1,2,...,L are the indexes of network layers. So, l = 0 represents the input layer and l = L represents the output layer. On the other hand,y l 1 i (n) is the output function relating to the neuron i in the previous layer, l 1, wji l (n) is the synaptic weight of neuron j in layer l and m l corresponds to the number of neurons in layer l. For i = 0, y0 l 1 (n) = +1 and wj0 l (n) represent the bias applied to neuron j in layer l [13]. The output of neuron j in layer l is given by yj l(n) = ϕ j (u l j (n)), where ϕ j is the activation function of j. Then, the error can be calculated by e l j (n) = yl j (n) d(n), where d(n) is the desired output for an input pattern x(n). In backward stage, the derivation of the backpropagation algorithm is performed starting from the output layer, as follows: δj L(n) = ϕ j (ul j (n))el j (n), where ϕ j is the derivative of the activation function. For l = L,L 1,...,2, is calculated: δ l 1 j (n) = ϕ j (ul 1 j (n)) m l wji l (n) δl j (n), for i=1 j = 0,1,...,m l 1. Consult Haykin [13] and Bishop [14] for more details. 1) Levenberg-Marquardt algorithm: the Levenberg- Marquardt algorithm is usually employed to optimize and accelerate the convergence of the backpropagation algorithm [14]. It is considered a second order method because it uses information about the second derivative of the error function. Details can be found in Bishop [14] and Hagan and Menhaj [15]. B. Support vector machines (SVM) Support vector machines (SVM) [16], [17], [18], [19], [20] is a machine learning method that can be used for pattern classification, regression and others learning tasks [13], [21]. This method was conceptually implemented following the idea that input vectors are non-linearly mapped to a high dimension feature space. In this feature space is constructed a linear decision surface which separates the classes of the input patterns. One of the main elements that the SVM uses to separate the patterns of distinct classes is a kernel function. Through it, the SVM constructs a decision surface nonlinear in the input space, but linear in the features space [13]. Table I presents the most popular SVM kernel functions, where γ, r and d are parameters that must be set by the user. Table I: The most popular SVM kernel functions [13], [22]. Linear k(x i,x j ) = x T i x j RBF k(x i,x j ) = exp( 1 2γ 2 x i x j 2 ),γ > 0 Polynomial k(x i,x j ) = (γx T i x j +r) d,γ > 0 Sigmoid k(x i,x j ) = tanh(γx T i x j +r) To assist the choice of the SVM parameters, Hsu et. al [22] recommend the employment of a grid search. For instance, considering the SVM with RBF kernel, in which is necessary to define the regularization parameter C and γ, the authors suggest that the grid search could be used to evaluate exponential sequences: C = 2 5,2 4,2 3...,2 15 and γ = 2 15, ,2 3.

3 C. Decision trees (C4.5) The C4.5 [23] is one of the most classical decision tree algorithms and uses both categorical and continuous attributes. It uses a divide-and-conquer approach to increase the predictive ability of the decision trees. Thus, a problem is divided in several sub-problems, by creating sub-trees in the path between the root and the leaves of the decision tree. D. Random forest A random forest [24] is a combination of decision trees in which each tree depends on the values of a random vector sampled independently and equally distributed for all trees in the forest. In this method, after generating a large number of trees, each one votes for one class of the problem. Then, the class with more number of votes is chosen. E. K-nearest neighbors (IBK) The IBK is an instance-based learning algorithm (IBL) [25]. Such method, derived from the k-nearest neighbors (KNN) classifier, is a non-incremental algorithm and aims to keep a perfect consistency with the initial training set. On the other hand, the IBL algorithm is incremental and one of its goals is maximizing classification accuracy on new instances [25]. As well as in the KNN, in IBK, the classification generated for the sample i is influenced by the outcome of the classification of its k-nearest neighbors, because similar samples often have similar classifications [25], [26]. F. Adaptive boosting (AdaBoost) The adaptive boosting [27] is a boosting algorithm widely used in pattern classification problems. In general, as any boosting method, it makes a combination of classifiers. However, it has some properties that make it more practical and easier to implement than the boosting algorithms that preceded it. One of these properties is that it does not require any prior knowledge of the predictions achieved by weak classifiers. Instead, it adapts to the bad predictions and generates a weighted majority hypothesis in which the weight of the each prediction achieved by weak classifiers it is a function of its prediction. G. Bagging The bagging [28] is a method for generating multiple versions of a classifier that are combined to achieve an aggregate classifier. The classification process is similar to the boosting methods, but according to Witten and Frank [26], unlike what occurs in the second one, in the bagging, the different models of classifiers get the same weight in the generation of a prediction. H. LogitBoost The LogitBoost method [29] is a statistical version of the boosting method and, according to Witten and Frank [26], it has some similarities with Adaboost, but it optimizes the likelihood of a class, while the Adaboost optimizes an exponential cost function. Friedman et. al [29] defines this method as an algorithm for assembly of additive logistic regression models. IV. DATABASE AND EXPERIMENT SETTINGS To give credibility to the found results and in order to make the experiments reproducible, all the tests were performed with the large, public and well-known WEBSPAM- UK2006 collection 4. It is composed by 77.9 million web pages hosted in 11,000 hosts in the UK domains. It is important to note that this corpus was used in Web Spam Challenge 5 I and II, that are the most known competitions of web spam detection techniques. In our experiment, we followed the same competition guidelines. In this way, we used three sets of 8,487 feature vectors employed to discriminate the hosts as spam or ham. Each set is composed by 6,509 hosts labeled as ham and 1,978 labeled as spam. The organizers provided three sets of features: the first one composed by 96 content-based features [11], the second one composed by 41 link-based features [30] and the third one composed by 138 transformed link-based features [11], which are the simple combination or logarithm operation of the link-based features. To address the algorithms performance, we used a random sub-sampling validation, which is also known as Monte Carlo cross-validation [31]. Such method provides more freedom to define the size of training and testing subsets. Unlike the traditional k-fold cross-validation, the random sub-sampling validation allows to do as many repetitions were desired, using any percentage of data for training and testing. In this way, we divided each simulation in 10 tests and calculated the arithmetic mean and standard deviation of the following well-known measures: spam recall, spam precision and F-measure [26], [32], [33]. In each test, we randomly selected 80% of the samples of each class to be presented to the algorithms in the training stage and the remaining ones were separated for testing. A. Settings In the following, we describe the main parameters used for each classifier evaluated in this work. 1) MLP: we evaluated the following well-known artificial neural networks: multilayer perceptron trained with the gradient descent method (MLP-GD) and multilayer perceptron trained with Levenberg-Marquardt method (MLP-LM). 4 Yahoo! Research: Web Spam Collections. Available at barcelona.research.yahoo.net/webspam/datasets/. 5 Web Spam Challenge:

4 We implemented all the MLPs with a single hidden layer and with one neuron in the output layer. In addition, we have employed a linear activation function for the neuron of output layer and a hyperbolic tangent activation function for the neurons of the intermediate layer. Thus, we normalized the data for the interval [ 1,1]. Furthermore, one of the stopping criteria that we have used was the increase of the validation set error (checked every 10 iterations). The others parameters were empirically calibrated and they are presented in Table II. Table II: Parameters used in the multilayer perceptron artificial neural networks. Parameter MLP-GD MLP-LM Max number of iterations Min limit for the mean square error (MSE) Step learning Neurons in the hidden layer ) SVM: we implemented the SVM using the LIBSVM library [21] available for the MATLAB tool. We performed simulations with the linear, RBF, sigmoid and polynomial kernel functions and we used grid search to define the parameters. However, in the SVMs with polynomial and sigmoid kernel, that have a larger number of parameters, we performed the grid search only on the parameters C and γ, due to excessive computational cost. In this case, we set the default values of the LIBSVM for the others parameters. We performed the grid search using the random subsampling validation with 80% of the samples for training and 20% for test. Then, we chosen the best parameters those that achieved the highest f-measures, and we used them to perform the experiments with SVM. In such experiments we evaluated all types of kernels. However, due to space limit, we decided to present only the results achieved by SVM with RBF kernel since it achieved the best performance. Table III presents the parameters used in the simulations with the SVM. Table III: Parameters achieved by grid search and used by SVM with RBF kernel according to the set of features and balance the classes. Content Links Transf. links Content+links C γ C γ C γ C γ Balanced Unbalanced ) Remaining methods: we implemented the remaining classifiers using the WEKA tool [34]. The AdaBoost and bagging algorithms were trained with 100 iterations and both of them employ aggregation of multiple versions of C4.5 method. For all the other approaches we used the default parameters. V. RESULTS For each evaluated method and feature set, we have performed a simulation using unbalanced classes, as originally provided in the web spam competition, with 6,509 (76.6%) samples of ham hosts and 1,978 (23.4%) samples of spam ones. Moreover, to evaluate if the unbalance of the data impacts on the classifiers accuracy, we have also performed simulations using the same number of samples in each class. For this, both classes were composed by 1,978 representatives, randomly selected. Table IV and Table V present the results achieved by each classifier exploring the content-based features, links, transformed links, and the combination of links and content using unbalanced and balanced data, respectively. In each table, the results are sorted by F-measure for each feature set and bold values indicate the highest score for each performance measure. Further, values preceded by the symbol * indicate the highest score considering all the simulations. In general, the evaluated classifiers achieved good results, since they have detected a significant amount of spam hosts with high precision, regardless of the used feature set. Furthermore, the results indicate that bagging of trees, in general, presented the best results. In average, it was able to detect 83.5% of the spam hosts with a precision rate of 82.9%. On the other hand, the SVM classifier, in general, achieved the worst performance. It is important to note that, since the data we used in our experiment is unbalanced, the results also indicate that all the evaluated techniques are superior when trained with the same amount of samples of each class. It is because the models tend to be biased to the benefit of the class with the largest amount of samples. If we compare the used set of features, we can observe that the best average result for all classifiers was achieved when transformed link-based features were employed. However, the highest F-measure was achieved by AdaBoost with the combination of content-based features and relation of links (Table V). Therefore, we can conclude that the most appropriated feature set varies according to the classifiers. In order to support our claims, we also performed a statistical analysis of the results (Table VI). For that, we ranked the methods by F-measure using the Wilcoxon ranksum test [35] and 95% of confidence interval. Such method is a statistical hypothesis test which is also sometimes called the Mann-Whitney test [35]. The methods that are at the same level in the table have statistically equivalent results. Note that, in our simulations, bagging of trees is statistically superior than other evaluated approaches for unbalanced data but statistically equal to AdaBoost, MLPs and random forest for balanced data. Therefore, it is safe to conclude that such methods can be used as a good baseline for further comparison. On the other hand, SVM is statistically inferior than all evaluated approaches independent on the balance

5 Table IV: Results achieved for each combination of classifier and feature set using unbalanced data. Recall Precision F-measure Content-based features Bagging 68.7± ± ±0.014 Random forest 65.7± ± ±0.024 MLP-LM 69.3± ± ±0.032 AdaBoost 66.6± ± ±0.024 IBK 64.6± ± ±0.011 C ± ± ±0.024 MLP-GD 57.0± ± ±0.039 LogitBoost 54.2± ± ±0.023 SVM 36.0± ± ±0.023 Link-based features Bagging 79.2± ± ±0.009 Random forest 76.5± ± ±0.027 AdaBoost 71.8± ± ±0.017 C ± ± ±0.011 MLP-LM 70.8± ± ±0.049 LogitBoost 71.8± ± ±0.009 MLP-GD 61.6± ± ±0.026 IBK 67.8± ± ±0.023 SVM 47.1± ± ±0.021 Transformed link-based features Bagging 78.9± ± ±0.014 Random forest 76.5± ± ±0.012 MLP-LM 74.9± ± ±0.027 SVM 76.6± ± ±0.012 MLP-GD 75.2± ± ±0.014 AdaBoost 72.6± ± ±0.019 LogitBoost 70.2± ± ±0.019 C ± ± ±0.016 IBK 67.5± ± ±0.011 Combination of content and link-based features AdaBoost 80.4±1.9 *86.3±1.2 *0.832±0.014 MLP-LM 81.7± ± ±0.008 Bagging *81.1± ± ±0.016 Random forest 74.4± ± ±0.006 MLP-GD 74.2± ± ±0.016 C ± ± ±0.014 LogitBoost 70.2± ± ±0.019 IBK 68.7± ± ±0.015 SVM 52.4± ± ±0.018 Table V: Results achieved for each combination of classifier and feature set using balanced data. Recall Precision F-measure Content-based features MLP-LM 86.5±2.0 *89.1± ±0.017 Bagging 83.5± ± ±0.013 MLP-GD 82.9± ± ±0.015 Random forest 81.8± ± ±0.015 AdaBoost 81.8± ± ±0.014 IBK 77.0± ± ±0.017 C ± ± ±0.012 LogitBoost 77.0± ± ±0.015 SVM 60.6± ± ±0.019 Link-based features Bagging 91.7± ± ±0.013 Random forest 88.8± ± ±0.010 MLP-LM 92.1± ± ±0.019 AdaBoost 87.7± ± ±0.009 MLP-GD 90.2± ± ±0.019 LogitBoost 88.0± ± ±0.018 C ± ± ±0.011 IBK 80.8± ± ±0.003 SVM 77.1± ± ±0.014 Transformed link-based features Bagging 91.3± ± ±0.011 AdaBoost 88.8± ± ±0.011 MLP-GD 89.6± ± ±0.019 Random forest 89.5± ± ±0.016 MLP-LM 87.5± ± ±0.027 SVM 88.5± ± ±0.007 LogitBoost 87.6± ± ±0.021 C ± ± ±0.020 IBK 81.6± ± ±0.002 Combination of content and link-based features AdaBoost 92.6± ±0.9 *0.905±0.005 Bagging *93.2± ± ±0.005 MLP-GD 92.6± ± ±0.013 Random forest 92.1± ± ±0.008 MLP-LM 91.9± ± ±0.021 LogitBoost 89.9± ± ±0.005 C ± ± ±0.009 IBK 84.7± ± ±0.012 SVM 71.4± ± ±0.015 and feature set. VI. CONCLUSIONS AND FUTURE WORK In this paper, we presented a comprehensive performance evaluation of different established machine learning algorithms used to automatically identify spam hosts based on features extracted from their web pages. For this, we employed a real, public, non-encoded and large database composed by samples represented by content-based features, link-based features, the combination of both and transformed link-based features. The results indicated that, in general, all evaluated techniques achieved good performance, regardless of the used attribute. This indicates that, besides to be efficient, they have good capability of generalization. Furthermore, since the data we used in our experiment is unbalanced, the results also indicated that all the evaluated techniques are superior when trained with the same amount of samples of each class. It is because the models tend to be biased to the benefit of the class with the largest amount of samples. Among all evaluated classifiers, the aggregation techniques, such as bagging and boosting of trees, statistically achieved the best performances, demonstrating to be promising to identify spam hosts and consequently they are recommended to be used as a good baseline for further comparison. For future work, we intend to adapt the most promising methods in order to optimize their performances and we aim to study and propose new set of features in order to increase the capacity of the classifiers prediction.

6 Table VI: Statistical analysis of the results using the Wilcoxon rank-sum test. Level Methods Unbalanced classes Balanced classes 1 Bagging bagging, AdaBoost, MLP-LM MLP-GD and random forest 2 Adaboost and MLP-LM C4.5, IBK and LogitBoost 3 MLP-LM and MLP-GD SVM 4 Random forest 5 C4.5 and LogitBoost 6 LogitBoost and IBK 7 SVM ACKNOWLEDGMENT The authors would like to thank the financial support of Brazilian agencies FAPESP, Capes and CNPq. REFERENCES [1] J. L. Ledford, Search Engine Optimization Bible, 2nd ed. Indianapolis, Indiana, USA: Wiley Publishing, [2] K. M. Svore, Q. Wu, and C. J. Burges, Improving web spam classification using rank-time features, in Proc. of the 3rd AIRWeb, Banff, Alberta, Canada, 2007, pp [3] Z. Gyongyi and H. Garcia-Molina, Spam: It s not just for inboxes anymore, Computer, vol. 38, no. 10, pp , [4] J. P. John, F. Yu, Y. Xie, A. Krishnamurthy, and M. Abadi, deseo: combating search-result poisoning, in Proc. of the 20th SEC, Berkeley, CA, USA, 2011, pp [5] R. M. Silva, T. A. Almeida, and A. Yamakami, Redes neurais artificiais para detecção de web spams, in Proc. of the 8th Brazilian Symposium on Information Systems SBSI, São Paulo, Brazil, 2012, pp [6], Artificial neural networks for content-based web spam detection, in Proc. of the 14th ICAI, Las Vegas, NV, USA, 2012, pp [7], Towards web spam filtering with neural-based approaches, in Proc. of the 13rd IBERAMIA, Cartagena de Indias, Colombia, 2012, pp [8] T. Largillier and S. Peyronnet, Webspam demotion: Low complexity node aggregation methods, Neurocomputing, vol. 76, no. 1, pp , [9] Y. Liu, F. Chen, W. Kong, H. Yu, M. Zhang, S. Ma, and L. Ru, Identifying web spam with the wisdom of the crowds, ACM Trans. on the Web, vol. 6, no. 1, pp. 2:1 2:30, [10] A. Rungsawang, A. Taweesiriwate, and B. Manaskasemsak, Spam host detection using ant colony optimization, in IT Convergence and Services, ser. Lecture Notes in Electrical Engineering, vol Springer Netherlands, 2011, pp [11] C. Castillo, D. Donato, and A. Gionis, Know your neighbors: Web spam detection using the web topology, in Proc. of the 30th SIGIR, Amsterdam, The Netherlands, 2007, pp [12] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, Top 10 algorithms in data mining, Knowledge and Information Systems, vol. 14, no. 1, pp. 1 37, [13] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed. New York, NY, USA: Prentice Hall, [14] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford: Oxford Press, [15] M. T. Hagan and M. B. Menhaj, Training feedforward networks with the marquardt algorithm, IEEE Trans. on Neural Networks, vol. 5, no. 6, pp , [16] C. Cortes and V. N. Vapnik, Support-vector networks, in Machine Learning, 1995, pp [17] T. Almeida and A. Yamakami, Compression-Based Spam Filter, Security and Communication Networks, pp. 1 9, [18], Occam s Razor-Based Spam Filter, Journal of Internet Services and Applications, pp. 1 9, [19] T. A. Almeida and A. Yamakami, Advances in Spam Filtering Techniques, in Computational Intelligence for Privacy and Security, ser. Studies in Computational Intelligence, D. Elizondo, A. Solanas, and A. Martinez-Balleste, Eds. Springer, 2012, vol. 394, pp [20], Facing the Spammers: A Very Effective Approach to Avoid Junk s, Expert Systems with Applications, vol. 39, no. 7, pp , [21] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Trans. on Intelligent Systems and Technology, vol. 2, pp. 27:1 27:27, [22] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, A practical guide to support vector classification, National Taiwan University, Tech. Rep., [23] J. R. Quinlan, C4.5: programs for machine learning, 1st ed. San Mateo, CA, USA: Morgan Kaufmann, [24] L. Breiman, Random forests, Machine Learning, vol. 45, no. 1, pp. 5 32, [25] D. W. Aha, D. Kibler, and M. K. Albert, Instance-based learning algorithms, Machine Learning, vol. 6, no. 1, pp , [26] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. San Francisco, CA: Morgan Kaufmann, [27] Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, in Proc. of the 13th ICML. Bari, Italy: Morgan Kaufmann, 1996, pp [28] L. Breiman, Bagging predictors, Machine Learning, vol. 24, pp , 1996.

7 [29] J. Friedman, T. Hastie, and R. Tibshirani, Additive logistic regression: a statistical view of boosting, Annals of Statistics, vol. 28, no. 2, pp , [30] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, Using rank propagation and probabilistic counting for link-based spam detection, in Proc. of the WebKDD 06, Philadelphia,USA, [31] J. Shao, Linear model selection by cross-validation, Journal of the American Statistical Association, vol. 88, no. 422, pp , [32] T. Almeida, J. Almeida, and A. Yamakami, Spam Filtering: How the Dimensionality Reduction Affects the Accuracy of Naive Bayes Classifiers, Journal of Internet Services and Applications, vol. 1, no. 3, pp , [33] T. Almeida, J. G. Hidalgo, and A. Yamakami, Contributions to the Study of SMS Spam Filtering: New Collection and Results, in Proceedings of the 2011 ACM Symposium on Document Engineering, Mountain View, CA, USA, 2011, pp [34] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, The WEKA data mining software: an update, SIGKDD Explorations Newsletter, vol. 11, no. 1, pp , [35] D. C. Montgomery and G. C. Runger, Applied Statistics and Probability for Engineers, 3rd ed. New York, NY, USA: John Wiley & Sons, 2002.

Artificial Neural Networks for Content-based Web Spam Detection

Artificial Neural Networks for Content-based Web Spam Detection Artificial Neural Networks for Content-based Web Spam Detection Renato M Silva 1, Tiago A Almeida 2, and Akebo Yamakami 1 1 School of Electrical and Computer Engineering, University of Campinas UNICAMP,

More information

Top Top 10 Algorithms in Data Mining

Top Top 10 Algorithms in Data Mining ICDM 06 Panel on Top Top 10 Algorithms in Data Mining 1. The 3-step identification process 2. The 18 identified candidates 3. Algorithm presentations 4. Top 10 algorithms: summary 5. Open discussions ICDM

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Top 10 Algorithms in Data Mining

Top 10 Algorithms in Data Mining Top 10 Algorithms in Data Mining Xindong Wu ( 吴 信 东 ) Department of Computer Science University of Vermont, USA; 合 肥 工 业 大 学 计 算 机 与 信 息 学 院 1 Top 10 Algorithms in Data Mining by the IEEE ICDM Conference

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Studying Auto Insurance Data

Studying Auto Insurance Data Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Spam detection with data mining method:

Spam detection with data mining method: Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

More information

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

More information

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,

More information

How To Predict Web Site Visits

How To Predict Web Site Visits Web Site Visit Forecasting Using Data Mining Techniques Chandana Napagoda Abstract: Data mining is a technique which is used for identifying relationships between various large amounts of data in many

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES Bruno Carneiro da Rocha 1,2 and Rafael Timóteo de Sousa Júnior 2 1 Bank of Brazil, Brasília-DF, Brazil brunorocha_33@hotmail.com 2 Network Engineering

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Predicting Student Performance by Using Data Mining Methods for Classification

Predicting Student Performance by Using Data Mining Methods for Classification BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

Prediction Model for Crude Oil Price Using Artificial Neural Networks

Prediction Model for Crude Oil Price Using Artificial Neural Networks Applied Mathematical Sciences, Vol. 8, 2014, no. 80, 3953-3965 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2014.43193 Prediction Model for Crude Oil Price Using Artificial Neural Networks

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems

More information

Lecture 6. Artificial Neural Networks

Lecture 6. Artificial Neural Networks Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

Data Mining in Education: Data Classification and Decision Tree Approach

Data Mining in Education: Data Classification and Decision Tree Approach Data Mining in Education: Data Classification and Decision Tree Approach Sonali Agarwal, G. N. Pandey, and M. D. Tiwari Abstract Educational organizations are one of the important parts of our society

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

Neural Networks and Support Vector Machines

Neural Networks and Support Vector Machines INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines

More information

Scalable Developments for Big Data Analytics in Remote Sensing

Scalable Developments for Big Data Analytics in Remote Sensing Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

Spam Host Detection Using Ant Colony Optimization

Spam Host Detection Using Ant Colony Optimization Spam Host Detection Using Ant Colony Optimization Arnon Rungsawang, Apichat Taweesiriwate and Bundit Manaskasemsak Abstract Inappropriate effort of web manipulation or spamming in order to boost up a web

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Neural Network Predictor for Fraud Detection: A Study Case for the Federal Patrimony Department

Neural Network Predictor for Fraud Detection: A Study Case for the Federal Patrimony Department DOI: 10.5769/C2012010 or http://dx.doi.org/10.5769/c2012010 Neural Network Predictor for Fraud Detection: A Study Case for the Federal Patrimony Department Antonio Manuel Rubio Serrano (1,2), João Paulo

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo 71251911@mackenzie.br,nizam.omar@mackenzie.br

More information

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition Submission to NIPS*97, Category: Algorithms & Architectures, Preferred: Oral Training Methods for Adaptive Boosting of Neural Networks for Character Recognition Holger Schwenk Dept. IRO Université de Montréal

More information

Operations Research and Knowledge Modeling in Data Mining

Operations Research and Knowledge Modeling in Data Mining Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 koda@sk.tsukuba.ac.jp

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

NEURAL NETWORKS A Comprehensive Foundation

NEURAL NETWORKS A Comprehensive Foundation NEURAL NETWORKS A Comprehensive Foundation Second Edition Simon Haykin McMaster University Hamilton, Ontario, Canada Prentice Hall Prentice Hall Upper Saddle River; New Jersey 07458 Preface xii Acknowledgments

More information

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence Government of Russian Federation Federal State Autonomous Educational Institution of High Professional Education National Research University «Higher School of Economics» Faculty of Computer Science School

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin *

Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin * Send Orders for Reprints to reprints@benthamscience.ae 766 The Open Electrical & Electronic Engineering Journal, 2014, 8, 766-771 Open Access Research on Application of Neural Network in Computer Network

More information

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

More information

Using artificial intelligence for data reduction in mechanical engineering

Using artificial intelligence for data reduction in mechanical engineering Using artificial intelligence for data reduction in mechanical engineering L. Mdlazi 1, C.J. Stander 1, P.S. Heyns 1, T. Marwala 2 1 Dynamic Systems Group Department of Mechanical and Aeronautical Engineering,

More information

Spam Detection on Twitter Using Traditional Classifiers M. McCord CSE Dept Lehigh University 19 Memorial Drive West Bethlehem, PA 18015, USA

Spam Detection on Twitter Using Traditional Classifiers M. McCord CSE Dept Lehigh University 19 Memorial Drive West Bethlehem, PA 18015, USA Spam Detection on Twitter Using Traditional Classifiers M. McCord CSE Dept Lehigh University 19 Memorial Drive West Bethlehem, PA 18015, USA mpm308@lehigh.edu M. Chuah CSE Dept Lehigh University 19 Memorial

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

Towards applying Data Mining Techniques for Talent Mangement

Towards applying Data Mining Techniques for Talent Mangement 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Towards applying Data Mining Techniques for Talent Mangement Hamidah Jantan 1,

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,

More information

Neural network software tool development: exploring programming language options

Neural network software tool development: exploring programming language options INEB- PSI Technical Report 2006-1 Neural network software tool development: exploring programming language options Alexandra Oliveira aao@fe.up.pt Supervisor: Professor Joaquim Marques de Sá June 2006

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

Neural Networks for Sentiment Detection in Financial Text

Neural Networks for Sentiment Detection in Financial Text Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.

More information

Application of Neural Network in User Authentication for Smart Home System

Application of Neural Network in User Authentication for Smart Home System Application of Neural Network in User Authentication for Smart Home System A. Joseph, D.B.L. Bong, D.A.A. Mat Abstract Security has been an important issue and concern in the smart home systems. Smart

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-349, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Ensemble Approach for the Classification of Imbalanced Data

Ensemble Approach for the Classification of Imbalanced Data Ensemble Approach for the Classification of Imbalanced Data Vladimir Nikulin 1, Geoffrey J. McLachlan 1, and Shu Kay Ng 2 1 Department of Mathematics, University of Queensland v.nikulin@uq.edu.au, gjm@maths.uq.edu.au

More information

Equity forecast: Predicting long term stock price movement using machine learning

Equity forecast: Predicting long term stock price movement using machine learning Equity forecast: Predicting long term stock price movement using machine learning Nikola Milosevic School of Computer Science, University of Manchester, UK Nikola.milosevic@manchester.ac.uk Abstract Long

More information

A Multi-level Artificial Neural Network for Residential and Commercial Energy Demand Forecast: Iran Case Study

A Multi-level Artificial Neural Network for Residential and Commercial Energy Demand Forecast: Iran Case Study 211 3rd International Conference on Information and Financial Engineering IPEDR vol.12 (211) (211) IACSIT Press, Singapore A Multi-level Artificial Neural Network for Residential and Commercial Energy

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

Solving Regression Problems Using Competitive Ensemble Models

Solving Regression Problems Using Competitive Ensemble Models Solving Regression Problems Using Competitive Ensemble Models Yakov Frayman, Bernard F. Rolfe, and Geoffrey I. Webb School of Information Technology Deakin University Geelong, VIC, Australia {yfraym,brolfe,webb}@deakin.edu.au

More information

Lasso-based Spam Filtering with Chinese Emails

Lasso-based Spam Filtering with Chinese Emails Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1

More information

The Effect of Clustering in the Apriori Data Mining Algorithm: A Case Study

The Effect of Clustering in the Apriori Data Mining Algorithm: A Case Study WCE 23, July 3-5, 23, London, U.K. The Effect of Clustering in the Apriori Data Mining Algorithm: A Case Study Nergis Yılmaz and Gülfem Işıklar Alptekin Abstract Many organizations collect and store data

More information

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring 714 Evaluation of Feature election Methods for Predictive Modeling Using Neural Networks in Credits coring Raghavendra B. K. Dr. M.G.R. Educational and Research Institute, Chennai-95 Email: raghavendra_bk@rediffmail.com

More information

A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM

A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM Journal of Computational Information Systems 10: 17 (2014) 7629 7635 Available at http://www.jofcis.com A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM Tian

More information

A Study on Prediction of User s Tendency Toward Purchases in Online Websites Based on Behavior Models

A Study on Prediction of User s Tendency Toward Purchases in Online Websites Based on Behavior Models A Study on Prediction of User s Tendency Toward Purchases in Online Websites Based on Behavior Models Emad Gohari Boroujerdi, Soroush Mehri, Saeed Sadeghi Garmaroudi, Mohammad Pezeshki, Farid Rashidi Mehrabadi,

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

How To Cluster On A Search Engine

How To Cluster On A Search Engine Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

SVM Ensemble Model for Investment Prediction

SVM Ensemble Model for Investment Prediction 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Benchmarking of different classes of models used for credit scoring

Benchmarking of different classes of models used for credit scoring Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want

More information

Ensemble Data Mining Methods

Ensemble Data Mining Methods Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

More information

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

More information

Role of Neural network in data mining

Role of Neural network in data mining Role of Neural network in data mining Chitranjanjit kaur Associate Prof Guru Nanak College, Sukhchainana Phagwara,(GNDU) Punjab, India Pooja kapoor Associate Prof Swami Sarvanand Group Of Institutes Dinanagar(PTU)

More information

Discovering Fraud in Online Classified Ads

Discovering Fraud in Online Classified Ads Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference Discovering Fraud in Online Classified Ads Alan McCormick and William Eberle Department of Computer

More information

Crowdfunding Support Tools: Predicting Success & Failure

Crowdfunding Support Tools: Predicting Success & Failure Crowdfunding Support Tools: Predicting Success & Failure Michael D. Greenberg Bryan Pardo mdgreenb@u.northwestern.edu pardo@northwestern.edu Karthic Hariharan karthichariharan2012@u.northwes tern.edu Elizabeth

More information

Machine learning for algo trading

Machine learning for algo trading Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with

More information

Why Ensembles Win Data Mining Competitions

Why Ensembles Win Data Mining Competitions Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information