Lasso-based Spam Filtering with Chinese s

Size: px
Start display at page:

Download "Lasso-based Spam Filtering with Chinese Emails"

Transcription

1 Journal of Computational Information Systems 8: 8 (2012) Available at Lasso-based Spam Filtering with Chinese s Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1 School of Information Engineering, East China Jiaotong University, Nanchang , China 2 Division of Scientific Research, Jiangxi University of Finance and Economics, Nanchang , China Abstract In spam filtering, the classifier built directly with high-dimensional and sparse data depicted by vector space model, will causing computation increased and poor generalization. Feature extraction or feature selection is commonly taken to reduce dimension before classifier training. A spam filtering approach based on l1 regularized multivariate linear model named Lasso regression is proposed in this paper, which aims to build a regression model for spam filtering and select the important terms automatically. Based on the selected terms, logistic regression (LR) models are built. The simulations are implemented with TREC06C, the results tell that LR plus lasso term selection achieve better performance. Keywords: Lasso; Feature Selection; LAR; Spam Filtering; Logistic Regression 1 Introduction Spam s, commonly known as unsolicited bulk s (UBE) or unsolicited commercial e- mails (UCE) [1], which has been an increasing severe problem with the Internet. Increasing in exponential speed all the time, spam s give rise to waste of social resources and loss of productivity (misuse of storage space and computational resources, time spent in reading and removing spam), and even destroy Internet Security. To deal with them, automatic spam filtering technologies should be taken. Among the techniques to combat spam s, the content-based spam filtering technology is a promising and effective approach [2]. Generally, there are two different approaches to detect spam s [2, 3, 4]: generative models (for instance, Naive Bayes [4]) and discriminative models such as SVM [2, 5], and Logistic Regression [3, 6]. It is a truth university acknowledged that the methods support vector machines and Naive Bayes classifiers are considered the top-performers [7] in text classification. And in the TREC (Text Retrieval Conference), the LR gained an extraordinarily This work is supported by National Natural Science Foundation of China ( ), Social Sciences Foundation of the State Education Ministry with No. 10YJC630379, the Natural Science Foundation of Jiangxi Province (2010GZS0034) and Technology Project underjiangxi Education Administration Department (GJJ10446). Corresponding author. address: xianlongok@163.com (Xianlong ZHANG) / Copyright 2012 Binary Information Press April 2012

2 3316 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) distinguished performance [8]. We adopt the LR as the classifier in this paper.the rest of this paper is organized as follows. The second section presents the related work. Section 3 describes the representation and VSM [9, 10] (Vector Space Model). The new feature selection method LARS is introduced in Section 4. In Section 5, we employed the new methodology in our experiments. Experimental results are also giving in this section. Finally conclusion is drawn in Section 6 and the Section 7 is the acknowledgment. 2 Related Works Spam detection can be considered as a problem of a binary document classification and each can be regarded as a document. An contains the body, the subject and other header fields. Vector Space Model, which also called bag-of-words method, is the most widely used to represent documents. With it, each can be processed and represented as a high dimensional sparse vector. In this way, s available will make up high dimensional data (with thousands of features), where many features are irrelevant or redundant. High dimensional data will increase calculation complexity and reduce generalization of the classifiers. So dimensionality reduction is crucial step in spam filtering. Feature selection is an approach for dimension reduction [11], which aiming to search an optimal features subset from a high dimensional feature space by using the statistic method or information theory. The related measures [12] are document frequency, information gain (IG), χ 2 -statistic and so on [7, 13]. Lasso [14] (Least absolute shrinkage and selection operator) regression is a multivariate linear regression with a bound on the sum of the absolute values of the coefficients, which can select variables and estimate coefficients simultaneously. After lasso proposed, there are many advanced techniques put forward based on lasso, such as elastic net and group lasso [15]. And moreover Least Angle Regression [14] (LAR) was proposed by Efron to deal with lasso computation efficiently [14]. In this paper, a new spam filtering approach based on lasso is proposed and used to Chinese spam filtering. 3 Representation representation is first step in spam filtering, and VSM commonly employed. In VSM, individual terms in each are collected to construct feature set T m = (t 1, t 2,..., t m ) ontaining m terms. Each can be represented as a vector d i = (w 1 (d i ), w 2 (d i ),...w m (d i )), w j (d i ) is the weight of term t j in the d i. So by gathering each data vector, the total corpus is represented as the term-document matrix X n m = (x 1, x 2,..., x m ), here x i = (w i (d 1 ), w i (d 2 ),...w i (d n )) T, n is the document number. The value of w j (d i ) is calculated using normalized ltc [13] function, which defined as: w j (d i ) = log (f ji + 1.0) log ( N ) n i (1) m [log (f ji + 1.0) log ( N n i )] 2 k=1 Where f ji is the number of occurrence of term t j appears in document d i, N means the number of the total documents set, n i denote the number of documents in N in which t j occurs, m

3 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) is the number of the terms. With the available corpus of n s, term-document matrix X = (x 1, x 2,..., x m ) with a size of n m can be produced, where vectors x 1, x 2,..., x m are n-dimensional vectors with respect to m features. And the responses for n s constitute vector y, its component values depend on the corresponding label. That is, if k is a spam then y k = 0, otherwise when it is a ham and y k = 1. 4 Lars Algorithm In this section, the approach for term selection is put forward in detail. Given data X with m features, X and y are firstly standardized so that n n x ij = 0, y i and 1 n x 2 ij = 1 for i=1 i=1 N i=1 afterwards usage. Let β = (β 1, β 2,..., β m ) 2, and the lasso is equivalent to the following optimization problem: minimize : S(β) = n m m (y i x ij β j ) 2 + λ β j (2) i=1 j=1 j=1 λ > 0, is a tunable regularization parameter. In order to solve this problem with L1 penalty efficiently, LARS (Least Angel Regression) was put forward, being improved from stagewise with high precision and easy computation. In LARS, the correlation between the term x i and target y is defined as: c i = x T i (y µ), µ = Xβ (3) X j3 X j2 u3 u2 u1 X j1 Fig. 1: The procedure of LARS LARS works with the procedure, shown in Fig 1. At the beginning of the algorithm µ = 0, and then the correlation values are calculated, and the maximum coefficient c j1 be found, meaning that the variable x j1 is most correlated with the predictor y µ, then added to the active set A. After that the largest step possible in the direction of this predictor is taken until some other predictor, say x j2, has as much correlation as x j1 with the current residual. Then LARS proceeds in a direction equiangular between the two predictors until a third variable x j1 earns its way into the most correlated set. LARS then proceeds equiangular between the three variables to find the next variable. LARS implementation is listed in Table 1.

4 3318 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) u The direction of angle bisector c The correlation Table 1: The LARS algorithm µ = 0. while y µ(k) < ε and ( A <= m) µ The predictor r, ˆr step length 1 c = X T (y µ) = (c 1, c 2,..., c m ) T,A = {j : c j = max(c) = C}, X A = [...x j..., j A] 2 u(k) = X A w A, ( w A = a A G 1 A 1 A a A = (1 T A G 1 A 1 A) 1 2 ) 3 a A = X T u = (a 1, a 2,..., a m ) 4 r = min + { } { } β j w j, if A < m ˆr = min + C cj j A C a A a j, C+c j a A +a j else ˆr = C A A j A 5 if ˆr < r µ(k) = µ(k 1) + ru(k) β A = β A + rw A A = A {ĵ} else µ(k) = µ(k 1) + ˆru(k) β A = β A + ˆrw A A = A {ĵ} return β A 5 Simulation 5.1 Preprocessing with document The Chinese corpus used for simulation experiments comes from the 2006 TREC Spam Filtering Track public datasets trec06c [16]. It consists of spam s and ham s. Before representing the corpus into term-document matrix, the contents should be abstracted and analyzed. A Visual C++ application program was made to extract the subject, content, and other major information from the original text. Since the Chinese document text has no obvious space between characters and always including some numbers, symbols, so the Chinese word segmentation is necessary. Therefore, ICTCLAS from Chinese Science Academy is utilized to achieve word segmentation. In the progress, those useless features, such as stop words, white space, punctuations, and so on are deleted. With statistics on the corpus, it s found that there are terms in all and 8879 terms appear in more than 60 s (document frequency is less than 0.2%. Then the 8879 terms are chosen as the original feature variables. Moreover, the terms appear in the subjects are set a higher weight because the subject may contains more important information. 5.2 Evlauation method The TREC Spam Filter Evaluation [16], developed for TREC 2005, provides a standardized method for evaluating spam filtering techniques. There are several statistics commonly used. hm% : the ham misclassification percentage. sm% : the spam misclassification percentage. 1-ROCA%: the area above the ROC curve, the most crucial statistics.

5 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) lam%: logistic average misclassification percentage, defined as lam% = log it 1 ( Here log it(x) = log( x 1 x ),and log it 1 (x) = h =.1 a statistics of sm% when hm%=0.1 log it(hm%) + log it(sm%) ) (4) 2 ex 1+e x 5.3 Experiments and result Actually, there are many s without contents will lead to some zero vectors, which should be removed to avoid computation problems. Because of the spam s taking the majority of experimented s, the selection of spam and ham is in a ratio of 2:1. Before the experiment, the selected data are partitioned into training datasets and testing datasets according to some rules. Firstly, the Lasso regression is used for filter algorithm to classify s. And in the meantime the important features are picked up, which coefficients is not zero in the lasso solution. Then another filter based on logistic regression is experimented on the new data which only contains the selected terms based on the lasso. And the next, the new data were use to training and testing on the LR classifier. With the training data {X, y} (X is the data matrix with n documents and y is the responses vector representing the label for the n s), the 10 fold cross-validation methods are employed to train the lasso regression filter with varying regularization parameters. With the solution β from lasso algorithm, the corresponding terms which coefficients are not zero can make an effective term subset. Changing the regularization parameter, the penalty of the L1 norm will be controlled, and the number of the selected terms will be decided accordingly. Here λ is set as λ i (12, 10, 8, 7) and the independent experiments are implemented with 10 fold crossvalidations. That is to say, 10 lasso models are built up for each selected regularization parameter. They are also checked with the testing data. The average number of the selected terms and the evaluation statistics, which were analysis carefully, are calculated and presented in Fig 2(a) and Table 2. With each lasso model for each regularization parameter, a new term-matrix data, which cover those with no-zero coefficients selected terms are built. And the new data are also divided into two groups train dataset and test dataset. Then the train dataset are used to train the logistic regression classifier. And the new LR classifier will be checked with testing dataset. The experiment results are listed in Fig 2(b) and Table 3. According to the TREC Spam Filter Evaluation standard, a distinguished spam filter would have a small area above the Receiver Operating Characteristic (ROC) Curve. Comparing the results showed in Fig 2(a) and Fig 2(b), it will be easily to know that the logistic regression plus lasso term selection has a better performance than lasso, but the previous need additional time cost to build logistic regression model. Furthermore, the performance will be improved with the increasing number of the selected terms. And in this experiment we get the best ROC curve when the λ values 7 as it showed in the figures From Table 2 and Table 3, it can also be seen that, the filter will get a better performance with the increasing of the selected terms, and the filters get the best performance when the λ = 7. By comparison, the filters with lasso-based term selection and logistic regression are superior to the lasso-based filters. Afterwards, more experiments are carried out to prove that the method proposed can achieve better performance. Another two typical feature selection methods, IG and

6 3320 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) %Spam Misclassification(logit scale) λ=12 λ=10 λ=8 λ=7 ROC %Spam Misclassification(logit scale) λ=12 λ=10 λ=8 λ=7 ROC %Ham Misclassification(logit scale) (a) The ROC curve of lasso %Ham Misclassification(logit scale) (b) The ROC curve with the lasso-based term selection and logistic regression Fig. 2: The ROC curve of Logistic regression and Lasso regression Table 2: Results with the lasso regression λ Terms 1-ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Table 3: Result with the lasso-based term selection and logistic regression λ Terms 1-ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) χ 2 are used to choose the important terms, and with this, another two logical regression models are built to classify the s. And the ROC curves and results are presented in Fig 3, Table 4 and Table 5. Fig 3(a) and Fig 3(b) showed that χ 2 feature selection has smaller area above the Receiver Operating Characteristic (ROC) Curve than the IG method. But both of them are inferior to lasso approach. It can be said that lasso is a better feature selection method in spam filtering. In the same way, a filter with a better performance will have small numerical value in 1-ROCA%, hm :.1% and lam%. The results of χ 2 and IG feature selection are also having poor performance on hm :.1% and lam%, compared with lasso selection method. So the conclusion that the lasso do well on the feature selection in spam filtering, can be drawn.

7 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) %Spam Misclassification(logit scale) terms=577 terms=820 terms=1370 terms=1823 ROC %Spam Misclassification(logit scale) terms=577 terms=820 terms=1370 terms=1823 ROC %Ham Misclassification(logit scale) (a) The ROC curve of Logistic using χ 2 to select terms Fig. 3: The ROC curve of Logistic using χ 2 and IG %Ham Misclassification(logit scale) (b) The ROC curve of Logistic using IG to select terms Table 4: Result of the logical regression using 2 variable selection λ Terms 1-ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Table 5: Result of the logical regression using IG variable selection λ Terms 1-ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 6 Conclusion In this paper, the approach to lasso-based spam filtering is presented. The key features which coefficient value are no-zero on the lasso solution, can be selected for feature reduction. Based on it, the logistic filters are built and tested with Chinese text . Lasso regression can selected the terms and estimated regression coefficients simultaneously, and on another hand, the lasso-based spam filters can also be built straightly. The approach is compared with two other term selection methods, and all of them work with the classifiers based on logistic regression. The simulation results proved that the lasso approach succeeds in term selection.there are still many challenges in spam filtering, the senders will spam the junk- s in other ways, such as word obfuscation, tokenization [17], and even image spam. Employing lasso method on the new problems will be our future research job.

8 3322 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) References [1] A. IltIk, T. G Ng R, Time-efficient spam filtering using n-gram models, Pattern Recognition Letters, 2008, pp [2] H. Yong, H. Xiaoning, Y. Muyun, Q. Haoliang, and S. Chao, Chinese Spam Filter Based on Relaxed Online Support Vector Machine, Proc. Asian Language Processing (IALP), 2010 International Conference on, 2010, pp [3] H. Yong, Y. Muyun, Q. Haoliang, H. Xiaoning, and L. Sheng, The Improved Logistic Regression Models for Spam Filtering, Proc. Asian Language Processing, IALP 09. International Conference on, 2009, pp [4] S. Lu, D. Chiang, H. Keh, and H. Huang, Chinese text classification by the Naive Bayes Classifier and the associative classifier with multiple confidence threshold values, Knowledge-Based Systems 2010, pp [5] H. Drucker, D. Wu, and V. N. Vapnik, Support vector machines for spam categorization, Neural Networks, IEEE Transactions on 1999, pp [6] H. Qi, X. He, Y. Han, M. Yang, and S. Li, Information Theory Based Feature Valuing for Logistic Regression for Spam Filtering, Asian Language Processing, International Conference on 2010, pp [7] T. Almeida, J. Almeida, and A. Yamakami, Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers, Journal of Internet Services and Applications, 2011, pp [8] G. Cormack, TREC 2007 Spam Track Overview, Proc. Proceedings of the Sixteenth Text REtrieval Conference, National Institute of Standards and Technology (NIST), [9] F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv.2002, pp [10] X. Tai, F. Ren, and K. Kita, An information retrieval model based on vector space method by supervised learning, Inf. Process. Manage, 2002, pp [11] I. A. Gheyas, and L. S. Smith, Feature subset selection in large dimensionality domains, Pattern Recognition, 2010, pp [12] W. Zhang, T. Yoshida, and X. Tang, A comparative study of TF*IDF, LSI and multi-words for text classification, Expert Systems With Applications2011, pp [13] Y. Li, C. Luo, and S. M. Chung, Text Clustering with Feature Selection by Using Statistical Data, Knowledge and Data Engineering, IEEE Transactions on 2008, pp [14] B. Efron, T. Hastie, L. Johnstone, and R. C. C. H. Tibshirani, Least angle regression, Annals Of Statistics, 2004, pp [15] L. Meier, S. van de Geer, and P. Buhlmann, The group lasso for logistic regression, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2008, pp [16] G. V. Cormack, TREC 2006 Spam Track Overview, Book TREC 2006 Spam Track Overview, Series TREC 2006 Spam Track Overview,ed., Editor ed, [17] D. Sculley, G. Wachman, and C. E. Brodley, Spam Filtering Using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers, 2006, pp. 1.

Not So Naïve Online Bayesian Spam Filter

Not So Naïve Online Bayesian Spam Filter Not So Naïve Online Bayesian Spam Filter Baojun Su Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 310027, China freizsu@gmail.com Congfu Xu Institute of Artificial

More information

How To Filter Spam Image From A Picture By Color Or Color

How To Filter Spam Image From A Picture By Color Or Color Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

More information

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

More information

Utilizing Multi-Field Text Features for Efficient Email Spam Filtering

Utilizing Multi-Field Text Features for Efficient Email Spam Filtering International Journal of Computational Intelligence Systems, Vol. 5, No. 3 (June, 2012), 505-518 Utilizing Multi-Field Text Features for Efficient Email Spam Filtering Wuying Liu College of Computer, National

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

More information

On-line Spam Filter Fusion

On-line Spam Filter Fusion On-line Spam Filter Fusion Thomas Lynam & Gordon Cormack originally presented at SIGIR 2006 On-line vs Batch Classification Batch Hard Classifier separate training and test data sets Given ham/spam classification

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

Spam Filtering Based on Latent Semantic Indexing

Spam Filtering Based on Latent Semantic Indexing Spam Filtering Based on Latent Semantic Indexing Wilfried N. Gansterer Andreas G. K. Janecek Robert Neumayer Abstract In this paper, a study on the classification performance of a vector space model (VSM)

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Statistical Feature Selection Techniques for Arabic Text Categorization

Statistical Feature Selection Techniques for Arabic Text Categorization Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering

More information

An Imbalanced Spam Mail Filtering Method

An Imbalanced Spam Mail Filtering Method , pp. 119-126 http://dx.doi.org/10.14257/ijmue.2015.10.3.12 An Imbalanced Spam Mail Filtering Method Zhiqiang Ma, Rui Yan, Donghong Yuan and Limin Liu (College of Information Engineering, Inner Mongolia

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

Research on Sentiment Classification of Chinese Micro Blog Based on

Research on Sentiment Classification of Chinese Micro Blog Based on Research on Sentiment Classification of Chinese Micro Blog Based on Machine Learning School of Economics and Management, Shenyang Ligong University, Shenyang, 110159, China E-mail: 8e8@163.com Abstract

More information

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Journal of Machine Learning Research 7 (2006) 2699-2720 Submitted 3/06; Revised 9/06; Published 12/06 Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Giorgio Fumera Ignazio

More information

Combining Global and Personal Anti-Spam Filtering

Combining Global and Personal Anti-Spam Filtering Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized

More information

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering 2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University

More information

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013 Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Naïve Bayesian Anti-spam Filtering Technique for Malay Language

Naïve Bayesian Anti-spam Filtering Technique for Malay Language Naïve Bayesian Anti-spam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Filtering Email Spam in the Presence of Noisy User Feedback

Filtering Email Spam in the Presence of Noisy User Feedback Filtering Email Spam in the Presence of Noisy User Feedback D. Sculley Department of Computer Science Tufts University 161 College Ave. Medford, MA 02155 USA dsculley@cs.tufts.edu Gordon V. Cormack School

More information

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

SVM-Based Spam Filter with Active and Online Learning

SVM-Based Spam Filter with Active and Online Learning SVM-Based Spam Filter with Active and Online Learning Qiang Wang Yi Guan Xiaolong Wang School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China Email:{qwang, guanyi,

More information

Neural Networks for Sentiment Detection in Financial Text

Neural Networks for Sentiment Detection in Financial Text Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.

More information

The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network

The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network , pp.67-76 http://dx.doi.org/10.14257/ijdta.2016.9.1.06 The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network Lihua Yang and Baolin Li* School of Economics and

More information

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then

More information

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

Less naive Bayes spam detection

Less naive Bayes spam detection Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

More information

Spam detection with data mining method:

Spam detection with data mining method: Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

Cross-Validation. Synonyms Rotation estimation

Cross-Validation. Synonyms Rotation estimation Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Content-Based Recommendation

Content-Based Recommendation Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and

More information

Recognition Method for Handwritten Digits Based on Improved Chain Code Histogram Feature

Recognition Method for Handwritten Digits Based on Improved Chain Code Histogram Feature 3rd International Conference on Multimedia Technology ICMT 2013) Recognition Method for Handwritten Digits Based on Improved Chain Code Histogram Feature Qian You, Xichang Wang, Huaying Zhang, Zhen Sun

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

More information

Using Biased Discriminant Analysis for Email Filtering

Using Biased Discriminant Analysis for Email Filtering Using Biased Discriminant Analysis for Email Filtering Juan Carlos Gomez 1 and Marie-Francine Moens 2 1 ITESM, Eugenio Garza Sada 2501, Monterrey NL 64849, Mexico juancarlos.gomez@invitados.itesm.mx 2

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

III. DATA SETS. Training the Matching Model

III. DATA SETS. Training the Matching Model A Machine-Learning Approach to Discovering Company Home Pages Wojciech Gryc Oxford Internet Institute University of Oxford Oxford, UK OX1 3JS Email: wojciech.gryc@oii.ox.ac.uk Prem Melville IBM T.J. Watson

More information

Spam Filtering using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers

Spam Filtering using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers Spam Filtering using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers D. Sculley, Gabriel M. Wachman, and Carla E. Brodley Department of Computer Science, Tufts University

More information

Visualization of large data sets using MDS combined with LVQ.

Visualization of large data sets using MDS combined with LVQ. Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

A semi-supervised Spam mail detector

A semi-supervised Spam mail detector A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach

More information

Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,

More information

On Attacking Statistical Spam Filters

On Attacking Statistical Spam Filters On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Paper review by Deepak Chinavle

More information

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System

Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System Bala Kumari P 1, Bercelin Rose Mary W 2 and Devi Mareeswari M 3 1, 2, 3 M.TECH / IT, Dr.Sivanthi Aditanar College

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Forecasting stock markets with Twitter

Forecasting stock markets with Twitter Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar

More information

WE DEFINE spam as an e-mail message that is unwanted basically

WE DEFINE spam as an e-mail message that is unwanted basically 1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir

More information

IDENTIFICATION OF AUCTION FRAUDULENT IN E-COMMERCE WEB

IDENTIFICATION OF AUCTION FRAUDULENT IN E-COMMERCE WEB INTERNATIONAL JOURNAL OF REVIEWS ON RECENT ELECTRONICS AND COMPUTER SCIENCE IDENTIFICATION OF AUCTION FRAUDULENT IN E-COMMERCE WEB Kimaya Nandkishor Shirke 1, K.Pushpa Rani 2 1 M.Tech Student, Dept of

More information

TOWARD BIG DATA ANALYSIS WORKSHOP

TOWARD BIG DATA ANALYSIS WORKSHOP TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)

More information

6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME

6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME INTERNATIONAL International Journal of Computer JOURNAL Engineering OF COMPUTER and Technology ENGINEERING (IJCET), ISSN 0976-6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET)

More information

Ensemble Approach for the Classification of Imbalanced Data

Ensemble Approach for the Classification of Imbalanced Data Ensemble Approach for the Classification of Imbalanced Data Vladimir Nikulin 1, Geoffrey J. McLachlan 1, and Shu Kay Ng 2 1 Department of Mathematics, University of Queensland v.nikulin@uq.edu.au, gjm@maths.uq.edu.au

More information

A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode

A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode Seyed Mojtaba Hosseini Bamakan, Peyman Gholami RESEARCH CENTRE OF FICTITIOUS ECONOMY & DATA SCIENCE UNIVERSITY

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: marthakala08@gmail.com, Abstract- Unethical

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information