Lassobased Spam Filtering with Chinese s


 John Pearson
 3 years ago
 Views:
Transcription
1 Journal of Computational Information Systems 8: 8 (2012) Available at Lassobased Spam Filtering with Chinese s Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1 School of Information Engineering, East China Jiaotong University, Nanchang , China 2 Division of Scientific Research, Jiangxi University of Finance and Economics, Nanchang , China Abstract In spam filtering, the classifier built directly with highdimensional and sparse data depicted by vector space model, will causing computation increased and poor generalization. Feature extraction or feature selection is commonly taken to reduce dimension before classifier training. A spam filtering approach based on l1 regularized multivariate linear model named Lasso regression is proposed in this paper, which aims to build a regression model for spam filtering and select the important terms automatically. Based on the selected terms, logistic regression (LR) models are built. The simulations are implemented with TREC06C, the results tell that LR plus lasso term selection achieve better performance. Keywords: Lasso; Feature Selection; LAR; Spam Filtering; Logistic Regression 1 Introduction Spam s, commonly known as unsolicited bulk s (UBE) or unsolicited commercial e mails (UCE) [1], which has been an increasing severe problem with the Internet. Increasing in exponential speed all the time, spam s give rise to waste of social resources and loss of productivity (misuse of storage space and computational resources, time spent in reading and removing spam), and even destroy Internet Security. To deal with them, automatic spam filtering technologies should be taken. Among the techniques to combat spam s, the contentbased spam filtering technology is a promising and effective approach [2]. Generally, there are two different approaches to detect spam s [2, 3, 4]: generative models (for instance, Naive Bayes [4]) and discriminative models such as SVM [2, 5], and Logistic Regression [3, 6]. It is a truth university acknowledged that the methods support vector machines and Naive Bayes classifiers are considered the topperformers [7] in text classification. And in the TREC (Text Retrieval Conference), the LR gained an extraordinarily This work is supported by National Natural Science Foundation of China ( ), Social Sciences Foundation of the State Education Ministry with No. 10YJC630379, the Natural Science Foundation of Jiangxi Province (2010GZS0034) and Technology Project underjiangxi Education Administration Department (GJJ10446). Corresponding author. address: (Xianlong ZHANG) / Copyright 2012 Binary Information Press April 2012
2 3316 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) distinguished performance [8]. We adopt the LR as the classifier in this paper.the rest of this paper is organized as follows. The second section presents the related work. Section 3 describes the representation and VSM [9, 10] (Vector Space Model). The new feature selection method LARS is introduced in Section 4. In Section 5, we employed the new methodology in our experiments. Experimental results are also giving in this section. Finally conclusion is drawn in Section 6 and the Section 7 is the acknowledgment. 2 Related Works Spam detection can be considered as a problem of a binary document classification and each can be regarded as a document. An contains the body, the subject and other header fields. Vector Space Model, which also called bagofwords method, is the most widely used to represent documents. With it, each can be processed and represented as a high dimensional sparse vector. In this way, s available will make up high dimensional data (with thousands of features), where many features are irrelevant or redundant. High dimensional data will increase calculation complexity and reduce generalization of the classifiers. So dimensionality reduction is crucial step in spam filtering. Feature selection is an approach for dimension reduction [11], which aiming to search an optimal features subset from a high dimensional feature space by using the statistic method or information theory. The related measures [12] are document frequency, information gain (IG), χ 2 statistic and so on [7, 13]. Lasso [14] (Least absolute shrinkage and selection operator) regression is a multivariate linear regression with a bound on the sum of the absolute values of the coefficients, which can select variables and estimate coefficients simultaneously. After lasso proposed, there are many advanced techniques put forward based on lasso, such as elastic net and group lasso [15]. And moreover Least Angle Regression [14] (LAR) was proposed by Efron to deal with lasso computation efficiently [14]. In this paper, a new spam filtering approach based on lasso is proposed and used to Chinese spam filtering. 3 Representation representation is first step in spam filtering, and VSM commonly employed. In VSM, individual terms in each are collected to construct feature set T m = (t 1, t 2,..., t m ) ontaining m terms. Each can be represented as a vector d i = (w 1 (d i ), w 2 (d i ),...w m (d i )), w j (d i ) is the weight of term t j in the d i. So by gathering each data vector, the total corpus is represented as the termdocument matrix X n m = (x 1, x 2,..., x m ), here x i = (w i (d 1 ), w i (d 2 ),...w i (d n )) T, n is the document number. The value of w j (d i ) is calculated using normalized ltc [13] function, which defined as: w j (d i ) = log (f ji + 1.0) log ( N ) n i (1) m [log (f ji + 1.0) log ( N n i )] 2 k=1 Where f ji is the number of occurrence of term t j appears in document d i, N means the number of the total documents set, n i denote the number of documents in N in which t j occurs, m
3 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) is the number of the terms. With the available corpus of n s, termdocument matrix X = (x 1, x 2,..., x m ) with a size of n m can be produced, where vectors x 1, x 2,..., x m are ndimensional vectors with respect to m features. And the responses for n s constitute vector y, its component values depend on the corresponding label. That is, if k is a spam then y k = 0, otherwise when it is a ham and y k = 1. 4 Lars Algorithm In this section, the approach for term selection is put forward in detail. Given data X with m features, X and y are firstly standardized so that n n x ij = 0, y i and 1 n x 2 ij = 1 for i=1 i=1 N i=1 afterwards usage. Let β = (β 1, β 2,..., β m ) 2, and the lasso is equivalent to the following optimization problem: minimize : S(β) = n m m (y i x ij β j ) 2 + λ β j (2) i=1 j=1 j=1 λ > 0, is a tunable regularization parameter. In order to solve this problem with L1 penalty efficiently, LARS (Least Angel Regression) was put forward, being improved from stagewise with high precision and easy computation. In LARS, the correlation between the term x i and target y is defined as: c i = x T i (y µ), µ = Xβ (3) X j3 X j2 u3 u2 u1 X j1 Fig. 1: The procedure of LARS LARS works with the procedure, shown in Fig 1. At the beginning of the algorithm µ = 0, and then the correlation values are calculated, and the maximum coefficient c j1 be found, meaning that the variable x j1 is most correlated with the predictor y µ, then added to the active set A. After that the largest step possible in the direction of this predictor is taken until some other predictor, say x j2, has as much correlation as x j1 with the current residual. Then LARS proceeds in a direction equiangular between the two predictors until a third variable x j1 earns its way into the most correlated set. LARS then proceeds equiangular between the three variables to find the next variable. LARS implementation is listed in Table 1.
4 3318 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) u The direction of angle bisector c The correlation Table 1: The LARS algorithm µ = 0. while y µ(k) < ε and ( A <= m) µ The predictor r, ˆr step length 1 c = X T (y µ) = (c 1, c 2,..., c m ) T,A = {j : c j = max(c) = C}, X A = [...x j..., j A] 2 u(k) = X A w A, ( w A = a A G 1 A 1 A a A = (1 T A G 1 A 1 A) 1 2 ) 3 a A = X T u = (a 1, a 2,..., a m ) 4 r = min + { } { } β j w j, if A < m ˆr = min + C cj j A C a A a j, C+c j a A +a j else ˆr = C A A j A 5 if ˆr < r µ(k) = µ(k 1) + ru(k) β A = β A + rw A A = A {ĵ} else µ(k) = µ(k 1) + ˆru(k) β A = β A + ˆrw A A = A {ĵ} return β A 5 Simulation 5.1 Preprocessing with document The Chinese corpus used for simulation experiments comes from the 2006 TREC Spam Filtering Track public datasets trec06c [16]. It consists of spam s and ham s. Before representing the corpus into termdocument matrix, the contents should be abstracted and analyzed. A Visual C++ application program was made to extract the subject, content, and other major information from the original text. Since the Chinese document text has no obvious space between characters and always including some numbers, symbols, so the Chinese word segmentation is necessary. Therefore, ICTCLAS from Chinese Science Academy is utilized to achieve word segmentation. In the progress, those useless features, such as stop words, white space, punctuations, and so on are deleted. With statistics on the corpus, it s found that there are terms in all and 8879 terms appear in more than 60 s (document frequency is less than 0.2%. Then the 8879 terms are chosen as the original feature variables. Moreover, the terms appear in the subjects are set a higher weight because the subject may contains more important information. 5.2 Evlauation method The TREC Spam Filter Evaluation [16], developed for TREC 2005, provides a standardized method for evaluating spam filtering techniques. There are several statistics commonly used. hm% : the ham misclassification percentage. sm% : the spam misclassification percentage. 1ROCA%: the area above the ROC curve, the most crucial statistics.
5 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) lam%: logistic average misclassification percentage, defined as lam% = log it 1 ( Here log it(x) = log( x 1 x ),and log it 1 (x) = h =.1 a statistics of sm% when hm%=0.1 log it(hm%) + log it(sm%) ) (4) 2 ex 1+e x 5.3 Experiments and result Actually, there are many s without contents will lead to some zero vectors, which should be removed to avoid computation problems. Because of the spam s taking the majority of experimented s, the selection of spam and ham is in a ratio of 2:1. Before the experiment, the selected data are partitioned into training datasets and testing datasets according to some rules. Firstly, the Lasso regression is used for filter algorithm to classify s. And in the meantime the important features are picked up, which coefficients is not zero in the lasso solution. Then another filter based on logistic regression is experimented on the new data which only contains the selected terms based on the lasso. And the next, the new data were use to training and testing on the LR classifier. With the training data {X, y} (X is the data matrix with n documents and y is the responses vector representing the label for the n s), the 10 fold crossvalidation methods are employed to train the lasso regression filter with varying regularization parameters. With the solution β from lasso algorithm, the corresponding terms which coefficients are not zero can make an effective term subset. Changing the regularization parameter, the penalty of the L1 norm will be controlled, and the number of the selected terms will be decided accordingly. Here λ is set as λ i (12, 10, 8, 7) and the independent experiments are implemented with 10 fold crossvalidations. That is to say, 10 lasso models are built up for each selected regularization parameter. They are also checked with the testing data. The average number of the selected terms and the evaluation statistics, which were analysis carefully, are calculated and presented in Fig 2(a) and Table 2. With each lasso model for each regularization parameter, a new termmatrix data, which cover those with nozero coefficients selected terms are built. And the new data are also divided into two groups train dataset and test dataset. Then the train dataset are used to train the logistic regression classifier. And the new LR classifier will be checked with testing dataset. The experiment results are listed in Fig 2(b) and Table 3. According to the TREC Spam Filter Evaluation standard, a distinguished spam filter would have a small area above the Receiver Operating Characteristic (ROC) Curve. Comparing the results showed in Fig 2(a) and Fig 2(b), it will be easily to know that the logistic regression plus lasso term selection has a better performance than lasso, but the previous need additional time cost to build logistic regression model. Furthermore, the performance will be improved with the increasing number of the selected terms. And in this experiment we get the best ROC curve when the λ values 7 as it showed in the figures From Table 2 and Table 3, it can also be seen that, the filter will get a better performance with the increasing of the selected terms, and the filters get the best performance when the λ = 7. By comparison, the filters with lassobased term selection and logistic regression are superior to the lassobased filters. Afterwards, more experiments are carried out to prove that the method proposed can achieve better performance. Another two typical feature selection methods, IG and
6 3320 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) %Spam Misclassification(logit scale) λ=12 λ=10 λ=8 λ=7 ROC %Spam Misclassification(logit scale) λ=12 λ=10 λ=8 λ=7 ROC %Ham Misclassification(logit scale) (a) The ROC curve of lasso %Ham Misclassification(logit scale) (b) The ROC curve with the lassobased term selection and logistic regression Fig. 2: The ROC curve of Logistic regression and Lasso regression Table 2: Results with the lasso regression λ Terms 1ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Table 3: Result with the lassobased term selection and logistic regression λ Terms 1ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) χ 2 are used to choose the important terms, and with this, another two logical regression models are built to classify the s. And the ROC curves and results are presented in Fig 3, Table 4 and Table 5. Fig 3(a) and Fig 3(b) showed that χ 2 feature selection has smaller area above the Receiver Operating Characteristic (ROC) Curve than the IG method. But both of them are inferior to lasso approach. It can be said that lasso is a better feature selection method in spam filtering. In the same way, a filter with a better performance will have small numerical value in 1ROCA%, hm :.1% and lam%. The results of χ 2 and IG feature selection are also having poor performance on hm :.1% and lam%, compared with lasso selection method. So the conclusion that the lasso do well on the feature selection in spam filtering, can be drawn.
7 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) %Spam Misclassification(logit scale) terms=577 terms=820 terms=1370 terms=1823 ROC %Spam Misclassification(logit scale) terms=577 terms=820 terms=1370 terms=1823 ROC %Ham Misclassification(logit scale) (a) The ROC curve of Logistic using χ 2 to select terms Fig. 3: The ROC curve of Logistic using χ 2 and IG %Ham Misclassification(logit scale) (b) The ROC curve of Logistic using IG to select terms Table 4: Result of the logical regression using 2 variable selection λ Terms 1ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Table 5: Result of the logical regression using IG variable selection λ Terms 1ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 6 Conclusion In this paper, the approach to lassobased spam filtering is presented. The key features which coefficient value are nozero on the lasso solution, can be selected for feature reduction. Based on it, the logistic filters are built and tested with Chinese text . Lasso regression can selected the terms and estimated regression coefficients simultaneously, and on another hand, the lassobased spam filters can also be built straightly. The approach is compared with two other term selection methods, and all of them work with the classifiers based on logistic regression. The simulation results proved that the lasso approach succeeds in term selection.there are still many challenges in spam filtering, the senders will spam the junk s in other ways, such as word obfuscation, tokenization [17], and even image spam. Employing lasso method on the new problems will be our future research job.
8 3322 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) References [1] A. IltIk, T. G Ng R, Timeefficient spam filtering using ngram models, Pattern Recognition Letters, 2008, pp [2] H. Yong, H. Xiaoning, Y. Muyun, Q. Haoliang, and S. Chao, Chinese Spam Filter Based on Relaxed Online Support Vector Machine, Proc. Asian Language Processing (IALP), 2010 International Conference on, 2010, pp [3] H. Yong, Y. Muyun, Q. Haoliang, H. Xiaoning, and L. Sheng, The Improved Logistic Regression Models for Spam Filtering, Proc. Asian Language Processing, IALP 09. International Conference on, 2009, pp [4] S. Lu, D. Chiang, H. Keh, and H. Huang, Chinese text classification by the Naive Bayes Classifier and the associative classifier with multiple confidence threshold values, KnowledgeBased Systems 2010, pp [5] H. Drucker, D. Wu, and V. N. Vapnik, Support vector machines for spam categorization, Neural Networks, IEEE Transactions on 1999, pp [6] H. Qi, X. He, Y. Han, M. Yang, and S. Li, Information Theory Based Feature Valuing for Logistic Regression for Spam Filtering, Asian Language Processing, International Conference on 2010, pp [7] T. Almeida, J. Almeida, and A. Yamakami, Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers, Journal of Internet Services and Applications, 2011, pp [8] G. Cormack, TREC 2007 Spam Track Overview, Proc. Proceedings of the Sixteenth Text REtrieval Conference, National Institute of Standards and Technology (NIST), [9] F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv.2002, pp [10] X. Tai, F. Ren, and K. Kita, An information retrieval model based on vector space method by supervised learning, Inf. Process. Manage, 2002, pp [11] I. A. Gheyas, and L. S. Smith, Feature subset selection in large dimensionality domains, Pattern Recognition, 2010, pp [12] W. Zhang, T. Yoshida, and X. Tang, A comparative study of TF*IDF, LSI and multiwords for text classification, Expert Systems With Applications2011, pp [13] Y. Li, C. Luo, and S. M. Chung, Text Clustering with Feature Selection by Using Statistical Data, Knowledge and Data Engineering, IEEE Transactions on 2008, pp [14] B. Efron, T. Hastie, L. Johnstone, and R. C. C. H. Tibshirani, Least angle regression, Annals Of Statistics, 2004, pp [15] L. Meier, S. van de Geer, and P. Buhlmann, The group lasso for logistic regression, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2008, pp [16] G. V. Cormack, TREC 2006 Spam Track Overview, Book TREC 2006 Spam Track Overview, Series TREC 2006 Spam Track Overview,ed., Editor ed, [17] D. Sculley, G. Wachman, and C. E. Brodley, Spam Filtering Using Inexact String Matching in Explicit Feature Space with OnLine Linear Classifiers, 2006, pp. 1.
Not So Naïve Online Bayesian Spam Filter
Not So Naïve Online Bayesian Spam Filter Baojun Su Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 310027, China freizsu@gmail.com Congfu Xu Institute of Artificial
More informationUtilizing MultiField Text Features for Efficient Email Spam Filtering
International Journal of Computational Intelligence Systems, Vol. 5, No. 3 (June, 2012), 505518 Utilizing MultiField Text Features for Efficient Email Spam Filtering Wuying Liu College of Computer, National
More informationCASICT at TREC 2005 SPAM Track: Using NonTextual Information to Improve Spam Filtering Performance
CASICT at TREC 2005 SPAM Track: Using NonTextual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of
More informationImage ContentBased Email Spam Image Filtering
Image ContentBased Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among
More informationThe Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 152138213, USA {bklimt,yiming}@cs.cmu.edu
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationAntiSpam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 AntiSpam Filter Based on,, and model YunNung Chen, CheAn Lu, ChaoYu Huang Abstract spam email filters are a wellknown and powerful type of filters. We construct different
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationOnline Spam Filter Fusion
Online Spam Filter Fusion Thomas Lynam & Gordon Cormack originally presented at SIGIR 2006 Online vs Batch Classification Batch Hard Classifier separate training and test data sets Given ham/spam classification
More informationFeature Subset Selection in Email Spam Detection
Feature Subset Selection in Email Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 1416 March, 2012 Feature
More informationA TwoPass Statistical Approach for Automatic Personalized Spam Filtering
A TwoPass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
More informationSpam Filtering Based on Latent Semantic Indexing
Spam Filtering Based on Latent Semantic Indexing Wilfried N. Gansterer Andreas G. K. Janecek Robert Neumayer Abstract In this paper, a study on the classification performance of a vector space model (VSM)
More informationBlog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
More informationDATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
More informationlop Building Machine Learning Systems with Python en source
Building Machine Learning Systems with Python Master the art of machine learning with Python and build effective machine learning systems with this intensive handson guide Willi Richert Luis Pedro Coelho
More informationStatistical Feature Selection Techniques for Arabic Text Categorization
Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +96227201000
More informationA Proposed Algorithm for Spam Filtering Emails by Hash Table Approach
International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251838X / Vol, 4 (9): 24362441 Science Explorer Publications A Proposed Algorithm for Spam Filtering
More informationAn Imbalanced Spam Mail Filtering Method
, pp. 119126 http://dx.doi.org/10.14257/ijmue.2015.10.3.12 An Imbalanced Spam Mail Filtering Method Zhiqiang Ma, Rui Yan, Donghong Yuan and Limin Liu (College of Information Engineering, Inner Mongolia
More informationT61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
More informationEmail Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationLogistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
More informationA Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad  Iraq ABSTRACT
More informationTowards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial
More informationResearch on Sentiment Classification of Chinese Micro Blog Based on
Research on Sentiment Classification of Chinese Micro Blog Based on Machine Learning School of Economics and Management, Shenyang Ligong University, Shenyang, 110159, China Email: 8e8@163.com Abstract
More informationConstruction Algorithms for Index Model Based on Web Page Classification
Journal of Computational Information Systems 10: 2 (2014) 655 664 Available at http://www.jofcis.com Construction Algorithms for Index Model Based on Web Page Classification Yangjie ZHANG 1,2,, Chungang
More informationCombining Global and Personal AntiSpam Filtering
Combining Global and Personal AntiSpam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to antispam filtering were personalized
More informationSVMBased Spam Filter with Active and Online Learning
SVMBased Spam Filter with Active and Online Learning Qiang Wang Yi Guan Xiaolong Wang School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China Email:{qwang, guanyi,
More informationSpam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
More informationLearning Classifiers for Misuse Detection Using a Bag of System Calls Representation
Learning Classifiers for Misuse Detection Using a Bag of System Calls Representation DaeKi Kang 1, Doug Fuller 2, and Vasant Honavar 1 1 Artificial Intelligence Lab, Department of Computer Science, Iowa
More informationModel selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013
Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.
More informationBayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
More informationImproved Fuzzy Cmeans Clustering Algorithm Based on Cluster Density
Journal of Computational Information Systems 8: 2 (2012) 727 737 Available at http://www.jofcis.com Improved Fuzzy Cmeans Clustering Algorithm Based on Cluster Density Xiaojun LOU, Junying LI, Haitao
More informationMAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationSpam Filtering Based On The Analysis Of Text Information Embedded Into Images
Journal of Machine Learning Research 7 (2006) 26992720 Submitted 3/06; Revised 9/06; Published 12/06 Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Giorgio Fumera Ignazio
More informationNeural Networks for Sentiment Detection in Financial Text
Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.
More informationFiltering Email Spam in the Presence of Noisy User Feedback
Filtering Email Spam in the Presence of Noisy User Feedback D. Sculley Department of Computer Science Tufts University 161 College Ave. Medford, MA 02155 USA dsculley@cs.tufts.edu Gordon V. Cormack School
More informationBEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
More informationClustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
More informationCrossValidation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
More informationThe Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network
, pp.6776 http://dx.doi.org/10.14257/ijdta.2016.9.1.06 The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network Lihua Yang and Baolin Li* School of Economics and
More informationClass #6: Nonlinear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Nonlinear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Nonlinear classification Linear Support Vector Machines
More informationPSSF: A Novel Statistical Approach for Personalized Serviceside Spam Filtering
2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Serviceside Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors ChiaHui Chang and ZhiKai Ding Department of Computer Science and Information Engineering, National Central University, ChungLi,
More informationNaïve Bayesian Antispam Filtering Technique for Malay Language
Naïve Bayesian Antispam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationBagged Ensemble Classifiers for Sentiment Classification of Movie Reviews
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:23197242 Volume 3 Issue 2 February, 2014 Page No. 39513961 Bagged Ensemble Classifiers for Sentiment Classification of Movie
More informationLess naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. Email:h.m.yang@tue.nl also CoSiNe Connectivity Systems
More informationA New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
More informationA Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters WeiLun Teng, WeiChung Teng
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationA STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then
More informationRegularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, JukkaPekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
More informationTerm extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
More informationMachine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio
Machine Learning model evaluation Luigi Cerulo Department of Science and Technology University of Sannio Accuracy To measure classification performance the most intuitive measure of accuracy divides the
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN13: 9780470860809 ISBN10: 0470860804 Editors Brian S Everitt & David
More informationStatistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
More informationContentBased Recommendation
ContentBased Recommendation Contentbased? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items Userbased CF Searches
More informationWeb Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
More information6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January February (2013), IAEME
INTERNATIONAL International Journal of Computer JOURNAL Engineering OF COMPUTER and Technology ENGINEERING (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET)
More informationMachine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
More informationEmail Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 3540 ISSN 23494840 (Print) & ISSN 23494859 (Online) www.arcjournals.org Email
More informationWE DEFINE spam as an email message that is unwanted basically
1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir
More informationResearch on News Video Multitopic Extraction and Summarization
International Journal of New Technology and Research (IJNTR) ISSN:24544116, Volume2, Issue3, March 2016 Pages 3739 Research on News Video Multitopic Extraction and Summarization Di Li, Hua Huo Abstract
More informationPredict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationCustomer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
More informationSURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING
I J I T E ISSN: 22297367 3(12), 2012, pp. 233237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and
More informationRecognition Method for Handwritten Digits Based on Improved Chain Code Histogram Feature
3rd International Conference on Multimedia Technology ICMT 2013) Recognition Method for Handwritten Digits Based on Improved Chain Code Histogram Feature Qian You, Xichang Wang, Huaying Zhang, Zhen Sun
More informationSimple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS  Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
More informationBehavior Analysis of SVM Based Spam Filtering Using Various Kernel Functions and Data Representations
ISSN: 2278181 Vol. 2 Issue 9, September  213 Behavior Analysis of SVM Based Spam Filtering Using Various Kernel Functions and Data Representations Author :Sushama Chouhan Author Affiliation: MTech Scholar
More informationProjektgruppe. Categorization of text documents via classification
Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction
More informationIDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION
http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,
More informationDistributed forests for MapReducebased machine learning
Distributed forests for MapReducebased machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
More informationUsing Biased Discriminant Analysis for Email Filtering
Using Biased Discriminant Analysis for Email Filtering Juan Carlos Gomez 1 and MarieFrancine Moens 2 1 ITESM, Eugenio Garza Sada 2501, Monterrey NL 64849, Mexico juancarlos.gomez@invitados.itesm.mx 2
More informationAutomatic Web Page Classification
Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationForecasting stock markets with Twitter
Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,
More informationVisualization of large data sets using MDS combined with LVQ.
Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87100 Toruń, Poland. www.phys.uni.torun.pl/kmk
More informationSpam Filtering using Inexact String Matching in Explicit Feature Space with OnLine Linear Classifiers
Spam Filtering using Inexact String Matching in Explicit Feature Space with OnLine Linear Classifiers D. Sculley, Gabriel M. Wachman, and Carla E. Brodley Department of Computer Science, Tufts University
More informationA semisupervised Spam mail detector
A semisupervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semisupervised approach
More informationStabilization by Conceptual Duplication in Adaptive Resonance Theory
Stabilization by Conceptual Duplication in Adaptive Resonance Theory Louis Massey Royal Military College of Canada Department of Mathematics and Computer Science PO Box 17000 Station Forces Kingston, Ontario,
More informationSentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
More informationThe fastclime Package for Linear Programming and LargeScale Precision Matrix Estimation in R
Journal of Machine Learning Research 15 (2014) 489493 Submitted 3/13; Revised 8/13; Published 2/14 The fastclime Package for Linear Programming and LargeScale Precision Matrix Estimation in R Haotian
More informationIDENTIFICATION OF AUCTION FRAUDULENT IN ECOMMERCE WEB
INTERNATIONAL JOURNAL OF REVIEWS ON RECENT ELECTRONICS AND COMPUTER SCIENCE IDENTIFICATION OF AUCTION FRAUDULENT IN ECOMMERCE WEB Kimaya Nandkishor Shirke 1, K.Pushpa Rani 2 1 M.Tech Student, Dept of
More informationMachine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
More informationAn Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H12 Islamabad, Pakistan Usman Qamar Faculty,
More informationPredicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
More informationServer Load Prediction
Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that
More informationNetwork Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016
Network Machine Learning Research Group S. Jiang InternetDraft Huawei Technologies Co., Ltd Intended status: Informational October 19, 2015 Expires: April 21, 2016 Abstract Network Machine Learning draftjiangnmlrgnetworkmachinelearning00
More informationThe basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23
(copyright by Scott M Lynch, February 2003) Brief Matrix Algebra Review (Soc 504) Matrix algebra is a form of mathematics that allows compact notation for, and mathematical manipulation of, highdimensional
More informationUniversité de Montpellier 2 Hugo AlatristaSalas : hugo.alatristasalas@teledetection.fr
Université de Montpellier 2 Hugo AlatristaSalas : hugo.alatristasalas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
More informationWikipedia and Web document based Query Translation and Expansion for Crosslanguage IR
Wikipedia and Web document based Query Translation and Expansion for Crosslanguage IR LingXiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Daybyday Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationMachine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
More informationDenial of Service Attack Detection Using Multivariate Correlation Information and Support Vector Machine Classification
International Journal of Computer Sciences and Engineering Open Access Research Paper Volume4, Issue3 EISSN: 23472693 Denial of Service Attack Detection Using Multivariate Correlation Information and
More informationModelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
More informationDoptimal plans in observational studies
Doptimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More information