Lasso-based Spam Filtering with Chinese s
|
|
- John Pearson
- 8 years ago
- Views:
Transcription
1 Journal of Computational Information Systems 8: 8 (2012) Available at Lasso-based Spam Filtering with Chinese s Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1 School of Information Engineering, East China Jiaotong University, Nanchang , China 2 Division of Scientific Research, Jiangxi University of Finance and Economics, Nanchang , China Abstract In spam filtering, the classifier built directly with high-dimensional and sparse data depicted by vector space model, will causing computation increased and poor generalization. Feature extraction or feature selection is commonly taken to reduce dimension before classifier training. A spam filtering approach based on l1 regularized multivariate linear model named Lasso regression is proposed in this paper, which aims to build a regression model for spam filtering and select the important terms automatically. Based on the selected terms, logistic regression (LR) models are built. The simulations are implemented with TREC06C, the results tell that LR plus lasso term selection achieve better performance. Keywords: Lasso; Feature Selection; LAR; Spam Filtering; Logistic Regression 1 Introduction Spam s, commonly known as unsolicited bulk s (UBE) or unsolicited commercial e- mails (UCE) [1], which has been an increasing severe problem with the Internet. Increasing in exponential speed all the time, spam s give rise to waste of social resources and loss of productivity (misuse of storage space and computational resources, time spent in reading and removing spam), and even destroy Internet Security. To deal with them, automatic spam filtering technologies should be taken. Among the techniques to combat spam s, the content-based spam filtering technology is a promising and effective approach [2]. Generally, there are two different approaches to detect spam s [2, 3, 4]: generative models (for instance, Naive Bayes [4]) and discriminative models such as SVM [2, 5], and Logistic Regression [3, 6]. It is a truth university acknowledged that the methods support vector machines and Naive Bayes classifiers are considered the top-performers [7] in text classification. And in the TREC (Text Retrieval Conference), the LR gained an extraordinarily This work is supported by National Natural Science Foundation of China ( ), Social Sciences Foundation of the State Education Ministry with No. 10YJC630379, the Natural Science Foundation of Jiangxi Province (2010GZS0034) and Technology Project underjiangxi Education Administration Department (GJJ10446). Corresponding author. address: xianlongok@163.com (Xianlong ZHANG) / Copyright 2012 Binary Information Press April 2012
2 3316 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) distinguished performance [8]. We adopt the LR as the classifier in this paper.the rest of this paper is organized as follows. The second section presents the related work. Section 3 describes the representation and VSM [9, 10] (Vector Space Model). The new feature selection method LARS is introduced in Section 4. In Section 5, we employed the new methodology in our experiments. Experimental results are also giving in this section. Finally conclusion is drawn in Section 6 and the Section 7 is the acknowledgment. 2 Related Works Spam detection can be considered as a problem of a binary document classification and each can be regarded as a document. An contains the body, the subject and other header fields. Vector Space Model, which also called bag-of-words method, is the most widely used to represent documents. With it, each can be processed and represented as a high dimensional sparse vector. In this way, s available will make up high dimensional data (with thousands of features), where many features are irrelevant or redundant. High dimensional data will increase calculation complexity and reduce generalization of the classifiers. So dimensionality reduction is crucial step in spam filtering. Feature selection is an approach for dimension reduction [11], which aiming to search an optimal features subset from a high dimensional feature space by using the statistic method or information theory. The related measures [12] are document frequency, information gain (IG), χ 2 -statistic and so on [7, 13]. Lasso [14] (Least absolute shrinkage and selection operator) regression is a multivariate linear regression with a bound on the sum of the absolute values of the coefficients, which can select variables and estimate coefficients simultaneously. After lasso proposed, there are many advanced techniques put forward based on lasso, such as elastic net and group lasso [15]. And moreover Least Angle Regression [14] (LAR) was proposed by Efron to deal with lasso computation efficiently [14]. In this paper, a new spam filtering approach based on lasso is proposed and used to Chinese spam filtering. 3 Representation representation is first step in spam filtering, and VSM commonly employed. In VSM, individual terms in each are collected to construct feature set T m = (t 1, t 2,..., t m ) ontaining m terms. Each can be represented as a vector d i = (w 1 (d i ), w 2 (d i ),...w m (d i )), w j (d i ) is the weight of term t j in the d i. So by gathering each data vector, the total corpus is represented as the term-document matrix X n m = (x 1, x 2,..., x m ), here x i = (w i (d 1 ), w i (d 2 ),...w i (d n )) T, n is the document number. The value of w j (d i ) is calculated using normalized ltc [13] function, which defined as: w j (d i ) = log (f ji + 1.0) log ( N ) n i (1) m [log (f ji + 1.0) log ( N n i )] 2 k=1 Where f ji is the number of occurrence of term t j appears in document d i, N means the number of the total documents set, n i denote the number of documents in N in which t j occurs, m
3 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) is the number of the terms. With the available corpus of n s, term-document matrix X = (x 1, x 2,..., x m ) with a size of n m can be produced, where vectors x 1, x 2,..., x m are n-dimensional vectors with respect to m features. And the responses for n s constitute vector y, its component values depend on the corresponding label. That is, if k is a spam then y k = 0, otherwise when it is a ham and y k = 1. 4 Lars Algorithm In this section, the approach for term selection is put forward in detail. Given data X with m features, X and y are firstly standardized so that n n x ij = 0, y i and 1 n x 2 ij = 1 for i=1 i=1 N i=1 afterwards usage. Let β = (β 1, β 2,..., β m ) 2, and the lasso is equivalent to the following optimization problem: minimize : S(β) = n m m (y i x ij β j ) 2 + λ β j (2) i=1 j=1 j=1 λ > 0, is a tunable regularization parameter. In order to solve this problem with L1 penalty efficiently, LARS (Least Angel Regression) was put forward, being improved from stagewise with high precision and easy computation. In LARS, the correlation between the term x i and target y is defined as: c i = x T i (y µ), µ = Xβ (3) X j3 X j2 u3 u2 u1 X j1 Fig. 1: The procedure of LARS LARS works with the procedure, shown in Fig 1. At the beginning of the algorithm µ = 0, and then the correlation values are calculated, and the maximum coefficient c j1 be found, meaning that the variable x j1 is most correlated with the predictor y µ, then added to the active set A. After that the largest step possible in the direction of this predictor is taken until some other predictor, say x j2, has as much correlation as x j1 with the current residual. Then LARS proceeds in a direction equiangular between the two predictors until a third variable x j1 earns its way into the most correlated set. LARS then proceeds equiangular between the three variables to find the next variable. LARS implementation is listed in Table 1.
4 3318 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) u The direction of angle bisector c The correlation Table 1: The LARS algorithm µ = 0. while y µ(k) < ε and ( A <= m) µ The predictor r, ˆr step length 1 c = X T (y µ) = (c 1, c 2,..., c m ) T,A = {j : c j = max(c) = C}, X A = [...x j..., j A] 2 u(k) = X A w A, ( w A = a A G 1 A 1 A a A = (1 T A G 1 A 1 A) 1 2 ) 3 a A = X T u = (a 1, a 2,..., a m ) 4 r = min + { } { } β j w j, if A < m ˆr = min + C cj j A C a A a j, C+c j a A +a j else ˆr = C A A j A 5 if ˆr < r µ(k) = µ(k 1) + ru(k) β A = β A + rw A A = A {ĵ} else µ(k) = µ(k 1) + ˆru(k) β A = β A + ˆrw A A = A {ĵ} return β A 5 Simulation 5.1 Preprocessing with document The Chinese corpus used for simulation experiments comes from the 2006 TREC Spam Filtering Track public datasets trec06c [16]. It consists of spam s and ham s. Before representing the corpus into term-document matrix, the contents should be abstracted and analyzed. A Visual C++ application program was made to extract the subject, content, and other major information from the original text. Since the Chinese document text has no obvious space between characters and always including some numbers, symbols, so the Chinese word segmentation is necessary. Therefore, ICTCLAS from Chinese Science Academy is utilized to achieve word segmentation. In the progress, those useless features, such as stop words, white space, punctuations, and so on are deleted. With statistics on the corpus, it s found that there are terms in all and 8879 terms appear in more than 60 s (document frequency is less than 0.2%. Then the 8879 terms are chosen as the original feature variables. Moreover, the terms appear in the subjects are set a higher weight because the subject may contains more important information. 5.2 Evlauation method The TREC Spam Filter Evaluation [16], developed for TREC 2005, provides a standardized method for evaluating spam filtering techniques. There are several statistics commonly used. hm% : the ham misclassification percentage. sm% : the spam misclassification percentage. 1-ROCA%: the area above the ROC curve, the most crucial statistics.
5 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) lam%: logistic average misclassification percentage, defined as lam% = log it 1 ( Here log it(x) = log( x 1 x ),and log it 1 (x) = h =.1 a statistics of sm% when hm%=0.1 log it(hm%) + log it(sm%) ) (4) 2 ex 1+e x 5.3 Experiments and result Actually, there are many s without contents will lead to some zero vectors, which should be removed to avoid computation problems. Because of the spam s taking the majority of experimented s, the selection of spam and ham is in a ratio of 2:1. Before the experiment, the selected data are partitioned into training datasets and testing datasets according to some rules. Firstly, the Lasso regression is used for filter algorithm to classify s. And in the meantime the important features are picked up, which coefficients is not zero in the lasso solution. Then another filter based on logistic regression is experimented on the new data which only contains the selected terms based on the lasso. And the next, the new data were use to training and testing on the LR classifier. With the training data {X, y} (X is the data matrix with n documents and y is the responses vector representing the label for the n s), the 10 fold cross-validation methods are employed to train the lasso regression filter with varying regularization parameters. With the solution β from lasso algorithm, the corresponding terms which coefficients are not zero can make an effective term subset. Changing the regularization parameter, the penalty of the L1 norm will be controlled, and the number of the selected terms will be decided accordingly. Here λ is set as λ i (12, 10, 8, 7) and the independent experiments are implemented with 10 fold crossvalidations. That is to say, 10 lasso models are built up for each selected regularization parameter. They are also checked with the testing data. The average number of the selected terms and the evaluation statistics, which were analysis carefully, are calculated and presented in Fig 2(a) and Table 2. With each lasso model for each regularization parameter, a new term-matrix data, which cover those with no-zero coefficients selected terms are built. And the new data are also divided into two groups train dataset and test dataset. Then the train dataset are used to train the logistic regression classifier. And the new LR classifier will be checked with testing dataset. The experiment results are listed in Fig 2(b) and Table 3. According to the TREC Spam Filter Evaluation standard, a distinguished spam filter would have a small area above the Receiver Operating Characteristic (ROC) Curve. Comparing the results showed in Fig 2(a) and Fig 2(b), it will be easily to know that the logistic regression plus lasso term selection has a better performance than lasso, but the previous need additional time cost to build logistic regression model. Furthermore, the performance will be improved with the increasing number of the selected terms. And in this experiment we get the best ROC curve when the λ values 7 as it showed in the figures From Table 2 and Table 3, it can also be seen that, the filter will get a better performance with the increasing of the selected terms, and the filters get the best performance when the λ = 7. By comparison, the filters with lasso-based term selection and logistic regression are superior to the lasso-based filters. Afterwards, more experiments are carried out to prove that the method proposed can achieve better performance. Another two typical feature selection methods, IG and
6 3320 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) %Spam Misclassification(logit scale) λ=12 λ=10 λ=8 λ=7 ROC %Spam Misclassification(logit scale) λ=12 λ=10 λ=8 λ=7 ROC %Ham Misclassification(logit scale) (a) The ROC curve of lasso %Ham Misclassification(logit scale) (b) The ROC curve with the lasso-based term selection and logistic regression Fig. 2: The ROC curve of Logistic regression and Lasso regression Table 2: Results with the lasso regression λ Terms 1-ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Table 3: Result with the lasso-based term selection and logistic regression λ Terms 1-ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) χ 2 are used to choose the important terms, and with this, another two logical regression models are built to classify the s. And the ROC curves and results are presented in Fig 3, Table 4 and Table 5. Fig 3(a) and Fig 3(b) showed that χ 2 feature selection has smaller area above the Receiver Operating Characteristic (ROC) Curve than the IG method. But both of them are inferior to lasso approach. It can be said that lasso is a better feature selection method in spam filtering. In the same way, a filter with a better performance will have small numerical value in 1-ROCA%, hm :.1% and lam%. The results of χ 2 and IG feature selection are also having poor performance on hm :.1% and lam%, compared with lasso selection method. So the conclusion that the lasso do well on the feature selection in spam filtering, can be drawn.
7 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) %Spam Misclassification(logit scale) terms=577 terms=820 terms=1370 terms=1823 ROC %Spam Misclassification(logit scale) terms=577 terms=820 terms=1370 terms=1823 ROC %Ham Misclassification(logit scale) (a) The ROC curve of Logistic using χ 2 to select terms Fig. 3: The ROC curve of Logistic using χ 2 and IG %Ham Misclassification(logit scale) (b) The ROC curve of Logistic using IG to select terms Table 4: Result of the logical regression using 2 variable selection λ Terms 1-ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Table 5: Result of the logical regression using IG variable selection λ Terms 1-ROCA% hm:.1% lam% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 6 Conclusion In this paper, the approach to lasso-based spam filtering is presented. The key features which coefficient value are no-zero on the lasso solution, can be selected for feature reduction. Based on it, the logistic filters are built and tested with Chinese text . Lasso regression can selected the terms and estimated regression coefficients simultaneously, and on another hand, the lasso-based spam filters can also be built straightly. The approach is compared with two other term selection methods, and all of them work with the classifiers based on logistic regression. The simulation results proved that the lasso approach succeeds in term selection.there are still many challenges in spam filtering, the senders will spam the junk- s in other ways, such as word obfuscation, tokenization [17], and even image spam. Employing lasso method on the new problems will be our future research job.
8 3322 Z. Liu et al. /Journal of Computational Information Systems 8: 8 (2012) References [1] A. IltIk, T. G Ng R, Time-efficient spam filtering using n-gram models, Pattern Recognition Letters, 2008, pp [2] H. Yong, H. Xiaoning, Y. Muyun, Q. Haoliang, and S. Chao, Chinese Spam Filter Based on Relaxed Online Support Vector Machine, Proc. Asian Language Processing (IALP), 2010 International Conference on, 2010, pp [3] H. Yong, Y. Muyun, Q. Haoliang, H. Xiaoning, and L. Sheng, The Improved Logistic Regression Models for Spam Filtering, Proc. Asian Language Processing, IALP 09. International Conference on, 2009, pp [4] S. Lu, D. Chiang, H. Keh, and H. Huang, Chinese text classification by the Naive Bayes Classifier and the associative classifier with multiple confidence threshold values, Knowledge-Based Systems 2010, pp [5] H. Drucker, D. Wu, and V. N. Vapnik, Support vector machines for spam categorization, Neural Networks, IEEE Transactions on 1999, pp [6] H. Qi, X. He, Y. Han, M. Yang, and S. Li, Information Theory Based Feature Valuing for Logistic Regression for Spam Filtering, Asian Language Processing, International Conference on 2010, pp [7] T. Almeida, J. Almeida, and A. Yamakami, Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers, Journal of Internet Services and Applications, 2011, pp [8] G. Cormack, TREC 2007 Spam Track Overview, Proc. Proceedings of the Sixteenth Text REtrieval Conference, National Institute of Standards and Technology (NIST), [9] F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv.2002, pp [10] X. Tai, F. Ren, and K. Kita, An information retrieval model based on vector space method by supervised learning, Inf. Process. Manage, 2002, pp [11] I. A. Gheyas, and L. S. Smith, Feature subset selection in large dimensionality domains, Pattern Recognition, 2010, pp [12] W. Zhang, T. Yoshida, and X. Tang, A comparative study of TF*IDF, LSI and multi-words for text classification, Expert Systems With Applications2011, pp [13] Y. Li, C. Luo, and S. M. Chung, Text Clustering with Feature Selection by Using Statistical Data, Knowledge and Data Engineering, IEEE Transactions on 2008, pp [14] B. Efron, T. Hastie, L. Johnstone, and R. C. C. H. Tibshirani, Least angle regression, Annals Of Statistics, 2004, pp [15] L. Meier, S. van de Geer, and P. Buhlmann, The group lasso for logistic regression, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2008, pp [16] G. V. Cormack, TREC 2006 Spam Track Overview, Book TREC 2006 Spam Track Overview, Series TREC 2006 Spam Track Overview,ed., Editor ed, [17] D. Sculley, G. Wachman, and C. E. Brodley, Spam Filtering Using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers, 2006, pp. 1.
Not So Naïve Online Bayesian Spam Filter
Not So Naïve Online Bayesian Spam Filter Baojun Su Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 310027, China freizsu@gmail.com Congfu Xu Institute of Artificial
More informationHow To Filter Spam Image From A Picture By Color Or Color
Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among
More informationCAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of
More informationUtilizing Multi-Field Text Features for Efficient Email Spam Filtering
International Journal of Computational Intelligence Systems, Vol. 5, No. 3 (June, 2012), 505-518 Utilizing Multi-Field Text Features for Efficient Email Spam Filtering Wuying Liu College of Computer, National
More informationThe Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
More informationAnti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
More informationOn-line Spam Filter Fusion
On-line Spam Filter Fusion Thomas Lynam & Gordon Cormack originally presented at SIGIR 2006 On-line vs Batch Classification Batch Hard Classifier separate training and test data sets Given ham/spam classification
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationA Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
More informationBlog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
More informationFeature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
More informationSpam Filtering Based on Latent Semantic Indexing
Spam Filtering Based on Latent Semantic Indexing Wilfried N. Gansterer Andreas G. K. Janecek Robert Neumayer Abstract In this paper, a study on the classification performance of a vector space model (VSM)
More informationA Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
More informationT-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
More informationStatistical Feature Selection Techniques for Arabic Text Categorization
Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000
More informationDATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
More informationA Proposed Algorithm for Spam Filtering Emails by Hash Table Approach
International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering
More informationAn Imbalanced Spam Mail Filtering Method
, pp. 119-126 http://dx.doi.org/10.14257/ijmue.2015.10.3.12 An Imbalanced Spam Mail Filtering Method Zhiqiang Ma, Rui Yan, Donghong Yuan and Limin Liu (College of Information Engineering, Inner Mongolia
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationEmail Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
More informationLogistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
More informationTowards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial
More informationResearch on Sentiment Classification of Chinese Micro Blog Based on
Research on Sentiment Classification of Chinese Micro Blog Based on Machine Learning School of Economics and Management, Shenyang Ligong University, Shenyang, 110159, China E-mail: 8e8@163.com Abstract
More informationSpam Filtering Based On The Analysis Of Text Information Embedded Into Images
Journal of Machine Learning Research 7 (2006) 2699-2720 Submitted 3/06; Revised 9/06; Published 12/06 Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Giorgio Fumera Ignazio
More informationCombining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
More informationPSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering
2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University
More informationModel selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013
Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.
More informationBayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationNaïve Bayesian Anti-spam Filtering Technique for Malay Language
Naïve Bayesian Anti-spam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationFiltering Email Spam in the Presence of Noisy User Feedback
Filtering Email Spam in the Presence of Noisy User Feedback D. Sculley Department of Computer Science Tufts University 161 College Ave. Medford, MA 02155 USA dsculley@cs.tufts.edu Gordon V. Cormack School
More informationA Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng
More informationClustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
More informationSVM-Based Spam Filter with Active and Online Learning
SVM-Based Spam Filter with Active and Online Learning Qiang Wang Yi Guan Xiaolong Wang School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China Email:{qwang, guanyi,
More informationNeural Networks for Sentiment Detection in Financial Text
Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.
More informationThe Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network
, pp.67-76 http://dx.doi.org/10.14257/ijdta.2016.9.1.06 The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network Lihua Yang and Baolin Li* School of Economics and
More informationMAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationA STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then
More informationBagged Ensemble Classifiers for Sentiment Classification of Movie Reviews
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie
More informationRegularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
More informationLess naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems
More informationSpam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
More informationTerm extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
More informationCross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
More informationBEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
More informationContent-Based Recommendation
Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches
More informationWeb Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
More informationHow To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
More informationSURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING
I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and
More informationRecognition Method for Handwritten Digits Based on Improved Chain Code Histogram Feature
3rd International Conference on Multimedia Technology ICMT 2013) Recognition Method for Handwritten Digits Based on Improved Chain Code Histogram Feature Qian You, Xichang Wang, Huaying Zhang, Zhen Sun
More informationDistributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationSimple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
More informationUsing Biased Discriminant Analysis for Email Filtering
Using Biased Discriminant Analysis for Email Filtering Juan Carlos Gomez 1 and Marie-Francine Moens 2 1 ITESM, Eugenio Garza Sada 2501, Monterrey NL 64849, Mexico juancarlos.gomez@invitados.itesm.mx 2
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationUniversité de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr
Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
More informationEmail Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
More informationIII. DATA SETS. Training the Matching Model
A Machine-Learning Approach to Discovering Company Home Pages Wojciech Gryc Oxford Internet Institute University of Oxford Oxford, UK OX1 3JS Email: wojciech.gryc@oii.ox.ac.uk Prem Melville IBM T.J. Watson
More informationSpam Filtering using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers
Spam Filtering using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers D. Sculley, Gabriel M. Wachman, and Carla E. Brodley Department of Computer Science, Tufts University
More informationVisualization of large data sets using MDS combined with LVQ.
Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk
More informationMachine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
More informationProjektgruppe. Categorization of text documents via classification
Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction
More informationA semi-supervised Spam mail detector
A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach
More informationProgramming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationAn Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
More informationCustomer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
More informationModelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
More informationMachine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationLan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,
More informationOn Attacking Statistical Spam Filters
On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Paper review by Deepak Chinavle
More informationIDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION
http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,
More informationStatistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
More information1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
More informationFiltering Noisy Contents in Online Social Network by using Rule Based Filtering System
Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System Bala Kumari P 1, Bercelin Rose Mary W 2 and Devi Mareeswari M 3 1, 2, 3 M.TECH / IT, Dr.Sivanthi Aditanar College
More informationComparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationForecasting stock markets with Twitter
Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationPredictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar
More informationWE DEFINE spam as an e-mail message that is unwanted basically
1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir
More informationIDENTIFICATION OF AUCTION FRAUDULENT IN E-COMMERCE WEB
INTERNATIONAL JOURNAL OF REVIEWS ON RECENT ELECTRONICS AND COMPUTER SCIENCE IDENTIFICATION OF AUCTION FRAUDULENT IN E-COMMERCE WEB Kimaya Nandkishor Shirke 1, K.Pushpa Rani 2 1 M.Tech Student, Dept of
More informationTOWARD BIG DATA ANALYSIS WORKSHOP
TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)
More information6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME
INTERNATIONAL International Journal of Computer JOURNAL Engineering OF COMPUTER and Technology ENGINEERING (IJCET), ISSN 0976-6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET)
More informationEnsemble Approach for the Classification of Imbalanced Data
Ensemble Approach for the Classification of Imbalanced Data Vladimir Nikulin 1, Geoffrey J. McLachlan 1, and Shu Kay Ng 2 1 Department of Mathematics, University of Queensland v.nikulin@uq.edu.au, gjm@maths.uq.edu.au
More informationA Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode
A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode Seyed Mojtaba Hosseini Bamakan, Peyman Gholami RESEARCH CENTRE OF FICTITIOUS ECONOMY & DATA SCIENCE UNIVERSITY
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationSearch and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
More informationD-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationIMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT
IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: marthakala08@gmail.com, Abstract- Unethical
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationPredict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More information