ThreeWay Decisions Solution to Filter Spam An Empirical Study


 Ethelbert Ray
 3 years ago
 Views:
Transcription
1 ThreeWay Decisions Solution to Filter Spam An Empirical Study Xiuyi Jia 1,4, Kan Zheng 2,WeiweiLi 3, Tingting Liu 2, and Lin Shang 4 1 School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China, School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing, China, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China, State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China, Abstract. A threeway decisions solution based on Bayesian decision theory for filtering spam s is examined in this paper. Compared to existed filtering systems, the spam filtering is no longer viewed as a binary classification problem. Each incoming is accepted as a legitimate or rejected as a spam or undecided as a furtherexam by considering the misclassification cost. The threeway decisions solution for spam filtering can reduce the error rate of classifying a legitimate to spam, and provide a more meaningful decision procedure for users. The solution is not restricted to a specific classifier. Experimental results on several corpus show that the threeway decisions solution can get a better total cost ratio value and a lower weighted error. Keywords: Decisiontheoretic rough set model, spam filtering, threeway decisions solution. 1 Introduction Spam filtering is often considered as a binary classification problem. Many machine learning algorithms were employed in different filters to classify an incoming as a legitimate or a spam, such as Naive bayesian classifier [11], memory based classifier () [1], based classifier [5] and so on [2,12]. In this paper, we will treat spam filtering as an optimum decision problem under the framework of cost sensitive learning. It is easy to understand that Spam filtering is a cost sensitive learning problem. The cost for misclassifying a legitimate as spam far outweighs the cost of marking a spam as legitimate [11]. Many machine learning approaches to spam filtering papers considered the different costs of two types of J.T. Yao et al. (Eds.): RSCTC 2012, LNAI 7413, pp , c SpringerVerlag Berlin Heidelberg 2012
2 288 X. Jia et al. misclassifications (legit spam and spam legit). In [11], 99.9% was used as the certain threshold for classifying test as spam to reflect the asymmetric cost of errors. The threshold was set manually. In [1], three scenarios(λ =1, λ =9,λ = 999) were discussed, while legit spam is λ times costly than spam legit. These parameters were applied in the cost sensitive evaluation procedure only, which can measure the efficiency of the learning algorithm, but can not help learn a better result because they did not consider them in the training procedure. In these papers, appropriate settings of parameters were not discussed. We will discuss this point in this paper. It is intuitively to treat spam filtering as a threeway decision problem [15]. Besides the usual decisions include accept an as a legitimate one and reject an as a spam, the third type decision furtherexam for those suspicious spam s is also considered in threeway decisions solution. The idea of threeway decisions making can be found in many areas. In [3], an optimum rejection scheme was derived to safeguard against excessive misclassification in pattern recognition system. In clinical decision making for a certain disease, with options of treating the conditional directly, not treating the condition, or performing a diagnose test to decide whether or not to treat the condition [9]. Yao et al. [13,14] introduced decision theoretic rough set model(dtrs) based on threeway decisions, which considered the cost of each error and Bayesian decision theory. Based on DTRS and Naive Bayesian classifier, a threeway decision approach to spam filtering was proposed recently [15]. Our work is continuation of them. Different with their work, our threeway decision solution is not only suit to Naive Bayesian classifier, but also, and other classifiers. The main advantage of threeway decisions solution is that it allows the possibility of refusing to make a direct decision, which means it can convert some potential misclassifications into rejections, and these s will be furtherexamed by users. Several cost functions are defined to state how costly each decision is, and the final decision can make the overall cost minimal filtering. We apply the threeway decisions solution on several classical classifiers which are used in spam filtering, including Naive Bayesian classifier, classifier and classifier. The tests on several benchmark corpus include LingSpam, PUCorpora and EnronSpam 1 show the efficiency of the threeway decisions solution. We can get a lower weighted error and a better total cost ratio (TCR). 2 ThreeWay Decisions Solution to Filter Spam 2.1 DecisionTheoretic Rough Set Model Decisiontheoretic rough set model was proposed by Yao et al. [13], which is based on Bayesian decision theory. The basic ideas of the theory [14] are reviewed. Let Ω = {ω 1,...,ω s } be a finite set of s states and let A = {a 1,...,a m } be a finite set of m possible actions. Let λ(a i ω j ) denote the cost, for taking 1 All corpus are available from
3 ThreeWay Decisions Solution to Filter Spam 289 action a i when the state is ω j.letp(ω j x) be the conditional probability of an incoming x being in state ω j, suppose action a i is taken. The expected cost associated with taking action a i is given by: R(a i x) = s λ(a i ω j ) p(ω j x). (1) j=1 In rough set theory [10], a set C is approximated by three regions, namely, the positive region POS(C) includes the objects that are sure belong to C, the boundary region BND(C) includes the objects that are possible belong to C, and the negative region NEG(C) includes the objects that are not belong to C. In spam filtering, we have a set of two states Ω = {C, C c } indicating that an is in C (i.e., legitimate) or not in C (i.e., spam), respectively. The set of s can be divided into three regions, POS(C) includes s that are legitimate s, BND(C) includes s that need furtherexam, and NEG(C) includes s that are spam. With respect to these three regions, the set of actions is given by A = {a P,a B,a N },wherea P, a B and a N represent the three actions in classifying an x, namely, deciding x POS(C), deciding x BND(C), and deciding x NEG(C). Six cost functions are imported, λ PP, λ BP and λ NP denote the costs incurred for taking actions a P, a B, a N, respectively, when an belongs to C, andλ PN, λ BN and λ NN denote the costs incurred for taking these actions when the does not belong to C. The expected costs associated with taking different actions for x can be expressed as: R(a P x) =λ PP p(c x)+λ PN p(c c x), R(a B x) =λ BP p(c x)+λ BN p(c c x), R(a N x) =λ NP p(c x)+λ NN p(c c x). (2) The Bayesian decision procedure suggests the following minimumcost decision rules: (P) If R(a P x) R(a B x) andr(a P x) R(a N x), decide x POS(C); (B) If R(a B x) R(a P x) andr(a B x) R(a N x), decide x BND(C); (N) If R(a N x) R(a P x) andr(a N x) R(a B x), decide x NEG(C); Consider a special kind of cost functions with: λ PP λ BP <λ NP, λ NN λ BN <λ PN. (3) That is, the cost of classifying an x being in C into the positive region POS(C) is less than or equal to the cost of classifying x into the boundary region BND(C), and both of these costs are strictly less than the cost of classifying x into the negative region NEG(C). The reverseorder of costs is used for classifying an not in C. Sincep(C x) +p(c c x) = 1, under above condition, we can simplify decision rules (P)(N) as follows:
4 290 X. Jia et al. (P) If p(c x) α and p(c x) γ, decide x POS(C); (B) If p(c x) α and p(c x) β, decide x BND(C); (N) If p(c x) β and p(c x) γ, decide x NEG(C). Where (λ PN λ BN ) α = (λ PN λ BN )+(λ BP λ PP ), (λ BN λ NN ) β = (λ BN λ NN )+(λ NP λ BP ), (λ PN λ NN ) γ = (λ PN λ NN )+(λ NP λ PP ). (4) Each rule is defined by two out of the three parameters. The conditions of rule (B) suggest that α>βmay be a reasonable constraint; it will ensure a welldefined boundary region. If we obtain the following condition on the cost functions [14]: (λ NP λ BP ) (λ BN λ NN ) > λ BP λ PP (λ PN λ BN ), (5) then 0 β<γ<α 1. In this case, after tiebreaking, the following simplified rules are obtained: (P1) If p(c x) α, decide x POS(C); (B1) If β<p(c x) <α, decide x BND(C); (N1) If p(c x) β, decide x NEG(C). The threshold parameters can be systematically calculated from cost functions based on the Bayesian decision theory. 2.2 ThreeWay Decisions Solution Given the cost functions, we can make the proper decisions for incoming s based on the parameters (α, β), which are computed by cost functions, and the probability of each being a legitimate one, which is provided by the running classifier. For an x, if the probability of being a legitimate is p(c x), then the threeway decisions solution is: If p(c x) α, thenx is a legitimate ; If p(c x) β, thenx is a spam; If β<p(c x) <α,thenx needs furtherexam. 3 Experiments on Several Benchmark Corpus In this section, we will check the efficiency of the threeway decisions solution. The followings are the detail of classifiers, corpus and the evaluation methods used in our experiments.
5 ThreeWay Decisions Solution to Filter Spam Classifiers Three classical classifiers (Naive Bayesian classifier, classifier and classifier) are applied in the experiments. The Naive Bayesian classifier can provide the probability of an incoming being a legitimate . Suppose an x is described by a feature vector x =(x 1, x 2,...,x n ), where x 1, x 2,...,x n are the values of attributes of the . Let C denote the legitimate class. Based on Bayes theorem and the theorem of total probability, given the vector of an , the probability of being a legitimate one is: p(c x) = p(c) p(x C), (6) p(x) where p(x) =p(x C) p(c)+p(x C c ) p(c c ). Here p(c) is the prior probability of an being in the legitimate class. p(c) is commonly known as the likelihood of an being in the legitimate class with respect to x. The likelihood p(x C) is a joint probability of p(x 1, x 2,...,x n C). The knearest neighbors algorithm() classifier [8] predicts an s class by a majority vote of its neighbors. Euclidean distance is usually used as the distance metric. Let n(l) denote the legitimate s number in the k nearest neighbors, and n(s) denote the spam s number. So n(l)+n(s) =k, thenweuse sigmoid function to estimate the posterior probability of an x being in the legitimate class. p(c x) = 1 1+exp( a (n(l) n(s))). (7) In our experiments, a =1andk = 17. For classifier, the sigmoid function also be used to estimate the posterior probability: p(c x) = 1 1+exp( a d(x)). (8) While d(x) = w.x+b w, where weighted vector w and threshold b are used to define the hyperplane. a = 6 in our experiments. 3.2 Benchmark Corpus Three benchmark corpus are used in this paper, called LingSpam, PUCorpora and EnronSpam. The corpus LingSpam was preprocessed as three different type of corpus, named LingSpambare, LingSpamlemm and LingSpamstop, respectively. LingSpambare is a kind of words only dataset, each attribute shows if a particular word occurs in the . For LingSpamlemm, a lemmatizer was applied to LingSpam, which means each word was substituted by its base form(e.g.
6 292 X. Jia et al. earning becomes earn ). LingSpamstop is a data set generated by considering stoplist based on LingSpamlemm. The corpus PUCorpora contains PU1, PU2, PU3 and PUA corpora. All corpus are only in bare form: tokens were separated by white characters, but no lemmtizer or stoplist has been applied. Another corpus EnronSpam is divided to 6 datasets, each dataset contains legitimate s from a single user of the Enron corpus, to which fresh spam s with varying legitspam ratios were added. Since the corpus in LingSpam and PUCorpora were divided into 10 datasets, 10fold crossvalidation is used in these corpus. EnronSpam was divided into 6 datasets, 6fold crossvalidation is used in it s experiment. Because we just want to compare the threeway decisions solution with twoway decisions solution based on the same classifiers and the same corpus, then it is nonnecessary to apply feature selection procedures in corpus, all attributes are used in the experiments. 3.3 Measures to Evaluate Performance As a cost sensitive learning problem, Androutsopoulos et al. [1] suggested that using weighted accuracy(or weighted error rate) and total cost ratio to replace the classical accuracy(or error rate) to measure the spam filter performance is reasonable. Let N legit and N spam be the total numbers of legitimate and spam s, to be classified by the filter, and n legit spam the number of s belonging to legitimate class that the filter classified as belonging to spam, n spam legit, reversely. Then the classical accuracy and error rate are defined as: Acc = n legit legit + n spam spam, N legit + N spam (9) Err = n legit spam + n spam legit. N legit + N spam (10) If legit spam is λ times more costly than spam legit, then weighted accuracy(wacc)andweighted error rate(werr) are defined as: WAcc = λ n legit legit + n spam spam λ N legit + N spam, (11) WErr = λ n legit spam + n spam legit λ N legit + N spam. (12) As the values of accuracy and error rate(or their weighted versions) are often misleadingly high, another measure is defined to get a clear picture of a classifier s performance, the ratio of its error rate and that of a simplistic baseline approach. The baseline approach is the filter that never blocks legitimate s and always passes spam s. The weighted error rate of the baseline is:
7 ThreeWay Decisions Solution to Filter Spam 293 The total cost ratio(tcr) is: WErr b = N spam λ N legit + N spam. (13) TCR = WErrb WErr = N spam. (14) λ n legit spam + n spam legit Greater TCR values indicate better performance. For TCR < 1, the baseline is better. If cost is proportional to wasted time, an intuitive meaning for TCR is the following: it measures how much time is wasted to delete manually all spam s when no filter is used, compared to the time wasted to delete manually any spam s that passed the filter plus the time needed to recover from mistakenly blocked legitimate s. For our threeway decisions solution, weighted rejection rate(wrej) is defined to indicate the weighted ratio of s need furtherexam. WRej = λ n legit boundary + n spam boundary λ N legit + N spam, (15) while n legit boundary and n spam boundary mean the numbers of legitimate and spam s being classified to boundary(need furtherexam), WRej = 1 WAcc WErr, actually. For classical classifiers, WRej = Experimental Result In our experiments, three different λ values(λ =1,λ =3,andλ = 9) are applied, same values were considered in [15]. For threeway decisions solution, we set α =0.9 andβ =0.1 to make corresponding decisions. From the results showed in the following 8 tables, we can see that the values of WAcc in threeway decisions solution are lower than that in classical approaches. It is the result of moving some s into the boundary for furtherexam. We can also conclude that the threeway decisions solution decreases the values of WErr and increases the values of TCR from the results, which means the threeway decisions solution can reduce the number of wrong classified s. For those suspicious spam s, it is a better choice to let users decide which is a legitimate or a spam. The increment of TCR shows that the threeway decisions solution gives a better performance. There exists a kind of tradeoff between the error rate and the rejection rate. Users can decrease the error rate by increasing the rejection rate. The most important conclusion we can get from the experiments result is that, the threeway decisions solution can get a better performance than twoway decision solution under the same situation, it does not depend on the parameter λ or some special classifiers.
8 294 X. Jia et al. Table 1. Comparison results on corpora LingSpambare Table 2. Comparison results on corpora LingSpamlemm NB λ = 1 Threeway NB λ = 3 Threeway NB λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway NB λ = 1 Threeway NB λ = 3 Threeway NB λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway Table 3. Comparison results on corpora LingSpamstop NB λ = 1 Threeway NB λ = 3 Threeway NB λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway Table 4. Comparison results on corpora PU1 NB λ = 1 Threeway NB λ = 3 Threeway NB λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway
9 ThreeWay Decisions Solution to Filter Spam 295 Table 5. Comparison results on corpora PU2 NB λ = 1 Threeway NB λ = 3 Threeway NB λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway Table 6. Comparison results on corpora PU3 NB λ = 1 Threeway NB λ = 3 Threeway NB λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway Table 7. Comparison results on corpora PUA NB λ = 1 Threeway NB λ = 3 Threeway NB λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway Table 8. Comparison results on corpora EnronSpam NB λ = 1 Threeway NB λ = 3 Threeway NB λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway λ = 1 Threeway λ = 3 Threeway λ = 9 Threeway Conclusions In this paper, spam filtering is seen as a cost sensitive learning problem. From the point of optimum decision view, we propose a threeway decisions solution to filter spam . For those suspicious s, the threeway decisions solution moves them to boundary for furtherexam, which can covert potential misclassification into rejections. The solution can be applied in any classifier only
10 296 X. Jia et al. if the classifier can provide the probability of each mail being a legitimate. Naive Bayesian classifier, and classifiers by considering threeway decisions solution are examined on several corpus, the result show the efficiency of the solution. With considering the threeway decisions solution, we can get a lower weighted error rate and a higher TCR. Acknowledgments. This research is supported by the National Natural Science Foundation of China under Grant No , , and the Fundamental Research Funds for the Central Universities under Grant No. NS References 1. Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C.D., Stamatopoulos, P.: Learning to filter spam A comparison of a naive bayesian and a memorybased approach. In: 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp (2000) 2. Carreras, X., Marquez, L.: Boosting trees for antispam filtering. In: European Conference on Recent Advances in NLP (2001) 3. Chow, C.K.: On optimum recognition error and reject tradeoff. IEEE Transcations on Information Theory 16(1), (1970) 4. Domingos, P., Pazzani, M.: Beyond independece: Conditions for the optimality of the simple Bayesian classifier. In: 13th International Conference on Machine Learning, pp (1996) 5. Drucker, H., Wu, D.H., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10(5), (1999) 6. Elkan, C.: The foundations of costsensitive learning. In: 17th International Joint Conference on Artificial Intelligence, pp (2001) 7. Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam filtering with naive bayeswhich naive bayes? In: 3rd Conference on and AntiSpam (2006) 8. Mitchell, T.M.: Machine Learning. McGrawHill (1997) 9. Pauker, S.G., Kassirer, J.P.: The threshold approach to clinical decision making. New England Journal of Medicine 302, (1980) 10. Pawlak, Z.: Rough sets. International Journal of Computer and Information Sciences 11, (1982) 11. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk . In: Learning for Text CategorizationPapers from the AAAI Workshop, pp (1996) 12. Schneider, K.M.: A comparison of event models for Naive Bayes antispam filtering. In: 10th Conference of the European Chapter of the Association for Computational Linguistics, pp (2003) 13. Yao, Y., Wong, S.K.M., Lingras, P.: A decisiontheoretic rough set model. Methodologies for Intelligent Systems 5, (1992) 14. Yao, Y.: Threeway decisions with probabilistic rough sets. Information Sciences 180, (2010) 15. Zhou, B., Yao, Y., Luo, J.: A ThreeWay Decision Approach to Spam Filtering. In: Farzindar, A., Kešelj, V. (eds.) Canadian AI LNCS, vol. 6085, pp Springer, Heidelberg (2010)
A ThreeWay Decision Approach to Email Spam Filtering
A ThreeWay Decision Approach to Email Spam Filtering Bing Zhou, Yiyu Yao, and Jigang Luo Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {zhou200b,yyao,luo226}@cs.uregina.ca
More informationNaive Bayes Spam Filtering Using WordPositionBased Attributes
Naive Bayes Spam Filtering Using WordPositionBased Attributes Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden johan.hovold.363@student.lu.se Abstract This paper
More informationA MemoryBased Approach to AntiSpam Filtering for Mailing Lists
Information Retrieval, 6, 49 73, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. A MemoryBased Approach to AntiSpam Filtering for Mailing Lists GEORGIOS SAKKIS gsakis@iit.demokritos.gr
More informationCASICT at TREC 2005 SPAM Track: Using NonTextual Information to Improve Spam Filtering Performance
CASICT at TREC 2005 SPAM Track: Using NonTextual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of
More informationA TwoPass Statistical Approach for Automatic Personalized Spam Filtering
A TwoPass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
More informationSingleClass Learning for Spam Filtering: An Ensemble Approach
SingleClass Learning for Spam Filtering: An Ensemble Approach TsangHsiang Cheng Department of Business Administration Southern Taiwan University of Technology Tainan, Taiwan, R.O.C. ChihPing Wei Institute
More informationSpam Detection System Combining Cellular Automata and Naive Bayes Classifier
Spam Detection System Combining Cellular Automata and Naive Bayes Classifier F. Barigou*, N. Barigou**, B. Atmani*** Computer Science Department, Faculty of Sciences, University of Oran BP 1524, El M Naouer,
More informationLan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 2123 September, 2005,
More informationDeveloping Methods and Heuristics with Low Time Complexities for Filtering Spam Messages
Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages Tunga Güngör and Ali Çıltık Boğaziçi University, Computer Engineering Department, Bebek, 34342 İstanbul, Turkey
More informationMachine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
More informationFiltering Junk Mail with A Maximum Entropy Model
Filtering Junk Mail with A Maximum Entropy Model ZHANG Le and YAO Tianshun Institute of Computer Software & Theory. School of Information Science & Engineering, Northeastern University Shenyang, 110004
More informationIMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT
IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: marthakala08@gmail.com, Abstract Unethical
More informationFiltering Spam EMail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques
52 The International Arab Journal of Information Technology, Vol. 6, No. 1, January 2009 Filtering Spam EMail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques Alaa ElHalees
More informationFiltering Spam EMail with Generalized Additive Neural Networks
Filtering Spam EMail with Generalized Additive Neural Networks Tiny du Toit School of Computer, Statistical and Mathematical Sciences NorthWest University, Potchefstroom Campus Private Bag X6001, Potchefstroom,
More informationAbstract. Find out if your mortgage rate is too high, NOW. Free Search
Statistics and The War on Spam David Madigan Rutgers University Abstract Text categorization algorithms assign texts to predefined categories. The study of such algorithms has a rich history dating back
More informationTowards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial
More informationDetecting Email Spam Using Spam Word Associations
Detecting Email Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 p10co977@coed.svnit.ac.in 2 dpr@coed.svnit.ac.in
More informationEffectiveness and Limitations of Statistical Spam Filters
Effectiveness and Limitations of Statistical Spam Filters M. Tariq Banday, Lifetime Member, CSI P.G. Department of Electronics and Instrumentation Technology University of Kashmir, Srinagar, India Abstract
More informationEmail Classification Using Data Reduction Method
Email Classification Using Data Reduction Method Rafiqul Islam and Yang Xiang, member IEEE School of Information Technology Deakin University, Burwood 3125, Victoria, Australia Abstract Classifying user
More informationDifferential Voting in Case Based Spam Filtering
Differential Voting in Case Based Spam Filtering Deepak P, Delip Rao, Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology Madras, India deepakswallet@gmail.com,
More informationBayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
More informationA MACHINE LEARNING APPROACH TO SERVERSIDE ANTISPAM EMAIL FILTERING 1 2
UDC 004.75 A MACHINE LEARNING APPROACH TO SERVERSIDE ANTISPAM EMAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,
More informationThree types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering ChingLung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
More informationTweaking Naïve Bayes classifier for intelligent spam detection
682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. araturi@uci.edu 2 School of Computing, Information
More informationSimple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS  Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
More informationUsing Online Linear Classifiers to Filter Spam Emails
BIN WANG 1,2 GARETH J.F. JONES 2 WENFENG PAN 1 Using Online Linear Classifiers to Filter Spam Emails 1 Institute of Computing Technology, Chinese Academy of Sciences, China 2 School of Computing, Dublin
More informationCombining SVM classifiers for email antispam filtering
Combining SVM classifiers for email antispam filtering Ángela Blanco Manuel MartínMerino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and
More informationAn Efficient Threephase Email Spam Filtering Technique
An Efficient Threephase Email Filtering Technique Tarek M. Mahmoud 1 *, Alaa Ismail ElNashar 2 *, Tarek AbdElHafeez 3 *, Marwa Khairy 4 * 1, 2, 3 Faculty of science, Computer Sci. Dept., Minia University,
More informationA Proposed Algorithm for Spam Filtering Emails by Hash Table Approach
International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251838X / Vol, 4 (9): 24362441 Science Explorer Publications A Proposed Algorithm for Spam Filtering
More informationPSSF: A Novel Statistical Approach for Personalized Serviceside Spam Filtering
2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Serviceside Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University
More informationAn Evaluation of Statistical Spam Filtering Techniques
An Evaluation of Statistical Spam Filtering Techniques Le Zhang, Jingbo Zhu, Tianshun Yao Natural Language Processing Laboratory Institute of Computer Software & Theory Northeastern University, China ejoy@xinhuanet.com,
More informationAn Outline of a Theory of Threeway Decisions
An Outline of a Theory of Threeway Decisions Yiyu Yao Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada S4S 0A2 Email: yyao@cs.uregina.ca Abstract. A theory of threeway
More informationSVMBased Spam Filter with Active and Online Learning
SVMBased Spam Filter with Active and Online Learning Qiang Wang Yi Guan Xiaolong Wang School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China Email:{qwang, guanyi,
More information1 Introductory Comments. 2 Bayesian Probability
Introductory Comments First, I would like to point out that I got this material from two sources: The first was a page from Paul Graham s website at www.paulgraham.com/ffb.html, and the second was a paper
More informationAn Experimental Comparison of Naive Bayesian and KeywordBased AntiSpam Filtering with Personal Email Messages
An Experimental Comparison of Naive Bayesian and KeywordBased AntiSpam Filtering with Personal Email Messages Ion Androutsopoulos, John Koutsias, Konstantinos V. Cbandrinos and Constantine D. Spyropoulos
More informationNaïve Bayesian Antispam Filtering Technique for Malay Language
Naïve Bayesian Antispam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information
More information2. NAIVE BAYES CLASSIFIERS
Spam Filtering with Naive Bayes Which Naive Bayes? Vangelis Metsis Institute of Informatics and Telecommunications, N.C.S.R. Demokritos, Athens, Greece Ion Androutsopoulos Department of Informatics, Athens
More informationEmail Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
More informationA Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters WeiLun Teng, WeiChung Teng
More informationImpact of Feature Selection Technique on Email Classification
Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity
More informationA Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad  Iraq ABSTRACT
More informationAntiSpam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 AntiSpam Filter Based on,, and model YunNung Chen, CheAn Lu, ChaoYu Huang Abstract spam email filters are a wellknown and powerful type of filters. We construct different
More informationShafzon@yahool.com. Keywords  Algorithm, Artificial immune system, Email Classification, NonSpam, Spam
An Improved AIS Based Email Classification Technique for Spam Detection Ismaila Idris Dept of Cyber Security Science, Fed. Uni. Of Tech. Minna, Niger State Idris.ismaila95@gmail.com Abdulhamid Shafi i
More informationDetecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA
More informationA Bayesian Topic Model for Spam Filtering
Journal of Information & Computational Science 1:12 (213) 3719 3727 August 1, 213 Available at http://www.joics.com A Bayesian Topic Model for Spam Filtering Zhiying Zhang, Xu Yu, Lixiang Shi, Li Peng,
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationArtificial Immune System Based Methods for Spam Filtering
Artificial Immune System Based Methods for Spam Filtering Ying Tan, Guyue Mi, Yuanchun Zhu, and Chao Deng Key Laboratory of Machine Perception (Ministry of Education) Department of Machine Intelligence,
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationINTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal
INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal Research Article ISSN 2277 9140 ABSTRACT Web page categorization based
More informationA CaseBased Approach to Spam Filtering that Can Track Concept Drift
A CaseBased Approach to Spam Filtering that Can Track Concept Drift Pádraig Cunningham 1, Niamh Nowlan 1, Sarah Jane Delany 2, Mads Haahr 1 1 Department of Computer Science, Trinity College Dublin 2 School
More informationSpam or Not Spam That is the question
Spam or Not Spam That is the question Ravi Kiran S S and Indriyati Atmosukarto {kiran,indria}@cs.washington.edu Abstract Unsolicited commercial email, commonly known as spam has been known to pollute the
More informationSpam Filtering using Naïve Bayesian Classification
Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering
More informationWE DEFINE spam as an email message that is unwanted basically
1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir
More informationSpam Filter: VSM based Intelligent Fuzzy Decision Maker
IJCST Vo l. 1, Is s u e 1, Se p te m b e r 2010 ISSN : 09768491(Online Spam Filter: VSM based Intelligent Fuzzy Decision Maker Dr. Sonia YMCA University of Science and Technology, Faridabad, India Email
More informationPredicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
More informationThe Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
More informationA Novel Spam Email Detection System Based on Negative Selection
2009 Fourth International Conference on Computer Sciences and Convergence Information Technology A Novel Spam Email Detection System Based on Negative Selection Wanli Ma, Dat Tran, and Dharmendra Sharma
More informationNonParametric Spam Filtering based on knn and LSA
NonParametric Spam Filtering based on knn and LSA Preslav Ivanov Nakov Panayot Markov Dobrikov Abstract. The paper proposes a nonparametric approach to filtering of unsolicited commercial email messages,
More informationAn Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H12 Islamabad, Pakistan Usman Qamar Faculty,
More informationAn Efficient Twophase Spam Filtering Method Based on Emails Categorization
International Journal of Network Security, Vol.9, No., PP.34 43, July 29 34 An Efficient Twophase Spam Filtering Method Based on Emails Categorization JyhJian Sheu Department of Information Management,
More informationA Novel Classification Approach for C2C ECommerce Fraud Detection
A Novel Classification Approach for C2C ECommerce Fraud Detection *1 Haitao Xiong, 2 Yufeng Ren, 2 Pan Jia *1 School of Computer and Information Engineering, Beijing Technology and Business University,
More informationPaper Classification for Recommendation on Research Support System Papits
IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.5A, May 006 17 Paper Classification for Recommendation on Research Support System Papits Tadachika Ozono, and Toramatsu Shintani,
More informationOn Attacking Statistical Spam Filters
On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Abstract. The efforts of antispammers
More informationPrediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
More informationA LVQbased neural network antispam email approach
A LVQbased neural network antispam email approach Zhan Chuan Lu Xianliang Hou Mengshu Zhou Xu College of Computer Science and Engineering of UEST of China, Chengdu, China 610054 zhanchuan@uestc.edu.cn
More informationAn Efficient Spam Filtering Techniques for Email Account
American Journal of Engineering Research (AJER) eissn : 23200847 pissn : 23200936 Volume02, Issue10, pp6373 www.ajer.org Research Paper Open Access An Efficient Spam Filtering Techniques for Email
More informationLearning to classify email
Information Sciences 177 (2007) 2167 2187 www.elsevier.com/locate/ins Learning to classify email Irena Koprinska *, Josiah Poon, James Clark, Jason Chan School of Information Technologies, The University
More informationSender and Receiver Addresses as Cues for AntiSpam Filtering ChihChien Wang
Sender and Receiver Addresses as Cues for AntiSpam Filtering ChihChien Wang Graduate Institute of Information Management National Taipei University 69, Sec. 2, JianGuo N. Rd., Taipei City 10433, Taiwan
More informationEmail Filters that use Spammy Words Only
Email Filters that use Spammy Words Only Vasanth Elavarasan Department of Computer Science University of Texas at Austin Advisors: Mohamed Gouda Department of Computer Science University of Texas at Austin
More informationA crash course in probability and Naïve Bayes classification
Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s
More informationHow does the performance of spam filters change with decreasing message length? Adrian Scoică (as2270), Girton College
How does the performance of spam filters change with decreasing message length? Adrian Scoică (as2270), Girton College Total word count (excluding numbers, tables, equations, captions, headings, and references):
More informationA Comparison of Event Models for Naive Bayes AntiSpam EMail Filtering
A Comparison of Event Models for Naive Bayes AntiSpam EMail Filtering KarlMichael Schneider University of Passau Department of General Linguistics Innstr. 40, D94032 Passau schneide@phil.uni passau.de
More informationECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift
ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift Sarah Jane Delany 1 and Pádraig Cunningham 2 and Barry Smyth 3 Abstract. While text classification has been identified for some time
More informationAn incremental clusterbased approach to spam filtering
Available online at www.sciencedirect.com Expert Systems with Applications Expert Systems with Applications 34 (2008) 1599 1608 www.elsevier.com/locate/eswa An incremental clusterbased approach to spam
More informationGaussian Classifiers CS498
Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary
More informationA Multiobjective Evolutionary Algorithm for Spam Email Filtering
A Multiobjective Evolutionary Algorithm for Spam Email Filtering A.G. LópezHerrera 1, E. HerreraViedma 2, F. Herrera 2 1.Dept. of Computer Sciences, University of Jaén, E23071, Jaén (Spain), aglopez@ujaen.es
More informationRepresentation of Electronic Mail Filtering Profiles: A User Study
Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 +1 949 824 5888 pazzani@ics.uci.edu
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationTracking Concept Drift at Feature Selection Stage in SpamHunting: An Antispam InstanceBased Reasoning System
Tracking Concept Drift at Feature Selection Stage in SpamHunting: An Antispam InstanceBased Reasoning System J.R. Méndez 1, F. FdezRiverola 1, E.L. Iglesias 1, F. Díaz 2, and J.M. Corchado 3 1 Dept.
More informationLearning to Filter Unsolicited Commercial EMail
Learning to Filter Unsolicited Commercial EMail Ion Androutsopoulos Athens University of Economics and Business and Georgios Paliouras and Eirinaios Michelakis National Centre for Scientific Research
More informationDomain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
More informationECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift
ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift Sarah Jane Delany 1 and Pádraig Cunningham 2 Abstract. While text classification has been identified for some time as a promising application
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationNot So Naïve Online Bayesian Spam Filter
Not So Naïve Online Bayesian Spam Filter Baojun Su Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 310027, China freizsu@gmail.com Congfu Xu Institute of Artificial
More informationRough Set Theory Approach for Filtering Spams from boundary messages in a Chat System
Rough Set Theory Approach for Filtering Spams from boundary messages in a Chat System Sanjiban Sekhar Roy 1, Saptarshi Charaborty 1, Swapnil Sourav 1 and Ajith Abraham 2,3 1 School of Computing Science
More informationHoodwinking Spam Email Filters
Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 1719, 2007 533 Hoodwinking Spam Email Filters WANLI MA, DAT TRAN, DHARMENDRA
More informationEcommerce Transaction Anomaly Classification
Ecommerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of ecommerce
More information1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
More informationSpam Detection Approaches with Case Study Implementation on Spam Corpora
194 Chapter 12 Spam Detection Approaches with Case Study Implementation on Spam Corpora Biju Issac Swinburne University of Technology (Sarawak Campus), Malaysia EXECUTIVE SUMMARY Email has been considered
More informationEfficient Spam Email Filtering using Adaptive Ontology
Efficient Spam Email Filtering using Adaptive Ontology Seongwook Youn and Dennis McLeod Computer Science Department, University of Southern California Los Angeles, CA 90089, USA {syoun, mcleod}@usc.edu
More informationMachine Learning Techniques in Spam Filtering
Machine Learning Techniques in Spam Filtering Konstantin Tretyakov, kt@ut.ee Institute of Computer Science, University of Tartu Data Mining Problemoriented Seminar, MTAT.03.177, May 2004, pp. 6079. Abstract
More informationAn Imbalanced Spam Mail Filtering Method
, pp. 119126 http://dx.doi.org/10.14257/ijmue.2015.10.3.12 An Imbalanced Spam Mail Filtering Method Zhiqiang Ma, Rui Yan, Donghong Yuan and Limin Liu (College of Information Engineering, Inner Mongolia
More informationSpam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
More informationRoulette Sampling for CostSensitive Learning
Roulette Sampling for CostSensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca
More informationA Hybrid ACO Based Feature Selection Method for Email Spam Classification
A Hybrid ACO Based Feature Selection Method for Email Spam Classification KARTHIKA RENUKA D 1, VISALAKSHI P 2 1 Department of Information Technology 2 Department of Electronics and Communication Engineering
More informationPARAMETER MAPPING OF FEATURE SELECTION VIA TAGUCHI METHOD ON EMAIL FILTERING
PARAMETER MAPPING OF FEATURE SELECTION VIA TAGUCHI METHOD ON EMAIL FILTERING 1 NOORMADINAH ALLIAS, 1 MEGAT NORULAZMI MEGAT MOHAMED NOOR, 2 MOHD NAZRI ISMAIL 1 Universiti Kuala Lumpur, Department of Malaysian
More informationEnhancing Scalability in Anomalybased Email Spam Filtering
Enhancing Scalability in Anomalybased Email Spam Filtering Carlos Laorden DeustoTech Computing  S 3 Lab, University of Deusto Avenida de las Universidades 24, 48007 Bilbao, Spain claorden@deusto.es Borja
More informationSpam Filters: Bayes vs. Chisquared; Letters vs. Words
Spam Filters: Bayes vs. Chisquared; Letters vs. Words Cormac O Brien & Carl Vogel Abstract We compare two statistical methods for identifying spam or junk electronic mail. The proliferation of spam email
More informationGood Word Attacks on Statistical Spam Filters
Good Word Attacks on Statistical Spam Filters Daniel Lowd Dept. of Computer Science and Engineering University of Washington Seattle, WA 981952350 lowd@cs.washington.edu Christopher Meek Microsoft Research
More informationSpam Filtering with Naive Bayesian Classification
Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09April2011
More information