Detecting Spam Using Spam Word Associations
|
|
- Lily Preston
- 8 years ago
- Views:
Transcription
1 Detecting Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 p10co977@coed.svnit.ac.in 2 dpr@coed.svnit.ac.in 3 rgm@coed.svnit.ac.in Abstract Now-a-days, mailbox management has become a big task. A large proportion of the s we receive are spam. These unwanted s clog the inbox and are very ubiquitous. Here, a new technique for spam detection is presented that makes use of clustering and association rules generated by the Apriori algorithm. Vector space notation is used to represent the s. The results obtained from experiments conducted on the ling-spam dataset demonstrate the effectiveness of the proposed technique. Keywords Association rules, Content based spam, detection, spam, Text clustering, Vector space model I. INTRODUCTION Spam is an unfortunate problem on the internet. Spam s are the s that we get without our consent. They are typically sent to millions of users at the same time. Spam can be defined as unsolicited (unwanted, junk) for a recipient or any that the user does not want to have in his inbox. It is also defined as Internet Spam is one or more unsolicited messages, sent or posted as a part of larger collection of messages, all having substantially identical content. [1] Most spam is sent to sell products and services and the reason that spam works is because a small number of people choose to respond to it. It costs the sender of the spam mail almost nothing to send millions of spam s. Spam is a big problem because of the large amount of shared resources it consumes. Spam increase the load on the servers and the bandwidth of the ISPs and the added cost to handle this load must be compensated by the customers. In addition, the time spent by people in reading and deleting the spam s is a waste. Taking a look at the 2010 statistics[2], 89.1% of the total s were spam. This amounts to about 262 spam s per day. These are truly large numbers. This paper is organized as follows. A brief introduction about spam was given in the paragraphs above. Related work in this field is discussed in section II. The work proposed in this paper is explained in Section III. The results & inferences are presented in section IV. Lastly, the conclusions are presented in section V II. RELATED WORK Several solutions to the spam problem involve detection and filtering of the spam s on the client side. Machine learning approaches have been used in the past for this purpose. Some examples of this are: Bayesian classifiers as Naive Bayes[3], [4], [5], [6], C4.5[7], Ripper[8] and Support Vector Machine(SVM)[9] and others. In many of these approaches, Bayesian classifiers were observed to give good results and so they have been widely used in several spam filtering software. A number of techniques make use of clustering as a part of their spam detection approach as: clustering followed by KNN classification [10], [11], clustering followed by KNN or BIRCH classification [1] and clustering followed by SVM classification [12]. Up to the knowledge of the authors, clustering with association rules has not been used for spam detection in the past. Vector space model[13] is an algebraic model for representing text documents as vectors of identifiers. Each dimension corresponds to a separate term. If the term occurs in the document, it has some non-zero value in the vector. This is shown in Figure 1. The simplest scheme is to set this value to the number of times a particular word occurs in that document. The drawback of this approach is that some terms occur with very high frequency and usually can t be used to discriminate the documents. The tf-idf weighting scheme is an improvement over the simple scheme. In this scheme, the value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus. According to this scheme the value is computed as: Where tf is the term frequency as discussed above and idf (t, D) is the inverse document frequency and is given as: 222
2 Where N is the number of documents and df t is the number of documents that contain that term. The tf-idf value is high when the document frequency is high and the inverse document frequency is low. The effect is that the common terms are filtered out. Figure1. Vector space notation Clustering is assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar to each other than to those in other clusters. It is one of the most important unsupervised learning problems. Document (or text) clustering is a subset of the larger field of data clustering. In our system, clustering is a data reduction step. i.e. after the clusters of documents are formed, we select only the 'spammy' clusters and then the subsequent steps are applied only to the selected clusters. This helps in reducing the time and effort needed to perform the entire operation. K-means[14] is one of the simplest clustering algorithms. It attempts to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. The main idea is to define k centroids, one for each cluster. The choice of the initial centroids affects the final outcome and ideally they should far away from each other. The steps in the algorithm are as given below: Arbitrarily choose k documents as the initial centroids. Repeat (re)assign each document to the cluster to which it is the most similar, based on the decided similarity measure Obtain new centroids for each cluster Until no change In distance-based clustering, the similarity criterion is distance: two or more objects belong to the same cluster if they are close according to a given distance. In our case, the distance measure that we use is the cosine distance between documents. It is the most popular similarity measure applied to text documents. The cosine distance of two documents is defined by the angle between their feature vectors. Where "." denotes the dot-product of the two frequency vectors A and B, and A denotes the length (or norm) of a vector. Document similarity is based on the amount of overlapping content between documents. The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity. Advantages of the K-means algorithm are that apart from being simple, it is efficient for operating on large data sets. Disadvantages are that initial choice of the centroids can give varied outcomes; it is sensitive to noise and outliers and tends to terminate at local optimums. The time complexity of the k-means algorithm is O (nkl), where n is the number of objects, k is the number of clusters, and l is the number of iterations [15]. Association rules are if/then statements that help uncover relationships between seemingly unrelated data in a relational database or other information repository. For example, the rule {milk, bread} {butter}found in the sales data of a supermarket would indicate that if a customer buys milk and bread together, he or she is also likely to buy butter. Two important terms that go along with association rules are Support and Confidence. Support is an indication of how frequently the items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true. Support and confidence values can be changed to control the number of rules that are generated Apriori[16] is a classic algorithm for learning association rules. It attempts to find subsets which are common to at least a minimum number C of the item sets. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. The purpose of the Apriori Algorithm is to find associations between different sets of data. In our technique, the associations that we are interested are between the spam words that occur in s. True positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), are the four possible outcomes when an is associated by the system. A false positive is when the is incorrectly identified as spam, when it is in fact non-spam. A false negative is when the is incorrectly identified as non-spam when it is in fact spam. True positives and true negatives are correct detections. 223
3 TABLE I FOUR POSSIBLE OUTCOMES OF ASSOCIATION Support and Confidence are used to control the number of rules generated, they are not the evaluation criteria. Instead, four measures precision, recall, specificity and accuracy were used in the evaluation process. They are defined in Table II[1]. TABLE II EVALUATION MEASURES USED IN THE SYSTEM Measure Defined as What it means Precision TP Percentage of TP FP positive predictions that are correct Recall/Sensitivity TP Percentage of TP FN positive labeled instances that were predicted as positive Specificity TN Percentage of TN FP negative labeled instances that were predicted as negative Accuracy TP TN Percentage of TP TN FP FN predictions that are correct Accuracy alone is not sufficient to gauge the effectiveness of the system because we want both the spam and the non-spam s to be correctly labeled. Accuracy value might be high but it might be labeling only one of spam or non-spam s correctly. If the precision is high, it means that the false positives are less. If he Recall is more, it means that the system recognizes most of the spam messages. A high value of specificity means that very few non-spam messages are associated as spam. A perfect predictor would be described as 100% sensitivity and 100% specificity. 224 III. PROPOSED WORK The purpose of this paper is to present a technique to identify messages as spam or non-spam. Once that is done, the accuracy of the method is tested to see how many mails were correctly categorized. s messages are represented as vectors. Clustering is then applied to group together spam and non-spam s. Then, Apriori algorithm is applied to generate the association rules. These can be then applied to the new mails to see whether they are spam or not. The steps in the proposed system are as indicated in Figure 2. Apache Mahout[17] contains free implementations of scalable machine learning algorithms. It contains several algorithms for clustering, classification, pattern mining, frequent item mining amongst others. We use it for converting the set of documents to the vector space notation and then to cluster the s. Christian Borglet s[18] implementation of the Apriori algorithm is used for generating frequent item sets. By varying the values used for support and confidence, the number of rules generated can be controlled. The proposed system is setup on a machine with the following hardware: Intel Centrino duo 1.66 GHz, 3 GB RAM, 160 GB HDD. Figure 2. Flowchart of the proposed system
4 Set of documents is converted into the vector space notation. Entries in the vectors are the tf-idf values. It may happen that some s in the set have more length than others. In such s, same terms are likely to appear more times i.e. they may have more term frequencies. Plus, such s may also have more terms that can be considered as spam. To compensate for this difference in length, some sort of normalization is needed. L2 normalization[19] is used for this purpose. Clusters of similar s are then formed. K-means algorithm is used for clustering. The clusters may not be distinct spam and non-spam clusters each may consist of a mixture of spam and non-spam s. Out of these clusters, ones having >= 50% spam s will be selected to generate the association rules. This is to ensure that the set of s obtained after this step has substantial number of spam s which will improve the accuracy of the system. As the size of the data to be clustered increases the number of clusters to be formed(k) should also be increased accordingly to obtain better clusters. A list of commonly occurring spam words was created. The s obtained once we have selected the spammy clusters are compared against this list. This forms a filtering step. Only the words that occur in the list are retained and the rest of the text is deleted. At the end of this operation, we get the various combinations in which the spam words occur in the set of s. This list can be updated periodically to include new words that can be considered as spam words. Figure 3. Obtaining spam words combination from an document At this point we can go ahead and generate the association rules using the Apriori algorithm. Once we have the rules, they can be matched against new s to decide whether it is spam or not. 225 To understand how the new s will be processed, let s take an example. Assume that the words lottery and gambling occur in the list of spam words. So there may be a rule of the form lottery > gambling. Any new that has both these words in its content will be treated as spam. Likewise, several other rules may match for some test . s which are not spam will not contain any words from the list of spam words or won t contain all the words that form a rule. Such s will be identified as non-spam by the system. IV. EXPERIMENTAL RESULT The corpus used for training and testing is the ling-spam corpus[20]. In ling-spam, there are four subdirectories, corresponding to 4 versions of the corpus, viz., bare: Lemmatiser disabled, stop-list disabled, lemm: Lemmatiser enabled, stop-list disabled, lemm_stop: Lemmatiser enabled, stop-list enabled, stop: Lemmatiser disabled, stop-list enabled, Where lemmatizing is similar to stemming and stop-list tells if stopping is done on the content of the parts or not. We use the lemm_stop subdirectory in our approach Each one of these 4 directories contains 10 subdirectories (part 1,, part 10). These correspond to the 10 partitions of the corpus that were used in the 10-fold experiments. In every part, 9 partitions were used for training and the 10 th partition was used for testing. Each one of the 10 subdirectories contains both spam and legitimate messages, one message in each file. Files whose names have the form spmsg*.txt are spam messages. All other files are legitimate messages. The total number of spam messages is 481 and that of legitimate messages are The percentage of spam in this corpus is 16.6%. Final results are obtained by taking the average of the scores obtained in each of the 10 experiments, One advantage of the ling-spam corpus is the focus on the textual component of the as needed in our system. In a real -filtering system, some part of the message header may be used to improve the classification performance. e.g. senders address could be looked up in the address book. Such strategies do not require machine learning and are not the focus of our work here. Results of conducting the experiments on the ling-spam data set are shown in the table below. Values are the average of 10 experiments, each using 1 of the 10 parts for testing and the other 9 for training. TABLE III RESULTS OF APPLYING THE PROPOSED APPROACH ON THE LING-SPAM DATASET Measure Average Value Precision 60.10% Recall 71.60% Specificity 92.28% Accuracy 89.31%
5 As seen in the table above, about 89.31% of the total s were correctly identified. Precision value is on the lower side so more false positives are being generated. Recall is also low and this indicates that not all spam messages are recognized. Since the specificity is high, most non-spam s are recognized as non-spam. V. CONCLUSION In this paper, a new technique to effectively detect spam s using clustering and association rules was suggested. Clustering is used as a data reduction step - to find the spammy clusters out of all the s. After the doubtful clusters are identified, association rules can be generated for such clusters. Using these rules, we can then associate an incoming as spam or non-spam. As part of future work, the system can be made truly dynamic by automating the entire process on the server side. REFERENCES [1] Prabhakar, R. and Basavaraju, M A Novel Method of Spam Mail Detection Using Text Based Clustering Approach. Phil. Trans. Roy. Soc. London, vol. A247, pp [2] Internet 2010 in Numbers. Internet: [Mar. 23, 2012]. [3] Androutsopoulos, I., Chandrinos, K., Koutsias, J., Paliouras, G. and Spyropoulos, C An Evaluation of Naive Bayesian Anti-spam Filtering, in Proceedings of the Workshop on Machine Learning in the New Information Age: 11th European Conference on Machine Learning (ECML 2000), pp [4] Bogofilter. Internet: [Mar. 23, 2012]. [5] Graham, P. Better Bayesian Filtering. Internet: [Mar. 23, 2012]. [6] Dumais, S., Heckerman, D., Horvitz, E. and Sahami, M A Bayesian Approach to Filtering Junk , in Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, pp [7] Quinlan, J.R. C4.5: Programs for Machine Learning, 1 st ed., San Mateo, CA: Morgan Kaufmann, [8] Cohen, W.W Learning Rules that Classify , in Proceedings of the 1996 AAAI Spring Symposium on Machine Learning in Information Access, pp [9] Druker, H Support Vector Machines for Spam Categorization, in IEEE Transaction on Neural Networks, pp [10] Firte, L., Lemnaru, C. and Potolea, R Spam Detection Filter Using KNN Algorithm and Resampling, in Intelligent Computer Communication and Processing (ICCP), 2010 IEEE International Conferenc, pp [11] Alguliev, R.M., Aliguliyev, R.M. and SNazirova, S.A. Classification of Textual Spam Using Data Mining Techniques. Applied Computational Intelligence and Soft Computing, vol. 2011, Article ID , 8 pages. doi: /2011/ [12] Kyriakopoulou, A. and Kalamboukis, T Text Classification Using Clustering, in ECML-PKDD Discovery Challenge Workshop Proceedings. [13] Vector Space Model. Internet: Vector_space_model [Mar. 23, 2012]. [14] MacQueen, J Some Methods for Classification and Analysis of Multivariate Observations. Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66) Vol. I: Statistics, pp [15] Manning, C.D., Raghavan, P., and Schutze, H. Introduction to Information Retrieval. 1 st ed., Cambridge, England: Cambridge University Press, pp , [16] Agrawal, Rakesh and Srikant, Ramakrishnan Fast Algorithms for Mining Association Rules in Large Databases, in Proceedings of the 20th International Conference on Very Large Data Bases,VLDB, Santiago, Chile, pp [17] Apache Mahout: Scalable Machine Learning and Data Mining. Internet: [Mar. 23, 2012]. [18] Apriori - Association Rule Induction / Frequent Item Set Mining. Internet: [Mar. 23, 2012]. [19] Lp space. Internet: [Mar. 23, 2012]. [20] Ling-Spam data set. Internet: [Mar. 23, 2012]. 226
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
More informationA Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng
More informationA MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2
UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,
More informationImpact of Feature Selection Technique on Email Classification
Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity
More informationBayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
More informationCAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of
More informationData Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
More informationNaïve Bayesian Anti-spam Filtering Technique for Malay Language
Naïve Bayesian Anti-spam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationA Proposed Algorithm for Spam Filtering Emails by Hash Table Approach
International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering
More informationEmail Filters that use Spammy Words Only
Email Filters that use Spammy Words Only Vasanth Elavarasan Department of Computer Science University of Texas at Austin Advisors: Mohamed Gouda Department of Computer Science University of Texas at Austin
More informationAn Efficient Spam Filtering Techniques for Email Account
American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-02, Issue-10, pp-63-73 www.ajer.org Research Paper Open Access An Efficient Spam Filtering Techniques for Email
More informationArtificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationIMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT
IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: marthakala08@gmail.com, Abstract- Unethical
More informationWE DEFINE spam as an e-mail message that is unwanted basically
1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir
More informationFeature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
More informationAnti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
More informationA Case-Based Approach to Spam Filtering that Can Track Concept Drift
A Case-Based Approach to Spam Filtering that Can Track Concept Drift Pádraig Cunningham 1, Niamh Nowlan 1, Sarah Jane Delany 2, Mads Haahr 1 1 Department of Computer Science, Trinity College Dublin 2 School
More informationFiltering Junk Mail with A Maximum Entropy Model
Filtering Junk Mail with A Maximum Entropy Model ZHANG Le and YAO Tian-shun Institute of Computer Software & Theory. School of Information Science & Engineering, Northeastern University Shenyang, 110004
More informationT-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationNon-Parametric Spam Filtering based on knn and LSA
Non-Parametric Spam Filtering based on knn and LSA Preslav Ivanov Nakov Panayot Markov Dobrikov Abstract. The paper proposes a non-parametric approach to filtering of unsolicited commercial e-mail messages,
More informationDifferential Voting in Case Based Spam Filtering
Differential Voting in Case Based Spam Filtering Deepak P, Delip Rao, Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology Madras, India deepakswallet@gmail.com,
More informationRepresentation of Electronic Mail Filtering Profiles: A User Study
Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 +1 949 824 5888 pazzani@ics.uci.edu
More informationA Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
More informationTowards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial
More informationAN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM
ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, APRIL 212, VOLUME: 2, ISSUE: 3 AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM S. Arun Mozhi Selvi 1 and R.S. Rajesh 2 1 Department
More informationSURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2
International Journal of Computer Engineering and Applications, Volume IX, Issue I, January 15 SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2
More informationEmail Classification Using Data Reduction Method
Email Classification Using Data Reduction Method Rafiqul Islam and Yang Xiang, member IEEE School of Information Technology Deakin University, Burwood 3125, Victoria, Australia Abstract Classifying user
More informationAn Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
More informationThree-Way Decisions Solution to Filter Spam Email: An Empirical Study
Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Xiuyi Jia 1,4, Kan Zheng 2,WeiweiLi 3, Tingting Liu 2, and Lin Shang 4 1 School of Computer Science and Technology, Nanjing University
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationSimple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
More informationLan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,
More informationSpam Detection and Pattern Recognition
International Journal of Signal Processing, Image Processing and Pattern Recognition 31 Learning to Detect Spam: Naive-Euclidean Approach Tony Y.T. Chan, Jie Ji, and Qiangfu Zhao The University of Akureyri,
More informationPSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering
2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University
More informationNaive Bayes Spam Filtering Using Word-Position-Based Attributes
Naive Bayes Spam Filtering Using Word-Position-Based Attributes Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden johan.hovold.363@student.lu.se Abstract This paper
More informationThree types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
More informationAnalytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
More informationDr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India
Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Effective Email
More informationAn incremental cluster-based approach to spam filtering
Available online at www.sciencedirect.com Expert Systems with Applications Expert Systems with Applications 34 (2008) 1599 1608 www.elsevier.com/locate/eswa An incremental cluster-based approach to spam
More informationAn Efficient Three-phase Email Spam Filtering Technique
An Efficient Three-phase Email Filtering Technique Tarek M. Mahmoud 1 *, Alaa Ismail El-Nashar 2 *, Tarek Abd-El-Hafeez 3 *, Marwa Khairy 4 * 1, 2, 3 Faculty of science, Computer Sci. Dept., Minia University,
More informationMachine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
More informationData Pre-Processing in Spam Detection
IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 11 May 2015 ISSN (online): 2349-784X Data Pre-Processing in Spam Detection Anjali Sharma Dr. Manisha Manisha Dr. Rekha Jain
More informationShafzon@yahool.com. Keywords - Algorithm, Artificial immune system, E-mail Classification, Non-Spam, Spam
An Improved AIS Based E-mail Classification Technique for Spam Detection Ismaila Idris Dept of Cyber Security Science, Fed. Uni. Of Tech. Minna, Niger State Idris.ismaila95@gmail.com Abdulhamid Shafi i
More informationAbstract. Find out if your mortgage rate is too high, NOW. Free Search
Statistics and The War on Spam David Madigan Rutgers University Abstract Text categorization algorithms assign texts to predefined categories. The study of such algorithms has a rich history dating back
More informationAn analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University
More informationSpam Filtering using Naïve Bayesian Classification
Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering
More informationExperiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
More informationAn Efficient Two-phase Spam Filtering Method Based on E-mails Categorization
International Journal of Network Security, Vol.9, No., PP.34 43, July 29 34 An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization Jyh-Jian Sheu Department of Information Management,
More informationFacilitating Business Process Discovery using Email Analysis
Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process
More information1 Introductory Comments. 2 Bayesian Probability
Introductory Comments First, I would like to point out that I got this material from two sources: The first was a page from Paul Graham s website at www.paulgraham.com/ffb.html, and the second was a paper
More informationSpam Detection System Combining Cellular Automata and Naive Bayes Classifier
Spam Detection System Combining Cellular Automata and Naive Bayes Classifier F. Barigou*, N. Barigou**, B. Atmani*** Computer Science Department, Faculty of Sciences, University of Oran BP 1524, El M Naouer,
More informationSpam Filter: VSM based Intelligent Fuzzy Decision Maker
IJCST Vo l. 1, Is s u e 1, Se p te m b e r 2010 ISSN : 0976-8491(Online Spam Filter: VSM based Intelligent Fuzzy Decision Maker Dr. Sonia YMCA University of Science and Technology, Faridabad, India E-mail
More informationSpam Filter Optimality Based on Signal Detection Theory
Spam Filter Optimality Based on Signal Detection Theory ABSTRACT Singh Kuldeep NTNU, Norway HUT, Finland kuldeep@unik.no Md. Sadek Ferdous NTNU, Norway University of Tartu, Estonia sadek@unik.no Unsolicited
More informationHow To Filter Spam Image From A Picture By Color Or Color
Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
More informationHow To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn
More informationAccelerating Techniques for Rapid Mitigation of Phishing and Spam Emails
Accelerating Techniques for Rapid Mitigation of Phishing and Spam Emails Pranil Gupta, Ajay Nagrale and Shambhu Upadhyaya Computer Science and Engineering University at Buffalo Buffalo, NY 14260 {pagupta,
More informationDATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
More informationSURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING
I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and
More informationSpam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
More informationAdaptive Filtering of SPAM
Adaptive Filtering of SPAM L. Pelletier, J. Almhana, V. Choulakian GRETI, University of Moncton Moncton, N.B.,Canada E1A 3E9 {elp6880, almhanaj, choulav}@umoncton.ca Abstract In this paper, we present
More informationTop 10 Algorithms in Data Mining
Top 10 Algorithms in Data Mining Xindong Wu ( 吴 信 东 ) Department of Computer Science University of Vermont, USA; 合 肥 工 业 大 学 计 算 机 与 信 息 学 院 1 Top 10 Algorithms in Data Mining by the IEEE ICDM Conference
More informationOverview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
More informationEmail Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
More informationSender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang
Sender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang Graduate Institute of Information Management National Taipei University 69, Sec. 2, JianGuo N. Rd., Taipei City 104-33, Taiwan
More informationA Novel Technique of Email Classification for Spam Detection
A Novel Technique of Email Classification for Spam Detection Vinod Patidar Student (M. Tech.), CSE Department, BUIT Divakar singh HOD, CSE Department, BUIT Anju Singh Assistant Professor, IT Department,
More informationAn Imbalanced Spam Mail Filtering Method
, pp. 119-126 http://dx.doi.org/10.14257/ijmue.2015.10.3.12 An Imbalanced Spam Mail Filtering Method Zhiqiang Ma, Rui Yan, Donghong Yuan and Limin Liu (College of Information Engineering, Inner Mongolia
More informationLarge-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationPredicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
More informationDetecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA
More informationHow To Filter Spam With A Poa
A Multiobjective Evolutionary Algorithm for Spam E-mail Filtering A.G. López-Herrera 1, E. Herrera-Viedma 2, F. Herrera 2 1.Dept. of Computer Sciences, University of Jaén, E-23071, Jaén (Spain), aglopez@ujaen.es
More informationUsing Biased Discriminant Analysis for Email Filtering
Using Biased Discriminant Analysis for Email Filtering Juan Carlos Gomez 1 and Marie-Francine Moens 2 1 ITESM, Eugenio Garza Sada 2501, Monterrey NL 64849, Mexico juancarlos.gomez@invitados.itesm.mx 2
More informationTop Top 10 Algorithms in Data Mining
ICDM 06 Panel on Top Top 10 Algorithms in Data Mining 1. The 3-step identification process 2. The 18 identified candidates 3. Algorithm presentations 4. Top 10 algorithms: summary 5. Open discussions ICDM
More informationInternational Journal of Research in Advent Technology Available Online at: http://www.ijrat.org
IMPROVING PEFORMANCE OF BAYESIAN SPAM FILTER Firozbhai Ahamadbhai Sherasiya 1, Prof. Upen Nathwani 2 1 2 Computer Engineering Department 1 2 Noble Group of Institutions 1 firozsherasiya@gmail.com ABSTARCT:
More informationEfficient Spam Email Filtering using Adaptive Ontology
Efficient Spam Email Filtering using Adaptive Ontology Seongwook Youn and Dennis McLeod Computer Science Department, University of Southern California Los Angeles, CA 90089, USA {syoun, mcleod}@usc.edu
More informationRobust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
More informationIt is designed to resist the spam in the Internet. It can provide the convenience to the email user and save the bandwidth of the network.
1. Abstract: Our filter program is a JavaTM 2 SDK, Standard Edition Version 1.5.0 (J2SE) based application, which can be running on the machine that has installed JDK 1.5.0. It can integrate with a JavaServer
More informationThe Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
More informationECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift
ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift Sarah Jane Delany 1 and Pádraig Cunningham 2 and Barry Smyth 3 Abstract. While text classification has been identified for some time
More informationEffectiveness and Limitations of Statistical Spam Filters
Effectiveness and Limitations of Statistical Spam Filters M. Tariq Banday, Lifetime Member, CSI P.G. Department of Electronics and Instrumentation Technology University of Kashmir, Srinagar, India Abstract
More informationInternational Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET
DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand
More informationStatistical Feature Selection Techniques for Arabic Text Categorization
Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000
More informationClustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationContent-Based Spam Filtering and Detection Algorithms- An Efficient Analysis & Comparison
Content-Based Spam Filtering and Detection Algorithms- An Efficient Analysis & Comparison 1 R.Malarvizhi, 2 K.Saraswathi 1 Research scholar, PG & Research Department of Computer Science, Government Arts
More informationSpam Filtering using Spam Mail Communities
Spam Filtering using Spam Mail Communities Deepak P 1, Jyothi John 1, Sandeep Parameswaran 2 1 Model Engg: College, Kochi, Kerala, India 2 IBM Global Services India Pvt. Ltd., Bangalore, India deepak-p@eth.net,
More informationUsing Data Mining Methods to Predict Personally Identifiable Information in Emails
Using Data Mining Methods to Predict Personally Identifiable Information in Emails Liqiang Geng 1, Larry Korba 1, Xin Wang, Yunli Wang 1, Hongyu Liu 1, Yonghua You 1 1 Institute of Information Technology,
More informationPredicting Student Performance by Using Data Mining Methods for Classification
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationSVM-Based Spam Filter with Active and Online Learning
SVM-Based Spam Filter with Active and Online Learning Qiang Wang Yi Guan Xiaolong Wang School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China Email:{qwang, guanyi,
More informationSpam Filters: Bayes vs. Chi-squared; Letters vs. Words
Spam Filters: Bayes vs. Chi-squared; Letters vs. Words Cormac O Brien & Carl Vogel Abstract We compare two statistical methods for identifying spam or junk electronic mail. The proliferation of spam email
More informationSingle-Class Learning for Spam Filtering: An Ensemble Approach
Single-Class Learning for Spam Filtering: An Ensemble Approach Tsang-Hsiang Cheng Department of Business Administration Southern Taiwan University of Technology Tainan, Taiwan, R.O.C. Chih-Ping Wei Institute
More informationSpam Filtering Based On The Analysis Of Text Information Embedded Into Images
Journal of Machine Learning Research 7 (2006) 2699-2720 Submitted 3/06; Revised 9/06; Published 12/06 Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Giorgio Fumera Ignazio
More informationEmail Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
More informationA Collaborative Approach to Anti-Spam
A Collaborative Approach to Anti-Spam Chia-Mei Chen National Sun Yat-Sen University TWCERT/CC, Taiwan Agenda Introduction Proposed Approach System Demonstration Experiments Conclusion 1 Problems of Spam
More information