A Spam Filtering Technique using SVM Light Tool
|
|
|
- Erick Wilcox
- 10 years ago
- Views:
Transcription
1 A Spam Filtering Technique using SVM Light Tool Agrawal Rammohan Ashok 1, Gaurav Shrivastav 2 1 M. Tech. CSE, RKDFIST, Bhopal, India 2 Assistant Professor, CSE, RKDFIST, Bhopal, India Abstract: Spam has become a headache of the today s internet causing many problems to all users on Internet. Spam s not only consume computing resources, money but are frustrating. A lot of money is being spent/ lost due to spam globally each year. Data mining approaches for content based spam filtering have been seen promising. Filtering is the one of the important technique to stop spam. Now-a-days there is a tremendous need of some trust-worthy and adaptive spam filtering system in the market which should have the ability to react quickly to the real time changes and provide fast and qualitative self-tuning in accordance with a new set of features. This paper explores the use of support vector machines for classifying and filtering spam messages and also to improve the accuracy and time performance. Also a comparative analysis among the algorithms is also being presented here with a view to get the optimized results in spam filtering. Keywords: Data Mining, Machine Learning, Spam Classification, Spam Filtering, Support Vector Machine 1. Introduction Spam is flooding the internet with many copies of the same message or unwanted messages in an attempt to force the message on people who would not otherwise choose to receive it. Mostly spam is commercial advertising. Spam costs the sender very little to send and most of the costs are paid by recipients or the carriers rather than by the sender. spam is a subset of spam that involves nearly identical messages sent to numerous recipients by [1]. Day by day the amount of incoming spam increase, scammer attacks are becoming targeted & consequently more of a threat. Below is a graphic that s based on spam data collected by Symantec s Message Labs. It shows that global spam volumes fell and spiked fairly regularly, from highs of 6 trillion messages sent per month to just below 1 trillion, but still the problem is not being solved completely and some measures are to be adopted in order to solve this issues. This graph is based on Symantec s raw spam data [3] computed with the available global spam data. Figure 1: Spam volumes from to [3] Spam should be figure out with some new effective approaches otherwise few problems may arise as such: Consumes computing resources and time. Reduces effectiveness of legitimate advertising. Cost Shifting. Fraud with Identity theft. percentage of that is spam coming from users of the top three services: Spam percentage recorded were Yahoo (22%), Microsoft (11%) & Google (6%). Figure 2: Spam Percentage by Service Providers [3] 2. Literature Survey Spam mail is sent to a group of recipients who have not requested it. These mails are causing many problems such as filling mailboxes, engulfing important personal mail, wasting network bandwidth, consuming users time and energy to sort through it and many more[11]. According to a series of surveys conducted by CAUBE.AU 1, the number of total spams received by 41 addresses has increased by a factor of six in two years (from 1753 spams in 2000 to 10,847 spams in 2001)[14]. Therefore it is challenging to develop spam filters that can effectively eliminate the increasing volumes of unwanted mails automatically before they enter a user's mailbox. D. Puniskis [12] in his research applied the neural network approach to the classification of spam. His method employs attributes composed of descriptive characteristics of the evasive patterns that spammers employ rather than using the context or frequency of keywords in the message. The data used is corpus of 2788 legitimate and 1812 spam s received during a period of several months. The result shows that ANN is good and ANN is not suitable for using alone as a spam filtering tool. Spam volumes are region dependent and providers you rely on for and connections to the internet. The following is a Spam data from Cloudmark, a San Francisco-based security firm. Their data (shown in the figure below) paint a very interesting picture of the difference in In [13] data was classified using four different classifiers (Neural Network, SVM classifier, Native Bayesian Classifier, and J48 classifier). The experiment was performed based on different data size and feature size. The final classification result should be 1 if it is spam otherwise Paper ID:
2 0. This paper shows that simple J48 classifier which make a binary tree, could be efficient for the dataset which could be classified as binary tree. In [2] author provides a comprehensive analysis of various classifiers using different software tools viz. WEKA, RapidMiner which were to be implemented on a common dataset for implementing the supervised learning approach for spam classification. 3. SPAM Filtering Techniques Depending on used techniques spam filtering methods [1] are generally divided into two categories: 1. Avoid spam distribution in their origins methods 2. Avoid spam at destination point methods. Spam must be detected initially by using some spam detection [10] techniques in order to lower the costs of implementation and then filtered to improve the results using some filtering techniques as: Hiding contact information Looking for an filtering software 3.1 Theoretical Approach for Spam Filtering [1] Depending on used theoretical approaches spam filtering methods are divided into traditional, learning-based and hybrid methods. In traditional methods the classification model or the data (rights, patterns, keywords, lists of IP addresses of servers), based on which messages are classified, is defined by expert. The data storage collected by experts is called as the knowledge base. There are also used trusted and mistrusted senders lists, which help to select legal mail. Actually it makes sense only creation of the white list, because spammers use fictitious addresses. This technique can t represent itself as a highgrade anti-spam filter, but can reduce considerably amount of false operations, being a part of filtration system based on other classification methods. In learning-based methods the classification model is developed using Data Mining techniques. There are some problems from the point of view of data mining as changing of spam content with time, the proportion of spam to legitimate mail, insufficient amount of training data are characteristic for learning based methods. Some Traditional methods [1] are: Acceptance of sender as a spammer. Verification of sender mail address & domain name. Content Filtering. providers get to the Black lists. For such systems it is actually impossible to estimate the level of false positives (the legitimate wrongly classified as spam) on real mail streams Verifying sender mail address & domain name: This is the simplest traditional method of filtration if DNS request s name is same with the domain name of sender. But spammers can use real addresses so that current method becomes ineffective. In this case it may be verified with possibility of sending the message from current IP address. Firstly, the Sender ID technology [4] can be used where sender s address is protected from falsification by means of publishing the policy of domain name use in DNS. Secondly, there can be used SPF (Sender Policy Framework) technology [5], where DNS protocol is used for verification of sender s address. The principle is that if domain s owner wants support SPF verification, then he adds special entry to DNS entry of his domain, where indicates the release of SPF and ranges of IP addresses from where may become an from users of current domain. Traditional method s disadvantages are: Necessary to update the knowledge base regularly. Dependence on update suppliers. Low Security level. Dependence on natural language of correspondence. Low level of detection because of general models of classification Traditional Content Filtering: This technique was used to filter the contents of the received mails. A mail may be pdf file, an excel document, an image or even an mp3 file. The problems with content filtering were: Low cost to evasion: Spammers can easily alter features of an s content. Customized s are easy to generate: Content-based filters need fuzzy hashes over content, etc. High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated. 3.3 Learning Based Methods An actively developed intellectual method based on few data mining algorithms for filtration divide the object to some categories using classification model previously defined on the base precedential information. Assume spam filtration is defined by the function 3.2 Acceptance of Sender as a Spammer These methods rely on different blackhole lists of IP and e- mail addresses. It is possible to apply own blackhole and white lists or to use RBL services (Real-time Blackhole List) Where m is a classified mail, m is a vector of parameters and DNSBL (DNS-based Blackhole List) for address verification. Advantage of these methods is detection of and is spam and legitimate respectively. spam in early step of mail receiving process. Disadvantage is Many spam filters based on classification using machine that the policy of addition and deletion of addresses is not learning techniques. In learning-based methods the vector of always transparent. Often the whole subnets belonging to parameters is a result of classification trainings on previously collected s. Paper ID:
3 various filtering techniques and get the advantages of the traditional and learning-based methods. Where m1, m2 mn are previously collected messages, y1, y2 are the corresponding labels and, Z is the training function. The following types are belonged to learning-based methods. 1) Image-based spam filtering. Spammers embed the message into the image and then attach it to the mail. Some traditional methods based on analysis of text-based information do not work in this case. Image filtering process is costly and time-consuming work. In the paper [9] it is proposed three-layer (Mail Header Classifier, the Image Header Classifier and the Visual Feature Classifier) image-spam filtering. In the First layer it is applied Bayesian classifier and SVM classifier in the remaining layers. 2) Bag of words Model. The bag-of-words model is a simplifying assumption used in natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and even word order [1, 6]. In spam filtering two bags of words are considered. One bag is filled with word found in spam s, and the other bag is filled with words met in legitimate s. Considering as a pile of words from one of these bags, there used Bayesian probability to determine to which bag this belongs. K-Nearest neighbor, SVM, boosting classifiers may also be applicable to the bag of words. 3) Collaborative spam filtering: This is gathering spam reports between P2P users or from mail server [1, 4, 5]. The collaborative centralized spam filtration is more economic in comparison with personal approach, but only under condition of presence of adequate procedures of the analysis of false operations and operative reclassification of not correctly classified messages. 4) Social networking against spam: This is a one of the latest methods where the information extracted from social networks is used to fight spammers [1, 5]. So in case of learning-based methods user defines the classification model himself, so that the majority disadvantages of traditional methods are solved successfully; intellectual methods are autonomous, independent on external knowledge base, doesn t require regular update, multilingual, independent of natural language, able to study new types of spam user-aided. There is advantage as construction of personalized mail classification model, where user himself defines which mail is legal or which one is a spam. Therefore learningbased methods have higher rank in spam determination. In many spam filtration systems based on the learningbased methods the Baye s theorem, Marcov s chain and others are successfully applied. 3.4 Smart Spam Filtering Method 4. Support Vector Machines Support Vector Machines [7, 8] has been recently proposed by Dr. V. Vapnik as an effective statistical learning method for pattern recognition. SVM based on statistical learning theory has proved many advantages and is different from previous nonparametric techniques such as nearestneighbors. It operates on induction principle called Structural risk minimization, which can overcome the problem of over fitting and local minimum and gain better generalization capability. Kernel function when applied, doesn t increase the computational complexity, furthermore overcomes the curse of dimensionality problem effectively. SVM has demonstrated higher generalization capabilities [8] in high dimensional space and spare samples. Its essence is to map optimal separating hyper plane that can correctly classify all samples. SVM has proved to be one of the most efficient kernel methods. Unlike many learning algorithms, SVM leads to good performances without the need to incorporate prior information. Moreover, the use of positive definite kernel in the SVM can be interpreted as an embedding of the input space into a high dimensional feature space where the classification is carried out without using explicitly this feature space. Hence, the problem of choosing architecture for a neural network application is replaced by the problem of choosing a suitable kernel for a SVM. Many SVM uses kernel functions but we are to incorporate some efficient and time-saving approaches to stop the spam traffic. It s a great deal of implementing such kernel functions. One such way is through classification and the building the incoming data in the SVM light format directly which will result in time saving at the end side, thus optimizing the performance. 4.1 Classification through SVM The idea of SVM classification [15] is the same as that of the preceptron: find a linear separation boundary wt x + b = 0 that correctly classifies training samples (such that a boundary exists). The difference from the perceptron here is that we don t search for any separating hyper plane, but for a very special maximal margin separating hyper plane, for which the distance to the closest training sample is maximal. Definition: Let X = {(xi, ci)}, xi 2 Rm, ci 2 { 1, +1} denote as usually the set of training samples. Suppose (w, b) is a separating hyper plane (i.e. sign(wt xi+b) = ci for all i). Define the margin mi of a training sample (xi, ci) with respect to the separating hyper plane as the distance from point xi to the hyper plane: mi = wt xi + b kwk. The margin m of the separating hyper plane with respect to the whole training set X is the smallest margin of an instance in the training set: m = min I min. Finally, the maximal margin separating hyper plane for a training set X is the separating hyper plane having the maximal margin with respect to the training set. One of the latest approaches in spam filtering is hybrid filtration system which is a combination of different algorithms, especially if they use some of the unrelated features to produce a solution. In this case it can be applied Paper ID:
4 Figure 3: Maximal margin separating hyper plane The above figure 3 shows the maximal margin separating the hyper plane. Here the circle represents the support vectors. Now here since the hyper plane given by parameters (x, b) is the same as the hyper plane given by parameters (kx, kb), we can safely bound our search by only considering canonical hyper planes for which min I wt xi + b = 1. It is possible to show that the optimal canonical hyper plane has minimal kwk, and that in order to find a canonical hype rplane it suffices to solve the following minimization problem: minimize ½ wtw under the conditions. Using the Lagrangian theory the problem may be transformed to a certain dual form: maximize with respect to the dual variables that for all I and 5. Applying Clustering using SVM Further on we will designate by classification the process of supervised learning on labeled data and we will designate by clustering the process of unsupervised learning on unlabeled data. Dr. Vapnik [VAP 01] presents an alteration of the classical algorithm which is used for unlabeled training data. Here finding the hyper-plane becomes finding a maximum dimensional sphere of minimum cost that groups the most resembling data. In some of the approaches, we can find a different clustering algorithm based mostly on probabilities. For document clustering we will use the terms defined above and we will mention the necessary changes for running the algorithm on unlabeled data. There are more types of usually used kernels that can be used in the decision function. so be considered as margins of the clusters in the input space. Points belonging to the same cluster will have the same boundary. As the width parameter of the Gaussian kernel decreases the number of unconnected boundaries increases. When the width parameters increases there will be overlapping clusters. We may make use of some of the Gaussian Kernels for the effective, efficient algorithms that maximizes the cluster classifications throughput while using a limited number of resources. Also there are various clustering methods that can be applied over a number of domains, each having its own pros and cons. The use of the best techniques shall be applicable. 5.1 Use of SVM in Clustering Problems SVM for Binary Classification Multiclass Classification Clustering & Sequential Minimal Optimization Probabilistic Outputs for SVM 5.2 Advantages of Clustering using SVM High performance & Large capacity High availability Incremental growth. Improved efficiency 5.3 Applications of Clustering Scientific computing & Making movies Commercial servers (web/database/etc) 6. Evaluation Metrics: Spam Classification & Filtering In [1] the author has seven freeware anti-spam software products for testing. Each software was being installed on Windows XP platform on different personal computers with POP3 mail server. The following table shows the summary of result which was obtained using the real post traffic. Table 1: Testing different spam filtering software s We use the kernel function to transpose the training data from the given input sequence to a higher dimensional feature space and also to separate this data in the new obtained state. The basic idea is to calculate the norm of the differences between the two vectors. The most frequently used are the linear kernel, the polynomial kernel, the Gaussian kernel and the sigmoid kernel. We can choose the kernel according to the type of data that we are using. The linear and the polynomial kernel run best when the data is well separated. The Gaussian and the sigmoid kernel work best when data is overlapped but the number of support vectors also increases. For clustering the training data will be mapped in a higher dimensional feature space using the Gaussian kernel. In this space we shall try to find the smallest sphere that includes the image of the mapped data. This is possible as data is generated by a given distribution and when they are mapped in a higher dimensional feature space they will group in a cluster. After computing the dimensions of the sphere this will be remapped in the original space. The boundary of the sphere will be transformed in one or more boundaries that will contain the classified data. The resulting boundaries can Paper ID:
5 Table 2: Survey of Spam Data collected in a specific time period 7.4 Evaluations of Performance Accuracy (AC) is the proportion of the total number of predictions that were correct. AC = (a + d) / (a + b + c + d) Recall is the proportion of positive cases that were correctly identified. R = d / (c + d) {Actual positive case no} Precision is the proportion of the predicted positive cases that were correct. P = d / (b + d) {Predicted positive case no} Generally In the field of text classification using the SVM tool, the performance evaluation on spam filtering makes use of related indexes. The followings are definitions about two common indexes: Recall Ratio & Precision Ratio of information retrieval in spam filtering [6]. Recall Ratio is the ratio of the amount of spam that has been filtered to the amount of s that should be filtered. Precision Ratio is the ratio of the amount of spam that has been filtered to the amount of s that have been filtered. Figure 4: An Example using SVMLight 7. SVMLight TOOL SVMLight is an implementation of Support Vector Machine in C Language. 7.1 Training Step The following syntax is used for training. svm-learn [-option] train_file model_file where, train_file contains training data. The filename of train_file can be any filename. The extension of train_file can be defined by user arbitrarily. model_file contains the model built based on training data by SVM. 7.2 Format of input files (Training data) Training data is a collection of documents. Each line represents a document. Each feature represents a term (word) in the document. The label and each of the feature value pairs are separated by a space character. Feature value pairs must be ordered by increasing feature number value e.g., tf-idf. 7.3 Testing Step The following syntax is used for testing. svm-classify test_file model_file predictions The format of test_file is exactly the same as train_file. It needs to be scaled into same range. We use the model_file based on training data to classify test data, and compare the predictions with the original label of each test document. Figure 5: Actual & Predicted Test cases 8. Implementation The primary concern will be the use of compression models in spam classification/filtering. Collections of some data sets in date warehouse and then perform data mining on those sets of data them by training and supervised learning method. Training the data involves a series of inputs sequences and the desired outputs which will eliminate the spam from the desired messages. In the second step, learning a function from training data set available to the user and the objective is to predict output from the function which is based upon any valid input after having seen a number of training examples. The steps followed here are as follows: 1. Acquisition of data and Dataset description a. Preparations of the required datasets. b. Training the datasets. 2. The Spam based dataset(has 541 instances) 3. Each instance in the file relates to a separate Each is represented using some real attributes like integers and subsets selectors. 5. Generating the svm_light format directly based on the above features (using the SVM-Light tool) Paper ID:
6 instances were used for training and 41 were used for testing. 7. Training the classifiers simultaneously. 8. Testing the Classifiers based on the above trained classifiers. 9. Conclusion & Future Enhancements This problem differs from classical text categorization tasks in many ways. Cost of misclassification is highly unbalanced. Messages in an stream arrive in order and must be classified upon delivery. It seems to be very common to deploy a filter without training data. Here we try to implement a new kernel function that would rather helps in achieving efficiency and solving many clustering problems. We also try to implement the kernels function that would even help in spam detection and classification/filtering upon detection at the origins. These unique characteristics of the spam classification and filtering and clustering approaches problem are reflected in the design of our experiments and the choice of measures that were used for classifier evaluation. Here after our analysis about the overall scenario of spam classification/filtering, we conclude that various classifiers like Decision Tree & Graphs Models classifiers have large memory requirement. Also the number of features for spam classification/filtering is havoc & large and hence in order to evaluate these attributes the use SVM has proved to be a good classifier because of its sparse data format and acceptable recall and precision value. SVM is regarded as an important application of kernel methods implementation, one of the key areas in machine learning. In Future, we have planned to propose and implement some more new algorithms using kernel functions that will enhance the efficiency and performance over any of the available datasets at real-time and removing spam significantly. 10. Acknowledgement I would like to thank Shri. Gaurav Shrivastav, for his inspiring discussions, guidance and mutual encouragement provided to me at all times!!!! [6] Bag of Words Model. omputer_vision. [7] N. Cristianini and J. S. Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, [8] V. Vapnik., The Nature of Statistical Learning Theory, Springer, New York, [9] T.-J. Liu, W.-L. Tsao and C.-L. Lee, A High Performance Image-Spam Filtering System, Ninth International Symposium on Distributed Computing and Applications to Business, Engineering and Science, August 2010, Hong Kong, pp doi: /dcabes [10] M. Sasaki & H. Shinnou, Spam Detection Using Text Clustering, IEEE Proceedings of 2005 International Conference on Cyberwords, Singapore, [11] Duncan Cook, Catching Spam before it arrives: Domain Specific Dynamic Blacklists, Australian Computer Society, 2006, ACM. [12] D. Puniškis, R. Laurutis, R. Dirmeikis, An Artificial Neural Nets for Spam Recognition, electronics and electrical engineering ISSN Nr. 5(69) [13] Youn and Dennis McLeod, A Comparative Study for Classification, Seongwook Los Angeles, CA 90089, USA, [14] Bekker, S 2003, Spam to Cost U.S. Companies $10 Billion in 2003, ENT News, viewed May , >. [15] SVM Application List. [16] html [17] Wikipedia, Spam. [18] Wikipedia, spam. [19] [VAP01] Vladimir Vapnik, Asa Ben-Hur, David Horn, Hava T, Sieglmann, Support Vector Clustering, Journal of Machine Learning Research 2, pages ,2001. References [1] Saadat Nazirova, Institute of Information Technology of Azerbaijan National Academy of Sciences, Azerbaijan. Received April 11, 2011; revised May 8, 2011; accepted May 15, 2011 [2] R.Deepa Lakshmi and N. Radha, Supervised Learning Approach for Spam Classification Analysis using Data Mining Tools International Journal on Computer Science and Engineering, Vol. 02, No. 08, 2010, [3] Brian Krebs, Krebs on Security, [4] Microsoft Sender ID Framework. enderid/default.mspx. [5] Sender Policy Framework. Paper ID:
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
Improving spam mail filtering using classification algorithms with discretization Filter
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational
Savita Teli 1, Santoshkumar Biradar 2
Effective Spam Detection Method for Email Savita Teli 1, Santoshkumar Biradar 2 1 (Student, Dept of Computer Engg, Dr. D. Y. Patil College of Engg, Ambi, University of Pune, M.S, India) 2 (Asst. Proff,
SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2
International Journal of Computer Engineering and Applications, Volume IX, Issue I, January 15 SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2
Learning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING
I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Support Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Spam Filtering using Naïve Bayesian Classification
Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
How To Filter Spam Image From A Picture By Color Or Color
Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among
KEITH LEHNERT AND ERIC FRIEDRICH
MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They
Spam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016
Network Machine Learning Research Group S. Jiang Internet-Draft Huawei Technologies Co., Ltd Intended status: Informational October 19, 2015 Expires: April 21, 2016 Abstract Network Machine Learning draft-jiang-nmlrg-network-machine-learning-00
Big Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam
Groundbreaking Technology Redefines Spam Prevention Analysis of a New High-Accuracy Method for Catching Spam October 2007 Introduction Today, numerous companies offer anti-spam solutions. Most techniques
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,
Index Terms Domain name, Firewall, Packet, Phishing, URL.
BDD for Implementation of Packet Filter Firewall and Detecting Phishing Websites Naresh Shende Vidyalankar Institute of Technology Prof. S. K. Shinde Lokmanya Tilak College of Engineering Abstract Packet
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Feature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
Kaspersky Anti-Spam 3.0
Kaspersky Anti-Spam 3.0 Whitepaper Collecting spam samples The Linguistic Laboratory Updates to antispam databases Spam filtration servers Spam filtration is more than simply a software program. It is
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
A Game Theoretical Framework for Adversarial Learning
A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
Bayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary [email protected] http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Investigation of Support Vector Machines for Email Classification
Investigation of Support Vector Machines for Email Classification by Andrew Farrugia Thesis Submitted by Andrew Farrugia in partial fulfillment of the Requirements for the Degree of Bachelor of Software
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Detecting E-mail Spam Using Spam Word Associations
Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 [email protected] 2 [email protected]
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
Prediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM
ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, APRIL 212, VOLUME: 2, ISSUE: 3 AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM S. Arun Mozhi Selvi 1 and R.S. Rajesh 2 1 Department
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2
UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,
Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research [email protected]
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research [email protected] Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
An Overview of Spam Blocking Techniques
An Overview of Spam Blocking Techniques Recent analyst estimates indicate that over 60 percent of the world s email is unsolicited email, or spam. Spam is no longer just a simple annoyance. Spam has now
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
eprism Email Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide
eprism Email Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide This guide is designed to help the administrator configure the eprism Intercept Anti-Spam engine to provide a strong spam protection
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
An Imbalanced Spam Mail Filtering Method
, pp. 119-126 http://dx.doi.org/10.14257/ijmue.2015.10.3.12 An Imbalanced Spam Mail Filtering Method Zhiqiang Ma, Rui Yan, Donghong Yuan and Limin Liu (College of Information Engineering, Inner Mongolia
CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA
CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab
Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup
Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor
How to keep spam off your network
GFI White Paper How to keep spam off your network What features to look for in anti-spam technology A buyer s guide to anti-spam software, this white paper highlights the key features to look for in anti-spam
Intrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. [email protected] J. Jiang Department
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
Combining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
Operations Research and Knowledge Modeling in Data Mining
Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 [email protected]
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development
Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development Author André Tschentscher Address Fachhochschule Erfurt - University of Applied Sciences Applied Computer Science
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools
Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools Spam Track Wednesday 1 March, 2006 APRICOT Perth, Australia James Carpinter & Ray Hunt Dept. of Computer Science and Software
Detecting spam using social networking concepts Honours Project COMP4905 Carleton University Terrence Chiu 100605339
Detecting spam using social networking concepts Honours Project COMP4905 Carleton University Terrence Chiu 100605339 Supervised by Dr. Tony White School of Computer Science Summer 2007 Abstract This paper
FRACTAL RECOGNITION AND PATTERN CLASSIFIER BASED SPAM FILTERING IN EMAIL SERVICE
FRACTAL RECOGNITION AND PATTERN CLASSIFIER BASED SPAM FILTERING IN EMAIL SERVICE Ms. S.Revathi 1, Mr. T. Prabahar Godwin James 2 1 Post Graduate Student, Department of Computer Applications, Sri Sairam
6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME
INTERNATIONAL International Journal of Computer JOURNAL Engineering OF COMPUTER and Technology ENGINEERING (IJCET), ISSN 0976-6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET)
Email Classification Using Data Reduction Method
Email Classification Using Data Reduction Method Rafiqul Islam and Yang Xiang, member IEEE School of Information Technology Deakin University, Burwood 3125, Victoria, Australia Abstract Classifying user
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Dual Mechanism to Detect DDOS Attack Priyanka Dembla, Chander Diwaker 2 1 Research Scholar, 2 Assistant Professor
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Engineering, Business and Enterprise
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
An Efficient Spam Filtering Techniques for Email Account
American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-02, Issue-10, pp-63-73 www.ajer.org Research Paper Open Access An Efficient Spam Filtering Techniques for Email
A LVQ-based neural network anti-spam email approach
A LVQ-based neural network anti-spam email approach Zhan Chuan Lu Xianliang Hou Mengshu Zhou Xu College of Computer Science and Engineering of UEST of China, Chengdu, China 610054 [email protected]
Ipswitch IMail Server with Integrated Technology
Ipswitch IMail Server with Integrated Technology As spammers grow in their cleverness, their means of inundating your life with spam continues to grow very ingeniously. The majority of spam messages these
International Journal of Research in Advent Technology Available Online at: http://www.ijrat.org
IMPROVING PEFORMANCE OF BAYESIAN SPAM FILTER Firozbhai Ahamadbhai Sherasiya 1, Prof. Upen Nathwani 2 1 2 Computer Engineering Department 1 2 Noble Group of Institutions 1 [email protected] ABSTARCT:
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Image Based Spam: White Paper
The Rise of Image-Based Spam No matter how you slice it - the spam problem is getting worse. In 2004, it was sufficient to use simple scoring mechanisms to determine whether email was spam or not because
Adaptive Anomaly Detection for Network Security
International Journal of Computer and Internet Security. ISSN 0974-2247 Volume 5, Number 1 (2013), pp. 1-9 International Research Publication House http://www.irphouse.com Adaptive Anomaly Detection for
Supervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT
IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: [email protected], Abstract- Unethical
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Spam filtering. Peter Likarish Based on slides by EJ Jung 11/03/10
Spam filtering Peter Likarish Based on slides by EJ Jung 11/03/10 What is spam? An unsolicited email equivalent to Direct Mail in postal service UCE (unsolicited commercial email) UBE (unsolicited bulk
Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. [email protected]
Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang [email protected] http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel
Content-Based Recommendation
Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Objective This howto demonstrates and explains the different mechanisms for fending off unwanted spam e-mail.
Collax Spam Filter Howto This howto describes the configuration of the spam filter on a Collax server. Requirements Collax Business Server Collax Groupware Suite Collax Security Gateway Collax Platform
