Improving spam mail filtering using classification algorithms with discretization Filter
|
|
|
- Melissa Sutton
- 10 years ago
- Views:
Transcription
1 International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS) ISSN (Print): ISSN (Online): Improving spam mail filtering using classification algorithms with discretization Filter Supriya S. Shinde 1, Prof. Rahul Patil 2 1 ME Computer Student, 2 Professor 1,2 Computer Department, Pimpri Chinchwad College of Engineering, Nigdi, Pune, Maharashtra, INDIA Abstract: spam or junk is one of the major problems of the today's usage of Internet, which carries financial damage to organizations and annoying individual users. Among the approaches developed till date to stop spam, filtering is an important and popular technique. Common practices for mail filters include organizing incoming and removal of spam mails and viruses. A rare use is to examine outgoing at some companies to ensure that employees observe with appropriate rules. Users might also employ a mail filter to prioritize messages, and to sort them into folders based on subject matter or other criteria. For many years these elements have driven pattern recognition and machine learning communities to keep improving filtering techniques. Mail filters can be connected by the user, either as separate packages, or as part of their program as client. In case of s, users can make manual filters which automatically filter mail according to the criteria chosen by particular user. In this paper, we present a survey of the performance of six machine learning methods in spam filtering techniques. Experiments are carried out on different classification techniques and association techniques using Waikato Environment for Knowledge Analysis (WEKA). Different classifiers are applied on one benchmark dataset in to evaluate which classifier gives better result. The dataset is in Attribute Relation File Format (ARFF). 10 fold cross validation is used to provide well accuracy. Results of classification algorithms are compared on spambase UCI dataset and it is found that no single algorithm performs best for spam mail filtering. For the different dataset it is observed that performance varies with different data sets. Our results prove that the performance of classification improve with filters. Key Terms: classification, Spam, Spam filtering, Machine learning, algorithm, WEKA classification Algorithm. I. Introduction In recent years, s have become a common and important medium of communication for most Internet users. Data mining [3] is defined as the process of discovering patterns in data. In today's world it can be used in various areas such as fraud detection, scientific discovery, marketing industry, artificial intelligence, Surveillance etc. This paper uses UCI spambase dataset [7]. Preprocessing is done and different classification methods are compared based on their performance. Rules are generated using association rule mining. Classification is a data mining (machine learning) technique which has a set of predefined classes and determine which class a new object belongs to it. There are large number of classifiers available which are used to classify the data such as bays, function, Rule, lazy, meta, decision tree etc. Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. The data must be preprocessed to select a subset of attributes to use in learning. For preprocessing we have applied Discretization filter. A. Weka: WEKA is a popular suite of machine learning software.it is developed at the University of Waikato in New Zealand by researchers. WEKA is freely available under GNU General Public License (GPL).This software is written in JAVA contains interactive GUI(Graphical User Interface).It supports many data mining tasks such as preprocessing, Classification, Clustering, Select attribute, Associate, visualization etc. In WEKA three graphical user interfaces are available namely The Explorer (exploratory data analysis) The Experimenter (experimental environment) The Knowledge Flow (new process model inspired interface) [2].WEKA Data format uses flat text files to describe the data. It can work with a wide variety of data files. Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary Data can also be read from a URL or from an SQL database. [3] B. Dataset: The UCI Spambase data set has 4601 instances. Among them, 1813 instances are spam and 2788 instances are non-spam. For each instance, there are 58 attributes. The first 48 attributes are all continuous real number IJETCAS ; 2014, IJETCAS All Rights Reserved Page 82
2 ranging from 0 to 100, with the name word_freq_word. The last attribute is a nominal class attribute where 1 denotes that the is considered spam, 0 denotes that the is non-spam. II. Literature Review This section describes the preprocessing Discretize filtering technique, spam filtering methods and classification methods in detail. We have taken UCI Spambase dataset and examined the classification result using six methods Bayes, Function, Lazy, Meta, Rules, Tree classifiers. Spam Techniques: A. Bag-of-Words model based methods: Support vector machine (SVM): In 2006, the SVM classifier was proposed for spam filtering. SVM worked based on training samples and a predefined transformation : Rs F, which forms a map from features to create a transformed feature space, storing the samples of the two classes with a hyperplane in the transformed feature space.[10] The decision rules appear in the following formula: where, ( ( Is the kernel function and αi,i = 1..n and S maximize margin space separate between hyperplane. Term frequency-inverse document frequency (TF-IDF): TF-IDF is used for word weights, as feature for the clustering. The document frequency of the word w is implemented by DF(w) which is defined as the number of messages in the collected data set where the word w appears in the document at least once as shown in the formula: W ab= TFab. Where, w ab is the weight of a th Word in the b th document ( ), TFxy is the occurrences number of the a th word (w) in the b th document ( ), DFa is the number of messages in which the i th word(w)occurs, and S, as above, is the total number of messages in the training dataset. Naïve bays classifiers: It is a simple probabilistic classifier which works based on Bayes theorem with powerful naïve independence conventions. In 2006, Naïve Bayes classifier was proposed for phishing filtering in Microsoft (Ganger 2006). Naive Bayes assumption is that the individual attributes are conditionally independent of each other in classification technique. k-nearest neighbor(k-nn): It is a classifier and was proposed for spam filtering by Gansterer. Using this classifier, the decision is made as follows: Based on k nearest training input, samples are chosen using a pre-defined similarity function; after that, the e is labeled as belonging to the same class as the bulk of k samples mails. B. Multiple Classification DM algorithms: Discretization Filter: Filters are used to preprocess the data to remove the noisy data in data mining. There are two types of filters supervised and unsupervised filter. Filters can be applied to both training and test dataset. There are various methods for supervised and unsupervised filter. In our work we have used Unsupervised Discretization filter this is to quantize each attribute in absence of any knowledge of the classes of the instances in the training set. Discretize filter converts numeric values to nominal [3] and distributes into selected number of bins equally. [1] Bayes Classifier Bayes method is also used for classification in data mining.there are six Bayesian method such as AODE, ADOEsr, Naive Bayes, Bayesian Net, Naive Bayes Simple, Naive Bayes updatable. In our work we have used Naive Bayes classification method. Naive Bayes classifier makes assumption about independence of the attributes. Bayes Rules are used to predict the class with some feature values [4]. [2] Function Classifier Neural Network and regression are the concepts used by Function Classifier. Function classifier can be written down as mathematical equation in natural way. Another methods such as decision tree and rules cannot. The various methods of function classifier are Linear Regression, Logistic, FunctionsLogistic, IJETCAS ; 2014, IJETCAS All Rights Reserved Page 83
3 RBNFNetwork, etc. For our experiment we have used FunctionsLogistic. FunctionsLogistic builds logistic models, fitting them using LogitBoost with simple regression function base learner as base learner and determining how many iterations to perform using cross validation, which supports automatic attributes selection[3]. [3] Rule Classifier Interesting relationship among all attributes can b found by using Association Rules. More than one conclusion is predicted by the rule classifier. The correctly predicted records are called coverage. Accuracy is total number of records correctly classified from total number of records. Support is coverage divided by total number of records. Different methods in rules classifier are Conjunctive Rule, Decision table, DTNB,PART, Zero, JRip, Rider. For our dataset we have used PART method. PART obtains rules from partial decision trees [3].It builds the tree using C4.5 s heuristics with the same user defined parameters as J4.8. [4] Lazy Classifier Lazy learners store the training instances and do no real work until classification time[3].it supports incremental learning. Different lazy classifier methods are IB1, IBK, K-Star, LWL, LBR. For our data set we have used K-Star algorithm. K-Star is a nearest neighbor method with a generalized distance function based on transformations. [5] Meta classifier Meta classifiers operate on output of other learners and hence called as meta learners. Meta classifier includes large number of classifiers. The most important meta classifiers are Bagging, AdaBoostM1, LogitBoost, MultiClass Classifier, CVParameterSelection, Filtered Classifier. In our dataset we have used Filtered Classifier which runs on filtered data. [6] Decision Trees Decision Trees specify the sequence of decision that need to be made along with the resulting recommendation. A Divide and Conquer approach to the problem of learning from a set of independent instances leads naturally to a style of representation called a decision tree[3].there are different methods for decision tree such as ADTree, BFTree, J48, J48graft, DecisionStump etc. For our data set we have used J48. Based on the highest value of the Information Gain and Entropy, it creates a tree of attributes which depicts the arrangement of attribute in tree structure. Improved version of C4.5 is J48.[5] C. Classifiers Models-Based Features: These approaches work based on building full models that are able to create new features with many adaptive algorithms and classifiers to produce the final results. D. Hybrid Approach: These approaches work based on combining different algorithms of classifiers working together to enhance results. SVM and Naïve Bayes algorithms if combined together generates better results in terms of more accuracy and less false positive rate for spam classification. E. Clustering of spam mails: Clustering is the process of defining data grouped together according to similarity. It is usually an unsupervised machine learning algorithms. This group of machine-learning techniques depends on filtering phishing s based on clustering s via online or offline mode. One of the most used clustering techniques is k-means clustering. K-means clustering is an offline and unsupervised algorithm which begins by determining cluster k as the assumed center of this cluster. Any random object or features vector can be selected as initial center then undertake the three steps provided below: Determine the Centre coordinate; Determine the distance of each object(vector) to the Centre group of objects based on a minimum distance. : 1- Determine the distance of each object (vector) to the Centre 2- Group of object based on minimum distance. F. Image based Spam Filtering: It has become a new type of spam classification. Spammers insert the message into the image and then attach it to the mail. Some traditional methods created on analysis of text-based information do not work in this case. In the paper [11] It is proposed three layer such as Mail Header Classifier, the Image Header Classifier and the Visual Feature Classifier image spam filtering respectively. In the First layer it is applied Bayesian classifier and SVM classifier in the remaining layers. In paper [9] it is offered statistical feature extraction for classification of image-based spam using artificial neural networks. They consider statistical image feature histogram and mean value of block of image for image classification. IJETCAS ; 2014, IJETCAS All Rights Reserved Page 84
4 No Techniques Used Advantages Disadvantages 1 Bag-of-Words model based methods 2 Multiple Classification DM algorithms -Build secure connection between user s mail transfer Agent and mail user Agent. -Provide clear idea about the effective level of each classifier on spam detection 3 Hybrid system -High level of accuracy by take the advantages of many classifiers 4 Classifiers Model-Based Features -Time consuming - Large number of features -consuming memory These are Non standard classifiers defined -Time consuming because this technique has many layers to make the final result - High level of accuracy can be achieved -huge number of features -many algorithm for classification which means time consuming -higher cost -need large mail server and high memory requirement 5 Clustering Technique -classification process is fast -Less accuracy because it depend on unsupervised learning, need feed continuously 6 Image based Spam Filtering -Efficiency will be more than text based approaches -It is very Costly -Time Consuming Table1: Comparison the advantages and disadvantages between machine learning techniques III. Proposed Work The figure shows the flow of experiment performed. In this work the performance of various classification techniques are analyzed with and without preprocessing technique. Steps to apply classification techniques on data set and get result in Weka: Step 1: Take the spambase dataset as input. Step 2: Apply the classifier algorithms on the whole data set. Step 3: Note the accuracy given by it and time required for execution. Step 4: Repeat step 2 and 3 for different classification algorithms. Step5: Apply the Discretization filter and Repeat Step2 and Step 3. Step 6: Results as accuracy among different spam Classifiers are analyzed Figure 1: Workflow of Experimented Classification process. IJETCAS ; 2014, IJETCAS All Rights Reserved Page 85
5 IV. Experimental Result and Discussion To investigate the performance on the selected classification methods or algorithms namely Bayes-Naive Bayes, Functions-Logistic, J-48, PART, K-Star, FilteredClassifer we use the same experiment procedure as suggested by WEKA i.e preprocessing->classification->evaluation of different spam classifier Algorithms. In WEKA, all data is considered as instances and features in the data are known as attributes [4]. For experimentation the UCI spambase dataset is used which has a total of 4601 data instances. The results of the simulation are shown in Tables 1 and 2 below. Table 1 mainly summarizes the result based on accuracy and time taken for each simulation when discretization filter is not used. Meanwhile, Table 2 shows the result based on accuracy and time taken for each simulation when Discretization Filter is used for data preprocessing. Figures 1 and 2 are the graphical representations of the simulation result. Algorithms Correctly classified Incorrectly classified Time taken to build model Instances(%) instances(%) NavieBayse Logistic KStar FilteredClassifier PART J Table 2: Result Before Preprocessing of classification methods Algorithms Correctly classified Instances(%) Incorrectly classified instances(%) Time taken to build model NavieBayse Logistic KStar FilteredClassifier PART J Table 3: Result After Preprocessing using Discretization Filter of classification method From above tables 1, 2 we have proven that the correctly classified instances are more if we preprocess the data using Discretization filter.table1 shows the performance of different classification algorithms. Logistic classifier has the best performance where 4346 instances are correctly classified from total 4601 instances. From table 2 we can see if the dataset is preprocessed using Discretization filter the performance of the same classification algorithm is increased. With the Logistic classifier the number of correctly classified instances are 4346 out of 4601 instances. Thus from the overall analysis we can say the performance of the classification algorithms can be improved by preprocessing the data set before applying the classification algorithms. With the help of figures we are showing the working performance of various algorithms used in WEKA in fig.1 below. Every algorithm has its own importance and we use them on the behavior of the data. No of Instances Classified(%) Correctly ClassifiedInstances(%) Incorrectly Classified Instances(%) Time Taken TO bui;d Model Alogorithms Figure 1: Classification methods comparative performance IJETCAS ; 2014, IJETCAS All Rights Reserved Page 86
6 V. Conclusion Spam mails are becoming a very serious problem for today s Internet community, threatening both the integrity of the networks and the productivity of the users. Through this work, we have met our objective which was to analyze the six selected classification algorithms based on Weka and various spam filtering techniques. The result shows the best classifier algorithm is Logistic classifier for UCI Spambase dataset and performance of each of this six algorithm can be improved if the dataset is preprocessed using Discretization Filter. Among the spam filtering techniques described hybrid approach generates the best spam mail filtering results in terms of more accuracy and less false positive rate. VI. References [1] Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, The WEKA Data Mining Software An Update. [2] Ian H.Witten and Elbe Frank, (2005) "Datamining Practical Machine Learning Tools and Techniques," Second Edition, Morgan Kaufmann, San Fransisco. [3] Mohd Fauzi bin Othman,Thomas Moh Shan Yau, Comparison of Different Classification Techniques Using WEKA for Breast Cancer.Control and Instrumentation Department, Faculty of Electrical Engineering, Universiti Teknologi Malaysia, Skudai, Malaysia. [4] Vaithiyanathan1, K. Rajeswari, Kapil Tajane, Rahul Pitale. Comparison of Different Classification Techniques Using Different Datasets. [5] Saadat Nazirova, Survey on Spam Filtering Techniques Communications and Network, 2011, 3, PublishedOnline August 2011 ( [6] UCI Machine Learning Repository: [7] R.Kishore Kumar,G.Poonkuzhali,P.sudhakar,Member, Comparative Study on Spam techniques, IMECS 2012 [8] Saadat Nazirova Survey on Filtering Techniques Science Research-Communication and Network(2011) [9] M. Soranamageswari and C. Meena, Statistical Feature Extraction for Classification of Image Spam Using Arti-ficial Neural Networks, Second International Confer-ence on Machine Learning and Computing, Bangalore, 9-11 February, 2010, pp [10] W.A. Awad and S.M. ELseuofi, Machine Learning Methods For Spam Classification, International Journal of Computer Science & Information Technology (IJCSIT), Vol 3, No 1, Feb [11] T.-J. Liu, W.-L. Tsao and C.-L. Lee, A High Perform-ance Image-Spam Filtering System, Ninth International Symposium on Distributed Computing and Applications to Business, Engineering and Science, August 2010, Hong Kong. IJETCAS ; 2014, IJETCAS All Rights Reserved Page 87
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
An Introduction to WEKA. As presented by PACE
An Introduction to WEKA As presented by PACE Download and Install WEKA Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html 2 Content Intro and background Exploring WEKA Data Preparation Creating Models/
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
Introduction Predictive Analytics Tools: Weka
Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
How To Understand How Weka Works
More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz More Data Mining with Weka a practical course
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Open-Source Machine Learning: R Meets Weka
Open-Source Machine Learning: R Meets Weka Kurt Hornik Christian Buchta Achim Zeileis Weka? Weka is not only a flightless endemic bird of New Zealand (Gallirallus australis, picture from Wekapedia) but
Predicting Student Performance by Using Data Mining Methods for Classification
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
WEKA KnowledgeFlow Tutorial for Version 3-5-8
WEKA KnowledgeFlow Tutorial for Version 3-5-8 Mark Hall Peter Reutemann July 14, 2008 c 2008 University of Waikato Contents 1 Introduction 2 2 Features 3 3 Components 4 3.1 DataSources..............................
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
COURSE RECOMMENDER SYSTEM IN E-LEARNING
International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand
Rule based Classification of BSE Stock Data with Data Mining
International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 4, Number 1 (2012), pp. 1-9 International Research Publication House http://www.irphouse.com Rule based Classification
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Data Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
Feature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors
Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
WEKA A Machine Learning Workbench for Data Mining
Chapter 1 WEKA A Machine Learning Workbench for Data Mining Eibe Frank, Mark Hall, Geoffrey Holmes, Richard Kirkby, Bernhard Pfahringer, Ian H. Witten Department of Computer Science, University of Waikato,
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining
A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining Sakshi Department Of Computer Science And Engineering United College of Engineering & Research Naini Allahabad [email protected]
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
A comparative study of data mining (DM) and massive data mining (MDM)
A comparative study of data mining (DM) and massive data mining (MDM) Prof. Dr. P K Srimani Former Chairman, Dept. of Computer Science and Maths, Bangalore University, Director, R & D, B.U., Bangalore,
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7
DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7 UNDER THE GUIDANCE Dr. N.P. DHAVALE, DGM, INFINET Department SUBMITTED TO INSTITUTE FOR DEVELOPMENT AND RESEARCH IN BANKING TECHNOLOGY
Learning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
Decision Support System on Prediction of Heart Disease Using Data Mining Techniques
International Journal of Engineering Research and General Science Volume 3, Issue, March-April, 015 ISSN 091-730 Decision Support System on Prediction of Heart Disease Using Data Mining Techniques Ms.
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries
A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries Aida Mustapha *1, Farhana M. Fadzil #2 * Faculty of Computer Science and Information Technology, Universiti Tun Hussein
Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
MS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
Data Mining & Data Stream Mining Open Source Tools
Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.
A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach
International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering
152 International Journal of Computer Science and Technology. Paavai Engineering College, India
IJCST Vol. 2, Issue 3, September 2011 Improving the Performance of Data Mining Algorithms in Health Care Data 1 P. Santhi, 2 V. Murali Bhaskaran, 1,2 Paavai Engineering College, India ISSN : 2229-4333(Print)
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Data Mining: Overview. What is Data Mining?
Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
DATA MINING AND REPORTING IN HEALTHCARE
DATA MINING AND REPORTING IN HEALTHCARE Divya Gandhi 1, Pooja Asher 2, Harshada Chaudhari 3 1,2,3 Department of Information Technology, Sardar Patel Institute of Technology, Mumbai,(India) ABSTRACT The
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION
ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical
Impact of Feature Selection Technique on Email Classification
Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
Keywords data mining, prediction techniques, decision making.
Volume 5, Issue 4, April 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analysis of Datamining
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE
EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE S. Anupama Kumar 1 and Dr. Vijayalakshmi M.N 2 1 Research Scholar, PRIST University, 1 Assistant Professor, Dept of M.C.A. 2 Associate
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo [email protected],[email protected]
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/
Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/ Email: [email protected] Office: Dipartimento di Ingegneria
How To Predict Web Site Visits
Web Site Visit Forecasting Using Data Mining Techniques Chandana Napagoda Abstract: Data mining is a technique which is used for identifying relationships between various large amounts of data in many
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION
http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,
Tweaking Naïve Bayes classifier for intelligent spam detection
682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. [email protected] 2 School of Computing, Information
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model
A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model Twinkle Patel, Ms. Ompriya Kale Abstract: - As the usage of credit card has increased the credit card fraud has also increased
SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2
International Journal of Computer Engineering and Applications, Volume IX, Issue I, January 15 SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES Bruno Carneiro da Rocha 1,2 and Rafael Timóteo de Sousa Júnior 2 1 Bank of Brazil, Brasília-DF, Brazil [email protected] 2 Network Engineering
Impelling Heart Attack Prediction System using Data Mining and Artificial Neural Network
General Article International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347-5161 2014 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Impelling
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
BIG DATA IN HEALTHCARE THE NEXT FRONTIER
BIG DATA IN HEALTHCARE THE NEXT FRONTIER Divyaa Krishna Sonnad 1, Dr. Jharna Majumdar 2 2 Dean R&D, Prof. and Head, 1,2 Dept of CSE (PG), Nitte Meenakshi Institute of Technology Abstract: The world of
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Maschinelles Lernen mit MATLAB
Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data
Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream
Data Mining with Weka
Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz Data Mining with Weka a practical course on how to
First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms
First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms Azwa Abdul Aziz, Nor Hafieza IsmailandFadhilah Ahmad Faculty Informatics & Computing
8. Machine Learning Applied Artificial Intelligence
8. Machine Learning Applied Artificial Intelligence Prof. Dr. Bernhard Humm Faculty of Computer Science Hochschule Darmstadt University of Applied Sciences 1 Retrospective Natural Language Processing Name
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India [email protected]
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm
Heart Disease Diagnosis Using Predictive Data mining
ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Clustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI
Data Mining Knowledge Discovery, Data Warehousing and Machine Learning Final remarks Lecturer: JERZY STEFANOWSKI Email: [email protected] Data Mining a step in A KDD Process Data mining:
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
