Efficient Spam Classification by Appropriate Feature Selection

Size: px
Start display at page:

Download "Efficient Spam Classification by Appropriate Feature Selection"

Transcription

1 Global Journal of Computer Science and Technology Software & Data Engineering Volume 13 Issue 5 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc. (USA) Online ISSN: & Print ISSN: Efficient Spam Classification by Appropriate Feature Selection By Prajakta Ozarkar & Dr. Manasi Patwardhan Vishwakarma Institute of Technology, India Abstract - Spam is a key problem in electronic communication, including large-scale systems and the growing number of blogs. Currently a lot of research work is performed on automatic detection of spam s using classification techniques such as SVM, NB, MLP, KNN, ID3, J48, Random Tree, etc. For spam dataset it is possible to have large number of training instances. Based on this fact, we have made use of Random Forest and Partial Decision Trees algorithms to classify spam vs. non-spam s. These algorithms outperformed the previously implemented algorithms in terms of accuracy and time complexity. As a preprocessing step we have used feature selection methods such as Chi-square, Information gain, Gain ratio, Symmetrical uncertainty, Relief, One R and Correlation. This allowed us to select subset of relevant, non redundant and most contributing features to have an added benefit in terms of improvisation in accuracy and reduced time complexity. Keywords : feature selection, preprocessing, random forest, part. GJCST-C Classification : H.4.3 Efficient Spam Classification by Appropriate Feature Selection Strictly as per the compliance and regulations of: Prajakta Ozarkar & Dr. Manasi Patwardhan. This is a research/review paper, distributed under the terms of the Creative Commons Attribution-Noncommercial 3.0 Unported License permitting all noncommercial use, distribution, and reproduction inany medium, provided the original work is properly cited.

2 Efficient Spam Classification by Appropriate Feature Selection Prajakta Ozarkar α & Dr. Manasi Patwardhan σ Abstract - Spam is a key problem in electronic communication, including large-scale systems and the growing number of blogs. Currently a lot of research work is performed on automatic detection of spam s using classification techniques such as SVM, NB, MLP, KNN, ID3, J48, Random Tree, etc. For spam dataset it is possible to have large number of training instances. Based on this fact, we have made use of Random Forest and Partial Decision Trees algorithms to classify spam vs. non-spam s. These algorithms outperformed the previously implemented algorithms in terms of accuracy and time complexity. As a preprocessing step we have used feature selection methods such as Chi-square, Information gain, Gain ratio, Symmetrical uncertainty, Relief, One R and Correlation. This allowed us to select subset of relevant, non redundant and most contributing features to have an added benefit in terms of improvisation in accuracy and reduced time complexity. Keywords : feature selection, preprocessing, random forest, part. I. Introduction I n this paper we have studied previous approaches used for classifying spam and non spam s by using distinct classification algorithms. We have also studied the distinct features extracted for classifier training and the feature selection algorithms applied to get rid of irrelevant features and selecting the most contributing features. After studying the current feature selection and classification approaches, we have applied two new classification techniques viz. Random forests and Partial decision trees along with distinct feature selection algorithms. R.Parimala,et.al. [1] Presents a new (Feature selection) technique which is guided by F selector Package. They have used nine feature selection techniques such as Correlation based feature selection, Chi-square, Entropy, Information Gain, Gain Ratio, Mutual Information, Symmetrical Uncertainty, One R, Relief and five classification algorithms such as Linear Discriminant Analysis, Random Forest, Rpart, Naïve Byes and Support Vector Machine on spam base dataset. In their evaluation, the results show that filter methods C, Chi-squared, GR, Relief, SU, IG, and one Author α : Prajakta Ozarkar, Student, Vishwakarma Institute of Technology, Pune, Maharashtra, India. Author σ : Manasi Patwardhan, Professor, Vishwakarma Institute of Technology, Pune, Maharashtra, India. Enables the classifiers to achieve the highest increase in classification accuracy.they conclude that the implemented can improve the accuracy of Support vector machine classifiers by performing. In the paper by R. Kishore Kumar, et.al.[2] spam dataset is analyzed using Tanagra data mining tool. Initially, feature construction and feature selection is done to extract the relevant features by using Fisher filtering, Relief, Runs Filtering, Step disc. Then classification algorithms such as C4.5, C-PLS, C-RT, CS-CRT, CS-MC4, CS-SVC, ID, K-NN LDA, Log Reg TRIRLS, Multilayer Perceptron, Multilogical Logistic Regression, Naïve Bayes Continuous, PLS-DA, PLS- LDA, Rend Tree and SVM are applied over spam base dataset and cross validation is done for each of these classifiers. They conclude Fisher filtering and Runs filtering feature selection algorithms performs better for many classifiers. The Rend tree classification algorithm with the relevant features extracted by fisher filtering produces more than 99% accuracy in spam detection. W.A. Awad, et.al. [3] reviews machine learning methods Bayesian classification, k-nn, ANNs, SVMs, Artificial immune system and Rough sets on the Spam Assassin spam corpus. They conclude Naïve bayes method has the highest precision among the six algorithms while the k-nearest neighbor has the worst precision percentage. Also, the rough sets method has a very competitive percentage. In the work by V.Christina, et.al.[4]employs supervised machine learning techniques namely C4.5 Decision tree classifier, Multilayer perceptron and Naïve Bayes classifier. Five features of an all (A), header (H), body (B), subject (S), and body with subject (B+S), are used to evaluate the performance of four machine learning algorithms. The training dataset, spam and legitimate message corpus is generated from the mails that they have received from their institute mail server for a period of six months. They conclude Multilayer Perceptron classifier outperforms other classifiers and the false positive rate is also very low compared to other algorithms. Rafiqul Islam, et.al. [5] have presented an effective and efficient classification technique based on data filtering method. In their testing they have introduced an innovative filtering technique using instance selection method (ISM) to reduce the pointless data instances from training model and then classify the test data. In their model, tokenization and domain 49

3 50 2 specific feature selection methods are used for feature extraction. The behavioral features are also included for improving performance, especially for reducing false positive (FP) problems. The behavioral features include the frequency of sending/receiving s, attachment, type of attachment, and size of attachment and length of the . In their experiment, they have tested five base classifiers Naive Bayes, SVM, IB1, Decision Table and Random Forest on 6 different datasets. They also have tested adaptive boosting (AdaboostM1) as meta-classifier on top of base classifiers. They have achieved overall classification accuracy above 97%. A comparative analysis is performed by Ms.D KarthikaRenuka, et.al. [6], for the classification techniques such as MLP, J48 and Naïve Bayesian, for classifying spam messages from using WEKA tool. The dataset gathered from UCI repository had 2788 legitimate and 1813 spam s received during a period of several months. Using this dataset as a training dataset, models are built for classification algorithms. The study reveals that the same classifier performed dissimilarly when run on the same dataset but using different software tools. Thus, from all perspectives MLP is top performer in all cases and thus, can be deemed consistent. Following table summarizes all the previous classification approaches enlisted above and provides a comparison in terms of % accuracy they have achieved with the application of a specific feature selection algorithm. Table 1 : Comparison of previous approaches of spam detection Reference R.Parimala,et.al. [1] R. Kishore Kumar, et.al.[2] Classifier Used and features % Feature Selection Acc SVM (100%) - 93 SVM (16%) C SVM (70%) Chi SVM (70%) IG SVM (70%) GR SVM (70%) SU SVM (70%) oner SVM (70%) Relief SVM( 32%) Lda SVM (12%) Rpart SVM (16%) SVM SVM (21%) RF SVM (7%) NB C-PLS Fisher C-RT Fisher CS-CRT Fisher CS-MC4 Fisher CS-SVC Fisher ID3 Fisher KNN Fisher LDA Fisher LogReg TRI Fisher Reference R. Kishore Kumar, et.al.[2] W. A. Awad, et.al.[3] II. Classifier Used and features % Feature Selection Acc MLP Fisher Multilogical LR Fisher NBC Fisher PLS-DA Fisher PLD-LDA Fisher Rnd Tree Fisher SVM Fisher C4.5 Relief C-PLS Relief C-RT Relief CS-CRT Relief CS-MC4 Relief CS-SVC Relief ID3 Relief KNN Relief LDA Relief LogReg TRI Relief MLP Relief Multilogical LR Relief NBC Relief PLS-DA Relief PLD-LDA Relief Rnd Tree Relief SVM Relief C4.5 Runs C-PLS Runs C-RT Runs CS-CRT Runs CS-MC4 Runs CS-SVC Runs ID3 Runs KNN Runs LDA Runs MLP Runs LogReg TRI Runs Multilogical LR Runs NBC Runs PLS-DA Runs PLD-LDA Runs Rnd Tree Runs SVM Runs C4.5 StepDisc C-PLS StepDisc C-RT StepDisc CS-CRT StepDisc CS-MC4 StepDisc CS-SVC StepDisc ID3 StepDisc KNN StepDisc LDA StepDisc LogReg TRI StepDisc MLP StepDisc Multilogical LR StepDisc NBC StepDisc PLS-DA StepDisc PLD-LDA StepDisc Rnd Tree StepDisc SVM StepDisc NBC SVM KNN ANN AIS Rough Sets V.Christina, et.al.[4] NBC J MLP RafiqulIslam,et.al [5] Ms.DKarthikaRenuk a,et.al [6] NB SMO IB DT RF MLP - 93 J48-92 NBC - 89 Proposed Work After a detailed review of the existing techniques used for spam detection, in this section we are illustrating the methodology and techniques we used for spam mail detection.

4 Figure 1 shows the process we have used for spam mail identification and how it is used in conjunction with a machine learning scheme. Feature ranking techniques such as Chi-square, Information gain, Gain ratio, Symmetrical uncertainty, Relief, One and Correlation are applied to a copy of the training data. After the feature selection subset with the highest merit is used to reduce the dimensionality of both the original training data and the testing data. Both reduced datasets may then be passed to a machine learning scheme for training and testing. Results are obtained by using Random Forest and Part classification techniques. Figure 1 : Stages of Spam Classification In the following subsections we discuss the basic concepts related to our work. It includes a brief background on feature ranking techniques, classification techniques and results. III. Data Set The dataset used for our experiment is spam base [13].The last column of 'spam base. Data' denotes whether the was considered spam (1) or not (0). Most of the attributes indicate the frequency of spam related term occurrences. The first 48 set of attributes (1 48) give tf-idf (term frequency and inverse document frequency) values for spam related words, whereas the next 6 attributes (49-54) provide tf-idf values for spam related terms. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters, capital_ run_ length_ average, capital_ run_ length_ longest and capital_ run_ length_ total. Thus, our dataset has in total 57 attributes serving as an input features for spam detection and the last attribute represent the class (spam/non-spam). We have also used one public dataset Enron [20].The preprocessed subdirectory contains the messages in the preprocessed format. Each message is in a separate text file. The body of an contains the actual information. This information needs to be extracted before running a filter process by means of preprocessing. The purpose for preprocessing is to transform messages in mail into a uniform format that can be understood by the learning algorithm. Following are the steps involved in preprocessing: 1. Feature extraction (Tokenization): Extracting features from in to a vector space. 2. Stemming: Stemming is a process for removing the commoner morphological and in-flexional endings from words in English. 3. Stop word removal: Removal of non-informative words. 4. Noise removal: Removal of obscure text or symbols from features. 5. Representation: tf-idf is a statistical measure used to calculate how significant a word is to a document in a feature corpus. Word frequency is established by term frequency (tf), number of times the word appears in the message yields the significance of the word to the document. The term frequency then is multiplied with inverse document frequency (idf) which measures the frequency of the word occurring in all messages IV. Feature Ranking and Subset Selection From the above defined feature vector of total 58 features, we use feature ranking and selection algorithms to select the subset of features. We rank the given set of features using the following distinct approaches. a) Chisquare Chi-squared hypothesis tests may be performed on contingency tables in order to decide whether or not effects are present. Effects in a contingency table are defined as relationships between the row and column variables; that is, are the levels of the row variable differentially distributed over levels of the column variables. Significance in this hypothesis test means that interpretation of the cell frequencies is warranted. Non-significance means that any differences in cell frequencies could be explained by chance. Hypothesis tests on contingency tables are based on a statistic called Chi-square [8]. 2 (O E) 2 = E Where, O Observed cell frequency, E Expected cell frequency. b) Information Gain Information Gain is the expected reduction in entropy caused by partitioning the examples according to a given attribute. Information gain is a symmetrical 51

5 52 2 measure that is, the amount of information gained about Y after observing X is equal to the amount of information gained about X after observing Y. The entropy of Y is given by [9] HH YY = PP YY lloogg2(pp YY )yy YY If the observed values of Y in the training data are partitioned according to the values of a second feature X, and the entropy of Y with respect to the partitions induced by X is less than the entropy of Y prior to partitioning, then there is a relationship between features Y and X. Equation gives the entropy of Y after observing X HH YY = (xx) PP yy xx lloogg2(pp yy xx )yy YYxx XX The amount by which the entropy of Y decreases reflects additional information about Y provided by X and is called the information gain or alternatively, mutual information [9]. Information gain is given by GGaaiinn=HH YY + HH YY XX = HH XX + HH XX YY =H Y +H X (XX,YY) c) Gain Ratio The various selection criteria have been compared empirically in a series of experiments. When all attributes are binary, the gain ratio criterion has been found to give considerably smaller decision trees. When the task includes attributes with large numbers of values, the subset criterion gives smaller decision trees that also have better predictive performance, but can require much more computation. However, when these many-valued attributes are augmented by redundant attributes which contain the same information at a lower level of detail, the gain ratio criterion gives decision trees with the greatest predictive accuracy. All in all, it suggests that the gain ratio criterion does pick a good attribute for the root of the tree [12]. GGaaiinn RRaattiioo=HH YY +HH XX HH(YY,XX)HH(XX) d) Symmetrical Uncertainty Information gain is a symmetrical measure that is, the amount of information gained about Y after observing X is equal to the amount of information gained about X after observing Y. Symmetry is a desirable property for a measure of feature-feature inter correlation to have. Unfortunately, information gain is biased in favor of features with more values. Symmetrical uncertainty compensates for information gain s bias toward attributes with more values and normalizes its value to the range [0, 1] [9]: SSyymmmmeettrriiccaall UUnncceerrttaaiinnttyy CCooeeffff= 2.0 GGaaiinnHH YY +(XX) e) Relief Relief [10] is a feature weighting algorithm that is sensitive to feature interactions. Relief attempts to approximate the following difference of probabilities for the weight of a feature X [9]: WWXX=PP( ddiiffffeerreenntt vvaalluuee ooff XX nneeaarreesstt iinnssttaannccee ooff ddiiffffeerreenntt ccllaassss) PP (ddiiffffeerreenntt vvaalluuee ooff XX nneeaarreesstt iinnssttaannccee ooff ssaammee ccllaassss) By removing the context sensitivity provided by the nearest instance condition, attributes are treated as independent of one another; RReelliieeffXX= PP (ddiiffffeerreenntt vvaalluuee ooff XX ddiiffffeerreenntt ccllaassss) PP( ddiiffffeerreenntt vvaalluuee ooff XX ssaammee ccllaassss) Which can be reformulated as RReelliieeffxx= GGiinnii pp xx 2xx XX 1 pp cc 2cc CC pp cc 2cc CC Where, C is the class variable and GGiinnii = pp cc (1 pp) CC pp xx 2 pp xx 2xx XX pp cc xx (1 pp cc xx) cc CC xx XX 6. f) OneR Like other empirical learning methods, 1R [11] takes as input a set of examples, each with several attributes and a class. The aim is to infer a rule that predicts the class given the values of the attributes. The 1R algorithm chooses the most informative single attribute and bases the rule on this attribute alone. The basic idea is: For each attribute a, form a rule as follows: For each value v from the domain of a, Select the set of instances where a has value v. Let c be the most frequent class in that set. Add the following clause to the rule fora: if a has value v then the class is c Calculate the classification accuracy of this rule. Use the rule with the highest classification accuracy. The algorithm assumes that the attributes are discrete. If not, then they must be discretized. g) Correlation Feature selection for classification tasks in machine learning can be accomplished on the basis of correlation between features, and that such a feature selection procedure can be beneficial to common machine learning algorithms [9]. Features are relevant if their values vary systematically with category membership. In other words, a feature is useful if it is correlated with or predictive of the class; otherwise it is irrelevant. A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other. The acceptance of a feature will depend on the extent to which it predicts classes in areas of the instance space not already predicted by other features. Correlation

6 based feature selection feature subset evaluation function [9]: MMss=kkrrccff kk+kk kk 1 rrffff Where, - the heuristic merit of a feature subset S containing k features, rrccff-the mean feature-class correlation, rrffff -the average feature-feature intercorrelation. Feature ranking further help us to - 1. Remove irrelevant features, which might be misleading the classifier decreasing the classifier interpretability by reducing generalization by increasing over fitting. 2. Remove redundant features, which provide no additional information than the other set of features, unnecessarily decreasing the efficiency of the classifier. 3. Selecting high rank features, which may not affect much as far as improving precision and recall is concerned; but reduces time complexity drastically. Selection of such high rank features reduces the dimensionality feature space of the domain. It speeds up the classifier there of improving the performance and increasing the comprehensibility of the classification result. We have considered 87%, 77% and 70% of the features; wherein there is a performance improvement in 70% feature consideration. IV. Classification Method Based on the assumption that the given dataset has enough number of the training instances we have chosen the following two classification algorithms. The algorithms work well based on the fact that the dataset is of good quality. a) Random Forest Random Forests [14] are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Each tree is grown as follows: 1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree. 2. If there are M input variables, a number m<<m is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing. 3. Each tree is grown to the largest extent possible. There is no pruning. Random Forest is an ensemble of trees. In our implementation of random forest we have selected a vector of 4 features (randomly selected), to build each tree in a forest of 10 random trees. Tree grows to its maximum depth as that argument is set to zero, which indicates unlimited depth. By using bagging and voting techniques classification is being done. For example, a sample part of the output of the forest (very small portion of the forest) is as shown below: Total Random forest Trees: 10 Numbers of random features: 4 Out of bag error: All the trees in the forest: RandomTree ========== word_freq_hpl < 0.07 char_freq_$ < 0.03 word_freq_you < 0.12 word_freq_hp < 0.02 char_freq_! < 0.01 word_freq_3d < 9.87 word_freq_000 < 0.08 char_freq_( < 0.04 word_freq_meeting < 0.85 word_freq_remove < 2.27 word_freq_free < 6.47 word_freq_will < 0.17 word_freq_pm < 0.42 word_freq_all < 0.21 word_freq_mail < 2.96 word_freq_re < 5.4 word_freq_technology < 1.43 capital_run_length_total < 18.5 word_freq_re < 0.68 word_freq_make < 1.39 capital_run_length_total < 10.5 : 0 (218/0) capital_run_length_total >= 10.5 word_freq_internet < 0.89 word_freq_people < 1.47 word_freq_data < 3.7 word_freq_edu < 2.38 char_freq_[ < 0.59 char_freq_; < 0.16 capital_run_length_total < 11.5 word_freq_credit < 9.09 : 0 (1/0) This is the case when 100%features have selected for training model, accordingly the root node of each tree changes. 53

7 54 2 b) Partial Decision Tree Rule learners are prominent representatives of supervised machine learning approaches. Basically, this type of learner tries to induce a set of rules for a collection of training instances. These rules are then applied on the test instances for classification purposes. Two well-known members of the family of rule-learners are C4.5 and RIPPER. C4.5 [16], for instance, generates an unprimed decision tree and transforms this tree into a set of rules. For each path from the root node to a leaf a rule is generated. Then, each rule is simplified separately followed by a rule-ranking strategy. Finally, the algorithm deletes rules from the rule set as long as the rule set s error rate on the training instances decreases. RIPPER [17] implements a divide and conquers strategy to rule induction. Only one rule is generated at a time and the instances from a training set covered by this rule are removed. It iteratively derives new rules for the remaining instances of the training set. PART (Partial Decision Trees) adopts the divide-and-conquer strategy of RIPPER [17] and combines it with the decision tree approach of C4.5 [16]. PART generates a set of rules according to the divide-and-conquer strategy, removes all instances from the training collection that are covered by this rule and proceeds recursively until no instance remains. To generate a single rule, PART builds a partial decision tree for the current set of instances and chooses the leaf with the largest coverage as the new rule. For example, following is the way of rule formation in our implementation of PART and some of the rules are as shown below: Rule 1: word_freq_remove> 0.0 AND char_freq_! > AND word_freq_edu<= 0.06: 1 (Instances: 490 and Incorrect: 7) Now, after Rule1 the next set of rules are formed excluding 490 instances from the 4601 total instances of spambase. Rule 2: char_freq_$ > AND word_freq_hp<=0.4 AND capital_run_length_longest> 9.0 AND word_freq_1999 <= 0.0 AND word_freq_edu<= 0.08 AND char_freq_! > 0.107: 1 (Instances: 334 and Incorrect: 2) Next set of rules is formed on remaining 3777 instnaces of spambase. Rule 3: word_freq_money<= 0.03 AND word_freq_000<=0.25and word_freq_remove<= 0.26 AND word_freq_free<=0.19and word_freq_font<= 0.12 AND char_freq_! <= AND char_freq_$<=0.172 AND word_freq_george> 0.0: 0 (Instances: 553 and Incorrect: 0) Total 42 rules are formulated when training. V. Results a) Smapbase Results The dataset spambase was taken from UCI machine learning repository [13]. Spambase dataset contains 4601 instances and 58 attributes continuous attributes and 1 nominal class label. The spam classification has been implemented in Eclipse. Eclipse considered by many to be the best Java development tool available. Feature ranking and feature selection is done by using the methods such as Chi-square, Information gain, Gain ratio, Relief, OneR, Correlation as a preprocessing step so as to select feature subset for building the learning model. Classification algorithms are from decision tree family, viz, Random Forest and Partial Decision Trees. Random forest is an effective tool in prediction. Because of the law of large numbers they do not over fit. Random inputs and random features produce good results in classification-less so in regression. For the larger data sets, it seems that significantly lower error rates are possible [14]. Feature space can be reduced by the magnitude of 10 while achieving similar classification results. For example, it takes about 2,000 features to achieve similar accuracies as those obtained with 149 PART features [15]. As a part of our implementation, we have divided the dataset into two parts. 80% of the dataset is used for training purpose and 20% for the testing purpose. After preprocessing step top 87%, 77% and 70% features are considered while building training model and testing because there is a significant performance improvement. Prediction accuracy, correctly classified instances, incorrectly classified instances, confusion matrix and time complexity are used as performance measures of the system. More than 99% prediction accuracy is achieved by Random forest with all the seven feature selection methods in consideration; whereas 97% prediction accuracy is achieved by PART with almost all the seven feature selection methods while training the model. Training and testing results, when 100% features have considered are given in Table 2. Table 2 : Results of 100% feature selection Classifier Training Testing Time Random Forest PART

8 Both training results and testing results on spambase dataset after feature ranking and subset selection are shown in the Table 3 and Table 4. Table 3 : Training Results RF Acc Time Table 4 : Testing Results Part Acc Time 87% Chi Infogain Gainratio Relief SU OneR Corr % Chi Infogain Gainratio Relief SU OneR Corr % Chi Infogain Gainratio Relief SU OneR Corr RF Acc Part Acc 87% Chi Infogain Gainratio Relief SU OneR % RF Acc Part Acc Corr Chi Infogain Gainratio Relief SU OneR Corr % Chi Infogain Gainratio Relief SU OneR Corr From the results above, it can be observed that for Random Forest, after using 87% of the feature set extracted the training accuracy is (96.012%) whereas the computation time reduced by % (from 9466ms to 4584ms). This shows that the remaining 13% features were not contributing towards the classification. Also, it can be observed that for Part, after using 87%, 77% of the feature set extracted the training accuracy is increased. There is a significant improvement in 87% feature selection by 1% and computation time is reduced by % (from ms to 5961ms). This shows that not only the remaining 30% features were redundant but also they were misleading the classification. VI. Enron Results More than 96% prediction accuracy is achieved by Random forest with all the seven feature selection methods in consideration; whereas more than 95% prediction accuracy is achieved by PART with almost all the seven feature selection methods while training the model. Training and testing results, when 100% features have considered are given in Table 5. Table 5 : Results of 100% feature selection Classifier Training Testing Time Random Forest PART Both training and testing results after feature ranking and subset selection are shown in the Table 6 and Table 7. Table 6 : Training Results RF Acc Time Part Acc Time 87% Chi Infogain Gainratio Relief SU OneR Corr

9 56 2 Table 7 : Testing Results From the results above, it can be observed that for Random Forest, after using 87% of the feature set extracted the training accuracy is (96.012%) whereas the computation time reduced by % (from 9466ms to 4584ms). This shows that the remaining 13% features were not contributing towards the classification. Also, it can be observed that for Part, after using 87%, 77% of the feature set extracted the training accuracy is increased. There is a significant improvement in 87% feature selection by 1% and computation time is reduced by % (from ms to 5961ms). This shows that not only the remaining 30% features were redundant but also they were misleading the classification. VI. Conclusion In this paper we have studied previous approaches of spam detection using machine learning methodologies. We have compared and evaluated the approaches based on the factors such as dataset used; features extracted, ranked and selected; feature selection algorithms used and the results received in terms of accuracy (precision, recall and error rate) and performance (time required). The datasets available for spam detection are large in number and for such larger datasets Random Forest and Part tend to produce better results with lower error rates and higher precision. So, we used these two classifiers to classify spam detection. For spambase dataset, we acquired the best percentage RF Acc Part Acc 87% Chi Infogain Gainratio Relief SU OneR Corr G-mail Dataset Test Results Further, we have tested our Enron model on the dataset created by using s we have received in our Gmail accounts during the period of last 3 months. The results are shown in the Table 8. In this, experiment we test dataset is completely non-overlapping with the training set allowing us to truly evaluate the performance of our system. Table 8 : Personal Dataset Testing Results Classifier Testing Accuracy Random Forest 96 PART accuracy of % with Random Forest which is 9% better than previous spambase approaches and % with Part. For enron dataset, we acquired the best percentage accuracy of % with Random Forest and % with Part. Enron dataset is used by [21] in an unsupervised spam learning and detection scheme. The feature selection algorithms used also contributed to achieve better accuracy with lower time complexity due to dimensionality reduction. For Random Forest, after using 70% of the feature set extracted, for spambase data set, the training accuracy remained the same (99.918%) whereas the computation time reduced by 20% (from 1540ms to 1276ms), whereas for PART, the training accuracy is increased by 1.521% and computation time is reduced by 52% (from 4938 ms to 2409ms). References Références Referencias 1. A Study of Spam classification using Feature Selection package, R.Parimala, Dr. R. Nallaswamy, National Institute of Technology, Global Journal of Computer Science and Technology, Volume 11 Issue 7 Version 1.0 May Comparative Study on Spam Classifier using Data Mining Techniques, R. Kishore Kumar, G. Poonkuzhali, P. Sudhakar, Member, IAENG, Proceedings of the International Multiconference of Engineers and Computer Scientists 2012 Vol I, IMECS 2012, March 14-16, Hong Kong. 3. Machine Learning Methods for Spam Classification, W.A. Awad and S.M. ELseuofi, International Journal of Computer Applications ( ) Volume 16 No.1, February Spam Filtering using Supervised Machine Learning Techniques, V.Christina, S.Karpagavalli, G.Suganya, (IJCSE) International Journal on Computer Science and EngineeringVol. 02, No. 09, 2010, Classification Using Data Reduction Method, Rafiqul Islam and Yang Xiang, member IEEE, School of Information Technology Deakin University, Burwood 3125, Victoria, Australia. 6. Spam Classification based on Supervised Learning using Machine Learning Techniques, Ms.D Karthika Renuka, Dr.T.Hamsapriya, Mr.M.Raja Chakkaravarthi, Ms. P. Lakshmi Surya, /11/$ IEEE. 7. An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization, Chih-Chin Lai, Ming-Chi Tsai, Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS 04) /04 $ IEEE. 8. Introductory Statistics: Concepts, Models, and Applications, David W. Stockburger. 9. Feature Subset Selection: A Correlation Based Filter Approach, Hall, M. A., Smith, L. A., 1997,

10 International Conference on Neural Information Processing and Intelligent Information Systems, Springer, p A practical approach to feature selection, K. Kira and L. A. Rendell, Proceedings of the Ninth International Conference, Very simple classification rules perform well on most commonly used datasets, Holte, R.C.(1993) Machine Learning, Vol. 11, Induction of decision trees, J.R. Quinlan, Machine Learning 1, , UCI repository of Machine learning Databases, Department of Information and Computer Science, University of California, Irvine, CA, Hettich, S., Blake, C. L., and Merz, C. J., Random Forests, Leo Breiman, Statistics Department University of California Berkeley, CA 94720, January Exploiting Partial Decision Trees for Feature Subset Selection in Categorization, Helmut Berger, Dieter Merkl, Michael Dittenbach, SAC 06 April 2327, 2006, Dijon, France Copyright 2006 ACM /06/ C4.5: Programs for Machine Learning, J. R. Quinlan, Morgan Kaufmann Publishers Inc., Fast effective rule induction, W. W. Cohen, In Proc. of the Int l Conf. on Machine Learning, pages Morgan Kaufmann, Toward optimal feature selection using Ranking methods and classification Algorithms, Jasmina Novaković, PericaStrbac, DusanBulatović, March SpamAssassin, The enron spam dataset ion/data/enron-spa 21. A Case for Unsupervised-Learning-based Spam Filtering, Feng Qian, Abhinav Pathak, Y. Charlie Hu, Z. Morley Mao, Yinglian Xie. 57

11 Global Journals Inc. (US) Guidelines Handbook

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]

More information

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

Email Classification Using Data Reduction Method

Email Classification Using Data Reduction Method Email Classification Using Data Reduction Method Rafiqul Islam and Yang Xiang, member IEEE School of Information Technology Deakin University, Burwood 3125, Victoria, Australia Abstract Classifying user

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Impact of Feature Selection Technique on Email Classification

Impact of Feature Selection Technique on Email Classification Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity

More information

AnalysisofData MiningClassificationwithDecisiontreeTechnique

AnalysisofData MiningClassificationwithDecisiontreeTechnique Global Journal of omputer Science and Technology Software & Data Engineering Volume 13 Issue 13 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

Improving spam mail filtering using classification algorithms with discretization Filter

Improving spam mail filtering using classification algorithms with discretization Filter International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Detecting E-mail Spam Using Spam Word Associations

Detecting E-mail Spam Using Spam Word Associations Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 [email protected] 2 [email protected]

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2 International Journal of Computer Engineering and Applications, Volume IX, Issue I, January 15 SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring 714 Evaluation of Feature election Methods for Predictive Modeling Using Neural Networks in Credits coring Raghavendra B. K. Dr. M.G.R. Educational and Research Institute, Chennai-95 Email: [email protected]

More information

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH SANGITA GUPTA 1, SUMA. V. 2 1 Jain University, Bangalore 2 Dayanada Sagar Institute, Bangalore, India Abstract- One

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India [email protected]

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

Comparative Study on Email Spam Classifier using Data Mining Techniques

Comparative Study on Email Spam Classifier using Data Mining Techniques Comparative Study on Email Spam Classifier using Data Mining Techniques R. Kishore Kumar, G. Poonkuzhali, P. Sudhakar, Member, IAENG Abstract In this e-world, most of the transactions and business is taking

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

A Hybrid ACO Based Feature Selection Method for Email Spam Classification

A Hybrid ACO Based Feature Selection Method for Email Spam Classification A Hybrid ACO Based Feature Selection Method for Email Spam Classification KARTHIKA RENUKA D 1, VISALAKSHI P 2 1 Department of Information Technology 2 Department of Electronics and Communication Engineering

More information

Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

Introduction to Learning & Decision Trees

Introduction to Learning & Decision Trees Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

More information

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Predicting Student Performance by Using Data Mining Methods for Classification

Predicting Student Performance by Using Data Mining Methods for Classification BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Using Random Forest to Learn Imbalanced Data Chao Chen, [email protected] Department of Statistics,UC Berkeley Andy Liaw, andy [email protected] Biometrics Research,Merck Research Labs Leo Breiman,

More information

Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality

Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality Tatsuya Minegishi 1, Ayahiko Niimi 2 Graduate chool of ystems Information cience,

More information

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Impact of Boolean factorization as preprocessing methods for classification of Boolean data Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,

More information

COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS

COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS AMJAD HARB and RASHID JAYOUSI Faculty of Computer Science, Al-Quds University, Jerusalem, Palestine Abstract This study exploits

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Data Mining: A Preprocessing Engine

Data Mining: A Preprocessing Engine Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,

More information

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo [email protected],[email protected]

More information

Modeling Suspicious Email Detection Using Enhanced Feature Selection

Modeling Suspicious Email Detection Using Enhanced Feature Selection Modeling Suspicious Email Detection Using Enhanced Feature Selection Sarwat Nizamani, Nasrullah Memon, Uffe Kock Wiil, and Panagiotis Karampelas Abstract The paper presents a suspicious email detection

More information

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya

More information

Learning to classify e-mail

Learning to classify e-mail Information Sciences 177 (2007) 2167 2187 www.elsevier.com/locate/ins Learning to classify e-mail Irena Koprinska *, Josiah Poon, James Clark, Jason Chan School of Information Technologies, The University

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]

More information

EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE

EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE S. Anupama Kumar 1 and Dr. Vijayalakshmi M.N 2 1 Research Scholar, PRIST University, 1 Assistant Professor, Dept of M.C.A. 2 Associate

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie

More information

HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM

HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM Ms.Barkha Malay Joshi M.E. Computer Science and Engineering, Parul Institute Of Engineering & Technology, Waghodia. India Email:

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1 Data Mining 1 Introduction 2 Data Mining methods Alfred Holl Data Mining 1 1 Introduction 1.1 Motivation 1.2 Goals and problems 1.3 Definitions 1.4 Roots 1.5 Data Mining process 1.6 Epistemological constraints

More information

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority

More information

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

More information

Tweaking Naïve Bayes classifier for intelligent spam detection

Tweaking Naïve Bayes classifier for intelligent spam detection 682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. [email protected] 2 School of Computing, Information

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

Content-Based Recommendation

Content-Based Recommendation Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

More information

ISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies

ISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

More information

Application of Data Mining Techniques to Model Breast Cancer Data

Application of Data Mining Techniques to Model Breast Cancer Data Application of Data Mining Techniques to Model Breast Cancer Data S. Syed Shajahaan 1, S. Shanthi 2, V. ManoChitra 3 1 Department of Information Technology, Rathinam Technical Campus, Anna University,

More information

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,

More information