** The High Institute of Zahra for Comperhensive Professions, Zahra-Libya

Size: px
Start display at page:

Download "agoweder@yahoo.com ** The High Institute of Zahra for Comperhensive Professions, Zahra-Libya"

Transcription

1 AN ANTI-SPAM SYSTEM USING ARTIFICIAL NEURAL NETWORKS AND GENETIC ALGORITHMS ABDUELBASET M. GOWEDER *, TARIK RASHED **, ALI S. ELBEKAIE ***, and HUSIEN A. ALHAMMI **** * The High Institute of Surman for Comperhensive Professions, Surman-Libya agoweder@yahoo.com ** The High Institute of Zahra for Comperhensive Professions, Zahra-Libya tarmanlib@yahoo.com *** The High Institute of Computer Technology, Tripoli-Libya ali.elbekai@yahoo.co.uk **** The High Institute of Zawia for Comperhensive Professions, Zawia-Libya h1974hami@yahoo.com Abstract Nowadays, is widely becoming one of the fastest and most economical forms of communication.thus, the is prone to be misused. One such misuse is the posting of unsolicited, unwanted s known as spam or junk s. This paper presents and discusses an implementation of an Anti-spam filtering system, which uses a Multi-Layer Perceptron (MLP) as a classifier and a Genetic Algorithm (GA) as a training algorithm. Standard genetic operators and advanced techniques of GA algorithm are used to train the MLP. The implemented filtering system has achieved an accuracy of about 94% to detect spam s, and 89% to detect legitimate s. Keywords: Artificial Neural Networks, Genetic Algorithms, Spam s, Legitimate s, Arabic Spam, Text Classification. relevant recent work. Section 3 provides a description of Genetic Algorithms. Section 4 describes the Multi- Layer Feed Forward Artificial Neural Networks. Section 5 discusses the experimental work, the results of the experiments conducted and includes an analysis of these results. Section 6 presents the conclusion drawn by the researchers. 1 INTRODUCTION Spam is becoming an increasingly large problem. Many Internet Service Providers (ISPs) receive over a billion spam messages per day. Much of these s are filtered before they reach end users. Content-Based filtering is a key technological method to filtering. The spam contents usually contain common words called features. Frequency of occurrence of these features inside an gives an indication that the is a spam or legitimate [1, 11, 26, 28].The spam filtering is high sensitive application of text classification (TC) task. Because spam s contain high noise, and redundant data to bypass filtering systems, a pre processing of s is required in order to split contents of s from HTML tags ( structure) and decide which information to use. The information is organized in as a set of fields, for example: From, To, Cc, Subject, and Body fields. In addition, we should handle the cases when some words appear in different forms (e.g.: CLICK, C*L*I*C*K, N-O-W, now!). In other languages such as Arabic, some words are also occur in different forms (e.g.:,ألتح ق) Altehk "Join"), Altehk!(!,ألتحق "Join!"), and,إضغط*)* Edkat "Click*")). For Arabic spam s, some of the challenges which we encountered in features reduction and selection phases are: some Arabic letters have many orthographical forms such as ألان) Alan, "NOW"),,(" NOW ",إلان) Elan and الان) Alan,"NOW"). In addition, some Arabic s usually include English words which need to be considered when designing and implementing an Arabic spam filtering system. This paper is organized as follows: Section 2 gives a theoretical background for the research and a review of 2 BACKGROUND AND LITERATURE REVIEW The success of statistical-probabilities algorithms and machine learning algorithms in text categorization (TC) has led researchers to explore these algorithms to be applied in anti-spam filtering [9,, 18]. Various techniques to extract features from have been proposed and implemented. Payne and Edwards [2] have used features consisting of words in the From and Subject fields. Segal et.al. [23] developed the MailCat system. They have used the information in the To, Cc, Subject, From, and Body Fields. Jason and Rennie [12] developed the ifile system and used the words found in the From, Subject, and Body Fields. Graham [11] extracted features from all fields in the Header and Body of s. In this paper, we have used the features that found in the From, Subject, and Body fields. There are three common and intuitive representations found in text categorization and they are called: Term Frequency (TF), Term Frequency Inverse Document Frequency (TF-IDF) weight representation and semantic approaches. Jason and Rennie [12]; and Boone [3] have used TF representation in the ifile filtering system. Segal, et.al. [23] have used TF-IDF weighting scheme to develop the MailCat text classifier. Boone [3] has showed that the TF-IDF weighting scheme captures the idea that the

2 Subject words will occur frequently in the document on a given topic. Liao, et.al. [] have compared between TF and TF-IDF feature representations. They have concluded that the TF-IDF features representation is better than the TF representation. Scott and Matwin [24] have discussed semantic approach representation in text classification. Their approach was focused on words meanings by clustering words which have the same meaning together. The TF-IDF representations have a greater advantage over semantic approaches and TF. This is because TF- IDF shows the degree of information represented by feature occurrences in s. Features reduction often applied to reduce the size of features extracted from e- mails. Almost all techniques for features reductions consider stop-words removal. Normalizing some Arabic alphabet letters are very useful and necessary reduction (ا) Alef step which converts some Arabic letters such as: with (ء) hamza above or below or Madda(~) above into the Arabic letter ا() Alef, and (و) Waw with (ء) hamza or.(و) Waw Madda(~) above into the Arabic letter Features selection approaches are usually employed to reduce the size of the feature set, and to select a subset of the original features. Chi-square test is used as a selection method [, 25, 3]. Boone [3] and Salton [25] have used the TF-IDF as a feature selection and weighting scheme. They have found that the TF-IDF scheme is useful for the features size reduction. Joachims [13] has used information gain to select a subset of features. Liao, et. al. [] have showed that the TF-IDF has similar performance to information gain and Chi-square test methods. The TF-IDF feature selection method is proposed to select the most discriminative features while eliminating irrelevant ones among arbitrarily constructed feature sets. Some algorithms are developed to classify and filter s. The RIPPER algorithm [4] is an algorithm that employs rule-based to filtering s. Drucker, et. al. [8] proposed an SVM algorithm for spam categorization. Jason [12] and Rennie [14] have demonstrated that the SVM is costly to train and requires significant time to classify. Sahami, et. al. [22] proposed Bayesian junk E- mail filter using bag-of-words representation and Naïve Bayes algorithm. Graham [11] described a simple implementation of the Naïve Bayes algorithm. Chuan, et. al. [7] proposed a Learning Vector Quantizers (LVQ) based on neural network Anti-spam approach. Özgür, et. al. [17] proposed an Anti-spam filtering method based on ANN and Bayesian networks for English languages in general and for Turkish in particular. Clark, et. al. [5] used the bag-of-words representation and ANN for automated spam filtering system. Previous researches have shown that ANN can achieve very accurate results, that are sometimes more accurate than those of the TC classifiers [27]. Some researchers used GA's as Alternative approach for training ANN [16]. Branke, J. [2] discussed how the genetic algorithm can be used to assist in designing and training. Riley. J. [21] described a method of utilizing genetic algorithms to train fixed architecture feedforward and recurrent neural networks. Yao. X. and Liu. Y. [29] reviewed the different combinations between ANN and GA, and used GA to evolve ANN connection weights, architectures, learning rules, and input features. Prados. D. [19] reported in his paper that the GA-based training algorithm is more useful for training ANN epically when simple ANN topology used. 3 A GENETIC ALGORITHM A GA is used in the system proposed by this paper for training the MLP. Training the MLP based on the GA will benefit from the GA properties which are parallel interactions process between a numbers of different chromosomes information (genes) in population pool of candidate solutions. This leads to create new several chromosomes information. In this paper, GA chromosome of the MLP is encoded as weights (w 1, w 2,,w n ) where n is the number of MLP connections and each gene is a real value number in the interval [-, ]. There are two genetic operators. The first one is referred to as the uniform crossover operator. Its occurrence is based on crossover probability (Cp). The crossover occurs, if the generated random value number which is between [, 1] is greater than or equals the Cp. The second genetic operator is called mutation which simply involves changing the genes values by adding the gene value to a uniformly random-generated number. Mutation occurs with a probability equals one for the chromosome that has not crossed, and with a probability equals (1-Cp) for the chromosome that has crossed. The mutation function can be computed according to the following equation: value = random_value[,1] * (Min_bound - Max_bound) + Max_b) (3.1) Where: Min_bound= Min_b*Random_value[,1]*Generation_Rate (3.2) Max_bound= Max_b*Random_value(,1)*Generation_Rate (3.3) Min_b: lower interval value = -3, Max_b: upper interval value = +3. Generation_Rate= log (Max_Gen)-(Cur_Gen) / log (Max_Gen) (3.4) Max_Gen: Maximum number of Generation, Cur_Gen: Current generation. 3.1 A FITNESS FUNCTION The fitness function was absolute sum of the output differences between actual and desired output of a chromosome over all training data. The fitness function is computed by the following equation: The fitness function = C desired_ou tput(i) actual_out put(i) (3.5) c = 1 Where: 2

3 C: is the number of chromosomes in the population pool. desired_ou put(i) : indicates the class which is either a spam (represented by the value.1) or legitimate (represented by the value.9), for an i. actual_out put(i) : is the expected output value of chromosome c over all s in the training data. 3.2 ELITISM STRATEGY AND A RANK- BASED SELECTION A rank-based selection is needed to make few copies of a set of best chromosomes. Equation 3.6 was used to calculate the number of copies for each chromosome depending on an ordered set. Copies = (q - ((Chr_order - 1) * p)) * Chrs (3.6) Where: Chr_order: is the order number of chromosome in population pool list. Chrs: is number of chromosomes in the population pool list. q = 2 /Chrs. p = q / (Chrs - 1). 4 AN ARTIFICIAL NEURAL NETWORK (ANN) The ANN used in our system is the key component that does the filtering operation. The MLP architecture is a full connection feed-forward with inputs depending on the number of selected features. Each input is corresponding to a single feature which is converted to the TF-IDF weight and organized as TF-IDF vector features with a class label spam (.1) or legitimate (.9). The MLP output is a single output. Training is done by constructing one target output for legitimate or spam e- mails, and training with the appropriate output value for the input data. By observation, a threshold value is chosen to be.6. On the basis of the output, a value less than.6 is thresholded to be.1, otherwise the value is thresholded to be.9. Training the MLP is performed using one and two hidden layers. A number of hidden and one output neuron with sigmoid activation function are used. English and Arabic data sets are tested on different combinations (5,,, 2, and 3) of hidden. Training the MLP is achieved through the use of the GA which is described in Section 3.2. The training procedure starts with 2 chromosomes. Other experiments are conducted on different number of chromosomes (e.g.: 4 and 6). Initial chromosome genes values were real numbers in the interval [-, ]. A training procedure was repeated many times with many different training data, over several generations until one of the following conditions are met: 1. The maximum number (set to be 5,) of generations is reached. 2. The fitness value (the MLP Error) is less than or equals to.5. 5 THE EXPERIMENTAL WORK In this section, we first present the data sets that we used to conduct our evaluation experiments. Next, a pre-processing of our data and an implementation of our system are given. Then, evaluation measures to assess our system are described. Finally, a set of experiments are presented followed by the results and their discussion. 5.1 THE DATA SETS Three different data sets are used to conduct our experiments. These data sets are collected from different sources [6, 31]. Table 5.1 shows these three data sets. Table 5. 1:The Data Sets (corpora). Corpus Name No. of Spam s No. of Legitimate s Total SpamAssassin TREC The Arabic Corpus TRAINING AND TEST DATA Each data set was equally split into two sets (5% for training and 5% for test data). Table (5.2) shows the training and test data for each corpus (data set). Table 5. 2: Training and Test Data. SpamAssassin The Arabic TREC Corpus Corpus Corpus Trainin Test Training Test Training Test set g set set set set set Spam Legitimate Total DATA PRE-PROCESSING Data pre-processing is an analysis of the textual data and an extraction of information from s. The general procedure for data pre-processing can be described according to the following steps: (i) Deletion: Remove irrelevant elements of s, and select segments suitable for processing (e.g., Subject and Body Fields). (ii) Normalization: For Arabic s, convert some Arabic letters which have the same shape such as: (ا) Alef with (ء) hamza above or below or Madda(~) above into the Arabic letter Alef( ا(, and (و) Waw with (ء) hamza or Madda(~) above into the Arabic letter.(و) Waw 3

4 (iii)tokenization: Divide the message into semantically coherent segments (e.g.: words, other character strings). (iv) Representation: Convert the message into a vector of values, where each value in this vector represents an feature. (v) Selection: Delete the least predictive features using the TF-IDF weighting scheme. The highest values of TF-IDF features are selected to represent the set of training features. 5.4 IMPLEMENTATION We have implemented an Anti spam system that runs under Windows XP platform. The code is written using Visual Basic.net. The system was built from scratch without using any ANN or GA libraries. The system has three main modules, these are: (1) A features extraction and reduction module. (2) A features weighting and selection module. (3) A classifier module, which consists of an MLP classifier and GA THE FEATURES EXTRACTION AND REDUCTION MODULE This module is concerned with the features extraction and reduction. It first tokenizes each included in the training data set. Then, a bag-of-words is created for each data set. No stemming was applied. Next, words that appear only three times and less in each corpus were discarded. Finally, words that are 2 characters in length or longer were removed from the e- mail. As a result, the initial number of unique features is reduced from about 48 to 981 for Arabic and English corpus. For SpamAssissn corpus, the initial number of features is reduced from 22 to 32. While for the TREC corpus, the features are reduced from 29 to THE FEATURES WEIGHTING AND SELECTION MODULE The implementation of feature selection using the TF- IDF scheme was carried out after the construction of the bag-of-words. The selection of the best features is done by sorting the TF-IDF features in a descending order. We then decide how many features we might include. The experiments are conducted using different number of selected features THE CLASSIFIER MODULE The MLP architecture is a full connection feed-forward with inputs depending on the number of selected features. Each input is converted to the TF-IDF weight and organized as TF-IDF vector features with a class label spam (.1) or legitimate (.9). Two matrices are used to calculate the outputs of every layer. The first matrix is concerned with the MLP inputs organized as vectors. Each vector consists of a set of TF-IDF values. The second matrix contains a set of chromosomes which represent the weight associated with every MLP input. 5.5 EVALUATION MEASURES The performance of spam filtering techniques is determined by two well known measures used in text classification. These measures are precision and recall [5, ] which can be computed as follows: N SS Spam Precision (SP) = (5.1) N + N ) Legitimate Precision (LP) = Spam Recall (SR) = ( SS LS N LL ( N + N ) SL LL ( N + N ) SL SS N SS (5.2) (5.3) N LL Legitimate Recall (LR) = (5.4) ( N LL + N LS ) Where: N SS = the number of spam messages correctly classified as spam. N SL = the number of spam messages incorrectly classified as legitimate. N LL = the number of legitimate messages correctly classified as legitimate. N LS = the number of legitimate messages incorrectly classified as spam. 5.6 EXPERIMENTS The purpose of these experiments is to evaluate the performance of the MLP in spam filtering and the efficiency of GA in training the MLP. A series of tests are performed on a small problem (the XOR) to discover the best GA parameters (e.g., mutation probabilities, crossover probabilities, and population size) that give the best performance of the MLP. The best obtained GA parameters are used to train our MLP classifier THE XOR PROBLEM The XOR problem was the first problem to be solved using the MLP trained by the GA. This problem has become a standard example used by many researchers to explain the training process. Table 5.3 shows the different values of mutation, crossover probabilities, and population size for each experiment. Table 5.3 clearly shows that experiment 2 recorded the minimum time to train the MLP using the GA for the XOR problem. 4

5 Table 5. 3: The GA Parameters for the XOR Problem. Experiment name Mp Cp Ps Training Time in Seconds (s) Experiment 1 Mp=.3 Cp=.7 Ps= s Experiment 2 Mp=.3 Cp=.7 Ps=2 5s Experiment 3 Mp=.3 Cp=.7 Ps=4 s Experiment 4 Mp=.3 Cp=.7 Ps=6 >2s Experiment 5 Mp=.5 Cp=.7 Ps= s Experiment 6 Mp=.5 Cp=.7 Ps=2 s Experiment 7 Mp=.5 Cp=.7 Ps=4 >2s Experiment 8 Mp=.5 Cp=.7 Ps=6 4s THE MLP AND GA CLASSIFIER A series of experiments were conducted to train our MLP using the GA parameters obtained from experiment 2 described in the previous section. These experiments are intended to train our MLP using the GA on three different data sets. Despite the fact that the training process is accomplished, there are some cases where combinations of the MLP parameters have led to a failure due to the low rates of SR, SP, LR, and LP evaluation measures. Other combinations of the MLP parameters were ignored and the processes of training were terminated because the training time exceeded 6 hours and the MLP errors were slightly improved. One of training processes that are terminated is the process where the experiment used 25 features as input, the first layer contained 3, and the second layer contained. The following sections present the results of the experiments conducted on three different data sets THE SPAMASSASSIN DATA SET RESULTS In this experiment, we have trained the MLP using the GA on the SpamAssassin data set. This section presents the obtained results using and 2 different features. Tables 5.4 and 5.5 show the SR, SP, LR, and LP values using and 2 input features respectively. It can be observed from Table 5.4 that the best results as highlighted are obtained using the MLP which consists of one hidden layer with 3. These results were error rate which was 123 in the first generation and it took about generations to reach the error value of.49. Table 5.5 also shows that the best results as highlighted are obtained using the MLP which consists of one hidden layer with 3. These results were error rate which was in the first generation and it took about generations to reach the error value of.432. Table 5. 4: The Results of SpamAssassin Data Set: ( input features) Table 5. 5: The Results of SpamAssassin Data Set: (2 input features) THE TREC DATA SET RESULTS In this experiment, the MLP has been trained using the GA on the TREC data set. The obtained results using and 2 different features are given in tables 5.6 and 5.7. These tables show the SR, SP, LR, and LP values using and 2 input features respectively. In general, the results show low rates, because the TREC corpus contains large number of spam s that are highly similar to legitimates s (hard spam). It can be observed from Table 5.6 that the best results as highlighted are obtained using the MLP which consists of two hidden layers with 3 in the first layer and in the second one. These results were error rate which was 124 in the first generation and it took about generations to reach the error value of.498. Table 5.7 also shows that the best results as highlighted are obtained using the MLP which consists of two hidden layers with 3 in the first layer and in the second one. These results were achieved through a gradual improvement of the initial error rate

6 which was 112 in the first generation and it took about generations to reach the error value of.46. Table 5. 6: The Results of TREC Data Set: ( input features). 2 3 Table 5. 8: The Results of Arabic Data Set: (5 input features) Table 5. 7: The Results of TREC Data Set: (2 input features) THE ARABIC DATA SET RESULTS In this experiment, we have trained the MLP using the GA on the Arabic data set. This section presents the obtained results using 5 and 9 different features. Tables 5.8 and 5.9 show the SR, SP, LR, and LP values using 5 and 9 input features respectively. It can be observed from Table 5.8 that the best results as highlighted are obtained using the MLP which consists of one hidden layer with. These results were error rate which was 8 in the first generation and it took about 3546 generations to reach the error value of.48. Table 5.9 also shows that the best results as highlighted are obtained using the MLP which consists of one hidden layer with. These results were error rate which was 7 in the first generation and it took about 4546 generations to reach the error value of Table 5. 9: The Results of Arabic Data Set: (9 input features) THE OVERALL PERFORMANCE The results of our experiments indicate that our implemented MLP classifier using the GA performed significantly well. The overall accuracy rates are about 94% to detect spam s. On the other hand, the accuracy rates are about 89% to detect legitimate e- mails. 5.8 DISCUSSION OF THE RESULTS An analysis of the results and a deep understanding of the experiments produced a set of remarks as follows: (1) The best input features for English s were that generate the best results comparable to the 2 input features. For Arabic s, 9 input features are considered to be the best input features. This implies that the sucess rates are apparently influnced by the number of input feaures. (2) Words in legimate s are as important as words in spam s for the filtering process. By obsrvation, most misclassifications were s containing only 6

7 one or two words, or Arabic s which have Arabic mixed with English words. (3) A wise setting of the number of hidden layers and the number of can significantly dcrease the MLP error rates. (4) The initial parameters that were used during the development of the GA were Mp =.3, Cp =.7, Ps=2, and maximum generations were set to be 5,. These settings were suitable for filtering domain. Increasing the population size gives less chances of good chromosomes to appear in the next generation using the rank-based selection. The GA works better using many inputs (spam filtering) than using a few inputs (the XOR problem). 6 CONCLUSION An anti-spam filtering system was proposed which uses the multi-layer artificial neural network trained by the genetic algorithm. The results clearly show that the Subject and Body fields can contain enough information to classify s into spam or legitimate. The results have also shown that the MLP with -3 in the first hidden layer are sufficient to filter both easy spam and easy legitimate s. The MLP architecture used to develop our system is good for filtering s, if we do not take into account the long time needed to train the MLP. We have also investigated the effects of several GA parameters. The parameters that have been found to be the most significant to the performance of the classifier are: size of the population pool, crossover, mutation probabilities and mutation method. It is important to remember that filtering is high sensitive application of textual classification problem. The classifier must be able to handle many input features, with low false positive and low false negative. ACKNOWLEDGEMENT We would like to express our gratitude to the Libyan General Secretariat for Human Resources and Training for supporting this work. REFERENCES [1] Bruening, P., Technological Responses to the Problem of Spam: Preserving Free Speech and Open Internet Values, First Conference on and Anti- Spam, 24. [2] Branke, J., "Evolutionary algorithms for neural network design and training", In Proceedings 1st Nordic Workshop on Genetic Algorithms and its Applications, Finland, [3] Boone, G., "Concept Features in Re:Agent, an Intelligent Agent", The Second international Conference on Autonomous Agents, 1998 [4] Cohen, W., Learning Rules that Classify , In AAAI Spring Symposium on Machine Learning in Information Access, California, [5] Clark, et. al., "A Neural Network Based Approach to Automated Classification, IEEE/WIC International Conference on Web Intelligence, 23. [6] Cormack, G. Lynam, T., Spam Corpus Creation for TREC, Second conference of and Anti-spam, 25. [7] Chuan, Z., et. al., A LVQ-based neural network anti-spam approach, Proceedings of the 5th International Conference, Singapore, 24. [8] Drucker, H., et. al., Support Vector Machines for Spam Categorization, In IEEE Transactions on Neural Networks, [9] Flavio, D., et. al., Spam Filter Analysis, University of Nijmegen, the Netherlands, 23. [] Goodman, J., Spam: Technologies and Policies, Microsoft Research, 23 [11] Graham, P., A Plan for Spam, MIT Conference on Spam, 23. [12] Jason, D., Rennie, M., "ifile: An Application of Machine Learning to E Mail Filtering", Text Mining Workshop, Boston, U.S.A, 2. [13] Joachims, T., "Text Categorization with Support Vector Machines: Learning with Many Relevant Features", Proceedings of ECML-98, th European Conference on Machine Learning, [14] Kolcz, A., Alspector, J., "SVM-based filtering of e- mail spam with content-specific misclassification costs", In Proceedings of the Workshop on Text Mining, IEEE International Conference on Data Mining. San Jose, California, 21. [] Liao, C., Alpha, S., "Dixon.P, "Feature Preparation in Text Categorization", Oracle Corporation, 24. [16] Montana. D., Davis, L., "Training feed-forward neural networks using genetic algorithms", In Proceedings of the 11th International on Artificial Intelligence, 1989 [17] Ozgur, L., et. al., Adaptive Turkish Anti-spam Filtering, International Twelfth Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN), 23. [18] Oda, T., White, T., Increasing the Accuracy of a Spam-detecting Artificial Immune System, In the Congress on Evolutionary Computation Proceedings, Canberra, Australia, 23. [19] Prados. D., Training multilayered neural networks by replacing the least fit hidden, In Proceedings IEEE SOUTHEASTCON 22, 22. [2] Payne, T., Edwards, P., "Interface Agents that Learn: An Investigation of Learning Issues in a Mail Agent Interface". Applied Artificial Intelligence, [21] Riley. J., "An evolutionary approach to training Feed-Forward and Recurrent Neural Networks", Master thesis of Applied Science in Information Technology, Department of Computer Science, Royal Melbourne Institute of Technology, Australia, 22. [22] Sahami, M., et. al., A Bayesian Approach to Filtering Junk , In Learning for Text Categorization, AAAI Technical Report, U.S.A,

8 [23] Segal, R., et. al., " MailCat: An intelligent assistant for organizing ", Proceedings of the Third International Conference on Autonomous Agents, [24] Scott, S., Matwin, S., "Feature engineering for text classification", Proceedings of ICML-99, 16th International Conference on Machine Learning, [25] Salton, G., Buckley, C., "Term Weighting Approaches in Automatic Text Retrieval", Information Processing and Management, Vol. 24, No.5, P513, [26] Urnkranz, J., "A Study Using n-gram Features for Text Categorization", Austrian Research Institute, [27] Vinther, M., "Intelligent junk mail detection using Neural networks", URL: kdetection.pdf, 22. [28] William, S., et. al., A Unified Model of Spam Filtration, MIT Spam Conference, Cambridge, 25. [29] Yao. X., Liu. Y., "A new evolutionary system for evolving artificial neural networks", IEEE Transactions on Neural Networks, [3] Yang, Y., Pedersen. J., "A comparative study on feature selection in text categorization", In Proceedings of ICML-97, 14th International Conference on Machine Learning, U.S.A, [31]URL: 8

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng

More information

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering

More information

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,

More information

Detecting E-mail Spam Using Spam Word Associations

Detecting E-mail Spam Using Spam Word Associations Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 p10co977@coed.svnit.ac.in 2 dpr@coed.svnit.ac.in

More information

A LVQ-based neural network anti-spam email approach

A LVQ-based neural network anti-spam email approach A LVQ-based neural network anti-spam email approach Zhan Chuan Lu Xianliang Hou Mengshu Zhou Xu College of Computer Science and Engineering of UEST of China, Chengdu, China 610054 zhanchuan@uestc.edu.cn

More information

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

More information

Representation of Electronic Mail Filtering Profiles: A User Study

Representation of Electronic Mail Filtering Profiles: A User Study Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 +1 949 824 5888 pazzani@ics.uci.edu

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Combining Global and Personal Anti-Spam Filtering

Combining Global and Personal Anti-Spam Filtering Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized

More information

Immunity from spam: an analysis of an artificial immune system for junk email detection

Immunity from spam: an analysis of an artificial immune system for junk email detection Immunity from spam: an analysis of an artificial immune system for junk email detection Terri Oda and Tony White Carleton University, Ottawa ON, Canada terri@zone12.com, arpwhite@scs.carleton.ca Abstract.

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

E-mail Spam Classification With Artificial Neural Network and Negative Selection Algorithm

E-mail Spam Classification With Artificial Neural Network and Negative Selection Algorithm E-mail Spam Classification With Artificial Neural Network and Negative Selection Algorithm Ismaila Idris Dept of Cyber Security Science, Federal University of Technology, Minna, Nigeria. Idris.ismaila95@gmail.com

More information

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Journal of Machine Learning Research 7 (2006) 2699-2720 Submitted 3/06; Revised 9/06; Published 12/06 Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Giorgio Fumera Ignazio

More information

Data Pre-Processing in Spam Detection

Data Pre-Processing in Spam Detection IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 11 May 2015 ISSN (online): 2349-784X Data Pre-Processing in Spam Detection Anjali Sharma Dr. Manisha Manisha Dr. Rekha Jain

More information

Increasing the Accuracy of a Spam-Detecting Artificial Immune System

Increasing the Accuracy of a Spam-Detecting Artificial Immune System Increasing the Accuracy of a Spam-Detecting Artificial Immune System Terri Oda Carleton University terri@zone12.com Tony White Carleton University arpwhite@scs.carleton.ca Abstract- Spam, the electronic

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering Advances in Intelligent Systems and Technologies Proceedings ECIT2004 - Third European Conference on Intelligent Systems and Technologies Iasi, Romania, July 21-23, 2004 Evolutionary Detection of Rules

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

Statistical Feature Selection Techniques for Arabic Text Categorization

Statistical Feature Selection Techniques for Arabic Text Categorization Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000

More information

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering 2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University

More information

ARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES

ARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES FOUNDATION OF CONTROL AND MANAGEMENT SCIENCES No Year Manuscripts Mateusz, KOBOS * Jacek, MAŃDZIUK ** ARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES Analysis

More information

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

More information

International Journal of Research in Advent Technology Available Online at: http://www.ijrat.org

International Journal of Research in Advent Technology Available Online at: http://www.ijrat.org IMPROVING PEFORMANCE OF BAYESIAN SPAM FILTER Firozbhai Ahamadbhai Sherasiya 1, Prof. Upen Nathwani 2 1 2 Computer Engineering Department 1 2 Noble Group of Institutions 1 firozsherasiya@gmail.com ABSTARCT:

More information

Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,

More information

Email Classification Using Data Reduction Method

Email Classification Using Data Reduction Method Email Classification Using Data Reduction Method Rafiqul Islam and Yang Xiang, member IEEE School of Information Technology Deakin University, Burwood 3125, Victoria, Australia Abstract Classifying user

More information

Adaption of Statistical Email Filtering Techniques

Adaption of Statistical Email Filtering Techniques Adaption of Statistical Email Filtering Techniques David Kohlbrenner IT.com Thomas Jefferson High School for Science and Technology January 25, 2007 Abstract With the rise of the levels of spam, new techniques

More information

Naïve Bayesian Anti-spam Filtering Technique for Malay Language

Naïve Bayesian Anti-spam Filtering Technique for Malay Language Naïve Bayesian Anti-spam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information

More information

Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages

Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages Tunga Güngör and Ali Çıltık Boğaziçi University, Computer Engineering Department, Bebek, 34342 İstanbul, Turkey

More information

IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: marthakala08@gmail.com, Abstract- Unethical

More information

Lasso-based Spam Filtering with Chinese Emails

Lasso-based Spam Filtering with Chinese Emails Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1

More information

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Intelligent Word-Based Spam Filter Detection Using Multi-Neural Networks

Intelligent Word-Based Spam Filter Detection Using Multi-Neural Networks www.ijcsi.org 17 Intelligent Word-Based Spam Filter Detection Using Multi-Neural Networks Ann Nosseir 1, Khaled Nagati 1 and Islam Taj-Eddin 1 1 Faculty of Informatics and Computer Sciences British University

More information

WE DEFINE spam as an e-mail message that is unwanted basically

WE DEFINE spam as an e-mail message that is unwanted basically 1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

Technical Report. The KNIME Text Processing Feature:

Technical Report. The KNIME Text Processing Feature: Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold Killian.Thiel@uni-konstanz.de Michael.Berthold@uni-konstanz.de Copyright 2012 by KNIME.com AG

More information

Impact of Feature Selection Technique on Email Classification

Impact of Feature Selection Technique on Email Classification Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity

More information

SpamNet Spam Detection Using PCA and Neural Networks

SpamNet Spam Detection Using PCA and Neural Networks SpamNet Spam Detection Using PCA and Neural Networks Abhimanyu Lad B.Tech. (I.T.) 4 th year student Indian Institute of Information Technology, Allahabad Deoghat, Jhalwa, Allahabad, India abhimanyulad@iiita.ac.in

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie

More information

Spam Filtering with Naive Bayesian Classification

Spam Filtering with Naive Bayesian Classification Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09-April-2011

More information

Numerical Research on Distributed Genetic Algorithm with Redundant

Numerical Research on Distributed Genetic Algorithm with Redundant Numerical Research on Distributed Genetic Algorithm with Redundant Binary Number 1 Sayori Seto, 2 Akinori Kanasugi 1,2 Graduate School of Engineering, Tokyo Denki University, Japan 10kme41@ms.dendai.ac.jp,

More information

Spam Filtering Based on Latent Semantic Indexing

Spam Filtering Based on Latent Semantic Indexing Spam Filtering Based on Latent Semantic Indexing Wilfried N. Gansterer Andreas G. K. Janecek Robert Neumayer Abstract In this paper, a study on the classification performance of a vector space model (VSM)

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Analecta Vol. 8, No. 2 ISSN 2064-7964

Analecta Vol. 8, No. 2 ISSN 2064-7964 EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,

More information

Shafzon@yahool.com. Keywords - Algorithm, Artificial immune system, E-mail Classification, Non-Spam, Spam

Shafzon@yahool.com. Keywords - Algorithm, Artificial immune system, E-mail Classification, Non-Spam, Spam An Improved AIS Based E-mail Classification Technique for Spam Detection Ismaila Idris Dept of Cyber Security Science, Fed. Uni. Of Tech. Minna, Niger State Idris.ismaila95@gmail.com Abdulhamid Shafi i

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

Spam? Not Any More! Detecting Spam emails using neural networks

Spam? Not Any More! Detecting Spam emails using neural networks Spam? Not Any More! Detecting Spam emails using neural networks ECE / CS / ME 539 Project Submitted by Sivanadyan, Thiagarajan Last Name First Name TABLE OF CONTENTS 1. INTRODUCTION...2 1.1 Importance

More information

Spam detection with data mining method:

Spam detection with data mining method: Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

Spam Filtering using Naïve Bayesian Classification

Spam Filtering using Naïve Bayesian Classification Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering

More information

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems

More information

ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift

ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift Sarah Jane Delany 1 and Pádraig Cunningham 2 and Barry Smyth 3 Abstract. While text classification has been identified for some time

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Anti Spamming Techniques

Anti Spamming Techniques Anti Spamming Techniques Written by Sumit Siddharth In this article will we first look at some of the existing methods to identify an email as a spam? We look at the pros and cons of the existing methods

More information

Spam Detection System Combining Cellular Automata and Naive Bayes Classifier

Spam Detection System Combining Cellular Automata and Naive Bayes Classifier Spam Detection System Combining Cellular Automata and Naive Bayes Classifier F. Barigou*, N. Barigou**, B. Atmani*** Computer Science Department, Faculty of Sciences, University of Oran BP 1524, El M Naouer,

More information

Filtering Junk Mail with A Maximum Entropy Model

Filtering Junk Mail with A Maximum Entropy Model Filtering Junk Mail with A Maximum Entropy Model ZHANG Le and YAO Tian-shun Institute of Computer Software & Theory. School of Information Science & Engineering, Northeastern University Shenyang, 110004

More information

Neural Networks for Sentiment Detection in Financial Text

Neural Networks for Sentiment Detection in Financial Text Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.

More information

Machine Learning for Naive Bayesian Spam Filter Tokenization

Machine Learning for Naive Bayesian Spam Filter Tokenization Machine Learning for Naive Bayesian Spam Filter Tokenization Michael Bevilacqua-Linn December 20, 2003 Abstract Background Traditional client level spam filters rely on rule based heuristics. While these

More information

A Collaborative Approach to Anti-Spam

A Collaborative Approach to Anti-Spam A Collaborative Approach to Anti-Spam Chia-Mei Chen National Sun Yat-Sen University TWCERT/CC, Taiwan Agenda Introduction Proposed Approach System Demonstration Experiments Conclusion 1 Problems of Spam

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Journal of Information Technology Impact

Journal of Information Technology Impact Journal of Information Technology Impact Vol. 8, No., pp. -0, 2008 Probability Modeling for Improving Spam Filtering Parameters S. C. Chiemeke University of Benin Nigeria O. B. Longe 2 University of Ibadan

More information

ifile: An Application of Machine Learning to E Mail Filtering

ifile: An Application of Machine Learning to E Mail Filtering ifile: An Application of Machine Learning to E Mail Filtering Jason D. M. Rennie Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 jrennie@ai.mit.edu ABSTRACT The rise

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

Unmasking Spam in Email Messages

Unmasking Spam in Email Messages Unmasking Spam in Email Messages Anjali Sharma 1, Manisha 2, Dr. Manisha 3, Dr. Rekha Jain 4 Abstract: Today e-mails have become one of the most popular and economical forms of communication for Internet

More information

8. Machine Learning Applied Artificial Intelligence

8. Machine Learning Applied Artificial Intelligence 8. Machine Learning Applied Artificial Intelligence Prof. Dr. Bernhard Humm Faculty of Computer Science Hochschule Darmstadt University of Applied Sciences 1 Retrospective Natural Language Processing Name

More information

Spam Filter: VSM based Intelligent Fuzzy Decision Maker

Spam Filter: VSM based Intelligent Fuzzy Decision Maker IJCST Vo l. 1, Is s u e 1, Se p te m b e r 2010 ISSN : 0976-8491(Online Spam Filter: VSM based Intelligent Fuzzy Decision Maker Dr. Sonia YMCA University of Science and Technology, Faridabad, India E-mail

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

IT services for analyses of various data samples

IT services for analyses of various data samples IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical

More information

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

Naive Bayes Spam Filtering Using Word-Position-Based Attributes Naive Bayes Spam Filtering Using Word-Position-Based Attributes Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden johan.hovold.363@student.lu.se Abstract This paper

More information

wealak@yahoo.com Sherif.shawki@gmail.com

wealak@yahoo.com Sherif.shawki@gmail.com MACHINE LEARNING METHODS FOR SPAM E-MAIL CLASSIFICATION ABSTRACT W.A. Awad 1 and S.M. ELseuofi 2 1 Math.&Comp.Sci.Dept., Science faculty, Port Said University wealak@yahoo.com 2 Inf. System Dept.,Ras El

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and

More information

Neural Network Predictor for Fraud Detection: A Study Case for the Federal Patrimony Department

Neural Network Predictor for Fraud Detection: A Study Case for the Federal Patrimony Department DOI: 10.5769/C2012010 or http://dx.doi.org/10.5769/c2012010 Neural Network Predictor for Fraud Detection: A Study Case for the Federal Patrimony Department Antonio Manuel Rubio Serrano (1,2), João Paulo

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

Hoodwinking Spam Email Filters

Hoodwinking Spam Email Filters Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 533 Hoodwinking Spam Email Filters WANLI MA, DAT TRAN, DHARMENDRA

More information

How To Create A Text Classification System For Spam Filtering

How To Create A Text Classification System For Spam Filtering Term Discrimination Based Robust Text Classification with Application to Email Spam Filtering PhD Thesis Khurum Nazir Junejo 2004-03-0018 Advisor: Dr. Asim Karim Department of Computer Science Syed Babar

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India

Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Effective Email

More information

An evolutionary learning spam filter system

An evolutionary learning spam filter system An evolutionary learning spam filter system Catalin Stoean 1, Ruxandra Gorunescu 2, Mike Preuss 3, D. Dumitrescu 4 1 University of Craiova, Romania, catalin.stoean@inf.ucv.ro 2 University of Craiova, Romania,

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

PLAANN as a Classification Tool for Customer Intelligence in Banking

PLAANN as a Classification Tool for Customer Intelligence in Banking PLAANN as a Classification Tool for Customer Intelligence in Banking EUNITE World Competition in domain of Intelligent Technologies The Research Report Ireneusz Czarnowski and Piotr Jedrzejowicz Department

More information

Towards applying Data Mining Techniques for Talent Mangement

Towards applying Data Mining Techniques for Talent Mangement 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Towards applying Data Mining Techniques for Talent Mangement Hamidah Jantan 1,

More information

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2 International Journal of Computer Engineering and Applications, Volume IX, Issue I, January 15 SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

More information

Adaptive Filtering of SPAM

Adaptive Filtering of SPAM Adaptive Filtering of SPAM L. Pelletier, J. Almhana, V. Choulakian GRETI, University of Moncton Moncton, N.B.,Canada E1A 3E9 {elp6880, almhanaj, choulav}@umoncton.ca Abstract In this paper, we present

More information

Using News Articles to Predict Stock Price Movements

Using News Articles to Predict Stock Price Movements Using News Articles to Predict Stock Price Movements Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 9237 gyozo@cs.ucsd.edu 21, June 15,

More information

Filtering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques

Filtering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques 52 The International Arab Journal of Information Technology, Vol. 6, No. 1, January 2009 Filtering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques Alaa El-Halees

More information