Spam Filtering

Transcription

1 Spam Filtering A Thesis Paper Submitted to the School of Information Technology In Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Computer Science By Bello, Lyndsay B. Genelsa, Rick Joshua L. Roxas, Edgel John M. Mapúa Institute of Technology July 2013

2

3 TABLE OF CONTENTS Abstract Acknowledgements i ii Chapter 1 Introduction 1 Research Questions 5 Scope and Limitation 6 Significance of the Study 6 Conceptual Framework 7 Chapter 2 Review of Related Literature 10 Chapter 3 Methodology 22 Header Analysis 25 STEP 1. Extraction of Header Features 27 STEP 2. Validate FROM Field 27 STEP 3. Validate TO Field 28 STEP 4. Validate DATE Field 28 STEP 5. Validate X-Mailer, Message-ID, Return-Path, Reply-To, CC, BCC, Sender 28 Content-based Body Analysis 30 STEP 1. HTML Tags Removal 30 STEP 2. Extraction of Body Features 31 STEP 3. Selection of Attributes 32 STEP 4. Training of the Classifiers 36 Testing of the Classifiers 36 Classification of 37 Accuracy Evaluation 38 Chapter 4 Results and Discussion 39 Pre-Experiment 40 Header Analysis 40 Body Analysis 42 Chapter 5 Conclusion and Recommendations 56 Conclusion 56 Recommendations 57 APPENDICES 58

4 LIST OF FIGURES Figure 1.1: Bayes Theorem 2 Figure 1.2: Naïve Bayesian Spam Filtering 3 Figure 1.3: C4.5 Pseudocode 3 Figure 1.4: Example of Decision Tree Built by C4.5 Algorithm 4 Figure 1.5: Parallel Approach Conceptual Framework 7 Figure 1.6: Sequential Approach Conceptual Framework 8 Figure 2.1: A Comparison of the Classification Accuracy of Case-based and Bayesian Spam Filtering 12 Figure 2.2: 10-fold Cross Validation 14 Figure 2.3: Sample Header 15 Figure 2.4: TCR result of SVM on SA corpus. The 9 thresholds for body, header and all (body+header) feature types are 0.63, and 0.48 respectively. The corresponding 999 values are 2.31, 1.43 and Figure 2.5: TCR result of SVM on ZH1 corpus. The 9 thresholds for body, header and all (body+header) feature types are 0.771, and 0.36 respectively. The corresponding 999 values are 1.47, 1.69 and Figure 3.1: Experimental Groups 22 Figure 3.2: Sample Extracted s 23 Figure 3.3: Sample Header 24 Figure 3.4: Sample Body 24 Figure 3.5: Header Analysis 25 Figure 3.6: Different Valid Formats for the From Field 27 Figure 3.7: Spam with Wrong From and To fields Format 27 Figure 3.8: Empty To Field 28 Figure 3.9: Sample of an Invalid Date 28 Figure 3.10: Sample of a Common Mail User Agent 28 Figure 3.11: Different Domain Names of The From and Message-ID Fields 29 Figure 3.12: Sample Header Analysis Results 30 Figure 3.13: Sample Body without HTML Tags 31 Figure 3.14: Sample Attributes/Tokens of an 32 Figure 3.15: Sample Summary of Frequency Table 33 Figure 3.16: Sample.csv File 34 Figure 3.17: Sample Arff File 34 Figure 3.18: Sample.arff File after Feature Selection 35 Figure 3.19: Sample.arff File of Test Set 36 Figure 3.20: Sample.arff File of the Test Result 37 Figure 3.21: Sample Result Tabulation 37 Figure 4.1: Sample Header Analysis Result 42 Figure 4.2: Sample Summary of All Feature Selection and Classifier Algorithms Performed 45 Figure 4.3: Threshold Curve for SVM Attribute Eval Ranker J48Graft 50

5 Figure 4.4: Threshold Curve for OneR Attribute Eval Ranker Naïve Bayes 50 Figure 4.5: Sample Classification Without Header Analysis 51 Figure 4.6: Screenshot of Spam Filter 52 Figure 4.7: Sample Tabulation for Validation 53 Figure 4.8: Software Architecture for Parallel Evaluation 54 Figure 4.9: Software Architecture for Sequential Evaluation 55

6 LIST OF TABLES Table 2.1: Comparison of Different Spam Filtering Methods 18 Table 3.1: Approach and Rules 25 Table 4.1: Indicators of Spam 43 Table 4.2: Indicators of Nonspam 43 Table 4.3: Parallel Top 3 Result 46 Table 4.4: Sequential Top 3 Result 46 Table 4.5: Evaluation of Classifier using Training Set for Spam Class 47 Table 4.6: Sample Summary of 10-Fold Cross Result for Spam 48 Table 4.7: Results of Pre-selected Algorithms without Header Analysis 51 Table 4.8: Summary of Validation Result 54

7 ABSTRACT Using is one of major activities in the Internet. Considered now as a major form of communication, it reaches to thousands of users worldwide within a couple of seconds. However, spam had threatened the viability of communication via . As a resolution to this problem, the Researchers developed a model through header fields and words (also referred to as attributes or tokens) that were extracted from s body. The data were later on processed using WEKA, a data mining tool. The model studies the different header fields and attributes that were good indicators of spam s. Keywords: data mining, , spam, header analysis i

8 ACKNOWLEDGEMENTS The Researchers would never have been able to finish this Thesis without the guidance of the technical adviser, research panelists, help from friends, and support from our family and God; To the advisor, Mr. Ramon L. Rodriguez, for his excellent guidance, care, patience, providence and excellent atmosphere for doing research; To the Faculty of the School of IT, for their inputs and wise suggestions; To the beloved Ms. Sally, for going out of her way just to help; To the Center for Student Advising, Mrs. Pamela A. Roldan and Ms. Mary Anne F. Balmes for sharing the needs of school supplies and the place to study; To the Friends especially to those who willingly lent their s information for the needed data; for the countless laughter shared; Lastly, the Families, in supporting them both financially and emotionally; conceived. To God Himself, for without Him not a single word of this Thesis would have been ii

9 Chapter 1 INTRODUCTION As the Internet continues to grow, it has opened new ways of communication. Using e- mail is thus the major activity when surfing the Internet. [1] This form of communication reaches out to thousands of users worldwide within a second; however, this freedom of communication can be misused. In the last couple of years, spam has become a phenomenon that threatens the viability of communication via . [2] Spam started in the spring of 1978 by a man named Gary Thuerk. He wanted everyone to know about his new DCE company.[3] Though users can recognize spam messages easily, it is difficult to have an accurate definition of spam. By definition, spam is unsolicited usually commercial sent to a large number of addresses. [4] Spam is also defined as flooding the Internet with many copies of the same message, in an attempt to force the message on people who would not otherwise choose to receive it. [5] Some companies, organizations or people take advantage of the freedom of using by sending unsolicited advertising or offensive messages, therefore many users inboxes are populated with unwanted messages. In 2007, a study [6] showed that eighty-five percent (85%) of the traffic is accounted to spam and if trends continued it would increase to ninety percent (90%) by year end. Because of this increasing problem, several techniques and solutions against spam filtering have already been proposed: decision trees (DT), support vector machines (SVM), K- nearest neighbor algorithm (KNN), Naïve Bayes (NB), neural networks, ensemble decision trees (EDT), boosting, bagging and stacking, and others. [7][8] But the problem with some techniques is the over-blocking, which is filtering out even the legitimate or normal s. 1

10 2 Rev. Thomas Bayes, an English Nonconformist, mathematician, and author of An Essay Towards Solving a Problem in the Doctrine of Chances (1793), first used probability inductively and established a mathematical basis for probability inference (a means of calculating, from the number of times an event has occurred, the probability that it will occur in future trials). [9][10] He introduced what is known today as the Bayes Theorem the probability of event B given A can be found by assuming that event A has occurred and, working under that assumption, calculating the probability that event B will occur. [11] Figure 1.1: Bayes Theorem From this theorem comes one of the most highly successful spam filtering tools available today [12], the Bayesian Spam Filtering. The use of Bayesian logic in spam filtering started in Paul Graham s online article A Plan for Spam [13], and was later adopted by many developers. Bayesian spam filtering is based on the idea that the presence of certain words will indicate that the message is spam, while other words will identify it as legitimate. This method starts by analyzing different sets of s which are already sorted as spam and legitimate. The Bayesian filter compares the contents of both sets, and then a database is built which contains words (also known as tokens), which can be used to identify classify future s as spam or not. [14] Naïve Bayesian (NB) is one of the commonly used algorithms based from the described Bayesian Spam filtering.

11 3 Figure 1.2: Naïve Bayesian Spam Filtering Another learning approach used to counter-attack spams is with the use of the C4.5 algorithm, a decision tree inducer, developed by John Ross Quinlan. C4.5 is Quinlan s extension of his own ID3 Algorithm. He introduced the gain ratio which is used by the algorithm as splitting criteria. When the number of instances to be split is below a specified threshold, splitting stops. After growing the decision tree, error-based pruning is performed. [15] Figure 1.3: C4.5 Pseudocode

12 4 Using a training set, C4.5 builds a decision tree according to the splitting node strategy. At each node, the algorithm selects a single attribute that most effectively splits its set of instances into subsets. It recursively visits each decision node and selects the optimal split until no further splits are possible. [16] The following premises guides the algorithm: (1) If all cases are of the similar class, the tree is a leaf and so the leaf is returned with this class; (2) Calculate the possible information provided by a test on the attribute (based on the probabilities of each case having a particular value for the attribute) for each attribute. Also calculate the gain in information that would result from a test on the attribute (based on the probabilities of each case with a particular value for the attribute being of a particular class); and (3) Find the best attribute to branch on depending on the current selection criterion. [17] Figure 1.4: Example of Decision Tree Built by C4.5 Algorithm headers are included in the message by the sender of by a component of the mail system, and also contain transit-handling trace information. [18] Standard headers include the following fields: (1) Return Path, (2) Delivery-date, (3) Date, (4) Message-ID, (5) X-Mailer, (6) From, (7) To, and (8) Subject. [19] According to Trevino[43], Header analysis still has life. Results of his tests showed that header analysis is capable of detecting over ninety percent (90%) of current spam with less than one percent (1%) false positive. These tests also require very little training and processing

13 5 power. Since it focused only on the header of the , messages that can trick statistical filter (such as phishing scams or image spam) are still easily detected and eliminated. A study done by Wang and Chen [1], using Header Session for anti-spam focused on header analysis too. Wang and Chen made use of Header fields as cue for spam filtering, fields such as To, CC, From, X-Mailer, Message-ID. These fields are the primary basis for analysis in their study, they analyzed these fields and found loopholes and pattern that are made as cue in classifying an as spam or not spam. Such rules are used in classifying spam, sender address is invalid and the recipient is not in the s To or CC fields. As the fight for spam increases, spammers find more ways to hide their identifications when sending spam s and by-pass filtering methods. Because of this, header analysis is considered as one of the main approaches to counter spam attacks with falsified header information. [20] Research Questions What are the other features, both on header and body that can increase accuracy rate in classifying messages? Which is a better approach for detection of spam s, parallel or sequential evaluation? Which is the best combination of features selection and classification algorithm available in WEKA to yield highest accuracy rate?

14 6 Scope and Limitation The training set and test set that were used in the study are only from the chosen corpuses. The s that were used are in plain-text and HTML format only, and did not cover the analysis of attachments. The frequency table created consists only of unigrams (a single item from a sequence). The study did not conduct any live system testing throughout the research. Domain Name Server (DNS) Spoofing detection was not considered, because testing requires a live mail server. The extraction tool (Export Messages to EML Format of OutlookFreeware Utilities) had some problems in extracting some messages. As of this writing, the said tool has no documentation. Significance of the Study The result of the study will be of great importance and significance to future spam filtering developers. By increasing the accuracy of the algorithms with the help of Header Session, it can help improve the classification rate of s, thus decreasing the high traffic accounted to spam and also reducing the heavy load taken by mail servers. The study can also help in improving the accuracy of filtering spam s using different classification algorithms. It will also show if header analysis, whether in sequence or in parallel, will have an effect in the accuracy of spam filtering. The study can show which algorithm works best with the header analysis by comparing the results of accuracy and false positive. The study also aims to discover new header features and additional rules that could help in improving the classification accuracy.

15 7 Conceptual Framework RFC 1036 rules Header Header Features Extraction Text (.txt) File Format Matching Header Classification Body Body Features Extraction Data Set Cleaning Text (.txt) File.csv File Body Classification Classifier Algorithm.arff File Classification: Indeterminate No Same Result? Yes Classification: Matching Result Figure 1.5: Parallel Approach Conceptual Framework An consists of two parts, the header and the body. These features were extracted, cleaned and separated for analysis. Extracted header features were saved in a text (.txt) file. The header part was analyzed using header analysis, while the body part was saved in a.csv file, and converted to a.arff file for the content-based analysis. For the parallel approach, header analysis and content-based body analysis classified the s simultaneously based on rules for each method. The results of both analyses methods were compared, and final classification is based on the rules indicated in Table 3.2.

16 8 RFC 1036 rules Header Header Features Extraction Text (.txt) File Format Matching Header Classification Body Body Features Extraction Data Set Cleaning Text (.txt) File.csv File Classification: Nonspam No Spam? Yes Body Classification Classifier Algorithm.arff File No Spam? Yes Classification: Spam Figure 1.6: Sequential Approach Conceptual Framework The sequential approach follows the same pre-processing steps of the parallel approach. However, header analysis processed the header part first applying slack filter. Only those s classified as non-spam were processed using the content-based analysis. classification is based on the rules indicated in Table 3.3. The research used collected personal s from people who agreed to give their received s for the training set. s from the two corpuses: the Spamassassin data set

17 9 which has 6,047 messages, with a known 31% spam ratio [21]; and Bruce Guenter s Spam Archive [22] were also used as the test data. The header fields that were extracted from the s are From, To, Subject, Date, Return-Path, Reply-To, CC, BCC, Sender, Message-ID and X-Mailer (MUA) fields.

18 Chapter 2 REVIEW OF RELATED LITERATURE Spam is an issue about consent, not content. Whether the Unsolicited Bulk ("UBE") message is an advert, a scam, porn, a begging letter or an offer of a free lunch, the content is irrelevant - if the message was sent unsolicited and in bulk then the message is spam. [23] Many studies have been published sharing different ways on how to fight spam such as the Rule Based Spam Filtering, Content Hash Based Filtering, Machine Learning techniques, Support Vector Machines (SVM), Collaborative Filtering (CF) and the Content- Based Filtering (CBF) to name a few. Among these methods, CBF has been the most widely used anti-spam solution because it is freely available and its commercial implementations. [24] Current research focuses on improving individual classifier performance, by a better preprocessing or enhancement of the learning algorithm. Ensembles that combine distinct spam classifiers have also been proposed. [25] However, both CF and CBF have drawbacks. CF face problems such as first-rater, sparsity of data and privacy. The first issue is because of the difficulty of classifying s that have not been rated before; the second problem arises when users rate few messages; and the last problem depends on what is shared [25]. One of the strong benefits of the CBF is that it reduces error rates as legitimate would not be blocked even if the ISP from which it originated, is on a real-time block list and it only needs occasional refinement, meaning less hassle for end-user. [26] 10

19 11 One popular method under CBF is the Naïve Bayesian - an anti-spam filter text categorization technique. [27] This method is simple and fast because it only requires linear training time. [28] In a study by Taninpong and Ngamsuriyaroj [28], they used an incremental Naïve Bayesian approach with variant incremental training evaluated using the Trec05p-1 and Trec06p corpora. Their approach uses the concept of sliding windows in training messages after classification by the filter. Using three schemes (with each scheme adding more data to the current scheme), an increase in the accuracy rate of classifying s was shown. From 78.24% (for Trec05p) and 85.27%(for Trec06p) for the first scheme, it increased to 91.51% and 92.42% for the third scheme. A study [29] conducted by Pantel and Lin also used the Naïve Bayesian algorithm to classify s as spam or legitimate messages. They presented a spam-filtering program called SpamCop which treats an message as a multiple set of words. In their experiments, SpamCop was able to classify ninety-two percent (92%) of the spams while having only a 1.16% misclassification of non-spam s. It also showed that a high accuracy rate is achievable using only as few as 32 spam examples. More studies about Naïve Bayesian as anti-spam technique [30] used two other methods namely, unigrams and bigrams. The idea of unigrams was taken from Paul Graham [13], and a question of redundancy occurred for unigrams and bigrams. But results proved it wrong, the total number of spams detected by either of the classifiers about forty percent (40%) is detected by both of them simultaneously. The remaining part is divided in approximately equal shares amongst them. This suggests that both versions of Naïve Bayesian classifiers are complementing each other, without a need of human supervision of the retraining process.

20 12 However, another study [31] comparing several forms of Naïve Bayes and linear SVM proved that a different method is better. Though all the forms that were used are the best choices for automatic filtering of spams, SVM presented the best average performance by having an accuracy rate of more than ninety percent (90%) compared to the Naïve Bayes approaches. Cunningham, Nowlan, Delany, and Haahr [32] made a study comparing Naïve Bayes approach to a Case-Based approach for a certain period of time, from May to January. They used cases that contain 30 spam words, 30 non-spam words, and 7 header features to test the classification accuracy of both classifiers. Both classifiers are trained to 200 spam and 200 nonspam cases and are evaluated to 150 spam and 150 non-spam at each test point. Results show that case-based approach with an accuracy percentage greater than ninety-five (95%) all throughout the testing process performs better than Naïve Bayes approach with an average accuracy of less than ninety-five percent (95%). Figure 2.1: A Comparison of the Classification Accuracy of Case-based and Bayesian Spam Filtering

21 13 Another study [27] evaluated a total of four different types of Naïve Bayesian showed that the Fixed Token approach has been found to be the most effective among the four techniques evaluated. In the series of tests and evaluation keeping the cost of false positives to a minimum became their prime priority. The researchers also found out that Click here and Buy free are better indicators of spam than the independent word Click, here, Buy, and Free. It was also noted that, to get efficiency above ninety percent (90%) with less than one percent (1%) false positives, a content-based filter would not be enough. Because of this loop hole in spam filtering, several other researches have been done to include the header of the in fighting against spam. Another technique that is used to classify spam is the C4.5 Decision Tree Algorithm sometimes also called as J48, an open source Java implementation in WEKA - developed by John Ross Quinlan. Researchers Abdelghani Bellaachia, Erhan Guven [33] have used WEKA and have investigated three data mining techniques - the Naïve Bayes, the back-propagated neural network and the C4.5 decision tree algorithms and concluded that C4.5 algorithm has a much better performance than the other two techniques based on their research. In a study [34] comparing different decision tree algorithms conducted using different sizes of dataset; C4.5 yielded a 95.80% accuracy for the data size of Also in the same study, C4.5 and the Naïve Bayesian Classifiers performed better compare to the other algorithms, averaging over ninety-five percent (95%) for the precision and recall rates. Another study [16] showed that though another decision tree (Logistic Model Tree) outperforms the C4.5 Algorithm/J48 in terms of classification accuracy rate, it is the best in terms of training time compared to the other decision tree algorithms compared in the study.

22 14 Making the said algorithm as the best choice if training time will be considered as the critical factor in selecting filtering algorithms. A comparative study done by Sharma and Sahni [35] analyzed and compared four (4) algorithms namely: C4.5/J48, ID3, ADTree, SimpleCART. During the training of classifier for each algorithm Sharma and Sahni used the 10-fold Cross Validation method for classifier training and results analysis. This 10-fold Cross Validation method means that: Figure 2.2: 10-fold Cross Validation With this validation method used it reduces over fitting (random error), making the built classifier more reliable. After the comparison the C4.5/J48 got the highest accuracy rate of ( %) compared to the other three algorithms. Another study [36] which evaluated different data mining techniques for spam filtering showed that the three (3) tree classification algorithms (C4.5, CS-MC4, and Rnd) produces ninety-five accuracy. In the same study, the best results (above ninety-five percent (95%) accuracy) for runs filtering and stepwise discriminant analysis are from the algorithms C4.5 and CS-MC4. J48 was also the most suitable associated algorithm with AdaBoostM1 to filter spam according to a study [37] done by Ali and Xiang. They compared three algorithms (Decision Stump, J48, and Naïve Bayes) first without the AdaBoostM1 algorithm, and the second in

23 15 combination with AdaBoostM1. In the first experimental setup, J48 showed the highest classification accuracy rate of 92.98% compared to 91.50% and 79.29% for Decision Stump and Naïve Bayes, respectively. However, based on training time, J48 performed the slowest having seconds, while Decision Stump having 7.65 seconds and Naïve Bayes with seconds. For the next experimental setup (all algorithms with AdaBoostM1), again J48 yielded the highest classification accuracy rate of 95.15%. Training time was also enhanced for J48 when the AdaBoostM1 was applied to the algorithm. From seconds, the training time was decreased to 6.84 seconds. The same study also showed that J48 has a high true positive rate, low false positive rate and computationally less expensive compared to the other algorithms used in the study. An has three (3) main elements: the header, the body, and the envelope. [38] The header part is said to be the most fascinating part of the . [39] It includes details about the message such as the sender, the receiver, the date and the subject. Below is an example of an e- mail header. headers should always be read from bottom to top. [40] Figure 2.3: Sample Header Header analysis is the breakdown of the header part of the , separating it into its elements. Header-based spam filtering represents an efficient and lightweight approach to

24 16 achieve filtering of spam messages by inspecting message header information. Typically, a machine learning classifier is applied on features extracted from header information to distinguish legitimate messages from spam. The can be filtered based just on the headers, no matter what they say in the body. [41] Chin-Chien Wang [42] conducted a study to determine if the header field could be of use when filtering junk s. In his study, he used 3,417 unsolicited s, where 60.3% of those unsolicited s have an invalid sender address and 92.8% receiver addresses was not shown in the To or CC headers. The result of his studies concluded that invalid sender and irrelative receiver addresses left in the header section of junk s could be used by spam filter developers to develop new anti-spam strategies or even improve the current anti-spam filters. Four anti-spam filtering techniques or methods were named in his study, namely: 1 Filtering by number of Recipients This method is used for blocking s sent to a large number of recipients. A downside of this method is that it could filter non-spam messages, because it assumes and consider all bulk s to be junk. 2 Filtering by keyword This method is said to be efficient in filtering junk mails. It also uses probability that the header or the body of the contain specific words, such as sell, sex, buy now, and other keywords that most people consider a spam. The downside of this method is that it has a high percentage that solicited mails might be filtered because of the words on the filter list. 3 Filtering by Sender Address This method filters spam based on the sender address. Like other methods in filtering junk mails, the downside of this method is that it s easy for the sender to use new address or create a fake address and use it for spamming. There are so many other problems regarding this method.

25 17 4 Filtering by address Validity This method checks if the followed the proper format described in the request for comment. It is said that every mail should at least have a sender and a receiver. According to Trevino, header analysis still has life. Results of his tests showed that header analysis is capable of detecting over ninety percent (90%) of current spam with less than one percent (1%) false positive. These tests also require very little training and processing power. Since it focused only on the header of the , messages that can trick statistical filter (such as phishing scams or image spam) are still easily detected and eliminated. [43] A study conducted by Hu [44] also focused on the header part, which includes the originator field, destination field, x-mailer field, sender server IP address, and subject, tested 5 different spam classifiers: Random Forest (RF), Decision Tree, (DT), Naïve Bayes (NB), Bayesian Network (BN), and Support Vector Machine (SVM). Testing the accuracy, precision, recall, and F-measure of 2 different datasets consisting of 33,209 s and 21,725, using the hybrid spam filtering framework it showed that random forest classifier has the best performance with 96.7% accuracy, 92.99% precision, 92.99% recall, and 93.3% F-measure. Using a two-phase spam filtering method based on categorized Decision Tree Data Mining Algorithm, Sheu utilized the basic information in header sessions to identify spam or legitimate s. [45] Based from his experimental results, the efficiency were evaluated in the following datum: 96.5% accuracy, 96.67% precision, and 96.35% recall. He pointed out that these datum are not lower than other filtering methods that checks content given the fact that he only checked the header sessions of s, which will reduce the computation cost and many system resources. Table 2.1 shows the results obtained by Sheu.

26 18 Table 2.1: Comparison of Different Spam Filtering Methods In an experiment conducted by Le Zhang, Jingbo Zhu, Tianshun Yao [46], they used the SpamAssasin and ZH1 corpora and processed it into 3 versions to determine the contribution of different part in filtering spam mails; one version uses only terms from message body plus subject line, another version with tokens occur in message headers only and the last version was one with both mail body and headers tokenized. The result of the experiment showed that SVM classifier achieved good TCR values using only the information that they used from mail headers only. They concluded that using message headers can be more reliable in eliminating or filtering spam mails, compared to mails that uses spam filters that focuses only to body. They also discovered that using both header and body in filtering mails is better than focusing only on either the body or the header alone. They concluded that message headers shouldn t be ignored and should be considered as important as mail bodies in terms of filtering spam. Figures 2.3 and 2.4 show the result of the experiment.

27 19 Figure 2.4: TCR result of SVM on SA corpus. The 9 thresholds for body, header and all (body+header) feature types are 0.63, and 0.48 respectively. The corresponding 999 values are 2.31, 1.43 and Figure 2.5: TCR result of SVM on ZH1 corpus. The 9 thresholds for body, header and all (body+header) feature types are 0.771, and 0.36 respectively. The corresponding 999 values are 1.47, 1.69 and 1.15.

28 20 A study [38] by Ahmed Obied used machine learning approach based on Bayesian analysis to filter spam. This study is different from most anti-spam methods because he evaluated both the header and the body of the . The filter learns of what spam and non-spam messages look like and it can make binary classification decisions (spam or non-spam) based on what it has learned. The filter does not require any heavy maintenance. All that is needed is to train it once and it is done. After training the filter, it becomes capable of filtering spam with high accuracy. The study used a feature extraction, that extracted the words both from the header and body by the use of delimiters ( \n\f\r\t\./&%# {}[]! +=-() *?:;<>) that was then placed on a hash table. The header and the body of an message is separated by an empty line. The result of the 4 tests of 5,000 messages equally divided by spam and non-spam messages an average result of 97.80% accuracy. Wang and Chen [1] made use of Header fields as cue for spam filtering, fields such as To, CC, From, X-Mailer, Message-ID. These fields are the primary basis for analysis in their study, they analyzed these fields and found loopholes and pattern that are made as cue in classifying an as spam or not spam. Such rules are used in classifying spam, sender address is invalid and the recipient is not in the s To or CC fields. Out of the many researches about spam-filtering, particularly the Naïve Bayesian approach, the very influential article [13] by Paul Graham was adopted by many other researchers. In his study, Graham pointed out the importance of message headers. The main difference of this study from [27] is the probably the inclusion of message headers in filtering spam s, and that data should not be discarded. In his Bayesian analysis for the body and header of the , he made use of tokens, score and hash tables to verify whether is a spam or not. Basing on a corpus of spam mail each common words would be given equivalent scores, depending on how often they occur on a mail ( free, sex, sexy ).

29 21 However, another study [47] by Gary Robinson put Paul Graham's approach under scrutiny. According to him, Graham's algorithm is subtly asymmetric with respect to how it handles words that indicate the is a spam compared to the words that make it a legitimate message. He also pointed out that Graham's technique was based on an anonymous article [48] showing that the probabilities are independent, which is not the case in words found in s. There are vast studies about CBF spam filtering, but some studies ignore the header of the . Though different approaches that are applied to the content of the body has a high accuracy rate in distinguishing spam from non-spam s, other studies also showed that combining the body of the with its header can also increase the accuracy of the filter. Because of these results, the researchers want to know which spam filtering algorithm, with header analysis, is better using Parallel or Sequential approach. Also, the researchers want to discover other header features and additional rules that could help in improving classification accuracy of the header analysis.

30 Chapter 3 METHODOLOGY In order to validate the Researcher s point of view, two (2) major experimental groups of data collection were created. Group A worked in parallel with header analysis, while Group B worked in sequence. Using exhaustive search, both groups were trained using the same set of classification algorithms available in WEKA. To ensure fairness in the experiment, all 4,179 samples were selected from personal s collected from people who opted to participate. The test set was from two existing spam corpuses: the CSDMC2010 SPAM corpus which have 4,327 messages in total with 2,949 non-spam and 1,378 spam messages; and Bruce Guenter s Spam Archive. (Figure 3.1) Body and Header Separation Exhaustive Feature Selection s Body s Header Apply Header Analysis Exhaustive Feature Selection Exhaustive Search Algorithm Exhaustive Search Algorithm Apply Header Analysis Classify Classify Evaluate Accuracy Evaluate Accuracy Figure 3.1: Experimental Groups 22

31 23 s for the training set were extracted using Export Messages to EML Format tool [55] for Microsoft Outlook. Figure 3.2 shows the sample extracted s in.eml format using the said extracting tool. Figure 3.2: Sample Extracted s Before performing the two (2) methods for analysis, the header and body parts of the were separated and saved in a.txt file as shown in Figures 3.3 and 3.4. However, the content of the Subject field was included in the body.

32 24 Figure 3.3: Sample Header Figure 3.4: Sample Body

33 25 Header Analysis In performing header analysis, the Researchers adapted Wang and Chen s [1] method applying the rules the researchers observed from the spam s collected: Extracted Header Features Validate FROM field Invalid FROM field? Yes? No? Validate TO field Mark as SPAM Invalid TO field? Yes? No? Validate X- MAILER Validate MESSAGE-ID Validate RETURN- PATH Validate REPLY- TO Validate CC/BCC/ SENDER Header Analysis Result Figure 3.5: Header Analysis Table 3.1: Approach and Rules Judgement Approach Rules Judged as normal s Normal has the following characteristics: Do not filter out s with 1 Valid FROM field format. the following characteristics. 2 Valid TO field format. 3 Valid DATE range. 4 Mail-User-Agent (X-MAILER)

34 26 is commonly used. 5 MESSAGE-ID field format is valid. 6 MESSAGE-ID field domain is the same as the FROM field domain. 7 RETURN-PATH field domain is the same as the FROM field domain. 8 REPLY-TO field domain is the same as the FROM field domain. 9 Valid CC field format 10 Valid BCC field format 11 Valid SENDER field format. Judged as spam Spam has the following characteristics: 1 Invalid FROM field format. 2 Invalid TO field format. 3 Invalid DATE range. 4 Mail-User-Agent (X-MAILER) is not commonly used. 5 MESSAGE-ID field format is invalid. 6 MESSAGE-ID field domain is Filter out s that match not the same as the FROM field rule 1, 2, and 3; or match any domain. two of rules 4 to RETURN-PATH field domain is not the same as the FROM field domain. 8 REPLY-TO field domain is not the same as the FROM field domain. 9 Invalid CC field format 10 Invalid BCC field format 11 Invalid SENDER field format. Judged as indeterminate Neither normal nor spam s. Neither normal nor spam s. All the rules applied in the header analysis were patterned after the Request for Comments (RFC) 1036 Standard for USENET Messages. [51]

35 27 STEP 1. Extraction of Header Features headers are present on every an individual received via the Internet, and it can provide valuable diagnostic information like, hop delays, anti-spam results and more. Figure 2.2 shows a sample of an header. The Researchers extracted the header features for both Group A and Group B. Using JavaMail API [58] for our own parsing tool in extracting, we: a. Parsed an message and extracted the header part. b. Extracted the selected header fields: From, To, Date, X-Mailer, and Message-ID; Return-Path, Reply-To, CC, BCC, Sender fields were extracted if present. STEP 2. Validate FROM Field According to RFC 1036, each message must include the From field, containing the address of the sender who wishes this message to be sent. Figure 3.6: Different Valid Formats for the From Field Based on the observations from the collected spam s, some spam s From field do not follow the standard set by RFC This loophole was used as criteria in judging the s. Figure 3.7: Spam with Wrong From and To fields Format

36 28 STEP 3. Validate TO Field The recipient address was also used as a cue for judging normal s. The To field also follows a strict structure like the FROM field. Another observation of the said field of spam s usually contains empty or undisclosed-recipients as shown in the figure below. Figure 3.8: Empty To Field STEP 4. Validate DATE Field The Date field was also used as a spam indicator. As show in the figure below, some spam s contain a date that is in the future. The researchers validated the date the was sent, setting the range from when started (1960) [52] up to the present date. Figure 3.9: Sample of an Invalid Date STEP 5. Validate X-Mailer, Message-ID, Return-Path, Reply-To, CC, BCC, Sender MUA (Mail User Agent) are software applications that are commonly used in sending (refer to Appendix A). On the contrary, s that did not use the common MUA, was used as a cue for spam. Figure 3.10: Sample of a Common Mail User Agent

37 29 Message-ID also follows a strict format (unique@full_domain_name) as stated in RFC This format was used to check the validity of the said field. The researchers also compared the Message-ID s domain to the From field s domain. As observed in spam s, the two domains were different. Figure 3.8 shows the comparison of the two fields domains. Figure 3.11: Different Domain Names of The From and Message-ID Fields If the contains Return-Path, Reply-To, CC, BCC, and/or Sender fields, these fields were also used to judge the s. Most spammers tend to spoof the From field, but leave their real address in the Return-Path and Reply-To fields so that when the recipient replies it goes to their . [8] With this, the researchers compared the From field to the Return-Path and Reply-To fields to verify if the said fields are the same. CC, BCC, and Sender fields were also processed the same as the To field. All possible combinations of the header fields were used to allow the researchers to get the best combination with the body classifier algorithms that would yield the highest accuracy. The different results for all combinations are shown in Figure 3.12.

38 30 Figure 3.12: Sample Header Analysis Results Content-based Body Analysis The contents of the body were filtered and classified using the different attribute selection and classifier algorithms available in WEKA that the Researchers selected. STEP 1. HTML Tags Removal HTML tags were removed in order to clean and extract only the contents of the body, without its formatting.

39 31 Figure 3.13: Sample Body without HTML Tags STEP 2. Extraction of Body Features Using Apache OpenNLP [53] the body features were extracted from the messages (shown in Figure 3.14). Applying Ahmed Obied s [38] method, we: a. Extracted attributes from the body by tokenizing using the delimiter: \n\f\r\t\\ /&%# {}[]! +=-()\'\"*?:;<>@~._ b. Ignored the attributes of size three characters or less c. Stop words were also not considered (refer to Appendix C).

40 32 Figure 3.14: Sample Attributes/Tokens of an STEP 3. Selection of Attributes Using the 4,179 samples of pre-classified s as the training set, two (2) frequency tables were built containing the number of occurrences of words/tokens. Frequency Table 1 contains all the tokens extracted without using a dictionary. On the other hand, Frequency Table 2 contains all the tokens that were considered as valid words by WordNet [54]. The frequency tables contain: a. Word/Token

41 33 b. Number of times each word occurred that belongs to spam Figure 3.15: Sample Summary of Frequency Table Setting the threshold at one hundred (100), the features (from the two frequency tables) with a hundred and more occurrences, totaling to two hundred sixty (260) attributes, were selected as attributes for the body analysis (refer to Appendix D). A.csv file (shown in Figure 3.16) was created containing the computed frequency of each token, and the actual classification of the s. Each row represents an in the training set. The frequency is computed using the formula: Equation 3.1: Frequency Formula

42 34 Figure 3.16: Sample.csv File This file was converted to an.arff file that is readable by WEKA. A sample is shown in the figure below: Figure 3.17: Sample Arff File

43 35 In order to select the features that are useful in classifying the messages, the researchers applied feature selection algorithms available in WEKA, which are divided into two adjustable functions an attribute evaluator and a search method. Appendix B lists all feature selection algorithms that were used. A sample.arff file of the training set after applying feature selection is shown in Figure Figure 3.18: Sample.arff File after Feature Selection

44 36 STEP 4. Training of the Classifiers Using exhaustive search, different classification algorithms in WEKA [49] were used in the study to verify which algorithm works best with header analysis (refer to Appendix B). Testing of the Classifiers As shown in Figure 3.19, the test set was composed of 512 s (155 nonspam and 357 spam). Testing was also done in WEKA using the model created in training each classifier. The result of the test was also saved in an.arff file shown in Figure Figure 3.19: Sample.arff File of Test Set

45 37 Figure 3.20: Sample.arff File of the Test Result Classification of Figure The results of the two (2) analyses were tabulated in an excel (.xlsx) file as shown in Figure 3.21: Sample Result Tabulation

46 38 Parallel Classification The result of the header analysis was compared to the body analysis result. The matched results of the two (2) analyses were considered as the final classifications; otherwise, the was considered Indeterminate. Sequential Classification To avoid mistakenly filtering out normal s we applied Slack Filtering. In a slack filter, all normal s were kept and processing continued. The s classified as spam by the header analysis were automatically considered as spam; otherwise the final classification was based on the body analysis result. Accuracy Evaluation To evaluate the performance of the experimental models, the following performance measures were used: 1.True Positive Percentage (Correctly Classified SPAM) 2.False Positive Percentage (NONSPAM classified as SPAM) 3.True Negative Percentage (Correctly classified NONSPAM) 4.False Negative Percentage (SPAM classified as NONSPAM) 5.Accuracy (What percent of the prediction is correct?) 6.Indeterminate (Percentage of indeterminate s)

47 Chapter 4 RESULTS AND DISCUSSION This chapter presents the results and discussions of the study. The series of steps and all the computations with it will be showed in this chapter, in both parallel and sequential evaluation. The possible combinations of header fields that yielded the highest accuracy, lowest false positive and low to zero indeterminate are also included in this chapter. The best combinations with the low false positive rate, high accuracy, and zero or low indeterminate rate (both in parallel and in sequence) are also presented here. The best three models for both types of evaluations are also presented here. The Researchers looked for volunteers varying from the researchers peer, colleague, relatives and mentors who are willing to give their s for the purpose of the research. Before the execution of the research study the significance, rationale and purpose of the study were given to the volunteers who gave their account. Furthermore, the volunteers have also been given the assurance that all the data they will give are used for the sole purpose of the research and the identities of the volunteers and their s data will be confidential. Roughly around 5,000 s were collected, then the researchers selected s which are in plain-text format and HTML format only. All the pre-classified s collected summed up to 4,179. Afterwards, they used Export Messages to EML Format [55] tool which extracts the from the Microsoft Outlook to.eml format which are Notepad-readable. 39

48 40 Pre-Experiment Originally the Researchers only considered the common Header fields from Wang & Chen [1] such as: Subject, From, To, X-mailer and Message ID. When the s were already Notepad-readable, the Researchers began observing and deduced that there are more possible header fields that can be critical in Header evaluation such as the following: Return-Path BCC Reply-To Sender CC Date The Researchers wrote a program to separate the header and body parts; also, they appended the Subject to the body since there is no relative comparison and cue that can be used for considering a Spam in the Subject field. Data cleaning was only used in Body Analysis. Header Analysis When the Researchers have gathered a reasonable number of s for the training set, the s were extracted from Microsoft Outlook. To further analyze and know the behavior of all the fields of each , the test set is analyzed and showed if there is any invalid header field. They observed different irregularities on the different fields of the s header. The following spam characteristics were seen: From and Return-Path/Reply-To fields are not similar; From, To, and Sender fields contains no ; To fields contain undisclosed-recipients ; Message-ID contains dollar ($) sign, and not following the proper format; Message-ID domain is not the same as From field s domain; and

49 41 Date is invalid (example: Tue, 19 Jan :13: ) JavaMail API was used to access and extract the header fields that were used for the Header evaluation. In validating the selected fields, they based the rules from the RFC 1036 which contains the valid formats for the header part of an . [51] According to RFC 1036, each message must have the From field. This field contains the address of the sender who wishes this message to be sent and must follow the proper format of an address. Then they validated the said field s format based from the standard set by RFC In validating the To, CC, and BCC fields, the recipient address was also used as a cue for judging s. The To field also follows a strict structure like the From field. Another observation of the To field of spam s, usually it contains empty or undisclosed-recipients ; the same rule applies to CC and BCC fields. The Date field was also used as a spam indicator. As observed, some spam s contain a date that is in the future. They validated the date the was sent, setting the range from when started (1960) [52] up to the present date. MUAs (Mail User Agent) are software applications that are commonly used in sending (refer to Appendix A). On the contrary, s that did not use the common MUA, was used as a cue for spam. The X-Mail field of the s were checked if it is in the list of commonly used Mail-User-Agents (refer to Appendix A). Most spammers tend to spoof the From field, but leave their real address in the Return-Path and Reply-To fields so that when the recipient replies it goes to their . This loophole was also used as a cue for spam. If present, these two fields were compared to the From field to check whether those fields have the same values.

50 42 The Researchers set the From, To, and Date fields as priority indicators of a spam . If one of those three fields is invalid, the is automatically classified as spam. Message-ID, X-Mail, Return-Path, Reply-To, CC, BCC, and Sender fields were treated with less priority, meaning an must violate two imposed rules on the said fields before being classified as spam. Below is a sample result of the Header Analysis showing all the fields which has an invalid field or format. Figure 4.1: Sample Header Analysis Result As shown in Figure 4.1, based on the rules set the header analysis results show that 66.89% of the test s have an invalid Message-ID domain, which is the highest rule violated; followed by the invalid Return-Path (39.84%) and invalid Message-ID format (33%). Other rules showed low percentages given the fact those fields are optional fields as indicated in RFC This also shows that these fields are good indicators for classification. Body Analysis Before body analysis, HTML tags were removed from all HTML format s. They used OpenNLP - machine learning based toolkit for the processing of natural language text for

51 43 tokenization using the delimiter: \n\f\r\t\\ /&%# {}[]! and applied Ahmed Obied Methods in parsing the words of the body part of the . To further polish the tokens, data cleaning was done such as ignoring words with three characters and less, and removing of stop words (refer to Appendix C). From these, two frequency tables were created. Frequency Table 1 contains all the tokens extracted. On the other hand, Frequency Table 2 contains tokens considered by WordNet, a large database of English, as valid words. From the two tables, features with one hundred (100) and more occurrences were selected as significant attributes, totaling to two hundred sixty (260) words/tokens. From this frequency table, the following tokens were considered to be good indicators of a spam Table 4.1: Indicators of Spam 1. http = fast = click = shipping = best = please = customers= address = delivery = prices = 1511 On the other hand, the following are indicators of a nonspam Table 4.2: Indicators of Nonspam 1. debian = weblog = lists = unsubscribe = radio = blogspot = 284

52 44 4. blog = postregsql = weblogs = index 259 A.csv file was created containing the probability of the frequency of each token in an . Then this file is converted to an.arff file. Using exhaustive search, all possible combinations of evaluator and search methods (feature selection) were applied to at most five (5) from the different classifier groups (Bayes, Functions, Lazy, Meta, MI, Misc, Rules, and Trees groups) available in WEKA. Models were created using the training set, and the test set was re-evaluated using these models. Predictions of the classifiers were saved to an.arff file and converted to a.csv file for easier tabulation. Results of each train-and-evaluate process were tabulated in an excel file as shown in Figure For the parallel evaluation, the header result is compared to the body. If the results of both analyses are the same, the final classification of the will be the matching result. On the other hand, sequential evaluation considers the result of the header analysis first. If it resulted to spam, the will automatically be considered as spam. Otherwise, the body analysis result will be considered (Slack Filtering). Other results are shown in Appendix E.

53 45 Figure 4.2: Sample Summary of All Feature Selection and Classifier Algorithms Performed Permutation of the header fields used was also done to identify which combination of fields would work best with body analysis. All gathered results were summarized in table as shown in Figure 4.3. Since several combinations yielded similar results, the top three were chosen based on the header fields used. The combination of header fields that has the most number of required fields was considered in order to make the header analysis more efficient. On the next page, Tables 4.3 and 4.4 present the True Positive (TP) rate, False Positive (FP) rate, True Negative (TN) rate, False Negative (FN) rate, Accuracy (A) rate, and Indeterminate (I) rate of top three classifications for each type of evaluation. TP rate is the percentage of the correctly classified spam s; FP is the rate of nonspam identified spam; TN is rate of the correctly classified nonspam; FN rate is the percentage of spam s identified as nonspam; A rate is the percentage of correct predictions; and I rate is the percentage of unclassified s. For both the parallel and the sequential evaluation, the header-feature selection-classifier algorithm combination with the highest accuracy rate, low false positive rate, and low

54 46 indeterminate rate was considered. These performance measures were used as the basis for selection because of the following reasons: accuracy rate is the percentage of correctly classified s, therefore a high accuracy rate shows a high number of correctly classified s; low false positive rate means low nonspam classified as spam, therefore legitimate s can pass through the filter; and a low indeterminate rate indicates that few s passed through the filter without being classified. The top three combinations were selected based from these criteria. Table 4.3: Parallel Top 3 Result Result with Header Analysis Performance Measure NB J48G DT OneRAttribute Ranker SVMAttribute- Ranker OneRAttribute- Ranker From/To, Date, MID, Date, MID, MID, RPath, XMAIL, CC, BCC, RPath, XMAIL, RTo, XMAIL, Sender CC, BCC CC, BCC TP FP TN FN A I *TP = True Positive *FP = False Positive *TN = True Negative *A = Accuracy Table 4.4: Sequential Top 3 Result *I = Indeterminate Result with NB NB NB Header Analysis Relief Attribute Eval -Ranker SVMAttribute- Ranker OneRAttribute- Ranker Performance Measure Date RPath RTO Xmail Date RPath RTO XMail Bcc Date RPath Rto Xmail Cc Bcc TP FP TN FN A I

55 47 The pre-selected models for both types of evaluation are: OneR Attribute Eval Ranker (with Naïve Bayes), SVM Attribute Eval Ranker (with J48Graft), OneR Attribute Eval Ranker (with Decision Table), Relief Attribute Eval Ranker (with Naïve Bayes), and SVM Attribute Eval Ranker (with Naïve Bayes). Table 4.5 presents the summary of the Correctly Classified Instances, Incorrectly Classified Instances, Kappa Statistics, Mean Absolute Error, Root mean squared error, TP Rate, FP rate and ROC of these classifiers when used in the training set. Table 4.5: Evaluation of Classifier using Training Set for Spam Class Classifier Evaluation Correctly Classified Instances Incorrectly Classified Instances Kappa Statistics Mean Absolute Error Root mean squared OneR Attribute Eval Ranker NB OneR Attribute Eval Ranker - DT SVM Attribute Eval Ranker - J48G SVM Attribute Eval Ranker NB Relief Attribute Eval Ranker NB error TP Rate FP Rate ROC Area The range for the correctly classified instances is from 75.47% to 99.29%. The value for the kappa statistics for all the classifiers ranges from 0.26 to It means that the classifiers

56 48 with 0.89 have better results. The ROC area appears to have high values ranges from 0.88 to It means that these models have a very good classifying ability. Ten-fold cross validation was performed to the pre-selected classifiers to verify which of the classifiers would perform better with header analysis. This form of validation was done to reduce over fitting (random error); making the classifiers more reliable. Table 4.6: Sample Summary of 10-Fold Cross Result for Spam Classifier Evaluation Correctly Classified Instances Incorrectly Classified Instances Kappa Statistics Mean Absolute Error Root mean squared OneR Attribute Eval Ranker NB One Attribute Eval Ranker - DT SVM Attribute Eval Ranker - J48G SVM Attribute Eval Ranker NB Relief Attribute Eval Ranker NB error TP Rate FP Rate ROC Area The correctly classified instances range from 75.31% to as much as 97.97%. The 10-fold cross validation divides the data into ten equal sets, and then trains the classifier on the nine sets and test on the one set; the process is repeated ten times and the output per test would be averaged. As shown in Table 4.6, after performing 10-fold cross validation a noticeable decrease

57 49 in accuracy can be seen, because unlike the usual where the classifier has the same data set and test set. 10-fold cross has different data set and test set in all of its ten tests. It is noticeable that OneR Attribute Eval Ranker DT and SVM Attribute Eval Ranker J48G (both from parallel evaluation) again showed the better results as compared to the other models. Values of the ROC area range from 0.85 to For all the classifiers, ROC area decreased after 10-fold cross, however the difference is quite small. All models presented good test. Again, OneR Attribute Eval Ranker DT and SVM Attribute Eval Ranker J48G have better results (0.95 and 0.97, respectively) as compared to other models. Significant figures such as the Mean Absolute Error (MAE), which defines how far a prediction to the actual values; and Root Mean Squared Error (RMSE), which measures the difference between predicted values by a given model and the actual values, are also considered in evaluation. [57] The results show again that OneR Attribute Eval Ranker DT and SVM Attribute Eval Ranker J48G models are accurately significant, as compared to the other models. SVM Attribute Eval Ranker J48 Graft yielded the highest accuracy rate of 97.97% for the parallel evaluation; and OneR Attribute Eval Ranker Naïve Bayes and Relief Attribute Eval Ranker Naïve Bayes, with 75.31% accuracy both for the sequential evaluation. Figures 4.4 and 4.5 show the threshold curve of SVM Attribute Eval Ranker with J48 Graft, and OneR Attribute Eval Ranker Naïve Bayes. Other graphs are shown in Appendix G and H. Based on the ROC curves shown, the x-axis corresponds to the false positive rate, whereas the y-axis corresponds to the true positive rate. The graphs for the two selected models show that the value of threshold for true positive rate exceeds the false positive rate. With these results, the two models can be used to predict classification.

58 50 Figure 4.3: Threshold Curve for SVM Attribute Eval Ranker J48Graft Figure 4.4: Threshold Curve for OneR Attribute Eval Ranker Naïve Bayes Another set of test was done without the header analysis to verify whether header analysis increases classification accuracy as shown in Figure 4.7.

59 51 Figure 4.5: Sample Classification Without Header Analysis Table 4.7: Results of Pre-selected Algorithms without Header Analysis NB J48G NB DT NB SVMAt SVMAtt OneRAtt tribute ribute ribute Ranker Ranker Ranker OneRAtt ribute Ranker ReliefAt tribute Ranker TP FP TN FN A Presented in Table 4.7 are the results of the pre-selected algorithms performed without header analysis. Taking OneR Attribute Eval Ranker Naïve Bayes (see Tables 4.1 and 4.2 for the result with header analysis) as an example, from 82.81% accuracy has slightly increased to 83.79% when header analysis was considered in evaluation. This shows that header analysis can be a supplement to body analysis.

60 52 Comparing the two types of evaluation, the top three models for the parallel evaluation showed strong results based on the accuracy rate of the classified s. The accuracy rates for the three models are quite high (above 90%). However, because the indeterminate rates were quite high (ranging from 38.87% to 44.34%) - meaning a large number of s were not classified therefore, the evaluation is still considered weak. It is clearly shown that more s can still pass through the filter without being classified which does not justify the role of a spam filter. On the other hand, the top three models of sequential evaluation accuracy rates are also high, ranging from 83.20% to 83.79%. This type of evaluation out-performed its parallel counterpart, because all s were classified (0% indeterminate rates for all models); making this type of evaluation more reliable when applied to a live system. Figure 4.6: Screenshot of Spam Filter Out of all the models selected, SVM Attribute Eval Ranker J48 Graft yielded the highest accuracy rate (97.97%). Since this model has been the most effective in classification,

61 53 this was embedded in a program (see Figure 4.7) for further evaluation. For the header analysis, the Date, Message-ID, Return-Path, X-Mail, CC, and BCC fields were used as these fields are the commonly used for the top three pre-selected models (for both types of evaluation); these fields were also used with the selected model during the testing phase of the experiment. Sequential evaluation was also used because it was shown in the results presented that it performed better than Parallel evaluation. A set of 100 new s (50 spam and 50 nonspam) from the researchers own s received in the month of May 2013, and the Enron Spam Dataset were used for the validation. The result of the validation is shown in the figure below. Figure 4.7: Sample Tabulation for Validation After the validation, the performance of the spam filter was computed. The TP rate, FP rate, TN rate, FP rate, Accuracy rate of the validation is shown in Table 4.8. As shown in the

62 54 results, 85% of the total s were correctly classified by the spam filter. This result shows that SVM Attribute Eval Ranker J48 Graft shows good promise if used with Header Analysis. Table 4.8: Summary of Validation Result SVM Attribute Eval Ranker J48 Graft Date, Message-ID, Return-Path, X-Mail, CC, and BCC TP 96% FP 26% TN 74% FP 4% A 85% The figures below show the software architecture for the Spam Filter, both for parallel and sequential evaluation. In parallel analysis, it starts with the input of the to be classified then the is divided into the Body part and header part. The Body part goes through removal of HTML tags, Data cleaning or Tokenization, converts to.csv file, formatted to.arff file and into the Model that has the best result of Parallel Analysis of the research, then output. Meanwhile, the Header part goes into feature extraction of its own, into its Model which has the combination that got the highest accuracy, then output; Afterwards Final Classification is determined based from the result of Output Body and Output Header. RFC 1036 rules Header Header Features Extraction Text (.txt) File Header Format Matching Header Classification Spam? Yes Classification: Spam.eml ( ) No Body Body Features Extraction Text (.txt) File.csv File.arff File Model Body Classification Spam? Yes No Figure 4.8: Software Architecture for Parallel Evaluation Classification: Nonspam

63 55 In sequential analysis, when is inputted the header part of the is extracted, then feed into the Model, which has the highest accuracy from the research, and Output. If the output of the header is spam then the final classification is spam; otherwise thes processing will proceed to the body analysis. During body analysis the HTML tags will be removed, data cleaning (tokenization/parsing) follows. The computed probability of the frequencies is then saved a to.csv file, and converted to a.arff file. The.arff file will be fed to the body analysis model. The final classification will be based on the output of the body analysis. RFC 1036 rules Yes Classification: Matching Result Header Header Features Extraction Text (.txt) File Header Format Matching Header Classification Same Result? No Classification: Indeterminate.eml ( ) Body Body Features Extraction Text (.txt) File.csv File.arff File Model Body Classification Figure 4.9: Software Architecture for Sequential Evaluation

64 Chapter 5 CONCLUSION AND RECOMMENDATIONS Conclusion Based on the exploration of different header features for header analysis, of the top three (3) models for both parallel and sequential evaluation the following fields are used: Date, Message-ID, Return-Path, Reply-To, X-Mail, CC, and BCC. Moreover, it was also shown that performing header analysis can slightly increase the accuracy rate of the filter. Using exhaustive search, all possible combinations of feature selection and classifier algorithms were explored. Based from the experiment, it has been found that the combination of SVM Attribute Evaluation with Ranker Search Method and J48 Graft Algorithm as classifier gave the best accuracy rate of 97.97% after 10-fold cross validation was performed. Another noticeable algorithm is OneR Attribute Evaluation with Ranker Search Method and Decision Table as classifier yielded 97.75% accuracy rate. With 75.31% correctly classified instances, OneR Attribute Evaluation with Ranker Search Method and Naïve Bayes algorithm was also considered, because it worked best with Sequential Evaluation. Based from the summary of the results, it was clearly shown that Sequential evaluation outperformed the Parallel evaluation for all combinations of attribute selection and classifier algorithms, because all s were classified (0% indeterminate rate); making spam filtering more effective because s cannot easily pass through the filter without being classified. 56

65 57 Recommendations One of the future improvements that could be done on this research is setting more rules on the header analysis. The researchers did not have a chance to implement some rules, because of certain limitations such as the need of a live server to detect DNS spoofing; and validating if a sender address indicated in the From field is valid. Thus, analyzing and classifying the s could be improved more and become more reliable. Another development that could be done is considering phrases for the body analysis. In the body analysis, researchers was only able to analyze unigrams further studies can improve the result with bigrams like Sexy Girls and 0% interest. By considering meaning of phrases, false positive could be decreased because some words that are considered as spam indicators are also in nonspam s, but has a different meaning. The Researchers also had a hard time collecting s from volunteers, because of security issues. To increase the number of s that can be used for testing, the researchers recommend conducting the study in a bigger setup, like a whole Institution. Greatly increasing the number of samples could test the capacity of the filter more. Also, by increasing the number of samples a different set of attributes can be tested. SVM Attribute Evaluation with Ranker Search Method, J48Graft algorithm and Sequential analysis got the highest scores among the rest of the validation classifiers. Since this classifier shows a lot of promise, the Researchers highly recommend that further studies can be made with this combination of the feature selection, classifier algorithm, and evaluation type. They also recommend that following header fields, which were used for the validation phase, of the classifier can be used for further studies: Date, Message-ID, Return-Path, Reply-To, Xmail, CC, and BCC

66 References [1] Wang, C. and Chen S Using header session messages to anti-spamming [2] Kågström, J. Improving Naïve Bayesian Spam Filtering Master Thesis [3] The History of Internet Spam [4] Merriam-Webster Online Dictionary [5] What is spam? [6] Jamali, N. and Geng, H A mailbox ownership based mechanism for curbing spam. Stone, T Parameterization of Naïve Bayes for Spam Filtering [7] Yin H. and Chaoyang, Z An improved Bayesian Algorithm for Filtering Spam [8] YE, M., Tao, T., Mai, F., and Cheng, X A Spam Discrimination Based on Mail Header Feature and SVM. [9] Encyclopedia Britannica [10] International Society for Bayesian Analysis [11] Triola, M. Bayes Theorem. of%20non-diagnostic%20values/troponin%20leaks/bayes%20theorem.pdf [12] Bleicher, P From Spam to Adverse Events. Online Article [13] Graham, P A Plan for Spam. [14] Kay, R Bayesian Logic And Filters. pagenumber=2

67 [15] Maimon, O. and Rokach L Data Mining and Knowledge Discovery Handbook Chapter 9: Decision Trees [16] Shi, L., et al Spam Classification Using Decision Tree Ensemble [17] Chakraborty, S. and Mondal, B Spam Mail Filtering Technique using Different Decision Tree Classifiers through Data Mining Approach - A Comparative Performance Analysis. [18] Banday, M. T Analysing Headers for Forensic Investigation [19] Header [20] Al-Jarrah, O., Khater, I., and Al-Duwairi, B Identifying Potentially Useful Header Features for Spam Filtering [21] Spamassassin Public Spam Corpus. [22] Bruce Guenter s Spam Archive. [23] Spam Haus [24] Garriss et al., Re: Reliable [25] Lopes, C., et al, Symbiotic filtering for spam detection [26] Islam, Md. Rafiqul & Chowdhury, U. Morshed SPAM FILTERING USING ML ALGORITHMS [27] Deshpande, V., Erbacher, R., and Harris, C An Evaluation of Naïve Bayesian Anti- Spam Filtering Techniques [28] Taninpong, P. and Ngamsuriyaroj, S Incremental Naïve Bayesian Spam Mail Filtering and Variant Incremental Training [29] Pantel, P. and Lin, D SpamCop: A Spam Classification & Organization Program

68 [30] Gajewski, P.Wojciech Adaptive Naïve Bayesian Anti-Spam Engine [31] Almeida, T. and Yamakami, A Content-Based Spam Filtering [32] Pádraig Cunningham, Niamh Nowlan, Sarah Jane Delany, Mads Haahr et al A Case-Based Approach to Spam Filtering that Can Track Concept Drift [33] Abdelghani Bellaachia, Erhan Guven. Predicting Breast Cancer Survivability Using Data Mining [34] Youn, S. and McLeaod, D A Comparative Study for Classification [35] Sharma, A., et al A Comparative Study of Classification Algorithms for Spam Data Analysis [36] Kumar, R. et al Comparative Study on Spam Classifier [37] Ali S., and Xiang, Y Spam Classification Using Adaptive Boosting Algorithm [38] Obied, A Bayesian Spam Filtering [39] What is an Header [40] Header Analysis Analysis-Definition.html [41] Graham, P Better Bayesian Filtering. [42] Chin-Chien Wang "Sender and Receiver Addresses as Cues for Anti-Spam Filtering" [43] Trevino, A Spam Filtering Through Header Relay Detection. [44] Y. Hu, et al 2010., A scalable intelligent non-content-based spam-filtering framework., [45] Sheu, J An Efficient Two-phase Spam Filtering Method Based on s Categorization [46] Le Zhang, Jingbo Zhu, Tianshun Yao "An Evaluation of Statistical Spam Filtering Techniques"

69 [47] Spam Detection [48] Combining Probabilities [49] Hall M., Frank E., et.al The WEKA Data Mining Software: An Update. SIGKDD Explorations, Volume 11, Issue 1. [50] Machine Learning Repository Spambase Data Set [51] Standard for Interchange of USENET Messages [52 ] The History of http:// [53] Apache OpenNLP [54] Fellbaum, C. 1998, ed. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press. [55] Outlook Freeware: Free Add-ins for Microsoft Outlook. Export Messages To EML Format. [56] address format [57] Witten, I. Frank, E. and Hall, M Data Mining: Practical Machine Learning Tools and Techniques, 3 rd ed. [58] JavaMail API

70 Appendix A COMMONLY USED MAIL-USER-AGENTS Microsoft Outlook Open WebMail Yahoo Iphone Ipad Apple Mail Microsoft Outlook Express Gmail Thunderbird Ipod Touch Hotmail AOL Sparrow Windows Phone 7 Lotus Notes Excite Blackberry Palm WebOS Entourage

71 Appendix B SELECTED ATTRIBUTE SELECTION AND CLASSIFIER ALGORITHMS Attribute Selection 1. CfsSubsetEval Evaluator using the following Search Methods: a. Greedy Stepwise b. Scatter Search c. SubsetSize Forward Selection d. Linear Forward Selection e. Best First f. Genetic Search 6. Consistency using the following Search Methods a. Greedy Stepwise b. Genetic Search c. Linear Forward Selection d. SubsetSize Forward Selection e. Best First 7. One Attribute Eval using Ranker Search Method 8. Relief Attribute Eval using Ranker Search Method 9. Wrapper Subset Eval using Genetic Search Method 10. SVM Attribute Eval using Ranker Search Method 11. Classifier Subset Eval using Genetic Search Method

72 Classifier Algorithms 1. Naïve Bayes 2. Naïve Bayes Simple 3. BayesNet (Bayes Network Learning using Hill Climbing Algorithm) 4. Complement Naïve Bayes 5. Naïve Bayes Updateable 6. Logistic (Multinomial logistic regression model) 7. Multilayer Perceptron(Back propagation) 8. RBF Network (Normalize Gaussian radial basis function network) 9. Simple Logistic (Linear logistic Regression model) 10. SMO (John Platt s sequential minimal optimization algorithm) 11. SPegasos (stochastic variant of the Pegasos Primal Estimated sub-gradient solver for SVM method of Shalev-Shwartz et al 12. Voted Perceptron (voted perceptron algorithm by Freund and Schapire 13. IB1 (nearest-neighbour classifier) 14. IBk (K-nearest neighbours classifier LinearNNSearch) 15. KStar (K* is an instance-based classifier, that is the class of a test instance is based upon the class of those training instances similar to it, as determined by some similarity function) 16. LWL (Locally weighted learning - classifier: Decision Stump, Nearest Neighbour Search Algo: LinearNNSearch) 17. AdaBoost (boosting a nominal class classifier using the Adaboost M1 method - Classifier: Decision Stump

73 18. Attribute Selected Classifier (Dimensionality of training and test data is reduced by attribute selection before being passed on to a classifier - Classifier J48, evaluator cfssubseteval, search best first) 19. Bagging (classifier to reduce variance classifier: REPTree) 20. Classification Via Clustering (meta-classifier that uses a clusterer for classification clusterer: SimpleKMeans) 21. Classification Via Regression (using regression methods M5Base: M5 Model trees and rules) 22. LogitBoost (additive logistic regression classifier: decision stump) 23. Rotation Forest (classification: J48, projection filter: principal components) 24. Dagging (creates a number of disjoint, stratified folds out of the data and feeds each chunk of data to a copy of the supplied base classifier classifier: SMO) 25. CV Parameter Selection (parameter selection by cross-validation for any classifier classifier: ZeroR) 26. FilteredClassifier (arbitrary classifier on data that has been passed through an arbitrary filter classifier: J48, filter: Discretize) 27. Hyper Pipes 28. VFI (voting feature intervals) 29. Conjunctive Rule 30. Decision Table (search: best first) 31. DTNB (decision table/naïve bayes hybrid classifier) 32. JRip (repeated incremental pruning to produce error reduction Ripper) 33. NNge (nearest-neighbor-like algorithm using non-nested generalized exemplars)

74 34. OneR 35. PART (PART decision list; uses separate-and-conquer. Builds a partial c4.5 decision tree in each iteration and makes the best leaf into a rule) 36. ADTree (alternating decision tree) 37. BFTree (best-first decision tree) 38. Decision Stump 39. J48 (C4) 40. LADTree (multi-class alternating decision tree using the LogitBoost strategy) 41. NBTree (decision tree with naïve bayes classifiers at the leaves) 42. Random Forest (forest of random trees) 43. Random Tree 44. FT/Functional Trees (classification trees that could have logistic regression functions at the inner nodes and/or leaves) 45. J48 Graft (grafted C4)

75 Appendix C STOP WORDS i me my myself we us our ours ourselves you your yours yourself yourselves he him his himself she her hers herself it its itself they them their theirs themselves what which who whom whose this that these those am is are was were be been being have has had having do does did doing will would should can could ought i'm you're he's she's it's we're they're i've you've we've they've i'd you'd he'd she'd we'd they'd i'll you'll he'll she'll we'll they'll isn't aren't wasn't weren't hasn't haven't hadn't doesn't don't didn't won't wouldn't shan't shouldn't can't cannot couldn't mustn't let's that's who's what's here's there's when's where's why's how's a an the and but if or because as until while of at by for with about against between into through during before after above below to from up upon down in out on off over under again further then once here there when where why how all any also both each few more most other some such no nor not only own same so than too very say says said shall

76 Appendix D ATTRIBUTES With Dictionary http = 4192 packaging = 1145 money = 434 watch = 300 click = = 1143 express = 431 watches = 300 best = 2777 images = 1025 internet = 423 details = 299 customers = 2565 viagra = 1023 visa = 423 business = 290 delivery = 2255 cialis = 1012 method = 419 content = 290 fast = 2162 spam = 1004 bank = 403 offer = 285 shipping = 2130 free = 831 credit = 398 family = 276 address = 1583 usps = 755 price = 371 like = 276 prices = 1511 canadian = 737 discount = 356 html = 267 quality = 1429 show = 731 service = 349 need = 267 copy = 1391 payment = 653 view = 347 make = 266 paste = 1363 yahoo = 641 save = 328 special = 258 pharmacy = 1327 links = 577 time = 327 blue = 251 discounts = 1302 pack = 517 windows = 321 fund = 250 professional = today = 505 days = 320 visit = security = 458 products = 316 wish = 244 market = 1276 information = 452 name = 310 full = 239 link = 1255 contact = 451 info = 309 funds = 233 guarantee = 1250 mail = 446 account = 306 want = 230 drugs = 1151 american = 441 message = 306 font = 228

77 match = 226 place = 168 customer = 145 high = 126 server = 226 stop = 161 official = 145 list = 126 product = 223 browser = 160 dear = 143 female = 124 sent = 214 bulletin = 160 home = 141 date = 123 find = 213 transfer = 160 installation = 141 marketing = 123 know = 209 country = 158 dollars = 140 questions = 123 number = 208 office = 158 term = 140 replicas = 123 company = 205 first = 157 s = 139 claim = 122 hello = 205 sale = 156 quick = 139 friend = 122 tell = 201 process = 154 well = 139 gifts = 122 work = 201 months = 153 deposit = 138 must = 122 good = 197 regards = 152 love = 138 response = 122 card = 191 group = 151 purchase = 138 thanks = 121 explorer = 189 people = 151 reply = 137 system = 120 take = 188 deal = 150 international = great = 118 help = 187 read = remove = 118 life = 183 check = 149 give = 133 video = 118 order = 183 february = 149 offers = 133 years = 118 privacy = 180 updates = 149 style = 132 software = 117 systems = 171 retail = 148 goods = 130 parcel = 116 luxury = 169 support = 147 policy = 129 cash = 115 phone = 169 million = 146 world = 129 loan = 115 back = 168 size = 146 package = 128 release = 114

78 update = 114 type = 109 subject = 107 part = 105 note = 113 year = 109 twitter = 107 transaction = 102 start = 113 dish = 108 right = 106 future = 101 month = 112 replica = 108 states = 106 notification = 101 philippines = 111 longer = 107 capital = 105 button = 109 look = 107 even = 105 pills = 109 rights = 107 much = 105

79 Without Dictionary please = 1696 echeck = 391 based = 165 visual = 131 approved = 1318 receive = = 162 facebook = 129 satisfied = 1291 vigara = 343 levtira = = 124 returning = 1224 just = 317 required = greetings= 123 productas = 1221 unsubscribe = 307 lowest = 155 instead = = 1141 cheap = 293 easy = 151 thank = 117 online = 1023 send = 251 dealspot = 150 doctorberk = 116 nbsp = 820 next = 239 earn = 149 trackable = 116 enable = 574 aspx = 212 cilais = 148 learn = 115 microsoft = 556 chels = 210 every = 147 paid = = 509 fanbox = 198 received = 140 never = 112 smartdraw = 397 united = 197 without = 137 really = 112,ach = 391 available = 196 important = 134 reserved = 110,mastercard, = 391 lose = 181 interested = 134 bitcoin = 391 receiving = 173 worldwide = 134

80 Appendix E SUMMARY OF EVALUATION TABULATION PARALLEL RESULT TABULATION Algorithm for classification: PARALLEL_OneAttributeEval - Ranker Feature selection algorithm: DecisionTable No. Original Classification Header Analysis Result Body Analysis Result Final Classification Nspam_01.eml.txt Nspam_02.eml.txt Nspam_03.eml.txt Nspam_04.eml.txt Nspam_05.eml.txt Nspam_06.eml.txt Nspam_07.eml.txt Nspam_08.eml.txt Nspam_09.eml.txt Nspam_10.eml.txt Nspam_100.eml.txt Nspam_101.eml.txt Nspam_102.eml.txt Nspam_103.eml.txt Nspam_104.eml.txt Nspam_105.eml.txt Nspam_106.eml.txt Nspam_107.eml.txt Nspam_108.eml.txt Nspam_109.eml.txt Nspam_11.eml.txt Nspam_110.eml.txt Nspam_111.eml.txt Nspam_112.eml.txt Nspam_113.eml.txt Nspam_114.eml.txt Nspam_115.eml.txt Nspam_116.eml.txt Nspam_117.eml.txt Nspam_118.eml.txt

81 Nspam_119.eml.txt Nspam_12.eml.txt Nspam_120.eml.txt Nspam_121.eml.txt Nspam_122.eml.txt Nspam_123.eml.txt Nspam_124.eml.txt Nspam_125.eml.txt Nspam_126.eml.txt Nspam_127.eml.txt Nspam_128.eml.txt Nspam_129.eml.txt Nspam_13.eml.txt Nspam_130.eml.txt Nspam_131.eml.txt Nspam_132.eml.txt Nspam_133.eml.txt Nspam_134.eml.txt Nspam_135.eml.txt Nspam_136.eml.txt Nspam_137.eml.txt Nspam_138.eml.txt Nspam_139.eml.txt Nspam_14.eml.txt Nspam_140.eml.txt Nspam_141.eml.txt Nspam_142.eml.txt Nspam_143.eml.txt Nspam_144.eml.txt Nspam_145.eml.txt Nspam_146.eml.txt Nspam_147.eml.txt Nspam_148.eml.txt Nspam_149.eml.txt Nspam_15.eml.txt Nspam_150.eml.txt Nspam_151.eml.txt Nspam_152.eml.txt Nspam_153.eml.txt Nspam_154.eml.txt Nspam_155.eml.txt

84 Nspam_98.eml.txt Nspam_99.eml.txt Spam_1.eml.txt Spam_10.eml.txt Spam_100.eml.txt Spam_101.eml.txt Spam_102.eml.txt Spam_103.eml.txt Spam_104.eml.txt Spam_105.eml.txt Spam_106.eml.txt Spam_107.eml.txt Spam_108.eml.txt Spam_109.eml.txt Spam_11.eml.txt Spam_110.eml.txt Spam_111.eml.txt Spam_112.eml.txt Spam_113.eml.txt Spam_114.eml.txt Spam_115.eml.txt Spam_116.eml.txt Spam_117.eml.txt Spam_118.eml.txt Spam_119.eml.txt Spam_12.eml.txt Spam_120.eml.txt Spam_121.eml.txt Spam_122.eml.txt Spam_123.eml.txt Spam_124.eml.txt Spam_125.eml.txt Spam_126.eml.txt Spam_127.eml.txt Spam_128.eml.txt Spam_129.eml.txt Spam_13.eml.txt Spam_130.eml.txt Spam_131.eml.txt Spam_132.eml.txt Spam_133.eml.txt

85 Spam_134.eml.txt Spam_135.eml.txt Spam_136.eml.txt Spam_137.eml.txt Spam_138.eml.txt Spam_139.eml.txt Spam_14.eml.txt Spam_140.eml.txt Spam_141.eml.txt Spam_142.eml.txt Spam_143.eml.txt Spam_144.eml.txt Spam_145.eml.txt Spam_146.eml.txt Spam_147.eml.txt Spam_148.eml.txt Spam_149.eml.txt Spam_15.eml.txt Spam_150.eml.txt Spam_151.eml.txt Spam_152.eml.txt Spam_153.eml.txt Spam_154.eml.txt Spam_155.eml.txt Spam_156.eml.txt Spam_157.eml.txt Spam_158.eml.txt Spam_159.eml.txt Spam_16.eml.txt Spam_160.eml.txt Spam_161.eml.txt Spam_162.eml.txt Spam_163.eml.txt Spam_164.eml.txt Spam_165.eml.txt Spam_166.eml.txt Spam_167.eml.txt Spam_168.eml.txt Spam_169.eml.txt Spam_17.eml.txt Spam_170.eml.txt

92 Spam_71.eml.txt Spam_72.eml.txt Spam_73.eml.txt Spam_74.eml.txt Spam_75.eml.txt Spam_76.eml.txt Spam_77.eml.txt Spam_78.eml.txt Spam_79.eml.txt Spam_8.eml.txt Spam_80.eml.txt Spam_81.eml.txt Spam_82.eml.txt Spam_83.eml.txt Spam_84.eml.txt Spam_85.eml.txt Spam_86.eml.txt Spam_87.eml.txt Spam_88.eml.txt Spam_89.eml.txt Spam_9.eml.txt Spam_90.eml.txt Spam_91.eml.txt Spam_92.eml.txt Spam_93.eml.txt Spam_94.eml.txt Spam_95.eml.txt Spam_96.eml.txt Spam_97.eml.txt Spam_98.eml.txt Spam_99.eml.txt

93 PARALLEL RESULT TABULATION Algorithm for classification: PARALLEL_OneAttributeEval - Ranker Feature selection algorithm: NaiveBayes No. Original Classification Header Analysis Result Body Analysis Result Final Classification Nspam_01.eml.txt Nspam_02.eml.txt Nspam_03.eml.txt Nspam_04.eml.txt Nspam_05.eml.txt Nspam_06.eml.txt Nspam_07.eml.txt Nspam_08.eml.txt Nspam_09.eml.txt Nspam_10.eml.txt Nspam_100.eml.txt Nspam_101.eml.txt Nspam_102.eml.txt Nspam_103.eml.txt Nspam_104.eml.txt Nspam_105.eml.txt Nspam_106.eml.txt Nspam_107.eml.txt Nspam_108.eml.txt Nspam_109.eml.txt Nspam_11.eml.txt Nspam_110.eml.txt Nspam_111.eml.txt Nspam_112.eml.txt Nspam_113.eml.txt Nspam_114.eml.txt Nspam_115.eml.txt Nspam_116.eml.txt Nspam_117.eml.txt Nspam_118.eml.txt Nspam_119.eml.txt Nspam_12.eml.txt

105 Spam_73.eml.txt Spam_74.eml.txt Spam_75.eml.txt Spam_76.eml.txt Spam_77.eml.txt Spam_78.eml.txt Spam_79.eml.txt Spam_8.eml.txt Spam_80.eml.txt Spam_81.eml.txt Spam_82.eml.txt Spam_83.eml.txt Spam_84.eml.txt Spam_85.eml.txt Spam_86.eml.txt Spam_87.eml.txt Spam_88.eml.txt Spam_89.eml.txt Spam_9.eml.txt Spam_90.eml.txt Spam_91.eml.txt Spam_92.eml.txt Spam_93.eml.txt Spam_94.eml.txt Spam_95.eml.txt Spam_96.eml.txt Spam_97.eml.txt Spam_98.eml.txt Spam_99.eml.txt

106 PARALLEL RESULT TABULATION Algorithm for classification: PARALLEL_SVMAttributeEvalRanker Feature selection algorithm: J48 Graft No. Original Classification Header Analysis Result Body Analysis Result Final Classification Nspam_01.eml.txt Nspam_02.eml.txt Nspam_03.eml.txt Nspam_04.eml.txt Nspam_05.eml.txt Nspam_06.eml.txt Nspam_07.eml.txt Nspam_08.eml.txt Nspam_09.eml.txt Nspam_10.eml.txt Nspam_100.eml.txt Nspam_101.eml.txt Nspam_102.eml.txt Nspam_103.eml.txt Nspam_104.eml.txt Nspam_105.eml.txt Nspam_106.eml.txt Nspam_107.eml.txt Nspam_108.eml.txt Nspam_109.eml.txt Nspam_11.eml.txt Nspam_110.eml.txt Nspam_111.eml.txt Nspam_112.eml.txt Nspam_113.eml.txt Nspam_114.eml.txt Nspam_115.eml.txt Nspam_116.eml.txt Nspam_117.eml.txt Nspam_118.eml.txt Nspam_119.eml.txt

110 Nspam_99.eml.txt Spam_1.eml.txt Spam_10.eml.txt Spam_100.eml.txt Spam_101.eml.txt Spam_102.eml.txt Spam_103.eml.txt Spam_104.eml.txt Spam_105.eml.txt Spam_106.eml.txt Spam_107.eml.txt Spam_108.eml.txt Spam_109.eml.txt Spam_11.eml.txt Spam_110.eml.txt Spam_111.eml.txt Spam_112.eml.txt Spam_113.eml.txt Spam_114.eml.txt Spam_115.eml.txt Spam_116.eml.txt Spam_117.eml.txt Spam_118.eml.txt Spam_119.eml.txt Spam_12.eml.txt Spam_120.eml.txt Spam_121.eml.txt Spam_122.eml.txt Spam_123.eml.txt Spam_124.eml.txt Spam_125.eml.txt Spam_126.eml.txt Spam_127.eml.txt Spam_128.eml.txt Spam_129.eml.txt Spam_13.eml.txt Spam_130.eml.txt Spam_131.eml.txt Spam_132.eml.txt Spam_133.eml.txt Spam_134.eml.txt

118 Spam_72.eml.txt Spam_73.eml.txt Spam_74.eml.txt Spam_75.eml.txt Spam_76.eml.txt Spam_77.eml.txt Spam_78.eml.txt Spam_79.eml.txt Spam_8.eml.txt Spam_80.eml.txt Spam_81.eml.txt Spam_82.eml.txt Spam_83.eml.txt Spam_84.eml.txt Spam_85.eml.txt Spam_86.eml.txt Spam_87.eml.txt Spam_88.eml.txt Spam_89.eml.txt Spam_9.eml.txt Spam_90.eml.txt Spam_91.eml.txt Spam_92.eml.txt Spam_93.eml.txt Spam_94.eml.txt Spam_95.eml.txt Spam_96.eml.txt Spam_97.eml.txt Spam_98.eml.txt Spam_99.eml.txt

119 SEQUENTIAL RESULT TABULATION Algorithm for classification: SEQUENTIAL_OneAttributeEval - Ranker Feature selection algorithm: NaiveBayes No. Original Classification Header Analysis Result Body Analysis Result Final Classification Nspam_01.eml.txt Nspam_02.eml.txt Nspam_03.eml.txt Nspam_04.eml.txt Nspam_05.eml.txt Nspam_06.eml.txt Nspam_07.eml.txt Nspam_08.eml.txt Nspam_09.eml.txt Nspam_10.eml.txt Nspam_100.eml.txt Nspam_101.eml.txt Nspam_102.eml.txt Nspam_103.eml.txt Nspam_104.eml.txt Nspam_105.eml.txt Nspam_106.eml.txt Nspam_107.eml.txt Nspam_108.eml.txt Nspam_109.eml.txt Nspam_11.eml.txt Nspam_110.eml.txt Nspam_111.eml.txt Nspam_112.eml.txt Nspam_113.eml.txt Nspam_114.eml.txt Nspam_115.eml.txt Nspam_116.eml.txt Nspam_117.eml.txt Nspam_118.eml.txt Nspam_119.eml.txt Nspam_12.eml.txt

132 SEQUENTIAL RESULT TABULATION Algorithm for classification: SEQUENTIAL_SVMAttributeEvalRanker Feature selection algorithm: Naive Bayes No. Original Classification Header Analysis Result Body Analysis Result Final Classification Nspam_01.eml.txt Nspam_02.eml.txt Nspam_03.eml.txt Nspam_04.eml.txt Nspam_05.eml.txt Nspam_06.eml.txt Nspam_07.eml.txt Nspam_08.eml.txt Nspam_09.eml.txt Nspam_10.eml.txt Nspam_100.eml.txt Nspam_101.eml.txt Nspam_102.eml.txt Nspam_103.eml.txt Nspam_104.eml.txt Nspam_105.eml.txt Nspam_106.eml.txt Nspam_107.eml.txt Nspam_108.eml.txt Nspam_109.eml.txt Nspam_11.eml.txt Nspam_110.eml.txt Nspam_111.eml.txt Nspam_112.eml.txt Nspam_113.eml.txt Nspam_114.eml.txt Nspam_115.eml.txt Nspam_116.eml.txt Nspam_117.eml.txt Nspam_118.eml.txt Nspam_119.eml.txt Nspam_12.eml.txt

145 SEQUENTIAL RESULT TABULATION Algorithm for classification: SEQUENTIAL_OneAttributeEval - Ranker Feature selection algorithm: NaiveBayes No. Original Classification Header Analysis Result Body Analysis Result Final Classification Nspam_01.eml.txt Nspam_02.eml.txt Nspam_03.eml.txt Nspam_04.eml.txt Nspam_05.eml.txt Nspam_06.eml.txt Nspam_07.eml.txt Nspam_08.eml.txt Nspam_09.eml.txt Nspam_10.eml.txt Nspam_100.eml.txt Nspam_101.eml.txt Nspam_102.eml.txt Nspam_103.eml.txt Nspam_104.eml.txt Nspam_105.eml.txt Nspam_106.eml.txt Nspam_107.eml.txt Nspam_108.eml.txt Nspam_109.eml.txt Nspam_11.eml.txt Nspam_110.eml.txt Nspam_111.eml.txt Nspam_112.eml.txt Nspam_113.eml.txt Nspam_114.eml.txt Nspam_115.eml.txt Nspam_116.eml.txt Nspam_117.eml.txt Nspam_118.eml.txt Nspam_119.eml.txt Nspam_12.eml.txt

158 Appendix F SUMMARY OF THRESHOLD CURVE SPAM CLASS NAÏVE BAYES OneR Attribute Evaluation Ranker Search Method DECISION TABLE

159 J48 Graft SVM Attribute Evaluation Ranker Search Method NAÏVE BAYES

160 Relief Attribute Evaluation Ranker Search Method NAÏVE BAYES

161 NONSPAM CLASS NAÏVE BAYES OneR Attribute Evaluation Ranker Search Method DECISION TABLE

164 Appendix G SUMMARY OF COST-BENEFIT ANALYSIS SPAM CLASS NAÏVE BAYES OneR Attribute Evaluation Ranker Search Method DECISION TABLE

167 NONSPAM CLASS NAÏVE BAYES OneR Attribute Evaluation Ranker Search Method DECISION TABLE

170 LYNDSAY BUÑALES BELLO B9 L22 Albania cor. Palma Street Lessandra Ph5, Brgy. Salinas II, Bacoor City, Cavite, 4102 Philippines Mobile Number: Address: Objective: To become an asset of a well-known, reputable and respected company that will give me the opportunity to learn essential skills in the field of Information Technology and to help contribute in the success of the organization. EDUCATION: BS Computer Science with Java Programming Specialization MAPÚA Institute of Technology Makati City BM Music Education Undergraduate University of Santo Tomas España, Manila Elizabeth Seton School-South ACHIEVEMENTS: Academic Scholar (SY , 3 rd Term SY , 1 st to 3 rd Term SY ) President s List (2 nd Term SY ) TECHNICAL SKILLS: Programming Languages: C++, C#, JAVA Web Technologies: HTML, JSP, Java Servlet Business Technologies: NetSuite, SAP Database Technologies: Apache Derby, Microsoft SQL Server Software: Microsoft Visual Studio, NetBeans SEMINARS ATTENDED: JCI Manila-Mapúa Technopreneurship Boot Camp, May 2012 HTML5, May 2013 Career Development Seminar, May 2013

171 RESEARCH PAPERS AND PROJECTS HANDLED: Spam Filtering (October 2012-July 2013) Thesis Paper SmartBuy Mobile Application (May 2012) Windows Mobile Application for JCI Manila-Mapúa Technopreneurship Boot Camp and Final Project for Software Engineering Course Grocery Sales Point-of-Sale and Inventory System (March 2012) Database Application for Java Programming 2 Course Zamboanga Carrageenan Manufacturing Corporation Enterprise Resource Planning System (March 2012) Project Plan for Systems Analysis and Design Course PERSONAL BACKGROUND: Born on March 9, 1990 in Quezon City. Single. Interests include sports, music, and technology. Has been involved in Church Music (Rondalla Conductor/Choir Member) and other youth programs. REFERENCES: Marivic G. Dichoso General Manager Oceanlink Marine Electronics Services (02) Gloren Sison-Fuentes Faculty Mapúa Institute of Technology ;

172 Rick Joshua L. Genelsa Home Address: 131 Acacia St. Meadowood Executive Vill. Bacoor, City Mobile: +63 (922) AREAS OF INTEREST Business Analytics, Database Management, Quality Assurance, Software Design and Development EDUCATION College: MAPUA INSTITUTE OF TECHNOLOGY MAKATI (MIT) Bachelor of Science in Computer Science - Major in Business Analytics SAS ( ) High School Christian Values School ( ) TECHNICAL SKILLS Programming Languages: Web Technologies: Database Technologies: Operating Systems: Business Technologies: C++, C#, Base SAS, VBA (Excel) HTML and Cascading Style Sheets SQL Server Microsoft Windows Software Solutions Applications and Services (SAP), Microsoft Visio, SAP Dashboard, Weka EXTRA-CURRICULAR ACTIVITIES Member Web Masters Guild of MAPUA (July present) Student Assistant Center for Student Advising Makati (January 2013 Present) RESEARCH PAPERS AND PROJECTS HANDLED Anti-Spam Analysis (October 2012 July 2013) Program\Course requirement Thesis Aqua Current Auto Service (July 2012 September 2012) Research Study for Project Management Which Came First (June 2012 August 2012) An Android Game Project for Software Engineering Baby book App for Android (February 2012 April 2012) Requirements and Planning Project for System Analysis and Design

173 SAP Dashboard Competition (January 2012 March 2012) An entry for the SAP Dashboard Competition themed Olympics Grading System (October 2011 December 2011) Software Application for Programming Course 4 and Database (Object Oriented Programming & SQL) Enrollment System (June 2011 July 2011) A Mapua Inspired Enrollment System for C# project Zark s Pizza Ordering System (March 2010 April 2010) Ordering System project for C++ TRAININGS AND SEMINARS ATTENDED Data Warehousing WTL May 14, 2013 AVR-1, Mapúa Institute of Technology Makati Esprit de Corps V2.0 March 23, 2013 Villa Concepcion Wet & Wild Waves Resort, Pandi, Bulacan HTML 5 February 2013 AVR2, Mapúa Institute of Technology Serviam: Student Leadership and Social Responsibility January 2013 AVR1, Mapúa Institute of Technology Peer Advisers: Strategize your way to success November 8, 2012 AVR-1, Mapúa Institute of Technology Makati APPGRADE: Windows Phone Application November 3, 2012 Development Training AVR-2, Mapúa Institute of Technology - Makati Ethics in the Workplace November 2012 AVR1, Mapúa Institute of Technology REFERENCES 1.) Pamela Roldan Center For Student Advising Makati - Head Mapua Institute of Technology - Makati

174 Edgel John M. Roxas City Address: #U410 Harmony Homes 2646 Sandejas St. Malate, Manila Tel.: (02) Mobile: (63) Areas of Interest Information Technology, Web Design and Development, Software Design and Development, Database Management and Administration, Systems Applications and Products. Educational Attainment MAPUA INSTITUTE OF TECHNOLOGY MAKATI (Bachelor of Science in Computer Science / ) (Specializing in MICROSOFT.NET TECHNOLOGY ) BATAAN MONTESSORI SCHOOL (High School Diploma March 2009 Computer Literacy Award) Technical Skills Programming Languages: Web Technologies: Database Technologies: Operating Systems: Business Technologies: Others: Infrastructure Library (ITIL) C++, C#, Visual Basic, Macro VBA HTML, CSS, ASP.NET SQL Server Microsoft Windows, Mac OS Solutions Applications and Services (SAP) Project Management, Information Technology

175 RESEARCH PAPER AND PROJECTS HANDLED Spam Filtering (Thesis Paper October 2012-July 2013 ) Tic-Tac-Toe (Project for C++ Console Game Application) Restaurant Inventory and Point of Sale System (An Inventory and Point of Sale system project for Object Oriented Programming course and Database Management System Course) Online Shopping Cart (Online Shopping Cart project for ASP.NET Course) TRAININGS AND SEMINARS ATTENDED HTML5, Web Master s Guild November 2011 INTRODUCTION TO SPECIALIZATION, Mapúa Institute of Technology August 2011 Programming the Future, Accenture July 2012 Lecture on Technology and Innovation Opportunities, Mapúa Institute of Technology August 2012 ORGANIZATIONAL AFFILIATION WEBMASTERS GUILD OF MAPUA (August 2009 May 2013) (Participated in the implementation of the various activities of the club Acquaintance party, Classroom activities and discount card distribution) ASSOCIATION FOR COMPUTING MACHINERY (ACM) (August 2009 May 2013) Assisted in the following projects -Team Building (Caliraya Lake, Laguna and Mango Camp, Zambalez) Awarded with the Leadership Award April 2012 PERSONAL BACKGROUND Born on August 30, 1992 in Bataan, Philippines. Fluent in English and Filipino. Knowledgeable in MS Word, MS Excel, MS PowerPoint and MS Visio. Interests and hobbies include playing basketball, playing guitar. Trustworthy, responsible, hardworking and punctual. REFERENCES Available upon request.