Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier, multiple classifiers, generalization ability, email spam, email spam classification By Po-Chun CHANG A Dissertation Submitted To the Advanced Analytics Institute For the degree of Master by research In The University of Technology, Sydney 2012 Word count: 2520 Page 1
Acknowledgements Word count: 2520 Page 2
CONTENTS Chapter 1. Introduction... 4 a. Background... 6 b. Significance... 7 Chapter 2. Literature review... 8 a. background... 8 b. Text Analytics... 8 c. SVM based classifiers... 8 i. SVM overview... 8 ii. Kernel trick... 8 d. Optimization... 8 e. Incremental learning... 8 f. Ensemble learning... 8 Chapter 3. Methodology... 8 Chapter 4. Results... 8 Chapter 5. Discussion... 9 Chapter 6. Conclusion... 9 Reference:... 10 Appendix A:... 10 Word count: 2520 Page 3
Chapter 1. INTRODUCTION [email spam] same as the previous assignment in TRP Email has become a popular media for spreading spam message, due to its fast transmission, low cost, and globally accessible. Spam email is also known as junk email, unsolicited bulk email, or unsolicited commercial email, and it becomes a serious. One problem caused by spam email is companies financial losses due to servers require more storage space and computational power to deal with large amounts of emails [1]. Another problem is that spam emails are received and stored in users mailboxes without their agreement, so users need to spend more time on checking and deleting junk mails from their mail boxes [2]. In addition, due to spam emails may contain malicious software (e.g.: phishing software), illegal advertising, such as pyramid schemes, or sensitive information, it has become a serious security issue on internet [3]. [classification] same as the previous assignment in TRP For solving the spam problem, one of the solutions is using data mining with machine learning techniques. According to Witten, Frank & Hall[4], Data mining is the automatic or semi-automatic processes for discovering the structural patterns from data, which discovers the knowledge from existing information. Machine learning is the algorithms, formulas or models that computers can apply to efficiently implement pattern reorganization on data and use them for predicting possible outcome on new dataset. Machine learning principle is to find the similarity between new incoming emails with the existing mails which labelled as spam [5].If the matching result is positive, then classified as spam, else is legitimate email. [generalization ability] Based on the concept of data mining and machine learning, the key property of a learning algorithm is generalization. As it is mentioned in previous paragraph that data mining is a method for discovering the patterns in the existing data, there is no guarantee the discovered patterns can achieve good result in the new incoming information. By the definition of Vapnik Chervonenkis (VC) dimension in statistical learning theory, small training error does not guarantee small generalization error [6-8]. For the email spam classification situation, generalization ability means the learning algorithm can still maintain the detection rate when the training data is reduced or new form spam messages are added. [SVM based classifiers] Word count: 2520 Page 4
Many learning algorithms have been proposed for dealing with data classification and categorization. Support vector machine (SVM) [6] is one of the preferable supervised learning algorithms due to its solid theoretical background, theoretically good classification accuracies without over fitting problem and reasonable time consuming [9]. SVM is linear based learning algorithm that training the classifier to find the best separating hyperplane to separate data, based on maximum margin training algorithm [10], into two groups. Moreover, for the dataset, this is not linear separable, SVM uses kernel trick to implicitly project the data instances into virtual space. Thus, nonlinear separable data would be linear separable in different feature space, usually in higher dimension [11]. [multiple classifiers] Based on empirical observations and machine learning applications, it is able to find a learning algorithm might achieve better result than others, but it is not realistic for one single classifier to achieve the best results on the overall problem domain. Moreover, many learning algorithms use optimization techniques to achieve the high accuracy results, but they may have chance to stuck in local optima [12]. In addition, it is not practical for one single inducer, the well trained model or classifier from a specific training set, to achieve 100% prediction on the new incoming data. With the premise that the classifiers results are not compromise one another, integrating multiple classifiers outcomes would improve the accuracy rate. As the old saying goes, Two heads are better than one. [Ensemble learning] Ensemble learning is a technique which can combine multiple classifiers and come out with one synthesized classifier to improve the prediction accuracy as well as better generalization[13]. The generalization ability of ensemble learning with multiple classifiers is usually much stronger than only use one classifier. The methodology of Ensemble learning is to weigh several individual classifiers and combine them together to generate final decision. The generalization ability of ensemble learning with diverse classifiers is usually much stronger than only use one classifier. This paper is organized as follows. In the chapter 2 literature review, section (a) will briefly describe how text messages be translated into clean dataset. Section (b) will introduce one of the widely used learning algorithm support vector machine (SVM) and kernel trick. Sections (c), (d) and (e) will talk about the existing optimization techniques for overcome the SVM vulnerability. The methodology proposed in this paper will be discussed on chapter 3. The experiment result will be shown in chapter 4. The discussion will be provided on chapter 5 and chapter 6 is conclusion. Word count: 2520 Page 5
a. BACKGROUND [Email spam- what it is, why it is serious] Internet has become one of the most common media for data communication. There are various ways and services, such as twitter, emails, blogs, and so forth, on Internet for connecting people to one another. However, technology is a two edge sword. Internet also provides an efficient way to spread junk messages known as spam. For the different channels and services that people use, spam can be generated in various ways and email spam is one of most widely recognized form. Email spam is known as junk email or unsolicited bulk emails (UBE) with similar content messages that been delivered to numerous recipients. Many researches announce the amount of receiving spam message is stably increasing in the past decade [1, 2, 14-16]. As many Internet Service Providers provide cheap or even free services for consumers, spammers, whom send spam message, take the advantages and operating their businesses. For instance, web based free email accounts provided by Gmail, Yahoo!, or Hotmail are misused by spammers to send junk mails [17]. In spite of the fact that some countries have enacted legislation to prohibit spam such as USA (Can Spam Act 2004) and the EU (directive 2002/58/EC) [18], many spam messages are sent from other various countries [19]. For these reasons, spammers would become more aggravated due to spreading spam messages is profitable with low risk. According to the Siponen and Stucke qualitative analysis on spam issue [1], the most serious problems for companies to worry about are wasting human and technical resources. Many respondents believe spam will ruin the reputation of Internet communication medium since spammers use spam for advertising and even for spreading viruses and malware. Thus, recipients will consider email service is for less important information, if the majority of messages they received are spams. [how Email spam detection relate to data mining and machine learning ] One of the most useful techniques for solving spam issue is using spam filter based on content analysis on spam messages. Spam filters identify spam based on user-defined rules base on the characteristic of spam messages [2]. For instance, keyword free appears frequently in many spam messages, so it can be considered as one condition. However, spammers always try to find the way to bypass the spam filter. Therefore, spam messages will be established in the manner of penetrating the vulnerability of spam filtering rules. To recall the previous example, keyword free is written as f r 33 [2]. For the same concept with new technique, although spam filters nowadays still looking for the clues for discrimination between spam and legitimate messages, the machine learning and data mining algorithms are applied to discover the patterns of spam. [Brabrabrabrabrabra. Incomplete] [relationship between text message and data mining and machine learning] Word count: 2520 Page 6
[relationship between text message and data mining and machine learning] [learning algorithm and ensemble learning algorithm] b. SIGNIFICANCE [project summary (will be deleted or move to other section) problem, idea, approach, outcome] (Email spam, known as the unsolicited bulk messages, is always an aggravating issue for companies and individuals. One of the simple but powerful solutions is applying spam detection system. Nowadays, spam detection is not merely focus on spam, but also includes unwanted messages based on company s policies and individual preference. The content of the message is the crucial information for discriminating spams from legitimate message. This project uses data mining and machine learning techniques to discover the patterns from existing message contents. Based on the discovered patterns, the system can categorize the received emails in various categories and treat differently according to the organization policy. ) [problem] Email spam, known as the unsolicited bulk messages, is always an aggravating issue for companies and individuals. Receiving large amount of unwanted messages not only waste technical resources, such as storage space, but also create additional task for junk email deletion. Moreover, some spammers use spam for spreading viruses and malware and raise the system security concerns. Even though some countries have enacted legislation to prohibit spam, e.g. spam Act 2003 has been passed in Australia, the consequent is not satisfactory, due to many spam messages are sent from other countries. Better do it than wish it done, it is recommended to a certain degree of security measures. [idea] One of the well-known methods for solving spam issue is applying spam filter. In general, a spam filter classifies spam based on rules or signature. Nowadays, the definition of spam is not merely means unsolicited bulk messages; the unwanted messages also considered as spam. Here comes a challenge, what messages should be considered as unwanted, due to the degree of personal or company s subjectivity. Even more, spam detection system should have the ability to update the detection rule and signature, since there is always a new form spam messages created by spammers for penetrating the filter. At the meantime, the filter with high false detection rate is not applicable, especially for company. Misclassifying one legitimate message as spam and delete, may loss one potential customer. [approach] Data mining and machine learning techniques are introduced to discriminate suspect messages and legitimate email based on the text contents. Based on the data analysis, data mining method can discover the pattern of messages from existing database. Machine learning approach can effectively reduce manual intervention on spam detection and be more adaptive to continued changes in spam patterns. The proposed spam filter solution in this paper is divided in two research areas, one is learning algorithm implementation and the other is text-content analysis for feature selection. Ensemble learning approach is the skeleton of this proposed spam filter system. Ensemble learning is a technique which can combine multiple classifiers and come out with one synthesized classifier to improve the prediction accuracy as well as better generalization. There are no good or bad arguments to criticize algorithms and techniques, the matter is, for what condition and situation, how to choose and apply the appropriate methods with high performance and accurate result. In this paper, multiple spam classifiers are diversely trained with support vector machine (SVM) algorithms in different aspects. These classifiers Word count: 2520 Page 7
prediction result pool together, without compromise one another, would achieve better result than single classifier. Feature selection based on the text-content is an important concept. Due to many learning algorithms e.g. SVM can only handle numeric data, the unstructured text data need to be prepared into algorithm friendly format. How well the data been prepared will affects the spam classifier training and prediction result. [outcome] The outcome from the new spam filter system will be significant not only in spam detection accuracy, but will provide a framework for future improvement. For example, the new spam classifier based on new approach can be trained independently and its prediction result can be simply aggregated by ensemble learning algorithm. Besides, ensemble learning approach can enhance the spam detection accuracy without replacing the old system. Thus, the spam filter system can incrementally improve with less down time period. Chapter 2. LITERATURE REVIEW a. BACKGROUND b. TEXT ANALYTICS c. SVM BASED CLASSIFIERS i. SVM OVERVIEW ii. KERNEL TRICK d. OPTIMIZATION e. INCREMENTAL LEARNING f. ENSEMBLE LEARNING Chapter 3. METHODOLOGY Chapter 4. RESULTS Word count: 2520 Page 8
Chapter 5. DISCUSSION Chapter 6. CONCLUSION Word count: 2520 Page 9
REFERENCE: 1. Siponen, M. and C. Stucke. Effective Anti-Spam Strategies in Companies: An International Study. in System Sciences, 2006. HICSS '06. Proceedings of the 39th Annual Hawaii International Conference on. 2006. 2. Guzella, T.S. and W.M. Caminhas, A review of machine learning approaches to Spam filtering. Expert Systems with Applications, 2009. 36(7): p. 10206-10222. 3. Kumar, R.K., G. Poonkuzhali, and P. Sudhakar, Comparative Study on Email Spam Classifier using Data Mining Techniques. Proceedings of the International MultiConference of Engineers and Computer Scientists, 2012. 1. 4. Witten, I.H., E. Frank, and M.A. Hall, Data Mining: Practical machine learning tools and techniques2011: Morgan Kaufmann. 5. Amayri, O. and N. Bouguila, A study of spam filtering using support vector machines. Artificial Intelligence Review, 2010. 34(1): p. 73-108. 6. Vapnik, V.N., The Nature of Statistical Learning Theory1995, NY: Springer Verlag. 7. Vapnik, V.N., The nature of statistical learning theory2000: Springer-Verlag New York Inc. 8. Burges, C.J.C., A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998. 2(2): p. 121-167. 9. Diao, L., C. Yang, and H. Wang, Training SVM email classifiers using very large imbalanced dataset. Journal of Experimental and Theoretical Artificial Intelligence, 2012. 24(2): p. 193-210. 10. Boser, B.E., I.M. Guyon, and V.N. Vapnik, A training algorithm for optimal margin classifiers, in Proceedings of the fifth annual workshop on Computational learning theory1992, ACM: Pittsburgh, Pennsylvania, United States. p. 144-152. 11. Schölkopf, B. and A.J. Smola, Learning with kernels: Support vector machines, regularization, optimization, and beyond2002: the MIT Press. 12. Valentini, G. and F. Masulli, Ensembles of learning machines. Neural Nets, 2002: p. 3-20. 13. Dietterich, T., Ensemble methods in machine learning. Multiple classifier systems, 2000: p. 1-15. 14. Sastry, G., Spam Classification & Spam Filtering. 2011. 15. Laclavík, M., et al., Email analysis and information extraction for enterprise benefit. Computing and Informatics, 2011. 30(1): p. 57-87. 16. Fan, W.-c. Spam Message Recognition Based on Content. in Computational and Information Sciences (ICCIS), 2011 International Conference on. 2011. 17. Ramachandran, A., et al. Spam or ham?: characterizing and detecting fraudulent not spam reports in web mail systems. 2011. ACM. 18. Carpinter, J. and R. Hunt, Tightening the net: A review of current and next generation spam filtering tools. Computers & Security, 2006. 25(8): p. 566-578. 19. Talbot, D., Where SPAM is born. Technology Review, 2008. 111(3): p. 28. APPENDIX A: Word count: 2520 Page 10