How To Create A Text Classification System For Spam Filtering

Size: px

Start display at page:

Download "How To Create A Text Classification System For Spam Filtering"

Pierce Richards
5 years ago
Views:

1 Term Discrimination Based Robust Text Classification with Application to Spam Filtering PhD Thesis Khurum Nazir Junejo Advisor: Dr. Asim Karim Department of Computer Science Syed Babar Ali School of Science and Engineering Lahore University of Management Sciences

2 Final Defence Committee Members Dr. Asim Karim (Advisor) Dr. Mian Muhammad Awais Dr. Shahid Masud Dr. Hamid Abdul Basit Dr. Haroon Atique Babri (External Examiner)

3 Abstract The Internet has touched every part of our lives, including our interactions and communications. Printed books are being replaced by electronic books (e-books), personal and official correspondences have shifted to electronic mail ( ), and news is now being read online. This is generating huge volumes of unstructured textual data that needs to be analyzed, filtered, and organized automatically in order to harness its wealth of information for profitable gains. By 2013, it is projected that the worldwide volume of s will reach 507 billion s per day out of which 89% will be spam e- mails. In 2008, the cost of spam to businesses in terms of hardware, software, and human resource cost was around $140 billion. Content-based text classification can automatically organize text documents into predefined thematic categories. However, text classification is challenging in the modern Internet environment. Firstly, text documents are sparsely represented in a very high dimensional feature space (easily in hundred thousands), making learning and generalization difficult. Secondly, due to the high cost of labeling documents researchers are forced to collect training data from sources different from the target domain, which results in a distribution shift between training and test data. Thirdly, although unlabeled data is easily available its utilization in practical text classification for improved performance remains a challenge. One important domain for text classification, which embodies these challenges, is that of spam filtering. A typical service provider (ESP) caters to thousands to millions of users where each user can have his own interests of topics and preferences for spam and non-spam e- mails. Personalized service-side spam filtering provides a solution to this problem; however, for such solutions to be practically usable they must be efficient, scalable, and robust to distribution shifts. In this thesis, we propose a robust text classification technique that combines local generative models and global discriminative classifiers through the use of discriminative term weighting and linear opinion pooling. Terms in the documents are assigned weights that quantify the discrimination information they provide for one category over the others. These weights, called discriminative term weights (DTW), also serve to partition the terms into two sets. An opinion pooling strategy consolidates the discrimination information of terms in the sets to yield a two dimensional feature space, in which a discriminant function is learned to categorize the documents. In addition to a supervised technique, we also develop two semi-supervised variants for personalizing the local and global models using unlabeled

This is generating huge volumes of unstructured textual data that needs to be analyzed, filtered, and organized automatically in order to harness its wealth of information for profitable gains.

4 data. We then generalize our technique into a classifier framework that integrates different feature selection criteria, discriminative term weighting schemes, information pooling strategies, and discriminative classifiers. We provide a theoretical comparison of our proposed framework with existing generative, discriminative, and hybrid classifiers. Our text classification framework is evaluated with five discriminative term weighting strategies, six opinion consolidation techniques, and four discriminative classifiers. We employ nine real-world datasets from different domains in our experimental evaluation, and the results are compared with four benchmark text classification algorithms via accuracy and AUC values. Our framework is also evaluated under varying distribution shift, on gray s, on unseen s, and under varying classifier size. Scalability of our spam filter is also demonstrated for personalized service-side spam filtering. Statistical significance tests confirm that our technique performs significantly better than the compared techniques in both supervised and semi-supervised settings, and in global and personalized spam filtering. In particular, it performs remarkably well when distribution shift is high between training and test data, a phenomenon common in systems. Additional contributions of this thesis include a systematic analysis of the spam filtering problem and the challenges to effective global and personalized spam filtering at the service side. We formally define key characteristics of classification such as distribution shift and gray s, and relate them to machine learning problem settings. The concept of term discrimination introduced in this work has also found applications in text clustering, visualization, and feature extraction, and it can be extended for keyword extraction and topic identification from textual documents.

Our text classification framework is evaluated with five discriminative term weighting strategies, six opinion consolidation techniques, and four discriminative classifiers.

5 Thesis Related Publications Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim, "A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering", Discovery Challenge Workshop, 17th European Conference on Machine Learning, 2006, Berlin, Germany. Khurum Nazir Junejo and Asim Karim, "Automatic Personalized Spam Filtering through Significant Word Modeling", In Proceeding of 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2007, Greece. Khurum Nazir Junejo and Asim Karim, "PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering", In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence (WI), 2007, California, USA. Khurum Nazir Junejo and Asim Karim, "A Robust Discriminative Term Weighting based Linear Discriminant Method for Text Classification", In Proceedings of IEEE International Conference on Data Mining (ICDM), 2008, Italy. Khurum Nazir Junejo and Asim Karim, "Robust Personalizable Spam Filtering via Local and Global Discrimination Modeling", Knowledge and Information Systems

Khurum Nazir Junejo and Asim Karim, "Automatic Personalized Spam Filtering through Significant Word Modeling", In Proceeding of 19th IEEE International Conference on Tools with Artificial

6 Other Publications Malik Tahir Hassan, Khurum Nazir Junejo, and Asim Karim, "Bayesian Inference for Web Surfer Behavior Prediction", Discovery Challenge Workshop, 18th European Conference on Machine Learning, 2007, Warsaw, Poland. Malik Tahir Hassan, Khurum Nazir Junejo, Asim Karim, "Learning and Predicting Key Web Navigation Patterns Using Bayesian Models", In Proceedings of Springer LNCS International Conference on Computational Science and Its Applications (ICCSA), 2009, Korea. Fahad Javed, Malik Tahir Hassan, Khurum Nazir Junejo, Naveed Arshad and Asim Karim, "Self- Calibration: Enabling Self-management in Autonomous Systems by Preserving Model Fidelity", In Proceedings of International Conference on Engineering of Complex Computer Systems (ICECCS), 2012, Paris, France. Fahad Javed, Malik Tahir Hassan, Khurum Nazir Junejo, Naveed Arshad and Asim Karim, Enabling Selfmanagement in Autonomous Systems by Preserving Model Fidelity", A Special Issue on Emerging Synergies of Artificial Intelligence and Software Engineering. International Journal of Software Engineering and Knowledge Engineering. Submitted. Imran Junejo, Khurum Nazir Junejo and Zaher Al Aghbari, Silhouette-based Human Action Recognition using SAX-Shapes", The Visual Computer. Submitted.

Malik Tahir Hassan, Khurum Nazir Junejo, Asim Karim, "Learning and Predicting Key Web Navigation Patterns Using Bayesian Models", In Proceedings of Springer LNCS International Conference on

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering

2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University