How To Create A Text Classification System For Spam Filtering



Similar documents
PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Robust personalizable spam filtering via local and global discrimination modeling

Robust Personalizable Spam Filtering Via Local and Global Discrimination Modeling

DATA MINING TECHNIQUES AND APPLICATIONS

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Learning is a very general term denoting the way in which agents:

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining - Evaluation of Classifiers

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

CS 6220: Data Mining Techniques Course Project Description

Data Mining & Data Stream Mining Open Source Tools

Connecting library content using data mining and text analytics on structured and unstructured data

Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning

Spam detection with data mining method:

Hexaware E-book on Predictive Analytics

MS1b Statistical Data Mining

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Classification algorithm in Data mining: An Overview

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Introduction to Data Mining

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

A semi-supervised Spam mail detector

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

Huseyin Polat s Curriculum Vitae

Single Level Drill Down Interactive Visualization Technique for Descriptive Data Mining Results

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Clustering Technique in Data Mining for Text Documents

Text Classification Using Symbolic Data Analysis

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

MA2823: Foundations of Machine Learning

MapReduce Approach to Collective Classification for Networks

Data Mining Approach For Subscription-Fraud. Detection in Telecommunication Sector

Machine Learning for Data Science (CS4786) Lecture 1

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

Ming-Wei Chang. Machine learning and its applications to natural language processing, information retrieval and data mining.

Semi-Supervised and Unsupervised Machine Learning. Novel Strategies

The Data Mining Process

Morteza Zihayat Curriculum Vitae October 2015

Towards applying Data Mining Techniques for Talent Mangement

DISIT Lab, competence and project idea on bigdata. reasoning

Machine Learning Introduction

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Inner Classification of Clusters for Online News

Social Media Mining. Data Mining Essentials

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Florida International University - University of Miami TRECVID 2014

Machine Learning: Overview

Domain Classification of Technical Terms Using the Web

A Content based Spam Filtering Using Optical Back Propagation Technique

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Classifying Manipulation Primitives from Visual Data

Distributed forests for MapReduce-based machine learning

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Big Data with Rough Set Using Map- Reduce

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Taking Advantage of the Web for Text Classification with Imbalanced Classes *

Concept and Project Objectives

Mining the Web of Linked Data with RapidMiner

A Review of Data Mining Techniques

Steven C.H. Hoi School of Information Systems Singapore Management University

A Survey of Classification Techniques in the Area of Big Data.

An Overview of Knowledge Discovery Database and Data mining Techniques

CHAPTER 1 INTRODUCTION

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

How To Use Neural Networks In Data Mining

Divide-n-Discover Discretization based Data Exploration Framework for Healthcare Analytics

How To Identify A Churner

A Proposed Algorithm for Spam Filtering s by Hash Table Approach

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

Cleaned Data. Recommendations

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Comparative Study of Features Space Reduction Techniques for Spam Detection

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Data, Measurements, Features

Business Intelligence Using Data Mining Techniques on Very Large Datasets

REVIEW OF ENSEMBLE CLASSIFICATION

Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features

Information Management course

Classification and Prediction techniques using Machine Learning for Anomaly Detection.

Analecta Vol. 8, No. 2 ISSN

The Scientific Data Mining Process

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Computer Forensics Application. ebay-uab Collaborative Research: Product Image Analysis for Authorship Identification

INVESTIGATIONS INTO EFFECTIVENESS OF GAUSSIAN AND NEAREST MEAN CLASSIFIERS FOR SPAM DETECTION

Simple and efficient online algorithms for real world applications

Data Mining On Diabetics

Transcription:

Term Discrimination Based Robust Text Classification with Application to Email Spam Filtering PhD Thesis Khurum Nazir Junejo 2004-03-0018 Advisor: Dr. Asim Karim Department of Computer Science Syed Babar Ali School of Science and Engineering Lahore University of Management Sciences

Final Defence Committee Members Dr. Asim Karim (Advisor) Dr. Mian Muhammad Awais Dr. Shahid Masud Dr. Hamid Abdul Basit Dr. Haroon Atique Babri (External Examiner)

Abstract The Internet has touched every part of our lives, including our interactions and communications. Printed books are being replaced by electronic books (e-books), personal and official correspondences have shifted to electronic mail (e-mail), and news is now being read online. This is generating huge volumes of unstructured textual data that needs to be analyzed, filtered, and organized automatically in order to harness its wealth of information for profitable gains. By 2013, it is projected that the worldwide volume of e-mails will reach 507 billion e-mails per day out of which 89% will be spam e- mails. In 2008, the cost of spam to businesses in terms of hardware, software, and human resource cost was around $140 billion. Content-based text classification can automatically organize text documents into predefined thematic categories. However, text classification is challenging in the modern Internet environment. Firstly, text documents are sparsely represented in a very high dimensional feature space (easily in hundred thousands), making learning and generalization difficult. Secondly, due to the high cost of labeling documents researchers are forced to collect training data from sources different from the target domain, which results in a distribution shift between training and test data. Thirdly, although unlabeled data is easily available its utilization in practical text classification for improved performance remains a challenge. One important domain for text classification, which embodies these challenges, is that of e-mail spam filtering. A typical e-mail service provider (ESP) caters to thousands to millions of users where each user can have his own interests of topics and preferences for spam and non-spam e- mails. Personalized service-side spam filtering provides a solution to this problem; however, for such solutions to be practically usable they must be efficient, scalable, and robust to distribution shifts. In this thesis, we propose a robust text classification technique that combines local generative models and global discriminative classifiers through the use of discriminative term weighting and linear opinion pooling. Terms in the documents are assigned weights that quantify the discrimination information they provide for one category over the others. These weights, called discriminative term weights (DTW), also serve to partition the terms into two sets. An opinion pooling strategy consolidates the discrimination information of terms in the sets to yield a two dimensional feature space, in which a discriminant function is learned to categorize the documents. In addition to a supervised technique, we also develop two semi-supervised variants for personalizing the local and global models using unlabeled

data. We then generalize our technique into a classifier framework that integrates different feature selection criteria, discriminative term weighting schemes, information pooling strategies, and discriminative classifiers. We provide a theoretical comparison of our proposed framework with existing generative, discriminative, and hybrid classifiers. Our text classification framework is evaluated with five discriminative term weighting strategies, six opinion consolidation techniques, and four discriminative classifiers. We employ nine real-world datasets from different domains in our experimental evaluation, and the results are compared with four benchmark text classification algorithms via accuracy and AUC values. Our framework is also evaluated under varying distribution shift, on gray e-mails, on unseen e-mails, and under varying classifier size. Scalability of our spam filter is also demonstrated for personalized service-side spam filtering. Statistical significance tests confirm that our technique performs significantly better than the compared techniques in both supervised and semi-supervised settings, and in global and personalized spam filtering. In particular, it performs remarkably well when distribution shift is high between training and test data, a phenomenon common in e-mail systems. Additional contributions of this thesis include a systematic analysis of the spam filtering problem and the challenges to effective global and personalized spam filtering at the service side. We formally define key characteristics of e-mail classification such as distribution shift and gray e-mails, and relate them to machine learning problem settings. The concept of term discrimination introduced in this work has also found applications in text clustering, visualization, and feature extraction, and it can be extended for keyword extraction and topic identification from textual documents.

Thesis Related Publications Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim, "A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering", Discovery Challenge Workshop, 17th European Conference on Machine Learning, 2006, Berlin, Germany. Khurum Nazir Junejo and Asim Karim, "Automatic Personalized Spam Filtering through Significant Word Modeling", In Proceeding of 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2007, Greece. Khurum Nazir Junejo and Asim Karim, "PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering", In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence (WI), 2007, California, USA. Khurum Nazir Junejo and Asim Karim, "A Robust Discriminative Term Weighting based Linear Discriminant Method for Text Classification", In Proceedings of IEEE International Conference on Data Mining (ICDM), 2008, Italy. Khurum Nazir Junejo and Asim Karim, "Robust Personalizable Spam Filtering via Local and Global Discrimination Modeling", Knowledge and Information Systems. 2012.

Other Publications Malik Tahir Hassan, Khurum Nazir Junejo, and Asim Karim, "Bayesian Inference for Web Surfer Behavior Prediction", Discovery Challenge Workshop, 18th European Conference on Machine Learning, 2007, Warsaw, Poland. Malik Tahir Hassan, Khurum Nazir Junejo, Asim Karim, "Learning and Predicting Key Web Navigation Patterns Using Bayesian Models", In Proceedings of Springer LNCS International Conference on Computational Science and Its Applications (ICCSA), 2009, Korea. Fahad Javed, Malik Tahir Hassan, Khurum Nazir Junejo, Naveed Arshad and Asim Karim, "Self- Calibration: Enabling Self-management in Autonomous Systems by Preserving Model Fidelity", In Proceedings of International Conference on Engineering of Complex Computer Systems (ICECCS), 2012, Paris, France. Fahad Javed, Malik Tahir Hassan, Khurum Nazir Junejo, Naveed Arshad and Asim Karim, Enabling Selfmanagement in Autonomous Systems by Preserving Model Fidelity", A Special Issue on Emerging Synergies of Artificial Intelligence and Software Engineering. International Journal of Software Engineering and Knowledge Engineering. Submitted. Imran Junejo, Khurum Nazir Junejo and Zaher Al Aghbari, Silhouette-based Human Action Recognition using SAX-Shapes", The Visual Computer. Submitted.