Text Mining and eforensics: Spam email Filtering



Similar documents
Using Biased Discriminant Analysis for Filtering

Feature Subset Selection in Spam Detection

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

A Content based Spam Filtering Using Optical Back Propagation Technique

Spam detection with data mining method:

Data Mining - Evaluation of Classifiers

Recurrent Patterns Detection Technology. White Paper

Social Media Mining. Data Mining Essentials

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Machine Learning Final Project Spam Filtering

Domain Name Abuse Detection. Liming Wang

Machine Learning for Data Science (CS4786) Lecture 1

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Experiments in Web Page Classification for Semantic Web

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

SpamNet Spam Detection Using PCA and Neural Networks

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

The Data Mining Process

How To Filter Spam Image From A Picture By Color Or Color

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

How To Cluster

How To Create A Text Classification System For Spam Filtering

Spam Detection A Machine Learning Approach

IT services for analyses of various data samples

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Introduction to Pattern Recognition

DATA MINING TECHNIQUES AND APPLICATIONS

A Statistical Text Mining Method for Patent Analysis

OCT Training & Technology Solutions Training@qc.cuny.edu (718)

Acceptable Use Policy

Chapter 6. The stacking ensemble approach

TIETS34 Seminar: Data Mining on Biometric identification

Impact of Feature Selection Technique on Classification

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

Data Warehousing and Data Mining in Business Applications

Spam Detection Using Customized SimHash Function

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

A Hybrid Approach to Detect Zero Day Phishing Websites

Introduction to Data Mining

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

AN ENHANCED APPROACH FOR CONTENT FILTERING IN SPAM DETECTION

Search and Information Retrieval

Projektgruppe. Categorization of text documents via classification

Commtouch RPD Technology. Network Based Protection Against -Borne Threats

Spam Filtering using Naïve Bayesian Classification

Search Result Optimization using Annotators

Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Learning outcomes. Knowledge and understanding. Competence and skills

An Efficient Spam Filtering Techniques for Account

Simple Language Models for Spam Detection

Object Recognition. Selim Aksoy. Bilkent University

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Digital Identity & Authentication Directions Biometric Applications Who is doing what? Academia, Industry, Government

Combining Global and Personal Anti-Spam Filtering

Analecta Vol. 8, No. 2 ISSN

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

MS1b Statistical Data Mining

Introduction. A. Bellaachia Page: 1

An Approach to Detect Spam s by Using Majority Voting

Data Mining Algorithms Part 1. Dejan Sarka

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM FILTERING 1 2

Learning is a very general term denoting the way in which agents:

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

CS 348: Introduction to Artificial Intelligence Lab 2: Spam Filtering

Machine Learning in Spam Filtering

Novelty Detection in image recognition using IRF Neural Networks properties

Ipswitch IMail Server with Integrated Technology

Spam Filtering Based on Latent Semantic Indexing

What is Visualization? Information Visualization An Overview. Information Visualization. Definitions

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Increasing the Accuracy of a Spam-Detecting Artificial Immune System

Machine Learning for Cyber Security Intelligence

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

A SURVEY OF TEXT CLASSIFICATION ALGORITHMS

SPATIAL DATA CLASSIFICATION AND DATA MINING

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Tweaking Naïve Bayes classifier for intelligent spam detection

Component Ordering in Independent Component Analysis Based on Data Power

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Strategic Online Advertising: Modeling Internet User Behavior with

DATA PREPARATION FOR DATA MINING

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Data Mining Yelp Data - Predicting rating stars from review text

The Scientific Data Mining Process

How to Use the Greymail Spam Filter

1. Classification problems

Transcription:

Text Mining and eforensics: Spam email Filtering Marie-Francine Moens Artificial Intelligence Lecture Series III: Data Mining Applications 26-10-2010 University of Luxembourg Joint work with (in alphabetical order): Erik Boiy, Jan De Beer and Juan Carlos Gomez

Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in email filtering 4 Future applications and conclusions

Problem definition Spam = unsolicited bulk messages sent indiscriminately; email spam, known as unsolicited bulk email, junk mail, or unsolicited commercial email Phishing = to a user falsely claiming to be an established legitimate enterprise in an attempt to scam the user into surrendering private information that will be used for identity theft; might be done by means of phishing emails that direct a user to a bogus Website

Problem definition Filtering based on IP-addresses is largely insufficient: it is known that spammers frequently change domain names and servers Filters or blockers based on content make a fine-grained control possible Text categorization techniques based on machine learning increasingly replace handcrafted rules that detect keywords or lexicographic characteristics in the email Rather accurate filters, but we want to reach 100% area under the ROC curve

Problem definition Spammers are very inventive Surface and hidden salting Embedded text in graphical content Personalization of the emails based on content extracted from social network sites...

Phishing October 26, 2010

Problem definition Pham and phishing messages contain a core (fraudulent) message wrapped in different disguises How to identify the core (fraudulent) message? The main focus here is on the detection of phishing messages

Problem definition Extraction of the core message is related to the extraction of core features Two strategies: Eliminate noisy features: especially the ones that take the form of hidden salting Extract highly discriminative features that robustly distinguish the spam from ham, and the spam from phishing Core features are used in a classification model

Problem definition Design and implement feature extraction methods to be used in highly accurate filters that classify messages Extraction of features are to be adaptive Algorithms are to be integrated in the field systems for Message (email, SMS,...) filtering In wired and wireless environments

Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in email filtering 4 Future applications and conclusions

The detection and resolution of salting Salting = intentional addition or distortion of content in order to obfuscate or evade automated inspection: Surface salting Hidden salting: Any medium (text: ASCII, HTML,...; images; audio) Any content genre, e.g. emails, Web pages or MMS messages => including phishing messages, Web pages Distinction between surface salting and hidden salting depends on whether the salting is respectively visually perceivable by the user of the content or not

The detection and resolution of salting Extraction of salting features => gives an indication that the message is probably fraudulous Resolution of the salting features => might improve the message classification Extra difficult when the salting is hidden

Salting detection methodology Two steps: Step 1: we tap into the rendering process to detect hidden content (= manifestations of salting) Step 2: we feed the intercepted, visible text into an artificially intelligent cognitive model which returns the truly perceived text by the user: Differences between source and perceived text = additional evidence of salting Yields improved content representation for filtering, mining, retrieval,...

Step 1 Glyphs = positioned shapes of individual characters, with rendering attributes and any concealing shapes Hidden salting => glyph visibility (which glyphs are seen by the user): Clipping = glyph drawn within the physical bounds of the drawing clip, which is a type of `spatial mask' Concealment = glyph not concealed by other glyphs or shapes Font colour = glyph's fill colour contrasts well with the background colour Glyph size = glyph size and shape is sufficiently large Failure to comply to any condition results in an invisible glyph => indication of hidden salting Perceived text = after elimination of all invisible glyphs

Step 2 Segmentation: Find partitioning of segments with proper and coherent reading order Top down processing of the perceived text Detection of the reading order: Reading order is detected based on language specific statistics If reading order <> compositional (glyph) order: extra indication of hidden salting slice-and-dice trick

Segmentation October 26, 2010

Different reading orders considered October 26, 2010

Determining the reading order of the text block Evidence for the reading order of a text block: Measuring the alignment of glyphs both horizontally and vertically Congruence with 3 language models: Distribution of word lengths Distribution of character k-grams Distribution of common words obtained via a dictionary

Gathering statistics on hidden text salting Most common salting trick: glyph order Phishing mails: preference for invisible font

Classification of emails Slight improvement of the classification into spam and ham by resolving the hidden salting

Classification of emails We are especially interested in the classification of phishing mails: Proprietary corpus: F1 measure of classification into phish is 85.91% using the covertext, compared to 81.04% using the plaintext Recall of both the phishing and spam improves using the covertext from 81.39% to 84.25% for spam and from 70.87% to 79.34% for phishing (confidence of 99.95% determined by the paired version of Student s t-test)

Because of its communicative function, a text - in our view - is defined by what a user perceives, no matter how it is now or in the future digitally constructed The digital textual source gives us additional information on how the text is constructed and possibly manipulated This aspect provides a timeless dimension to our research and transcends applications such as email filtering

Hidden salting detection and resolution beyond email filtering Web content might contain hidden content to fool content filters: E.g., spoofed phishing websites E.g., sites with offensive content, defamation, hate speech, child abuse images and content, speech that attacks the legitimacy of government institutions and preservation of the national identity, obscene content and pornography Unsolicited popups, spam and advertisements, malware and many more scams might have interest in hiding content and avoid filtering When content is disguised and obfuscated, the detection of intellectual property rights (IPR) infringements and plagiarism detection is more difficult

Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in email filtering 4 Future applications and conclusions

Improving the classification performance General idea: there is a core (fraudulent) content which is common to the bad messages despite the different forms the messages take How to detect this core automatically from multiple messages and so improve the classification performance? In an adversarial setting the disguises and forms change over time to avoid the filters How can we build filters that are robust over time and maintain their classification performance over time?

Traditional content filters for email Supervised learning: a set of emails is manually classified as positive or negative examples of the spam category (e.g., spam versus ham, spam versus phishing) A classifier is trained using the annotated examples, which hopefully can correctly predict the class of unseen emails The classification model can be of any type, but Bayesian classifiers (often naive Bayes) and support vector machines are quite popular The emails are usually represented by unigram features (e.g., words of the mails), sometimes grams of a larger size are used

Dimensionality reduction Dimensionality reduction popular since the early 90s in text processing tasks, e.g., Latent Semantic Analysis (LSA) Probabilistic Latent Semantic Analysis (plsa) and Latent Dirichlet Allocation (LDrA) The above methods can be used without and with annotated examples Linear Discriminant Analysis (LDA) uses class information in order to separate well the classes

Dimensionality reduction Recently, the computer vision community has successfully proposed several variants of LDA that artificialy pull apart the positive and the negative examples of the training set An example of such an approach is Biased Discriminant Analysis (BDA): Eigenvalue based method: Eigenvalue is a number indicating the weight of a particular pattern or cluster of features expressed by the corresponding eigenvector The larger the eigenvalue the more important the pattern is

! October 26, 2010 Dimensionality reduction The goal is to represent the emails with few, but highly discriminative features Let {(x 1, c 1 ),..., (x n, c n )} be a set of email messages with their corresponding classes, where x i R d is the ith email, represented by a d dimensional row vector, and c i C is the class of x i We have two classes C = { 1 +1}, where -1 refers to the negative class N (ham messages) and +1 to the positive class P (spam or phishing) The data dimensionality reduction learns a d x l projection matrix W, which can project to: zi = xiw where z i R l is the projected data with l << d

Linear Discriminant Analysis LDA aims at maximizing the following function: W* = argmax W W T SPNW W T SPW The inter-class scatter matrix S PN is computed as:! SPN = pp(µp " µ) T (µp " µ) + pn(µn " µ) T (µn " µ) where p P and µ P are respectively the prior and the mean of the examples in the positive class; p N and! µ N are respectively the prior and the mean of the examples in the negative class; and µ is the mean of the entire dataset The intra-class scatter matrix S P is computed as: SP = $ (x " µp) T (x " µp) x #P

Biased Discriminant analysis BDA aims at maximizing the same function as LDA, but redefining the inter-class scatter matrix S PN : SPN = $ (y " µp) T (y " µp) y #N!

Biased Discriminant Analysis BDA transforms the feature space so that : The positive examples cluster together Each negative instance is pushed away as far as possible from this positive cluster As a result the centroids of both the negative and positive examples are moved

Biased Discriminant Analysis We then perform an eigenvalue decomposition of and construct the d x l matrix W whose columns are composed by the eigenvectors of corresponding to its largest eigenvalues The goal of BDA is to transform the training data set X into a new data set Z using the projection matrix W, with Z= XW in such a way the examples inside the new data set are well separated by class If q is a test example, its projection using BDA is u = qw! S P "1 " SPN

[Gomez et al. submitted] October 26, 2010

Experiments Evaluated on 4 public spam corpora: Ling-Spam (LS) SpamAssassin (SA) TREC 2007 spam corpus (TREC) A subset of Phishing Corpus created by randomly selecting 1,250 phishing messages from the Nazario corpus and 1,250 ham messages from the TREC corpus (PC)

Experiments Raw features unigrams weighted by their term frequency and inverse document frequency Classifier: bagging ensemble classifier using as single classifier the C4.5 decision tree Baselines: Raw features (all terms) Classical LDA model

Training on oldest data and testing on the remainder of the data October 26, 2010

Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in email filtering 4 Future applications and conclusions

The above content mining techniques and variants can be used in many other applications: Opportunities to monitor information Especially Web content... But there are many novel challenges

Examples of applications Wikia.com Protection of citizens for harmful content: Webpages (e.g., protect children - PuppyIR EU FP7) Spam and phishing Websites (e.g., protect citizens, AntiPhish EU FP6) False information (e.g., protect customers) Defamation (e.g., protect companies, individuals) Protection of groups: Intelligent surveillance (e.g., video surveillance) www.kansascitypi.com

Examples of applications Protection of European companies: Against industrial espionage, unlawful copying Protection of nations: Against terrorist groups www.newsweek.com Restoring security at moments of crisis: fusion, filtering and generation of information Dit probleem is ondertussen opgelost en je kan de mail opnieuw sturen. Niet alle uitgaande mails zijn geweigerd, het gaat in totaal over 700 mails en je krijgt later een bericht AP / Brynjar Gauti

Issues Recognition of content: but Heterogeneous sources, different languages, media Fraudulent scams cloak content Fraudulent scams change strategies continuously Content can be unreliable (credibility) Can you trust it?

Needs Robust and reliable extractors (text, speech, images, video...) Robust and reliable linking technologies (connecting the dots...) Includes also disambiguation Adaptable to different languages and media with minimum of human intervention

Response ICT Technologies: Knowledge methodologies maturing: ontologies, semantics, machine learning, data/text/graph mining, joint classification, alignment,... Probabilistic models for reasoning Latent class models for discovering hidden semantics FP7: European Security Research programme: Develop technologies and knowledge to ensure security of citizens from threats such as terrorism, (organised) crime, natural disasters and industrial accidents

Conclusions We presented innovative work with regard to spam and phishing email filtering: Detection and resolution of hidden text salting Extraction of highly discriminative features by means of Biased Discriminant Analysis that are robust notwithstanding changes of the messages over time Content filtering is an important research area with many novel challenges

Main references Moens, M.-F., Boiy, E., De Beer, Jan & Gomez, J.-C. (2010). Identifying and Resolving Hidden Text Salting. In IEEE Transactions on Information Forensics and Security 5 (3) (in press). Gomez, J.-C. & Moens, M.-F. (2010). Using Biased Discriminant Analysis for Email Filtering. In Proceedings of the 14th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (Lecture Notes in Computer Science 6276) (pp. 566-575). Berlin: Springer.

We thank the EU FP6-027600 Antiphish consortium (http://www.antiphishresearch.org/) and in particular Christina Lioma, Gerhard Paass, André Bergholz, Patrick Horkan, Brian Witten, Marc Dacier and Domenico Dato. October 26, 2010