Text Mining and eforensics: Spam email Filtering Marie-Francine Moens Artificial Intelligence Lecture Series III: Data Mining Applications 26-10-2010 University of Luxembourg Joint work with (in alphabetical order): Erik Boiy, Jan De Beer and Juan Carlos Gomez
Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in email filtering 4 Future applications and conclusions
Problem definition Spam = unsolicited bulk messages sent indiscriminately; email spam, known as unsolicited bulk email, junk mail, or unsolicited commercial email Phishing = to a user falsely claiming to be an established legitimate enterprise in an attempt to scam the user into surrendering private information that will be used for identity theft; might be done by means of phishing emails that direct a user to a bogus Website
Problem definition Filtering based on IP-addresses is largely insufficient: it is known that spammers frequently change domain names and servers Filters or blockers based on content make a fine-grained control possible Text categorization techniques based on machine learning increasingly replace handcrafted rules that detect keywords or lexicographic characteristics in the email Rather accurate filters, but we want to reach 100% area under the ROC curve
Problem definition Spammers are very inventive Surface and hidden salting Embedded text in graphical content Personalization of the emails based on content extracted from social network sites...
Phishing October 26, 2010
Problem definition Pham and phishing messages contain a core (fraudulent) message wrapped in different disguises How to identify the core (fraudulent) message? The main focus here is on the detection of phishing messages
Problem definition Extraction of the core message is related to the extraction of core features Two strategies: Eliminate noisy features: especially the ones that take the form of hidden salting Extract highly discriminative features that robustly distinguish the spam from ham, and the spam from phishing Core features are used in a classification model
Problem definition Design and implement feature extraction methods to be used in highly accurate filters that classify messages Extraction of features are to be adaptive Algorithms are to be integrated in the field systems for Message (email, SMS,...) filtering In wired and wireless environments
Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in email filtering 4 Future applications and conclusions
The detection and resolution of salting Salting = intentional addition or distortion of content in order to obfuscate or evade automated inspection: Surface salting Hidden salting: Any medium (text: ASCII, HTML,...; images; audio) Any content genre, e.g. emails, Web pages or MMS messages => including phishing messages, Web pages Distinction between surface salting and hidden salting depends on whether the salting is respectively visually perceivable by the user of the content or not
The detection and resolution of salting Extraction of salting features => gives an indication that the message is probably fraudulous Resolution of the salting features => might improve the message classification Extra difficult when the salting is hidden
Salting detection methodology Two steps: Step 1: we tap into the rendering process to detect hidden content (= manifestations of salting) Step 2: we feed the intercepted, visible text into an artificially intelligent cognitive model which returns the truly perceived text by the user: Differences between source and perceived text = additional evidence of salting Yields improved content representation for filtering, mining, retrieval,...
Step 1 Glyphs = positioned shapes of individual characters, with rendering attributes and any concealing shapes Hidden salting => glyph visibility (which glyphs are seen by the user): Clipping = glyph drawn within the physical bounds of the drawing clip, which is a type of `spatial mask' Concealment = glyph not concealed by other glyphs or shapes Font colour = glyph's fill colour contrasts well with the background colour Glyph size = glyph size and shape is sufficiently large Failure to comply to any condition results in an invisible glyph => indication of hidden salting Perceived text = after elimination of all invisible glyphs
Step 2 Segmentation: Find partitioning of segments with proper and coherent reading order Top down processing of the perceived text Detection of the reading order: Reading order is detected based on language specific statistics If reading order <> compositional (glyph) order: extra indication of hidden salting slice-and-dice trick
Segmentation October 26, 2010
Different reading orders considered October 26, 2010
Determining the reading order of the text block Evidence for the reading order of a text block: Measuring the alignment of glyphs both horizontally and vertically Congruence with 3 language models: Distribution of word lengths Distribution of character k-grams Distribution of common words obtained via a dictionary
Gathering statistics on hidden text salting Most common salting trick: glyph order Phishing mails: preference for invisible font
Classification of emails Slight improvement of the classification into spam and ham by resolving the hidden salting
Classification of emails We are especially interested in the classification of phishing mails: Proprietary corpus: F1 measure of classification into phish is 85.91% using the covertext, compared to 81.04% using the plaintext Recall of both the phishing and spam improves using the covertext from 81.39% to 84.25% for spam and from 70.87% to 79.34% for phishing (confidence of 99.95% determined by the paired version of Student s t-test)
Because of its communicative function, a text - in our view - is defined by what a user perceives, no matter how it is now or in the future digitally constructed The digital textual source gives us additional information on how the text is constructed and possibly manipulated This aspect provides a timeless dimension to our research and transcends applications such as email filtering
Hidden salting detection and resolution beyond email filtering Web content might contain hidden content to fool content filters: E.g., spoofed phishing websites E.g., sites with offensive content, defamation, hate speech, child abuse images and content, speech that attacks the legitimacy of government institutions and preservation of the national identity, obscene content and pornography Unsolicited popups, spam and advertisements, malware and many more scams might have interest in hiding content and avoid filtering When content is disguised and obfuscated, the detection of intellectual property rights (IPR) infringements and plagiarism detection is more difficult
Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in email filtering 4 Future applications and conclusions
Improving the classification performance General idea: there is a core (fraudulent) content which is common to the bad messages despite the different forms the messages take How to detect this core automatically from multiple messages and so improve the classification performance? In an adversarial setting the disguises and forms change over time to avoid the filters How can we build filters that are robust over time and maintain their classification performance over time?
Traditional content filters for email Supervised learning: a set of emails is manually classified as positive or negative examples of the spam category (e.g., spam versus ham, spam versus phishing) A classifier is trained using the annotated examples, which hopefully can correctly predict the class of unseen emails The classification model can be of any type, but Bayesian classifiers (often naive Bayes) and support vector machines are quite popular The emails are usually represented by unigram features (e.g., words of the mails), sometimes grams of a larger size are used
Dimensionality reduction Dimensionality reduction popular since the early 90s in text processing tasks, e.g., Latent Semantic Analysis (LSA) Probabilistic Latent Semantic Analysis (plsa) and Latent Dirichlet Allocation (LDrA) The above methods can be used without and with annotated examples Linear Discriminant Analysis (LDA) uses class information in order to separate well the classes
Dimensionality reduction Recently, the computer vision community has successfully proposed several variants of LDA that artificialy pull apart the positive and the negative examples of the training set An example of such an approach is Biased Discriminant Analysis (BDA): Eigenvalue based method: Eigenvalue is a number indicating the weight of a particular pattern or cluster of features expressed by the corresponding eigenvector The larger the eigenvalue the more important the pattern is
! October 26, 2010 Dimensionality reduction The goal is to represent the emails with few, but highly discriminative features Let {(x 1, c 1 ),..., (x n, c n )} be a set of email messages with their corresponding classes, where x i R d is the ith email, represented by a d dimensional row vector, and c i C is the class of x i We have two classes C = { 1 +1}, where -1 refers to the negative class N (ham messages) and +1 to the positive class P (spam or phishing) The data dimensionality reduction learns a d x l projection matrix W, which can project to: zi = xiw where z i R l is the projected data with l << d
Linear Discriminant Analysis LDA aims at maximizing the following function: W* = argmax W W T SPNW W T SPW The inter-class scatter matrix S PN is computed as:! SPN = pp(µp " µ) T (µp " µ) + pn(µn " µ) T (µn " µ) where p P and µ P are respectively the prior and the mean of the examples in the positive class; p N and! µ N are respectively the prior and the mean of the examples in the negative class; and µ is the mean of the entire dataset The intra-class scatter matrix S P is computed as: SP = $ (x " µp) T (x " µp) x #P
Biased Discriminant analysis BDA aims at maximizing the same function as LDA, but redefining the inter-class scatter matrix S PN : SPN = $ (y " µp) T (y " µp) y #N!
Biased Discriminant Analysis BDA transforms the feature space so that : The positive examples cluster together Each negative instance is pushed away as far as possible from this positive cluster As a result the centroids of both the negative and positive examples are moved
Biased Discriminant Analysis We then perform an eigenvalue decomposition of and construct the d x l matrix W whose columns are composed by the eigenvectors of corresponding to its largest eigenvalues The goal of BDA is to transform the training data set X into a new data set Z using the projection matrix W, with Z= XW in such a way the examples inside the new data set are well separated by class If q is a test example, its projection using BDA is u = qw! S P "1 " SPN
[Gomez et al. submitted] October 26, 2010
Experiments Evaluated on 4 public spam corpora: Ling-Spam (LS) SpamAssassin (SA) TREC 2007 spam corpus (TREC) A subset of Phishing Corpus created by randomly selecting 1,250 phishing messages from the Nazario corpus and 1,250 ham messages from the TREC corpus (PC)
Experiments Raw features unigrams weighted by their term frequency and inverse document frequency Classifier: bagging ensemble classifier using as single classifier the C4.5 decision tree Baselines: Raw features (all terms) Classical LDA model
Training on oldest data and testing on the remainder of the data October 26, 2010
Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in email filtering 4 Future applications and conclusions
The above content mining techniques and variants can be used in many other applications: Opportunities to monitor information Especially Web content... But there are many novel challenges
Examples of applications Wikia.com Protection of citizens for harmful content: Webpages (e.g., protect children - PuppyIR EU FP7) Spam and phishing Websites (e.g., protect citizens, AntiPhish EU FP6) False information (e.g., protect customers) Defamation (e.g., protect companies, individuals) Protection of groups: Intelligent surveillance (e.g., video surveillance) www.kansascitypi.com
Examples of applications Protection of European companies: Against industrial espionage, unlawful copying Protection of nations: Against terrorist groups www.newsweek.com Restoring security at moments of crisis: fusion, filtering and generation of information Dit probleem is ondertussen opgelost en je kan de mail opnieuw sturen. Niet alle uitgaande mails zijn geweigerd, het gaat in totaal over 700 mails en je krijgt later een bericht AP / Brynjar Gauti
Issues Recognition of content: but Heterogeneous sources, different languages, media Fraudulent scams cloak content Fraudulent scams change strategies continuously Content can be unreliable (credibility) Can you trust it?
Needs Robust and reliable extractors (text, speech, images, video...) Robust and reliable linking technologies (connecting the dots...) Includes also disambiguation Adaptable to different languages and media with minimum of human intervention
Response ICT Technologies: Knowledge methodologies maturing: ontologies, semantics, machine learning, data/text/graph mining, joint classification, alignment,... Probabilistic models for reasoning Latent class models for discovering hidden semantics FP7: European Security Research programme: Develop technologies and knowledge to ensure security of citizens from threats such as terrorism, (organised) crime, natural disasters and industrial accidents
Conclusions We presented innovative work with regard to spam and phishing email filtering: Detection and resolution of hidden text salting Extraction of highly discriminative features by means of Biased Discriminant Analysis that are robust notwithstanding changes of the messages over time Content filtering is an important research area with many novel challenges
Main references Moens, M.-F., Boiy, E., De Beer, Jan & Gomez, J.-C. (2010). Identifying and Resolving Hidden Text Salting. In IEEE Transactions on Information Forensics and Security 5 (3) (in press). Gomez, J.-C. & Moens, M.-F. (2010). Using Biased Discriminant Analysis for Email Filtering. In Proceedings of the 14th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (Lecture Notes in Computer Science 6276) (pp. 566-575). Berlin: Springer.
We thank the EU FP6-027600 Antiphish consortium (http://www.antiphishresearch.org/) and in particular Christina Lioma, Gerhard Paass, André Bergholz, Patrick Horkan, Brian Witten, Marc Dacier and Domenico Dato. October 26, 2010