Text Mining and eforensics: Spam Filtering

Transcription

1 Text Mining and eforensics: Spam Filtering Marie-Francine Moens Artificial Intelligence Lecture Series III: Data Mining Applications University of Luxembourg Joint work with (in alphabetical order): Erik Boiy, Jan De Beer and Juan Carlos Gomez

2 Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in filtering 4 Future applications and conclusions

3 Problem definition Spam = unsolicited bulk messages sent indiscriminately; spam, known as unsolicited bulk , junk mail, or unsolicited commercial Phishing = to a user falsely claiming to be an established legitimate enterprise in an attempt to scam the user into surrendering private information that will be used for identity theft; might be done by means of phishing s that direct a user to a bogus Website

4

5 Problem definition Filtering based on IP-addresses is largely insufficient: it is known that spammers frequently change domain names and servers Filters or blockers based on content make a fine-grained control possible Text categorization techniques based on machine learning increasingly replace handcrafted rules that detect keywords or lexicographic characteristics in the Rather accurate filters, but we want to reach 100% area under the ROC curve

6 Problem definition Spammers are very inventive Surface and hidden salting Embedded text in graphical content Personalization of the s based on content extracted from social network sites...

7 Phishing October 26, 2010

8 Problem definition Pham and phishing messages contain a core (fraudulent) message wrapped in different disguises How to identify the core (fraudulent) message? The main focus here is on the detection of phishing messages

9 Problem definition Extraction of the core message is related to the extraction of core features Two strategies: Eliminate noisy features: especially the ones that take the form of hidden salting Extract highly discriminative features that robustly distinguish the spam from ham, and the spam from phishing Core features are used in a classification model

10 Problem definition Design and implement feature extraction methods to be used in highly accurate filters that classify messages Extraction of features are to be adaptive Algorithms are to be integrated in the field systems for Message ( , SMS,...) filtering In wired and wireless environments

12 The detection and resolution of salting Salting = intentional addition or distortion of content in order to obfuscate or evade automated inspection: Surface salting Hidden salting: Any medium (text: ASCII, HTML,...; images; audio) Any content genre, e.g. s, Web pages or MMS messages => including phishing messages, Web pages Distinction between surface salting and hidden salting depends on whether the salting is respectively visually perceivable by the user of the content or not

13 The detection and resolution of salting Extraction of salting features => gives an indication that the message is probably fraudulous Resolution of the salting features => might improve the message classification Extra difficult when the salting is hidden

14

15

16 Salting detection methodology Two steps: Step 1: we tap into the rendering process to detect hidden content (= manifestations of salting) Step 2: we feed the intercepted, visible text into an artificially intelligent cognitive model which returns the truly perceived text by the user: Differences between source and perceived text = additional evidence of salting Yields improved content representation for filtering, mining, retrieval,...

17

18 Step 1 Glyphs = positioned shapes of individual characters, with rendering attributes and any concealing shapes Hidden salting => glyph visibility (which glyphs are seen by the user): Clipping = glyph drawn within the physical bounds of the drawing clip, which is a type of `spatial mask' Concealment = glyph not concealed by other glyphs or shapes Font colour = glyph's fill colour contrasts well with the background colour Glyph size = glyph size and shape is sufficiently large Failure to comply to any condition results in an invisible glyph => indication of hidden salting Perceived text = after elimination of all invisible glyphs

19 Step 2 Segmentation: Find partitioning of segments with proper and coherent reading order Top down processing of the perceived text Detection of the reading order: Reading order is detected based on language specific statistics If reading order <> compositional (glyph) order: extra indication of hidden salting slice-and-dice trick

20 Segmentation October 26, 2010

21

22

23 Different reading orders considered October 26, 2010

24 Determining the reading order of the text block Evidence for the reading order of a text block: Measuring the alignment of glyphs both horizontally and vertically Congruence with 3 language models: Distribution of word lengths Distribution of character k-grams Distribution of common words obtained via a dictionary

25 Gathering statistics on hidden text salting Most common salting trick: glyph order Phishing mails: preference for invisible font

26 Classification of s Slight improvement of the classification into spam and ham by resolving the hidden salting

27 Classification of s We are especially interested in the classification of phishing mails: Proprietary corpus: F1 measure of classification into phish is 85.91% using the covertext, compared to 81.04% using the plaintext Recall of both the phishing and spam improves using the covertext from 81.39% to 84.25% for spam and from 70.87% to 79.34% for phishing (confidence of 99.95% determined by the paired version of Student s t-test)

28 Because of its communicative function, a text - in our view - is defined by what a user perceives, no matter how it is now or in the future digitally constructed The digital textual source gives us additional information on how the text is constructed and possibly manipulated This aspect provides a timeless dimension to our research and transcends applications such as filtering

29 Hidden salting detection and resolution beyond filtering Web content might contain hidden content to fool content filters: E.g., spoofed phishing websites E.g., sites with offensive content, defamation, hate speech, child abuse images and content, speech that attacks the legitimacy of government institutions and preservation of the national identity, obscene content and pornography Unsolicited popups, spam and advertisements, malware and many more scams might have interest in hiding content and avoid filtering When content is disguised and obfuscated, the detection of intellectual property rights (IPR) infringements and plagiarism detection is more difficult

31 Improving the classification performance General idea: there is a core (fraudulent) content which is common to the bad messages despite the different forms the messages take How to detect this core automatically from multiple messages and so improve the classification performance? In an adversarial setting the disguises and forms change over time to avoid the filters How can we build filters that are robust over time and maintain their classification performance over time?

32 Traditional content filters for Supervised learning: a set of s is manually classified as positive or negative examples of the spam category (e.g., spam versus ham, spam versus phishing) A classifier is trained using the annotated examples, which hopefully can correctly predict the class of unseen s The classification model can be of any type, but Bayesian classifiers (often naive Bayes) and support vector machines are quite popular The s are usually represented by unigram features (e.g., words of the mails), sometimes grams of a larger size are used

33 Dimensionality reduction Dimensionality reduction popular since the early 90s in text processing tasks, e.g., Latent Semantic Analysis (LSA) Probabilistic Latent Semantic Analysis (plsa) and Latent Dirichlet Allocation (LDrA) The above methods can be used without and with annotated examples Linear Discriminant Analysis (LDA) uses class information in order to separate well the classes

34 Dimensionality reduction Recently, the computer vision community has successfully proposed several variants of LDA that artificialy pull apart the positive and the negative examples of the training set An example of such an approach is Biased Discriminant Analysis (BDA): Eigenvalue based method: Eigenvalue is a number indicating the weight of a particular pattern or cluster of features expressed by the corresponding eigenvector The larger the eigenvalue the more important the pattern is

35 ! October 26, 2010 Dimensionality reduction The goal is to represent the s with few, but highly discriminative features Let {(x 1, c 1 ),..., (x n, c n )} be a set of messages with their corresponding classes, where x i R d is the ith , represented by a d dimensional row vector, and c i C is the class of x i We have two classes C = { 1 +1}, where -1 refers to the negative class N (ham messages) and +1 to the positive class P (spam or phishing) The data dimensionality reduction learns a d x l projection matrix W, which can project to: zi = xiw where z i R l is the projected data with l << d

36 Linear Discriminant Analysis LDA aims at maximizing the following function: W* = argmax W W T SPNW W T SPW The inter-class scatter matrix S PN is computed as:! SPN = pp(µp " µ) T (µp " µ) + pn(µn " µ) T (µn " µ) where p P and µ P are respectively the prior and the mean of the examples in the positive class; p N and! µ N are respectively the prior and the mean of the examples in the negative class; and µ is the mean of the entire dataset The intra-class scatter matrix S P is computed as: SP = $ (x " µp) T (x " µp) x #P

37 Biased Discriminant analysis BDA aims at maximizing the same function as LDA, but redefining the inter-class scatter matrix S PN : SPN = $ (y " µp) T (y " µp) y #N!

38 Biased Discriminant Analysis BDA transforms the feature space so that : The positive examples cluster together Each negative instance is pushed away as far as possible from this positive cluster As a result the centroids of both the negative and positive examples are moved

39 Biased Discriminant Analysis We then perform an eigenvalue decomposition of and construct the d x l matrix W whose columns are composed by the eigenvectors of corresponding to its largest eigenvalues The goal of BDA is to transform the training data set X into a new data set Z using the projection matrix W, with Z= XW in such a way the examples inside the new data set are well separated by class If q is a test example, its projection using BDA is u = qw! S P "1 " SPN

40 [Gomez et al. submitted] October 26, 2010

41 Experiments Evaluated on 4 public spam corpora: Ling-Spam (LS) SpamAssassin (SA) TREC 2007 spam corpus (TREC) A subset of Phishing Corpus created by randomly selecting 1,250 phishing messages from the Nazario corpus and 1,250 ham messages from the TREC corpus (PC)

42 Experiments Raw features unigrams weighted by their term frequency and inverse document frequency Classifier: bagging ensemble classifier using as single classifier the C4.5 decision tree Baselines: Raw features (all terms) Classical LDA model

43

44

45 Training on oldest data and testing on the remainder of the data October 26, 2010

47 The above content mining techniques and variants can be used in many other applications: Opportunities to monitor information Especially Web content... But there are many novel challenges

48 Examples of applications Wikia.com Protection of citizens for harmful content: Webpages (e.g., protect children - PuppyIR EU FP7) Spam and phishing Websites (e.g., protect citizens, AntiPhish EU FP6) False information (e.g., protect customers) Defamation (e.g., protect companies, individuals) Protection of groups: Intelligent surveillance (e.g., video surveillance)

49 Examples of applications Protection of European companies: Against industrial espionage, unlawful copying Protection of nations: Against terrorist groups Restoring security at moments of crisis: fusion, filtering and generation of information Dit probleem is ondertussen opgelost en je kan de mail opnieuw sturen. Niet alle uitgaande mails zijn geweigerd, het gaat in totaal over 700 mails en je krijgt later een bericht AP / Brynjar Gauti

50 Issues Recognition of content: but Heterogeneous sources, different languages, media Fraudulent scams cloak content Fraudulent scams change strategies continuously Content can be unreliable (credibility) Can you trust it?

51 Needs Robust and reliable extractors (text, speech, images, video...) Robust and reliable linking technologies (connecting the dots...) Includes also disambiguation Adaptable to different languages and media with minimum of human intervention

52 Response ICT Technologies: Knowledge methodologies maturing: ontologies, semantics, machine learning, data/text/graph mining, joint classification, alignment,... Probabilistic models for reasoning Latent class models for discovering hidden semantics FP7: European Security Research programme: Develop technologies and knowledge to ensure security of citizens from threats such as terrorism, (organised) crime, natural disasters and industrial accidents

53 Conclusions We presented innovative work with regard to spam and phishing filtering: Detection and resolution of hidden text salting Extraction of highly discriminative features by means of Biased Discriminant Analysis that are robust notwithstanding changes of the messages over time Content filtering is an important research area with many novel challenges

54 Main references Moens, M.-F., Boiy, E., De Beer, Jan & Gomez, J.-C. (2010). Identifying and Resolving Hidden Text Salting. In IEEE Transactions on Information Forensics and Security 5 (3) (in press). Gomez, J.-C. & Moens, M.-F. (2010). Using Biased Discriminant Analysis for Filtering. In Proceedings of the 14th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (Lecture Notes in Computer Science 6276) (pp ). Berlin: Springer.

55 We thank the EU FP Antiphish consortium ( and in particular Christina Lioma, Gerhard Paass, André Bergholz, Patrick Horkan, Brian Witten, Marc Dacier and Domenico Dato. October 26, 2010