Text Mining and eforensics: Spam Filtering

Size: px
Start display at page:

Download "Text Mining and eforensics: Spam email Filtering"

Transcription

1 Text Mining and eforensics: Spam Filtering Marie-Francine Moens Artificial Intelligence Lecture Series III: Data Mining Applications University of Luxembourg Joint work with (in alphabetical order): Erik Boiy, Jan De Beer and Juan Carlos Gomez

2 Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in filtering 4 Future applications and conclusions

3 Problem definition Spam = unsolicited bulk messages sent indiscriminately; spam, known as unsolicited bulk , junk mail, or unsolicited commercial Phishing = to a user falsely claiming to be an established legitimate enterprise in an attempt to scam the user into surrendering private information that will be used for identity theft; might be done by means of phishing s that direct a user to a bogus Website

4

5 Problem definition Filtering based on IP-addresses is largely insufficient: it is known that spammers frequently change domain names and servers Filters or blockers based on content make a fine-grained control possible Text categorization techniques based on machine learning increasingly replace handcrafted rules that detect keywords or lexicographic characteristics in the Rather accurate filters, but we want to reach 100% area under the ROC curve

6 Problem definition Spammers are very inventive Surface and hidden salting Embedded text in graphical content Personalization of the s based on content extracted from social network sites...

7 Phishing October 26, 2010

8 Problem definition Pham and phishing messages contain a core (fraudulent) message wrapped in different disguises How to identify the core (fraudulent) message? The main focus here is on the detection of phishing messages

9 Problem definition Extraction of the core message is related to the extraction of core features Two strategies: Eliminate noisy features: especially the ones that take the form of hidden salting Extract highly discriminative features that robustly distinguish the spam from ham, and the spam from phishing Core features are used in a classification model

10 Problem definition Design and implement feature extraction methods to be used in highly accurate filters that classify messages Extraction of features are to be adaptive Algorithms are to be integrated in the field systems for Message ( , SMS,...) filtering In wired and wireless environments

11 Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in filtering 4 Future applications and conclusions

12 The detection and resolution of salting Salting = intentional addition or distortion of content in order to obfuscate or evade automated inspection: Surface salting Hidden salting: Any medium (text: ASCII, HTML,...; images; audio) Any content genre, e.g. s, Web pages or MMS messages => including phishing messages, Web pages Distinction between surface salting and hidden salting depends on whether the salting is respectively visually perceivable by the user of the content or not

13 The detection and resolution of salting Extraction of salting features => gives an indication that the message is probably fraudulous Resolution of the salting features => might improve the message classification Extra difficult when the salting is hidden

14

15

16 Salting detection methodology Two steps: Step 1: we tap into the rendering process to detect hidden content (= manifestations of salting) Step 2: we feed the intercepted, visible text into an artificially intelligent cognitive model which returns the truly perceived text by the user: Differences between source and perceived text = additional evidence of salting Yields improved content representation for filtering, mining, retrieval,...

17

18 Step 1 Glyphs = positioned shapes of individual characters, with rendering attributes and any concealing shapes Hidden salting => glyph visibility (which glyphs are seen by the user): Clipping = glyph drawn within the physical bounds of the drawing clip, which is a type of `spatial mask' Concealment = glyph not concealed by other glyphs or shapes Font colour = glyph's fill colour contrasts well with the background colour Glyph size = glyph size and shape is sufficiently large Failure to comply to any condition results in an invisible glyph => indication of hidden salting Perceived text = after elimination of all invisible glyphs

19 Step 2 Segmentation: Find partitioning of segments with proper and coherent reading order Top down processing of the perceived text Detection of the reading order: Reading order is detected based on language specific statistics If reading order <> compositional (glyph) order: extra indication of hidden salting slice-and-dice trick

20 Segmentation October 26, 2010

21

22

23 Different reading orders considered October 26, 2010

24 Determining the reading order of the text block Evidence for the reading order of a text block: Measuring the alignment of glyphs both horizontally and vertically Congruence with 3 language models: Distribution of word lengths Distribution of character k-grams Distribution of common words obtained via a dictionary

25 Gathering statistics on hidden text salting Most common salting trick: glyph order Phishing mails: preference for invisible font

26 Classification of s Slight improvement of the classification into spam and ham by resolving the hidden salting

27 Classification of s We are especially interested in the classification of phishing mails: Proprietary corpus: F1 measure of classification into phish is 85.91% using the covertext, compared to 81.04% using the plaintext Recall of both the phishing and spam improves using the covertext from 81.39% to 84.25% for spam and from 70.87% to 79.34% for phishing (confidence of 99.95% determined by the paired version of Student s t-test)

28 Because of its communicative function, a text - in our view - is defined by what a user perceives, no matter how it is now or in the future digitally constructed The digital textual source gives us additional information on how the text is constructed and possibly manipulated This aspect provides a timeless dimension to our research and transcends applications such as filtering

29 Hidden salting detection and resolution beyond filtering Web content might contain hidden content to fool content filters: E.g., spoofed phishing websites E.g., sites with offensive content, defamation, hate speech, child abuse images and content, speech that attacks the legitimacy of government institutions and preservation of the national identity, obscene content and pornography Unsolicited popups, spam and advertisements, malware and many more scams might have interest in hiding content and avoid filtering When content is disguised and obfuscated, the detection of intellectual property rights (IPR) infringements and plagiarism detection is more difficult

30 Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in filtering 4 Future applications and conclusions

31 Improving the classification performance General idea: there is a core (fraudulent) content which is common to the bad messages despite the different forms the messages take How to detect this core automatically from multiple messages and so improve the classification performance? In an adversarial setting the disguises and forms change over time to avoid the filters How can we build filters that are robust over time and maintain their classification performance over time?

32 Traditional content filters for Supervised learning: a set of s is manually classified as positive or negative examples of the spam category (e.g., spam versus ham, spam versus phishing) A classifier is trained using the annotated examples, which hopefully can correctly predict the class of unseen s The classification model can be of any type, but Bayesian classifiers (often naive Bayes) and support vector machines are quite popular The s are usually represented by unigram features (e.g., words of the mails), sometimes grams of a larger size are used

33 Dimensionality reduction Dimensionality reduction popular since the early 90s in text processing tasks, e.g., Latent Semantic Analysis (LSA) Probabilistic Latent Semantic Analysis (plsa) and Latent Dirichlet Allocation (LDrA) The above methods can be used without and with annotated examples Linear Discriminant Analysis (LDA) uses class information in order to separate well the classes

34 Dimensionality reduction Recently, the computer vision community has successfully proposed several variants of LDA that artificialy pull apart the positive and the negative examples of the training set An example of such an approach is Biased Discriminant Analysis (BDA): Eigenvalue based method: Eigenvalue is a number indicating the weight of a particular pattern or cluster of features expressed by the corresponding eigenvector The larger the eigenvalue the more important the pattern is

35 ! October 26, 2010 Dimensionality reduction The goal is to represent the s with few, but highly discriminative features Let {(x 1, c 1 ),..., (x n, c n )} be a set of messages with their corresponding classes, where x i R d is the ith , represented by a d dimensional row vector, and c i C is the class of x i We have two classes C = { 1 +1}, where -1 refers to the negative class N (ham messages) and +1 to the positive class P (spam or phishing) The data dimensionality reduction learns a d x l projection matrix W, which can project to: zi = xiw where z i R l is the projected data with l << d

36 Linear Discriminant Analysis LDA aims at maximizing the following function: W* = argmax W W T SPNW W T SPW The inter-class scatter matrix S PN is computed as:! SPN = pp(µp " µ) T (µp " µ) + pn(µn " µ) T (µn " µ) where p P and µ P are respectively the prior and the mean of the examples in the positive class; p N and! µ N are respectively the prior and the mean of the examples in the negative class; and µ is the mean of the entire dataset The intra-class scatter matrix S P is computed as: SP = $ (x " µp) T (x " µp) x #P

37 Biased Discriminant analysis BDA aims at maximizing the same function as LDA, but redefining the inter-class scatter matrix S PN : SPN = $ (y " µp) T (y " µp) y #N!

38 Biased Discriminant Analysis BDA transforms the feature space so that : The positive examples cluster together Each negative instance is pushed away as far as possible from this positive cluster As a result the centroids of both the negative and positive examples are moved

39 Biased Discriminant Analysis We then perform an eigenvalue decomposition of and construct the d x l matrix W whose columns are composed by the eigenvectors of corresponding to its largest eigenvalues The goal of BDA is to transform the training data set X into a new data set Z using the projection matrix W, with Z= XW in such a way the examples inside the new data set are well separated by class If q is a test example, its projection using BDA is u = qw! S P "1 " SPN

40 [Gomez et al. submitted] October 26, 2010

41 Experiments Evaluated on 4 public spam corpora: Ling-Spam (LS) SpamAssassin (SA) TREC 2007 spam corpus (TREC) A subset of Phishing Corpus created by randomly selecting 1,250 phishing messages from the Nazario corpus and 1,250 ham messages from the TREC corpus (PC)

42 Experiments Raw features unigrams weighted by their term frequency and inverse document frequency Classifier: bagging ensemble classifier using as single classifier the C4.5 decision tree Baselines: Raw features (all terms) Classical LDA model

43

44

45 Training on oldest data and testing on the remainder of the data October 26, 2010

46 Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in filtering 4 Future applications and conclusions

47 The above content mining techniques and variants can be used in many other applications: Opportunities to monitor information Especially Web content... But there are many novel challenges

48 Examples of applications Wikia.com Protection of citizens for harmful content: Webpages (e.g., protect children - PuppyIR EU FP7) Spam and phishing Websites (e.g., protect citizens, AntiPhish EU FP6) False information (e.g., protect customers) Defamation (e.g., protect companies, individuals) Protection of groups: Intelligent surveillance (e.g., video surveillance)

49 Examples of applications Protection of European companies: Against industrial espionage, unlawful copying Protection of nations: Against terrorist groups Restoring security at moments of crisis: fusion, filtering and generation of information Dit probleem is ondertussen opgelost en je kan de mail opnieuw sturen. Niet alle uitgaande mails zijn geweigerd, het gaat in totaal over 700 mails en je krijgt later een bericht AP / Brynjar Gauti

50 Issues Recognition of content: but Heterogeneous sources, different languages, media Fraudulent scams cloak content Fraudulent scams change strategies continuously Content can be unreliable (credibility) Can you trust it?

51 Needs Robust and reliable extractors (text, speech, images, video...) Robust and reliable linking technologies (connecting the dots...) Includes also disambiguation Adaptable to different languages and media with minimum of human intervention

52 Response ICT Technologies: Knowledge methodologies maturing: ontologies, semantics, machine learning, data/text/graph mining, joint classification, alignment,... Probabilistic models for reasoning Latent class models for discovering hidden semantics FP7: European Security Research programme: Develop technologies and knowledge to ensure security of citizens from threats such as terrorism, (organised) crime, natural disasters and industrial accidents

53 Conclusions We presented innovative work with regard to spam and phishing filtering: Detection and resolution of hidden text salting Extraction of highly discriminative features by means of Biased Discriminant Analysis that are robust notwithstanding changes of the messages over time Content filtering is an important research area with many novel challenges

54 Main references Moens, M.-F., Boiy, E., De Beer, Jan & Gomez, J.-C. (2010). Identifying and Resolving Hidden Text Salting. In IEEE Transactions on Information Forensics and Security 5 (3) (in press). Gomez, J.-C. & Moens, M.-F. (2010). Using Biased Discriminant Analysis for Filtering. In Proceedings of the 14th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (Lecture Notes in Computer Science 6276) (pp ). Berlin: Springer.

55 We thank the EU FP Antiphish consortium ( and in particular Christina Lioma, Gerhard Paass, André Bergholz, Patrick Horkan, Brian Witten, Marc Dacier and Domenico Dato. October 26, 2010

Using Biased Discriminant Analysis for Email Filtering

Using Biased Discriminant Analysis for Email Filtering Using Biased Discriminant Analysis for Email Filtering Juan Carlos Gomez 1 and Marie-Francine Moens 2 1 ITESM, Eugenio Garza Sada 2501, Monterrey NL 64849, Mexico [email protected] 2

More information

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

01219211 Software Development Training Camp 1 (0-3) Prerequisite : 01204214 Program development skill enhancement camp, at least 48 person-hours.

01219211 Software Development Training Camp 1 (0-3) Prerequisite : 01204214 Program development skill enhancement camp, at least 48 person-hours. (International Program) 01219141 Object-Oriented Modeling and Programming 3 (3-0) Object concepts, object-oriented design and analysis, object-oriented analysis relating to developing conceptual models

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Spam detection with data mining method:

Spam detection with data mining method: Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Recurrent Patterns Detection Technology. White Paper

Recurrent Patterns Detection Technology. White Paper SeCure your Network Recurrent Patterns Detection Technology White Paper January, 2007 Powered by RPD Technology Network Based Protection against Email-Borne Threats Spam, Phishing and email-borne Malware

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Domain Name Abuse Detection. Liming Wang

Domain Name Abuse Detection. Liming Wang Domain Name Abuse Detection Liming Wang Outline 1 Domain Name Abuse Work Overview 2 Anti-phishing Research Work 3 Chinese Domain Similarity Detection 4 Other Abuse detection ti 5 System Information 2 Why?

More information

Machine Learning for Data Science (CS4786) Lecture 1

Machine Learning for Data Science (CS4786) Lecture 1 Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:

More information

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS. PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2 International Journal of Computer Engineering and Applications, Volume IX, Issue I, January 15 SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

More information

SpamNet Spam Detection Using PCA and Neural Networks

SpamNet Spam Detection Using PCA and Neural Networks SpamNet Spam Detection Using PCA and Neural Networks Abhimanyu Lad B.Tech. (I.T.) 4 th year student Indian Institute of Information Technology, Allahabad Deoghat, Jhalwa, Allahabad, India [email protected]

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

How To Filter Spam Image From A Picture By Color Or Color

How To Filter Spam Image From A Picture By Color Or Color Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

More information

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

How To Create A Text Classification System For Spam Filtering

How To Create A Text Classification System For Spam Filtering Term Discrimination Based Robust Text Classification with Application to Email Spam Filtering PhD Thesis Khurum Nazir Junejo 2004-03-0018 Advisor: Dr. Asim Karim Department of Computer Science Syed Babar

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

IT services for analyses of various data samples

IT services for analyses of various data samples IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Introduction to Pattern Recognition

Introduction to Pattern Recognition Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

A Statistical Text Mining Method for Patent Analysis

A Statistical Text Mining Method for Patent Analysis A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, [email protected] Abstract Most text data from diverse document databases are unsuitable for analytical

More information

OCT Training & Technology Solutions [email protected] (718) 997-4875

OCT Training & Technology Solutions Training@qc.cuny.edu (718) 997-4875 OCT Training & Technology Solutions [email protected] (718) 997-4875 Understanding Information Security Information Security Information security refers to safeguarding information from misuse and theft,

More information

Acceptable Use Policy

Acceptable Use Policy Acceptable Use Policy Contents 1. Internet Abuse... 2 2. Bulk Commercial E-Mail... 2 3. Unsolicited E-Mail... 3 4. Vulnerability Testing... 3 5. Newsgroup, Chat Forums, Other Networks... 3 6. Offensive

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

TIETS34 Seminar: Data Mining on Biometric identification

TIETS34 Seminar: Data Mining on Biometric identification TIETS34 Seminar: Data Mining on Biometric identification Youming Zhang Computer Science, School of Information Sciences, 33014 University of Tampere, Finland [email protected] Course Description Content

More information

Impact of Feature Selection Technique on Email Classification

Impact of Feature Selection Technique on Email Classification Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity

More information

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015 W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction

More information

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Journal of Machine Learning Research 7 (2006) 2699-2720 Submitted 3/06; Revised 9/06; Published 12/06 Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Giorgio Fumera Ignazio

More information

Data Warehousing and Data Mining in Business Applications

Data Warehousing and Data Mining in Business Applications 133 Data Warehousing and Data Mining in Business Applications Eesha Goel CSE Deptt. GZS-PTU Campus, Bathinda. Abstract Information technology is now required in all aspect of our lives that helps in business

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

A Hybrid Approach to Detect Zero Day Phishing Websites

A Hybrid Approach to Detect Zero Day Phishing Websites International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 17 (2014), pp. 1761-1770 International Research Publications House http://www. irphouse.com A Hybrid Approach

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari [email protected]

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari [email protected] Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

AN ENHANCED APPROACH FOR CONTENT FILTERING IN SPAM DETECTION

AN ENHANCED APPROACH FOR CONTENT FILTERING IN SPAM DETECTION AN ENHANCED APPROACH FOR CONTENT FILTERING IN SPAM DETECTION Shashi Kant Rathore Department of Computer Science & Engineering, Lovely Professional University, Jalandhar, Punjab [email protected] Jyoti

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

Commtouch RPD Technology. Network Based Protection Against Email-Borne Threats

Commtouch RPD Technology. Network Based Protection Against Email-Borne Threats Network Based Protection Against Email-Borne Threats Fighting Spam, Phishing and Malware Spam, phishing and email-borne malware such as viruses and worms are most often released in large quantities in

More information

Spam Filtering using Naïve Bayesian Classification

Spam Filtering using Naïve Bayesian Classification Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian

Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian www..org 104 Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian 1 Abha Suryavanshi, 2 Shishir Shandilya 1 Research Scholar, NIIST Bhopal, India. 2 Prof. (CSE), NIIST Bhopal,

More information

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper CAST-2015 provides an opportunity for researchers, academicians, scientists and

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

An Efficient Spam Filtering Techniques for Email Account

An Efficient Spam Filtering Techniques for Email Account American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-02, Issue-10, pp-63-73 www.ajer.org Research Paper Open Access An Efficient Spam Filtering Techniques for Email

More information

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

More information

Object Recognition. Selim Aksoy. Bilkent University [email protected]

Object Recognition. Selim Aksoy. Bilkent University saksoy@cs.bilkent.edu.tr Image Classification and Object Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] Image classification Image (scene) classification is a fundamental

More information

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded

More information

Digital Identity & Authentication Directions Biometric Applications Who is doing what? Academia, Industry, Government

Digital Identity & Authentication Directions Biometric Applications Who is doing what? Academia, Industry, Government Digital Identity & Authentication Directions Biometric Applications Who is doing what? Academia, Industry, Government Briefing W. Frisch 1 Outline Digital Identity Management Identity Theft Management

More information

Combining Global and Personal Anti-Spam Filtering

Combining Global and Personal Anti-Spam Filtering Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized

More information

Analecta Vol. 8, No. 2 ISSN 2064-7964

Analecta Vol. 8, No. 2 ISSN 2064-7964 EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

MS1b Statistical Data Mining

MS1b Statistical Data Mining MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to

More information

Introduction. A. Bellaachia Page: 1

Introduction. A. Bellaachia Page: 1 Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

CS 348: Introduction to Artificial Intelligence Lab 2: Spam Filtering

CS 348: Introduction to Artificial Intelligence Lab 2: Spam Filtering THE PROBLEM Spam is e-mail that is both unsolicited by the recipient and sent in substantively identical form to many recipients. In 2004, MSNBC reported that spam accounted for 66% of all electronic mail.

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

Novelty Detection in image recognition using IRF Neural Networks properties

Novelty Detection in image recognition using IRF Neural Networks properties Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,

More information

Ipswitch IMail Server with Integrated Technology

Ipswitch IMail Server with Integrated Technology Ipswitch IMail Server with Integrated Technology As spammers grow in their cleverness, their means of inundating your life with spam continues to grow very ingeniously. The majority of spam messages these

More information

Spam Filtering Based on Latent Semantic Indexing

Spam Filtering Based on Latent Semantic Indexing Spam Filtering Based on Latent Semantic Indexing Wilfried N. Gansterer Andreas G. K. Janecek Robert Neumayer Abstract In this paper, a study on the classification performance of a vector space model (VSM)

More information

What is Visualization? Information Visualization An Overview. Information Visualization. Definitions

What is Visualization? Information Visualization An Overview. Information Visualization. Definitions What is Visualization? Information Visualization An Overview Jonathan I. Maletic, Ph.D. Computer Science Kent State University Visualize/Visualization: To form a mental image or vision of [some

More information

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and

More information

Increasing the Accuracy of a Spam-Detecting Artificial Immune System

Increasing the Accuracy of a Spam-Detecting Artificial Immune System Increasing the Accuracy of a Spam-Detecting Artificial Immune System Terri Oda Carleton University [email protected] Tony White Carleton University [email protected] Abstract- Spam, the electronic

More information

Machine Learning for Cyber Security Intelligence

Machine Learning for Cyber Security Intelligence Machine Learning for Cyber Security Intelligence 27 th FIRST Conference 17 June 2015 Edwin Tump Senior Analyst National Cyber Security Center Introduction whois Edwin Tump 10 yrs at NCSC.NL (GOVCERT.NL)

More information

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott

More information

A SURVEY OF TEXT CLASSIFICATION ALGORITHMS

A SURVEY OF TEXT CLASSIFICATION ALGORITHMS Chapter 6 A SURVEY OF TEXT CLASSIFICATION ALGORITHMS Charu C. Aggarwal IBM T. J. Watson Research Center Yorktown Heights, NY [email protected] ChengXiang Zhai University of Illinois at Urbana-Champaign

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Tweaking Naïve Bayes classifier for intelligent spam detection

Tweaking Naïve Bayes classifier for intelligent spam detection 682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. [email protected] 2 School of Computing, Information

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?

More information

Strategic Online Advertising: Modeling Internet User Behavior with

Strategic Online Advertising: Modeling Internet User Behavior with 2 Strategic Online Advertising: Modeling Internet User Behavior with Patrick Johnston, Nicholas Kristoff, Heather McGinness, Phuong Vu, Nathaniel Wong, Jason Wright with William T. Scherer and Matthew

More information

DATA PREPARATION FOR DATA MINING

DATA PREPARATION FOR DATA MINING Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI

More information

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

How to Use the Greymail Spam Filter

How to Use the Greymail Spam Filter How to Use the Greymail Spam Filter This guide will show you the basics of how to view messages flagged as spam, and how to recover them, if improperly flagged. For a full overview of the New Greymail

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information