Text Mining and eforensics: Spam Filtering
|
|
|
- Myles Houston
- 10 years ago
- Views:
Transcription
1 Text Mining and eforensics: Spam Filtering Marie-Francine Moens Artificial Intelligence Lecture Series III: Data Mining Applications University of Luxembourg Joint work with (in alphabetical order): Erik Boiy, Jan De Beer and Juan Carlos Gomez
2 Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in filtering 4 Future applications and conclusions
3 Problem definition Spam = unsolicited bulk messages sent indiscriminately; spam, known as unsolicited bulk , junk mail, or unsolicited commercial Phishing = to a user falsely claiming to be an established legitimate enterprise in an attempt to scam the user into surrendering private information that will be used for identity theft; might be done by means of phishing s that direct a user to a bogus Website
4
5 Problem definition Filtering based on IP-addresses is largely insufficient: it is known that spammers frequently change domain names and servers Filters or blockers based on content make a fine-grained control possible Text categorization techniques based on machine learning increasingly replace handcrafted rules that detect keywords or lexicographic characteristics in the Rather accurate filters, but we want to reach 100% area under the ROC curve
6 Problem definition Spammers are very inventive Surface and hidden salting Embedded text in graphical content Personalization of the s based on content extracted from social network sites...
7 Phishing October 26, 2010
8 Problem definition Pham and phishing messages contain a core (fraudulent) message wrapped in different disguises How to identify the core (fraudulent) message? The main focus here is on the detection of phishing messages
9 Problem definition Extraction of the core message is related to the extraction of core features Two strategies: Eliminate noisy features: especially the ones that take the form of hidden salting Extract highly discriminative features that robustly distinguish the spam from ham, and the spam from phishing Core features are used in a classification model
10 Problem definition Design and implement feature extraction methods to be used in highly accurate filters that classify messages Extraction of features are to be adaptive Algorithms are to be integrated in the field systems for Message ( , SMS,...) filtering In wired and wireless environments
11 Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in filtering 4 Future applications and conclusions
12 The detection and resolution of salting Salting = intentional addition or distortion of content in order to obfuscate or evade automated inspection: Surface salting Hidden salting: Any medium (text: ASCII, HTML,...; images; audio) Any content genre, e.g. s, Web pages or MMS messages => including phishing messages, Web pages Distinction between surface salting and hidden salting depends on whether the salting is respectively visually perceivable by the user of the content or not
13 The detection and resolution of salting Extraction of salting features => gives an indication that the message is probably fraudulous Resolution of the salting features => might improve the message classification Extra difficult when the salting is hidden
14
15
16 Salting detection methodology Two steps: Step 1: we tap into the rendering process to detect hidden content (= manifestations of salting) Step 2: we feed the intercepted, visible text into an artificially intelligent cognitive model which returns the truly perceived text by the user: Differences between source and perceived text = additional evidence of salting Yields improved content representation for filtering, mining, retrieval,...
17
18 Step 1 Glyphs = positioned shapes of individual characters, with rendering attributes and any concealing shapes Hidden salting => glyph visibility (which glyphs are seen by the user): Clipping = glyph drawn within the physical bounds of the drawing clip, which is a type of `spatial mask' Concealment = glyph not concealed by other glyphs or shapes Font colour = glyph's fill colour contrasts well with the background colour Glyph size = glyph size and shape is sufficiently large Failure to comply to any condition results in an invisible glyph => indication of hidden salting Perceived text = after elimination of all invisible glyphs
19 Step 2 Segmentation: Find partitioning of segments with proper and coherent reading order Top down processing of the perceived text Detection of the reading order: Reading order is detected based on language specific statistics If reading order <> compositional (glyph) order: extra indication of hidden salting slice-and-dice trick
20 Segmentation October 26, 2010
21
22
23 Different reading orders considered October 26, 2010
24 Determining the reading order of the text block Evidence for the reading order of a text block: Measuring the alignment of glyphs both horizontally and vertically Congruence with 3 language models: Distribution of word lengths Distribution of character k-grams Distribution of common words obtained via a dictionary
25 Gathering statistics on hidden text salting Most common salting trick: glyph order Phishing mails: preference for invisible font
26 Classification of s Slight improvement of the classification into spam and ham by resolving the hidden salting
27 Classification of s We are especially interested in the classification of phishing mails: Proprietary corpus: F1 measure of classification into phish is 85.91% using the covertext, compared to 81.04% using the plaintext Recall of both the phishing and spam improves using the covertext from 81.39% to 84.25% for spam and from 70.87% to 79.34% for phishing (confidence of 99.95% determined by the paired version of Student s t-test)
28 Because of its communicative function, a text - in our view - is defined by what a user perceives, no matter how it is now or in the future digitally constructed The digital textual source gives us additional information on how the text is constructed and possibly manipulated This aspect provides a timeless dimension to our research and transcends applications such as filtering
29 Hidden salting detection and resolution beyond filtering Web content might contain hidden content to fool content filters: E.g., spoofed phishing websites E.g., sites with offensive content, defamation, hate speech, child abuse images and content, speech that attacks the legitimacy of government institutions and preservation of the national identity, obscene content and pornography Unsolicited popups, spam and advertisements, malware and many more scams might have interest in hiding content and avoid filtering When content is disguised and obfuscated, the detection of intellectual property rights (IPR) infringements and plagiarism detection is more difficult
30 Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in filtering 4 Future applications and conclusions
31 Improving the classification performance General idea: there is a core (fraudulent) content which is common to the bad messages despite the different forms the messages take How to detect this core automatically from multiple messages and so improve the classification performance? In an adversarial setting the disguises and forms change over time to avoid the filters How can we build filters that are robust over time and maintain their classification performance over time?
32 Traditional content filters for Supervised learning: a set of s is manually classified as positive or negative examples of the spam category (e.g., spam versus ham, spam versus phishing) A classifier is trained using the annotated examples, which hopefully can correctly predict the class of unseen s The classification model can be of any type, but Bayesian classifiers (often naive Bayes) and support vector machines are quite popular The s are usually represented by unigram features (e.g., words of the mails), sometimes grams of a larger size are used
33 Dimensionality reduction Dimensionality reduction popular since the early 90s in text processing tasks, e.g., Latent Semantic Analysis (LSA) Probabilistic Latent Semantic Analysis (plsa) and Latent Dirichlet Allocation (LDrA) The above methods can be used without and with annotated examples Linear Discriminant Analysis (LDA) uses class information in order to separate well the classes
34 Dimensionality reduction Recently, the computer vision community has successfully proposed several variants of LDA that artificialy pull apart the positive and the negative examples of the training set An example of such an approach is Biased Discriminant Analysis (BDA): Eigenvalue based method: Eigenvalue is a number indicating the weight of a particular pattern or cluster of features expressed by the corresponding eigenvector The larger the eigenvalue the more important the pattern is
35 ! October 26, 2010 Dimensionality reduction The goal is to represent the s with few, but highly discriminative features Let {(x 1, c 1 ),..., (x n, c n )} be a set of messages with their corresponding classes, where x i R d is the ith , represented by a d dimensional row vector, and c i C is the class of x i We have two classes C = { 1 +1}, where -1 refers to the negative class N (ham messages) and +1 to the positive class P (spam or phishing) The data dimensionality reduction learns a d x l projection matrix W, which can project to: zi = xiw where z i R l is the projected data with l << d
36 Linear Discriminant Analysis LDA aims at maximizing the following function: W* = argmax W W T SPNW W T SPW The inter-class scatter matrix S PN is computed as:! SPN = pp(µp " µ) T (µp " µ) + pn(µn " µ) T (µn " µ) where p P and µ P are respectively the prior and the mean of the examples in the positive class; p N and! µ N are respectively the prior and the mean of the examples in the negative class; and µ is the mean of the entire dataset The intra-class scatter matrix S P is computed as: SP = $ (x " µp) T (x " µp) x #P
37 Biased Discriminant analysis BDA aims at maximizing the same function as LDA, but redefining the inter-class scatter matrix S PN : SPN = $ (y " µp) T (y " µp) y #N!
38 Biased Discriminant Analysis BDA transforms the feature space so that : The positive examples cluster together Each negative instance is pushed away as far as possible from this positive cluster As a result the centroids of both the negative and positive examples are moved
39 Biased Discriminant Analysis We then perform an eigenvalue decomposition of and construct the d x l matrix W whose columns are composed by the eigenvectors of corresponding to its largest eigenvalues The goal of BDA is to transform the training data set X into a new data set Z using the projection matrix W, with Z= XW in such a way the examples inside the new data set are well separated by class If q is a test example, its projection using BDA is u = qw! S P "1 " SPN
40 [Gomez et al. submitted] October 26, 2010
41 Experiments Evaluated on 4 public spam corpora: Ling-Spam (LS) SpamAssassin (SA) TREC 2007 spam corpus (TREC) A subset of Phishing Corpus created by randomly selecting 1,250 phishing messages from the Nazario corpus and 1,250 ham messages from the TREC corpus (PC)
42 Experiments Raw features unigrams weighted by their term frequency and inverse document frequency Classifier: bagging ensemble classifier using as single classifier the C4.5 decision tree Baselines: Raw features (all terms) Classical LDA model
43
44
45 Training on oldest data and testing on the remainder of the data October 26, 2010
46 Overview 1 Problem definition 2 Detection and resolution of hidden text salting 3 Advanced feature extraction techniques and use in filtering 4 Future applications and conclusions
47 The above content mining techniques and variants can be used in many other applications: Opportunities to monitor information Especially Web content... But there are many novel challenges
48 Examples of applications Wikia.com Protection of citizens for harmful content: Webpages (e.g., protect children - PuppyIR EU FP7) Spam and phishing Websites (e.g., protect citizens, AntiPhish EU FP6) False information (e.g., protect customers) Defamation (e.g., protect companies, individuals) Protection of groups: Intelligent surveillance (e.g., video surveillance)
49 Examples of applications Protection of European companies: Against industrial espionage, unlawful copying Protection of nations: Against terrorist groups Restoring security at moments of crisis: fusion, filtering and generation of information Dit probleem is ondertussen opgelost en je kan de mail opnieuw sturen. Niet alle uitgaande mails zijn geweigerd, het gaat in totaal over 700 mails en je krijgt later een bericht AP / Brynjar Gauti
50 Issues Recognition of content: but Heterogeneous sources, different languages, media Fraudulent scams cloak content Fraudulent scams change strategies continuously Content can be unreliable (credibility) Can you trust it?
51 Needs Robust and reliable extractors (text, speech, images, video...) Robust and reliable linking technologies (connecting the dots...) Includes also disambiguation Adaptable to different languages and media with minimum of human intervention
52 Response ICT Technologies: Knowledge methodologies maturing: ontologies, semantics, machine learning, data/text/graph mining, joint classification, alignment,... Probabilistic models for reasoning Latent class models for discovering hidden semantics FP7: European Security Research programme: Develop technologies and knowledge to ensure security of citizens from threats such as terrorism, (organised) crime, natural disasters and industrial accidents
53 Conclusions We presented innovative work with regard to spam and phishing filtering: Detection and resolution of hidden text salting Extraction of highly discriminative features by means of Biased Discriminant Analysis that are robust notwithstanding changes of the messages over time Content filtering is an important research area with many novel challenges
54 Main references Moens, M.-F., Boiy, E., De Beer, Jan & Gomez, J.-C. (2010). Identifying and Resolving Hidden Text Salting. In IEEE Transactions on Information Forensics and Security 5 (3) (in press). Gomez, J.-C. & Moens, M.-F. (2010). Using Biased Discriminant Analysis for Filtering. In Proceedings of the 14th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (Lecture Notes in Computer Science 6276) (pp ). Berlin: Springer.
55 We thank the EU FP Antiphish consortium ( and in particular Christina Lioma, Gerhard Paass, André Bergholz, Patrick Horkan, Brian Witten, Marc Dacier and Domenico Dato. October 26, 2010
Using Biased Discriminant Analysis for Email Filtering
Using Biased Discriminant Analysis for Email Filtering Juan Carlos Gomez 1 and Marie-Francine Moens 2 1 ITESM, Eugenio Garza Sada 2501, Monterrey NL 64849, Mexico [email protected] 2
Feature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
01219211 Software Development Training Camp 1 (0-3) Prerequisite : 01204214 Program development skill enhancement camp, at least 48 person-hours.
(International Program) 01219141 Object-Oriented Modeling and Programming 3 (3-0) Object concepts, object-oriented design and analysis, object-oriented analysis relating to developing conceptual models
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
Spam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Recurrent Patterns Detection Technology. White Paper
SeCure your Network Recurrent Patterns Detection Technology White Paper January, 2007 Powered by RPD Technology Network Based Protection against Email-Borne Threats Spam, Phishing and email-borne Malware
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
Domain Name Abuse Detection. Liming Wang
Domain Name Abuse Detection Liming Wang Outline 1 Domain Name Abuse Work Overview 2 Anti-phishing Research Work 3 Chinese Domain Similarity Detection 4 Other Abuse detection ti 5 System Information 2 Why?
Machine Learning for Data Science (CS4786) Lecture 1
Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:
PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.
PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2
International Journal of Computer Engineering and Applications, Volume IX, Issue I, January 15 SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2
SpamNet Spam Detection Using PCA and Neural Networks
SpamNet Spam Detection Using PCA and Neural Networks Abhimanyu Lad B.Tech. (I.T.) 4 th year student Indian Institute of Information Technology, Allahabad Deoghat, Jhalwa, Allahabad, India [email protected]
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
How To Filter Spam Image From A Picture By Color Or Color
Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among
CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing
CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
How To Create A Text Classification System For Spam Filtering
Term Discrimination Based Robust Text Classification with Application to Email Spam Filtering PhD Thesis Khurum Nazir Junejo 2004-03-0018 Advisor: Dr. Asim Karim Department of Computer Science Syed Babar
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
IT services for analyses of various data samples
IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
Introduction to Pattern Recognition
Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
A Statistical Text Mining Method for Patent Analysis
A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, [email protected] Abstract Most text data from diverse document databases are unsuitable for analytical
OCT Training & Technology Solutions [email protected] (718) 997-4875
OCT Training & Technology Solutions [email protected] (718) 997-4875 Understanding Information Security Information Security Information security refers to safeguarding information from misuse and theft,
Acceptable Use Policy
Acceptable Use Policy Contents 1. Internet Abuse... 2 2. Bulk Commercial E-Mail... 2 3. Unsolicited E-Mail... 3 4. Vulnerability Testing... 3 5. Newsgroup, Chat Forums, Other Networks... 3 6. Offensive
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
TIETS34 Seminar: Data Mining on Biometric identification
TIETS34 Seminar: Data Mining on Biometric identification Youming Zhang Computer Science, School of Information Sciences, 33014 University of Tampere, Finland [email protected] Course Description Content
Impact of Feature Selection Technique on Email Classification
Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity
W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015
W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction
Spam Filtering Based On The Analysis Of Text Information Embedded Into Images
Journal of Machine Learning Research 7 (2006) 2699-2720 Submitted 3/06; Revised 9/06; Published 12/06 Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Giorgio Fumera Ignazio
Data Warehousing and Data Mining in Business Applications
133 Data Warehousing and Data Mining in Business Applications Eesha Goel CSE Deptt. GZS-PTU Campus, Bathinda. Abstract Information technology is now required in all aspect of our lives that helps in business
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup
Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
A Hybrid Approach to Detect Zero Day Phishing Websites
International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 17 (2014), pp. 1761-1770 International Research Publications House http://www. irphouse.com A Hybrid Approach
Introduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari [email protected]
Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari [email protected] Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
AN ENHANCED APPROACH FOR CONTENT FILTERING IN SPAM DETECTION
AN ENHANCED APPROACH FOR CONTENT FILTERING IN SPAM DETECTION Shashi Kant Rathore Department of Computer Science & Engineering, Lovely Professional University, Jalandhar, Punjab [email protected] Jyoti
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Projektgruppe. Categorization of text documents via classification
Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction
Commtouch RPD Technology. Network Based Protection Against Email-Borne Threats
Network Based Protection Against Email-Borne Threats Fighting Spam, Phishing and Malware Spam, phishing and email-borne malware such as viruses and worms are most often released in large quantities in
Spam Filtering using Naïve Bayesian Classification
Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering
Search Result Optimization using Annotators
Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,
Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian
www..org 104 Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian 1 Abha Suryavanshi, 2 Shishir Shandilya 1 Research Scholar, NIIST Bhopal, India. 2 Prof. (CSE), NIIST Bhopal,
IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper
IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper CAST-2015 provides an opportunity for researchers, academicians, scientists and
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Learning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
An Efficient Spam Filtering Techniques for Email Account
American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-02, Issue-10, pp-63-73 www.ajer.org Research Paper Open Access An Efficient Spam Filtering Techniques for Email
Simple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
Object Recognition. Selim Aksoy. Bilkent University [email protected]
Image Classification and Object Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] Image classification Image (scene) classification is a fundamental
Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project
Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded
Digital Identity & Authentication Directions Biometric Applications Who is doing what? Academia, Industry, Government
Digital Identity & Authentication Directions Biometric Applications Who is doing what? Academia, Industry, Government Briefing W. Frisch 1 Outline Digital Identity Management Identity Theft Management
Combining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
Analecta Vol. 8, No. 2 ISSN 2064-7964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
MS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
Introduction. A. Bellaachia Page: 1
Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.
An Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2
UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,
Learning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
CS 348: Introduction to Artificial Intelligence Lab 2: Spam Filtering
THE PROBLEM Spam is e-mail that is both unsolicited by the recipient and sent in substantively identical form to many recipients. In 2004, MSNBC reported that spam accounted for 66% of all electronic mail.
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
Novelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
Ipswitch IMail Server with Integrated Technology
Ipswitch IMail Server with Integrated Technology As spammers grow in their cleverness, their means of inundating your life with spam continues to grow very ingeniously. The majority of spam messages these
Spam Filtering Based on Latent Semantic Indexing
Spam Filtering Based on Latent Semantic Indexing Wilfried N. Gansterer Andreas G. K. Janecek Robert Neumayer Abstract In this paper, a study on the classification performance of a vector space model (VSM)
What is Visualization? Information Visualization An Overview. Information Visualization. Definitions
What is Visualization? Information Visualization An Overview Jonathan I. Maletic, Ph.D. Computer Science Kent State University Visualize/Visualization: To form a mental image or vision of [some
SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING
I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and
Increasing the Accuracy of a Spam-Detecting Artificial Immune System
Increasing the Accuracy of a Spam-Detecting Artificial Immune System Terri Oda Carleton University [email protected] Tony White Carleton University [email protected] Abstract- Spam, the electronic
Machine Learning for Cyber Security Intelligence
Machine Learning for Cyber Security Intelligence 27 th FIRST Conference 17 June 2015 Edwin Tump Senior Analyst National Cyber Security Center Introduction whois Edwin Tump 10 yrs at NCSC.NL (GOVCERT.NL)
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
A SURVEY OF TEXT CLASSIFICATION ALGORITHMS
Chapter 6 A SURVEY OF TEXT CLASSIFICATION ALGORITHMS Charu C. Aggarwal IBM T. J. Watson Research Center Yorktown Heights, NY [email protected] ChengXiang Zhai University of Illinois at Urbana-Champaign
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
Tweaking Naïve Bayes classifier for intelligent spam detection
682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. [email protected] 2 School of Computing, Information
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum
Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?
Strategic Online Advertising: Modeling Internet User Behavior with
2 Strategic Online Advertising: Modeling Internet User Behavior with Patrick Johnston, Nicholas Kristoff, Heather McGinness, Phuong Vu, Nathaniel Wong, Jason Wright with William T. Scherer and Matthew
DATA PREPARATION FOR DATA MINING
Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
How to Use the Greymail Spam Filter
How to Use the Greymail Spam Filter This guide will show you the basics of how to view messages flagged as spam, and how to recover them, if improperly flagged. For a full overview of the New Greymail
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
