CS 348: Introduction to Artificial Intelligence Lab 2: Spam Filtering

Size: px
Start display at page:

Download "CS 348: Introduction to Artificial Intelligence Lab 2: Spam Filtering"

Transcription

1 THE PROBLEM Spam is that is both unsolicited by the recipient and sent in substantively identical form to many recipients. In 2004, MSNBC reported that spam accounted for 66% of all electronic mail. By December 2006, Infoweek reported that 94% of all electronic mail is now spam. Currently, spam is dealt with on the receiving end by automated spam filter programs that attempt to classify each message as either spam (undesired mass ) or ham (mail the user would like to receive). A Naïve Bayes Spam Filter uses a Naive Bayes classifier to identify spam . Many modern mail programs implement Bayesian spam filtering. Examples include SpamAssassin, SpamBayes, and Bogofilter. In this lab, you will create a spam filter based on Naïve Bayes classification. NAÏVE BAYES CLASSIFIER A Naïve Bayesian classifier takes objects described by a feature vector and classifies each object based on the features. Let V be the set of possible classifications. In this case, that will be spam or ham. V { spam, ham} = (1) Let a 1 through a n be Boolean values. We will represent each mail document by these n Boolean values. Here, a i will be true if the ith word in our dictionary is present in the document, and a i will otherwise be false. a Dictionary words (2) i If we have a dictionary consisting of the following set of words: { feature, ham, impossible, zebra }, then the first paragraph of this section would be characterized by a vector of four Boolean values (one for each word). paragraph1: a = T a = T a = F a = F more compactly: paragraph1 = { T, T, F, F} (3) A Bayesian classifier would then find the classification for the spam that produces the maximal probability for the following equation. v = arg max P( v a a... a ) (4) MAP j 1 2 n Now in order to create our probabilities, it is going to be more convenient to flip these probabilities around, so that we can talk about the likelihood of each attribute, given that something is spam. We can do this by applying Bayes rule. v MAP Pa ( 1 a2... an vj) Pv ( j) = arg max v V Pa ( a... a) j 1 2 n (5) Since we are comparing values over the set V in this equation, we can remove the divisor, since it will be the same, no matter what v j is set to. v = arg max P( a a... a v ) P( v ) (6) MAP 1 2 n j j

2 A Naïve Bayes classifier is naïve because it assumes the likelihood of each feature being present is completely independent of whether any other feature is present. v = arg max P( a v ) P( a v )... P( a v ) P( v ) NB 1 j 2 j n j j = arg max Pv ( j) Pa ( i vj) i (7) Now, given this mathematical framework, building a spam classifier is fairly straightforward. Each attribute a i is a word drawn from a dictionary. To classify a document, generate a feature vector for that document that specifies which words from the dictionary are present and which are absent. Then, for each attribute (such as the word halo ), look up the probability of that attribute being present (or absent), given a particular categorization ( spam or ham ) and multiply them together. Don t forget, of course, to determine the base probability of a particular category (how likely is it for any given document to be spam?). Voila, the categorization that produces the highest probability is the category selected for the document. A TRAINING CORPUS In order to train your system, we suggest you use the spam corpus used by the people who built the Spam Assassin program. To get this spam corpus, go to the following link: And download the following files: _spam.tar.bz _easy_ham.tar.bz2 You can unzip these files using the standard unix utility tar (with the options xjf) or from windows using WinRAR (which can be found here: On this page, use the very first link on the list. Once you have unzipped the files you should have two folders, one with spam s and one with "ham" a.k.a. non-spam s. THE DICTIONARY BUILDER (four points) You will need to write a program that builds a dictionary of keywords from a set of spam and ham files, along with the associated probability of occurrence of each word. Your program must be written in C, C++, Java, MATLAB or Lisp. Specific OSes used and calling procedures are described in the section labeled The Testing Environment. For the remainder of this section, we will assume the calling procedure for a Windows executable. The dictionary builder should be callable from the command line and accept two parameters: spam_directory and ham_directory, as shown here. builddictionary.exe [spam_directory] [ham_directory] When run, your program should compile a dictionary of key words. For each key word, it should determine the likelihood of observing that word in a spam document and the likelihood of observing that word in a ham document. Once done, it should save the results to a file named dictionary_filename. The format for this dictionary file is described below.

3 Here, the parameter spam_directory is an absolute path to a directory containing ONLY spam files. Each file in this directory will be an ASCII text file that is spam. The parameter ham_directory is an absolute path to a directory containing ONLY ham files. Each file in this directory will be an ASCII file containing a ham document. The dictionary file should consist of three space-delineated elements per line. The first element is a keyword. The second element is the probability of observing that word at least once in a spam document. The third element is the probability of observing the word in a ham document. Lines in the dictionary MUST BE IN ALPHABETICAL ORDER. Assume A-Z is equivalent to a-z (i.e. capitalization does not matter) and that the digits 1-9 precede A-Z in alphabetical order. Here is an example dictionary format [word] [P(word spam] [P(word ham)] Here is an example dictionary of four keywords. Bob break Cl3an Supercalafragalisticexpealidocious THE SPAM CLASSIFIER (four points) You will need to write a program that classifies a directory of ASCII text documents as either spam (undesired junk mail) or ham (desired document) Your program must be written in C, C++, Java, MATLAB or Lisp. Specific OSes used and calling procedures are described in the section labeled The Testing Environment. For the remainder of this section, we will assume the calling procedure for a Windows executable. The classifier should be callable from the command line and accept three parameters, as shown here. spamsort.exe [mail_directory] [spam_directory] [ham_directory] [spam_probability] Here, mail_directory specifies the path to a directory where every file is an ASCII text file to be classified as spam or not. The parameter dictionary_filename specifies the full path to the dictionary file in the same format as the output from the builddictionary program. The parameter spam_probability specifies the base probability that any given document is spam. Your program must classify each document in the mail directory as either spam or ham and move it to spam_directory if it is spam and move it to the ham_directory if it is ham (not spam). When done, the mail_directory should be empty, since all the files will have been moved to either spam_directory or ham_directory. THE TESTING ENVIRONMENT The only acceptable languages for your lab are C, C++, Java, Lisp and Matlab. Assignments in other languages will not be graded. The environment in which your code will be tested is shown below. Code that does not run in the test environment will be graded as non-functional. So, don t hand in code for C++ code compiled for Mac OS X and expect full credit. Specifics for each language follow. If your program is written in C or C++: Your executables must run in Windows XP and must be callable from the command line exactly as shown below. builddictionary.exe [spam_directory] [ham_directory]

4 spamsort.exe [mail_directory] [spam_directory] [ham_directory] [spam_probability] If your program is written in Java: You must create executable.jar files so that we may call the file using the following syntax. java jar builddictionary.jar [spam_directory] [ham_directory] java jar spamsort.jar [mail_directory] [spam_directory] [ham_directory] [spam_probability] Each.jar file must run under J2SE 5.0. If you do not know how to create a.jar file, there is an excellent tutorial available at the following URL. If your program is written in Lisp: The dictionary builder code should be written in a file called builddictionary.lisp. Within this file, you should include a function called builddictionary which takes the parameters as mentioned above for the other languages ([spam_directory] [ham_directory]). The classifier must be in a file called spamsort.lisp and it must include a function called spamsort that takes the parameters [mail_directory] [spam_directory] [ham_directory] [spam_probability]. The parameters should all be treated as strings. Your code will be tested using Allegro CL, which will be configured to conform to the ANSI Common Lisp standard, so you should make sure that it runs in that environment. If your program is written in MATLAB: Your code should be written in a file called builddictionary.m. Within this file, you should include a function called "builddictionary" which can be called from the MATLAB command line as follows: builddictionary([spam_directory], [ham_directory],) spamsort([mail_directory], [spam_directory], [ham_directory], [spam_probability], ) Your code will be tested using MATLAB (R2006b) for Windows XP SP 2, so you should make sure that it runs in this environment. The parameters should all be treated as strings. QUESTION TO ANSWER (two points total) You will need to test the performance of your system. This means creating a training dataset and a testing dataset. You will train the spam filter on the training set and then evaluate results on the testing set. 1) Describe your datasets. What is the testing set? What is the training set? How many files are in each? What is the percentage of spam in each set? Where did you get your data? Is your data representative of the kinds of spam you might see in the real world? Why or why not? 2) What percentage of correct classifications does your system have on the testing set? What percentage of your classifications are false positives (a ham file categorized as spam)? What percentage of your classifications are false negatives (a spam file categorized as ham)? 3) Describe how you decided to build your dictionary. How did you decide to deal with non-alphanumeric characters? Is every word from the training corpus in your dictionary or did you remove some words or limit the dictionary size? How did you calculate the probabilities? Why did you make the choices you made? What are the costs/benefits of your choices? 4) How might you try to redesign the system to take into account higher level features when categorizing as spam? Examples might include part-of-speech information, sentence structure, formatting, etc. Be specific when describing what information you would try to use and how you would approach using it to determine what is spam and what is not.

5 WHAT TO HAND IN Note that we will not grade programs that do not meet the criteria listed below. You are to submit the following things: 1) The Source Code for ALL programs. 2) An Executable for each program: You must submit an executable of your code that runs on Windows XP as specified in the section titled The Executable. *CAVEAT ABOUT.EXE* there have been issues in the past with.exe files being stripped from attachments, please rename your executables [name].txt in the.zip file sent to us. We ll rename it as an.exe when we try it. 3) Output Files: You are required to provide an example dictionary file you generated from your training corpus. 4) A.txt (not.pdf or.doc) file containing answers to the question in Question to Answer. 5) Please place all files in one directory. HOW TO HAND IT IN To submit your lab, compress all of the files specified into a.zip file. The file should be named in the following manner, lastname_labnumber.zip. For example, my lab submission would be called pardo_2.zip. Submit this.zip file via to *NOTE* there have been issues in the past with.exe files being stripped from attachments, please rename your executable [name].txt in the.zip file sent to us. Due by the start of class on Wednesday May 30, 2007.

On Attacking Statistical Spam Filters

On Attacking Statistical Spam Filters On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Paper review by Deepak Chinavle

More information

Anti Spamming Techniques

Anti Spamming Techniques Anti Spamming Techniques Written by Sumit Siddharth In this article will we first look at some of the existing methods to identify an email as a spam? We look at the pros and cons of the existing methods

More information

Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

More information

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class Problem 1. (10 Points) James 6.1 Problem 2. (10 Points) James 6.3 Problem 3. (10 Points) James 6.5 Problem 4. (15 Points) James 6.7 Problem 5. (15 Points) James 6.10 Homework 4 Statistics W4240: Data Mining

More information

Is Spam Bad For Your Mailbox?

Is Spam Bad For Your Mailbox? Whitepaper Spam and Ham Spam and Ham A Simple Guide Fauzi Yunos 12 Page2 Executive Summary People tend to be much less bothered by spam slipping through filters into their mail box (false negatives), than

More information

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

An Efficient Spam Filtering Techniques for Email Account

An Efficient Spam Filtering Techniques for Email Account American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-02, Issue-10, pp-63-73 www.ajer.org Research Paper Open Access An Efficient Spam Filtering Techniques for Email

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Spam Filtering with Naive Bayesian Classification

Spam Filtering with Naive Bayesian Classification Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09-April-2011

More information

the barricademx end user interface documentation for barricademx users

the barricademx end user interface documentation for barricademx users the barricademx end user interface documentation for barricademx users BarricadeMX Plus The End User Interface This short document will show you how to use the end user web interface for the BarricadeMX

More information

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Installing (1.8.7) 9/2/2009. 1 Installing jgrasp

Installing (1.8.7) 9/2/2009. 1 Installing jgrasp 1 Installing jgrasp Among all of the jgrasp Tutorials, this one is expected to be the least read. Most users will download the jgrasp self-install file for their system, doubleclick the file, follow the

More information

How To Send Email From A Netbook To A Spam Box On A Pc Or Mac Or Mac (For A Mac) On A Mac Or Ipo (For An Ipo) On An Ipot Or Ipot (For Mac) (For

How To Send Email From A Netbook To A Spam Box On A Pc Or Mac Or Mac (For A Mac) On A Mac Or Ipo (For An Ipo) On An Ipot Or Ipot (For Mac) (For INSTITUTE of TECHNOLOGY CARLOW Intelligent Anti-Spam Technology User Manual Author: CHEN LIU (C00140374) Supervisor: Paul Barry Date: 16 th April 2010 Content 1. System Requirement... 3 2. Interface and

More information

It is designed to resist the spam in the Internet. It can provide the convenience to the email user and save the bandwidth of the network.

It is designed to resist the spam in the Internet. It can provide the convenience to the email user and save the bandwidth of the network. 1. Abstract: Our filter program is a JavaTM 2 SDK, Standard Edition Version 1.5.0 (J2SE) based application, which can be running on the machine that has installed JDK 1.5.0. It can integrate with a JavaServer

More information

A Crash Course in OS X D. Riley and M. Allen

A Crash Course in OS X D. Riley and M. Allen Objectives A Crash Course in OS X D. Riley and M. Allen To learn some of the basics of the OS X operating system - including the use of the login panel, system menus, the file browser, the desktop, and

More information

Reasoning Component Architecture

Reasoning Component Architecture Architecture of a Spam Filter Application By Avi Pfeffer A spam filter consists of two components. In this article, based on my book Practical Probabilistic Programming, first describe the architecture

More information

SpamTitan Outlook Addin v1.1 Installation Instructions

SpamTitan Outlook Addin v1.1 Installation Instructions SpamTitan Outlook Addin v1.1 Installation Instructions Introduction What does this Addin Do? Allows reporting of SPAM and HAM messages to the SpamTitan appliance, this in turn will allow the Bayesian appliance

More information

1 Introduction. 2 An Interpreter. 2.1 Handling Source Code

1 Introduction. 2 An Interpreter. 2.1 Handling Source Code 1 Introduction The purpose of this assignment is to write an interpreter for a small subset of the Lisp programming language. The interpreter should be able to perform simple arithmetic and comparisons

More information

Outlook Add-in Deployment Guide

Outlook Add-in Deployment Guide Outlook Add-in Deployment Guide Sophos TOC 3 Contents Introduction...4 Prerequisites...4 Installation...4 Downloading the Outlook Add-in...5 Installing the Add-in on a Single Workstation...5 Installing

More information

Bayes and Naïve Bayes. cs534-machine Learning

Bayes and Naïve Bayes. cs534-machine Learning Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

More information

Virto Active Directory Service for SharePoint. Release 4.1.2. Installation and User Guide

Virto Active Directory Service for SharePoint. Release 4.1.2. Installation and User Guide Virto Active Directory Service for SharePoint Release 4.1.2 Installation and User Guide 2 Table of Contents OVERVIEW... 3 SYSTEM REQUIREMENTS... 4 OPERATING SYSTEM... 4 SERVER... 4 BROWSER... 5 INSTALLATION...

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

Bayesian Filtering. Scoring

Bayesian Filtering. Scoring 9 Bayesian Filtering The Bayesian filter in SpamAssassin is one of the most effective techniques for filtering spam. Although Bayesian statistical analysis is a branch of mathematics, one doesn't necessarily

More information

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Email Filters that use Spammy Words Only

Email Filters that use Spammy Words Only Email Filters that use Spammy Words Only Vasanth Elavarasan Department of Computer Science University of Texas at Austin Advisors: Mohamed Gouda Department of Computer Science University of Texas at Austin

More information

FileMaker 11. ODBC and JDBC Guide

FileMaker 11. ODBC and JDBC Guide FileMaker 11 ODBC and JDBC Guide 2004 2010 FileMaker, Inc. All Rights Reserved. FileMaker, Inc. 5201 Patrick Henry Drive Santa Clara, California 95054 FileMaker is a trademark of FileMaker, Inc. registered

More information

Programming Project 1: Lexical Analyzer (Scanner)

Programming Project 1: Lexical Analyzer (Scanner) CS 331 Compilers Fall 2015 Programming Project 1: Lexical Analyzer (Scanner) Prof. Szajda Due Tuesday, September 15, 11:59:59 pm 1 Overview of the Programming Project Programming projects I IV will direct

More information

Backing Up TestTrack Native Project Databases

Backing Up TestTrack Native Project Databases Backing Up TestTrack Native Project Databases TestTrack projects should be backed up regularly. You can use the TestTrack Native Database Backup Command Line Utility to back up TestTrack 2012 and later

More information

FireEye Email Threat Prevention Cloud Evaluation

FireEye Email Threat Prevention Cloud Evaluation Evaluation Prepared for FireEye June 9, 2015 Tested by ICSA Labs 1000 Bent Creek Blvd., Suite 200 Mechanicsburg, PA 17050 www.icsalabs.com Table of Contents Executive Summary... 1 Introduction... 1 About

More information

How to Create and Manage your Junk Email Inbox Rule

How to Create and Manage your Junk Email Inbox Rule How to Create and Manage your Junk Email Inbox Rule Overview Every email message received from outside of the University is scanned by a program called SpamAssassin before it is delivered to your Exchange

More information

Microsoft Windows PowerShell v2 For Administrators

Microsoft Windows PowerShell v2 For Administrators Course 50414B: Microsoft Windows PowerShell v2 For Administrators Course Details Course Outline Module 1: Introduction to PowerShell the Basics This module explains how to install and configure PowerShell.

More information

Detecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo

Detecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo Detecting Email Spam MGS 8040, Data Mining Audrey Gies Matt Labbe Tatiana Restrepo 5 December 2011 INTRODUCTION This report describes a model that may be used to improve likelihood of recognizing undesirable

More information

Introduction to: Computers & Programming: Input and Output (IO)

Introduction to: Computers & Programming: Input and Output (IO) Introduction to: Computers & Programming: Input and Output (IO) Adam Meyers New York University Summary What is Input and Ouput? What kinds of Input and Output have we covered so far? print (to the console)

More information

2.0. Specification of HSN 2.0 JavaScript Static Analyzer

2.0. Specification of HSN 2.0 JavaScript Static Analyzer 2.0 Specification of HSN 2.0 JavaScript Static Analyzer Pawe l Jacewicz Version 0.3 Last edit by: Lukasz Siewierski, 2012-11-08 Relevant issues: #4925 Sprint: 11 Summary This document specifies operation

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

OUTLOOK ADDIN V1.5 ABOUT THE ADDIN

OUTLOOK ADDIN V1.5 ABOUT THE ADDIN OUTLOOK ADDIN V1.5 ABOUT THE ADDIN The SpamTitan Outlook Addin v1.5 allows reporting of SPAM and HAM messages to the SpamTitan appliance, these messages are then examined by the SpamTitan Bayesian filter

More information

Code Estimation Tools Directions for a Services Engagement

Code Estimation Tools Directions for a Services Engagement Code Estimation Tools Directions for a Services Engagement Summary Black Duck software provides two tools to calculate size, number, and category of files in a code base. This information is necessary

More information

Bitrix Site Manager ASP.NET. Installation Guide

Bitrix Site Manager ASP.NET. Installation Guide Bitrix Site Manager ASP.NET Installation Guide Contents Introduction... 4 Chapter 1. Checking for IIS Installation... 5 Chapter 2. Using An Archive File to Install Bitrix Site Manager ASP.NET... 7 Preliminary

More information

Configuring MDaemon for Centralized Spam Blocking and Filtering

Configuring MDaemon for Centralized Spam Blocking and Filtering Configuring MDaemon for Centralized Spam Blocking and Filtering Alt-N Technologies, Ltd 2201 East Lamar Blvd, Suite 270 Arlington, TX 76006 (817) 525-2005 http://www.altn.com July 26, 2004 Contents A Centralized

More information

REVIEWER S GUIDE - QURB -

REVIEWER S GUIDE - QURB - REVIEWER S GUIDE - QURB - INTRODUCTION Industry watchers predict that nearly half of a typical user s Inbox will be spam by 2005. What if you could guarantee that each morning you opened a completely clean

More information

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering Advances in Intelligent Systems and Technologies Proceedings ECIT2004 - Third European Conference on Intelligent Systems and Technologies Iasi, Romania, July 21-23, 2004 Evolutionary Detection of Rules

More information

Adaption of Statistical Email Filtering Techniques

Adaption of Statistical Email Filtering Techniques Adaption of Statistical Email Filtering Techniques David Kohlbrenner IT.com Thomas Jefferson High School for Science and Technology January 25, 2007 Abstract With the rise of the levels of spam, new techniques

More information

Hands-on Exercises with Big Data

Hands-on Exercises with Big Data Hands-on Exercises with Big Data Lab Sheet 1: Getting Started with MapReduce and Hadoop The aim of this exercise is to learn how to begin creating MapReduce programs using the Hadoop Java framework. In

More information

PANDA CLOUD EMAIL PROTECTION 4.0.1 1 User Manual 1

PANDA CLOUD EMAIL PROTECTION 4.0.1 1 User Manual 1 PANDA CLOUD EMAIL PROTECTION 4.0.1 1 User Manual 1 Contents 1. INTRODUCTION TO PANDA CLOUD EMAIL PROTECTION... 4 1.1. WHAT IS PANDA CLOUD EMAIL PROTECTION?... 4 1.1.1. Why is Panda Cloud Email Protection

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

SpamPanel Email Level Manual Version 1 Last update: March 21, 2014 SpamPanel

SpamPanel Email Level Manual Version 1 Last update: March 21, 2014 SpamPanel SpamPanel Email Level Manual Version 1 Last update: March 21, 2014 SpamPanel Table of Contents Incoming... 1 Incoming Spam Quarantine... 2 Incoming Log Search... 4 Delivery Queue... 7 Report Non-Spam...

More information

Eventia Log Parsing Editor 1.0 Administration Guide

Eventia Log Parsing Editor 1.0 Administration Guide Eventia Log Parsing Editor 1.0 Administration Guide Revised: November 28, 2007 In This Document Overview page 2 Installation and Supported Platforms page 4 Menus and Main Window page 5 Creating Parsing

More information

Spam Filtering using Naïve Bayesian Classification

Spam Filtering using Naïve Bayesian Classification Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering

More information

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and

More information

Immunity from spam: an analysis of an artificial immune system for junk email detection

Immunity from spam: an analysis of an artificial immune system for junk email detection Immunity from spam: an analysis of an artificial immune system for junk email detection Terri Oda and Tony White Carleton University, Ottawa ON, Canada terri@zone12.com, arpwhite@scs.carleton.ca Abstract.

More information

USER S MANUAL Cloud Email Firewall 4.3.2.4 1. Cloud Email & Web Security

USER S MANUAL Cloud Email Firewall 4.3.2.4 1. Cloud Email & Web Security USER S MANUAL Cloud Email Firewall 4.3.2.4 1 Contents 1. INTRODUCTION TO CLOUD EMAIL FIREWALL... 4 1.1. WHAT IS CLOUD EMAIL FIREWALL?... 4 1.1.1. What makes Cloud Email Firewall different?... 4 1.1.2.

More information

Programming Assignment II Due Date: See online CISC 672 schedule Individual Assignment

Programming Assignment II Due Date: See online CISC 672 schedule Individual Assignment Programming Assignment II Due Date: See online CISC 672 schedule Individual Assignment 1 Overview Programming assignments II V will direct you to design and build a compiler for Cool. Each assignment will

More information

Cisco Networking Academy Program Curriculum Scope & Sequence. Fundamentals of UNIX version 2.0 (July, 2002)

Cisco Networking Academy Program Curriculum Scope & Sequence. Fundamentals of UNIX version 2.0 (July, 2002) Cisco Networking Academy Program Curriculum Scope & Sequence Fundamentals of UNIX version 2.0 (July, 2002) Course Description: Fundamentals of UNIX teaches you how to use the UNIX operating system and

More information

Learning from Data: Naive Bayes

Learning from Data: Naive Bayes Semester 1 http://www.anc.ed.ac.uk/ amos/lfd/ Naive Bayes Typical example: Bayesian Spam Filter. Naive means naive. Bayesian methods can be much more sophisticated. Basic assumption: conditional independence.

More information

1.1.1. What makes Panda Cloud Email Protection different?... 4. 1.1.2. Is it secure?... 4. 1.2.1. How messages are classified... 5

1.1.1. What makes Panda Cloud Email Protection different?... 4. 1.1.2. Is it secure?... 4. 1.2.1. How messages are classified... 5 Contents 1. INTRODUCTION TO PANDA CLOUD EMAIL PROTECTION... 4 1.1. WHAT IS PANDA CLOUD EMAIL PROTECTION?... 4 1.1.1. What makes Panda Cloud Email Protection different?... 4 1.1.2. Is it secure?... 4 1.2.

More information

Cisco IronPort Email Security Plug-in 7.3 Administrator Guide

Cisco IronPort Email Security Plug-in 7.3 Administrator Guide Cisco IronPort Email Security Plug-in 7.3 Administrator Guide May 01, 2013 Americas Headquarters Cisco Systems, Inc. 170 West Tasman Drive San Jose, CA 95134-1706 USA http://www.cisco.com Tel: 408 526-4000

More information

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015 NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015 Starting a Python and an NLTK Session Open a Python 2.7 IDLE (Python GUI) window or a Python interpreter

More information

Discrete Structures for Computer Science

Discrete Structures for Computer Science Discrete Structures for Computer Science Adam J. Lee adamlee@cs.pitt.edu 6111 Sennott Square Lecture #20: Bayes Theorem November 5, 2013 How can we incorporate prior knowledge? Sometimes we want to know

More information

How To Use A Pmsft On A Pc Or Mac Or Mac (For Mac) With A Pmf (For Pc) Or Mac Mac (Or Mac) On A Mac Or Pc (For Pmsf) On An Ipad

How To Use A Pmsft On A Pc Or Mac Or Mac (For Mac) With A Pmf (For Pc) Or Mac Mac (Or Mac) On A Mac Or Pc (For Pmsf) On An Ipad Capario Secure File Transfer User Guide Notices This user guide (the Guide ) is provided by Capario in order to facilitate your use of the Capario Secure File Transfer Software. This Guide is subject to

More information

Bayesian Learning Email Cleansing. In its original meaning, spam was associated with a canned meat from

Bayesian Learning Email Cleansing. In its original meaning, spam was associated with a canned meat from Bayesian Learning Email Cleansing. In its original meaning, spam was associated with a canned meat from Hormel. In recent years its meaning has changed. Now, an obscure word has become synonymous with

More information

Sample CSE8A midterm Multiple Choice (circle one)

Sample CSE8A midterm Multiple Choice (circle one) Sample midterm Multiple Choice (circle one) (2 pts) Evaluate the following Boolean expressions and indicate whether short-circuiting happened during evaluation: Assume variables with the following names

More information

Figure 1: Main screen

Figure 1: Main screen Version 0.15 May 2007 Contents 1.0 Introduction... 2 1.1 License (GNU GPL)... 3 2.0 Requirements... 3 2.1 Free FTP Programs... 4 3.0 Owl Ultralite Installation... 4 3.1 Downloading and Decompressing (unzipping)...

More information

CS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

CS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions CS 2112 Spring 2014 Assignment 3 Data Structures and Web Filtering Due: March 4, 2014 11:59 PM Implementing spam blacklists and web filters requires matching candidate domain names and URLs very rapidly

More information

Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian

Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian www..org 104 Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian 1 Abha Suryavanshi, 2 Shishir Shandilya 1 Research Scholar, NIIST Bhopal, India. 2 Prof. (CSE), NIIST Bhopal,

More information

CMPSCI 240: Reasoning about Uncertainty

CMPSCI 240: Reasoning about Uncertainty CMPSCI 240: Reasoning about Uncertainty Lecture 18: Spam Filtering and Naive Bayes Classification Andrew McGregor University of Massachusetts Last Compiled: April 9, 2015 Review Total Probability If A

More information

Tweaking Naïve Bayes classifier for intelligent spam detection

Tweaking Naïve Bayes classifier for intelligent spam detection 682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. araturi@uci.edu 2 School of Computing, Information

More information

Customer Control Panel Manual

Customer Control Panel Manual Customer Control Panel Manual Contents Introduction... 2 Before you begin... 2 Logging in to the Control Panel... 2 Resetting your Control Panel password.... 3 Managing FTP... 4 FTP details for your website...

More information

FileMaker 12. ODBC and JDBC Guide

FileMaker 12. ODBC and JDBC Guide FileMaker 12 ODBC and JDBC Guide 2004 2012 FileMaker, Inc. All Rights Reserved. FileMaker, Inc. 5201 Patrick Henry Drive Santa Clara, California 95054 FileMaker and Bento are trademarks of FileMaker, Inc.

More information

BARRACUDA. N e t w o r k s SPAM FIREWALL 600

BARRACUDA. N e t w o r k s SPAM FIREWALL 600 BARRACUDA N e t w o r k s SPAM FIREWALL 600 Contents: I. What is Barracuda?...1 II. III. IV. How does Barracuda Work?...1 Quarantine Summary Notification...2 Quarantine Inbox...4 V. Sort the Quarantine

More information

E-mail Spam Classification With Artificial Neural Network and Negative Selection Algorithm

E-mail Spam Classification With Artificial Neural Network and Negative Selection Algorithm E-mail Spam Classification With Artificial Neural Network and Negative Selection Algorithm Ismaila Idris Dept of Cyber Security Science, Federal University of Technology, Minna, Nigeria. Idris.ismaila95@gmail.com

More information

On Attacking Statistical Spam Filters

On Attacking Statistical Spam Filters On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Abstract. The efforts of anti-spammers

More information

webmethods Certificate Toolkit

webmethods Certificate Toolkit Title Page webmethods Certificate Toolkit User s Guide Version 7.1.1 January 2008 webmethods Copyright & Document ID This document applies to webmethods Certificate Toolkit Version 7.1.1 and to all subsequent

More information

How to Configure URL Filtering Using the Firewall s HTTP Proxy

How to Configure URL Filtering Using the Firewall s HTTP Proxy How to Configure URL Filtering Using the Firewall s HTTP Proxy Introduction Allied Telesyn switches and routers have a built-in firewall designed to allow safe access to the Internet by applying a set

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.035, Fall 2005 Handout 7 Scanner Parser Project Wednesday, September 7 DUE: Wednesday, September 21 This

More information

SpamNet Spam Detection Using PCA and Neural Networks

SpamNet Spam Detection Using PCA and Neural Networks SpamNet Spam Detection Using PCA and Neural Networks Abhimanyu Lad B.Tech. (I.T.) 4 th year student Indian Institute of Information Technology, Allahabad Deoghat, Jhalwa, Allahabad, India abhimanyulad@iiita.ac.in

More information

Downloading and Using Mozilla Thunderbird By Brazos Price Spring 2005

Downloading and Using Mozilla Thunderbird By Brazos Price Spring 2005 Downloading and Using Mozilla Thunderbird By Brazos Price Spring 2005 What is Thunderbird? Why should I use it? Thunderbird is the email client portion of Mozilla, an open-source suite of applications

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Some fitting of naive Bayesian spam filtering for Japanese environment

Some fitting of naive Bayesian spam filtering for Japanese environment Some fitting of naive Bayesian spam filtering for Japanese environment Manabu Iwanaga 1, Toshihiro Tabata 2, and Kouichi Sakurai 2 1 Graduate School of Information Science and Electrical Engineering, Kyushu

More information

Anti-Spam Service User s Guide Advanced Internet Technologies, Inc. December 3, 2004

Anti-Spam Service User s Guide Advanced Internet Technologies, Inc. December 3, 2004 Page 1 of 7 Anti-Spam Service User s Guide Advanced Internet Technologies, Inc. December 3, 2004 Search All Your Favorite Engines from a Single Source with tybit!!! (Download Now) Revision History: This

More information

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks CS 188: Artificial Intelligence Naïve Bayes Machine Learning Up until now: how use a model to make optimal decisions Machine learning: how to acquire a model from data / experience Learning parameters

More information

Netbeans IDE Tutorial for using the Weka API

Netbeans IDE Tutorial for using the Weka API Netbeans IDE Tutorial for using the Weka API Kevin Amaral University of Massachusetts Boston First, download Netbeans packaged with the JDK from Oracle. http://www.oracle.com/technetwork/java/javase/downloads/jdk-7-netbeans-download-

More information

Oracle Fusion Middleware 11gR2: Forms, and Reports (11.1.2.0.0) Certification with SUSE Linux Enterprise Server 11 SP2 (GM) x86_64

Oracle Fusion Middleware 11gR2: Forms, and Reports (11.1.2.0.0) Certification with SUSE Linux Enterprise Server 11 SP2 (GM) x86_64 Oracle Fusion Middleware 11gR2: Forms, and Reports (11.1.2.0.0) Certification with SUSE Linux Enterprise Server 11 SP2 (GM) x86_64 http://www.suse.com 1 Table of Contents Introduction...3 Hardware and

More information

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming 10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming Due: Friday, Feb. 21, 2014 23:59 EST via Autolab Late submission with 50% credit: Sunday, Feb. 23, 2014 23:59 EST via Autolab Policy on Collaboration

More information

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

Naive Bayes Spam Filtering Using Word-Position-Based Attributes Naive Bayes Spam Filtering Using Word-Position-Based Attributes Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden johan.hovold.363@student.lu.se Abstract This paper

More information

RecoveryVault Express Client User Manual

RecoveryVault Express Client User Manual For Linux distributions Software version 4.1.7 Version 2.0 Disclaimer This document is compiled with the greatest possible care. However, errors might have been introduced caused by human mistakes or by

More information

SpamTitan Outlook Addin V2.0

SpamTitan Outlook Addin V2.0 SpamTitan Outlook Addin V2.0 The SpamTitan Outlook Addin v2.0 allows users to do a varity of spam filtering management from the outlook client, including Report Spam and Ham messages to the SpamTitan Bayesian

More information

Object Oriented Software Design

Object Oriented Software Design Object Oriented Software Design Introduction to Java - II Giuseppe Lipari http://retis.sssup.it/~lipari Scuola Superiore Sant Anna Pisa October 28, 2010 G. Lipari (Scuola Superiore Sant Anna) Introduction

More information

Creating a Java application using Perfect Developer and the Java Develo...

Creating a Java application using Perfect Developer and the Java Develo... 1 of 10 15/02/2010 17:41 Creating a Java application using Perfect Developer and the Java Development Kit Introduction Perfect Developer has the facility to execute pre- and post-build steps whenever the

More information

COS 116 The Computational Universe Laboratory 11: Machine Learning

COS 116 The Computational Universe Laboratory 11: Machine Learning COS 116 The Computational Universe Laboratory 11: Machine Learning In last Tuesday s lecture, we surveyed many machine learning algorithms and their applications. In this lab, you will explore algorithms

More information

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples

More information

Installing GFI MailEssentials

Installing GFI MailEssentials Installing GFI MailEssentials Introduction to installing GFI MailEssentials This chapter shows you how to install and configure GFI MailEssentials. GFI MailEssentials can be installed in two ways: Installation

More information

PLESK 7 NEW FEATURES HOW-TO RESOURCES

PLESK 7 NEW FEATURES HOW-TO RESOURCES PLESK 7 NEW FEATURES HOW-TO RESOURCES Copyright (C) 1999-2004 SWsoft, Inc. All rights reserved. Distribution of this work or derivative of this work in any form is prohibited unless prior written permission

More information

Journal of Information Technology Impact

Journal of Information Technology Impact Journal of Information Technology Impact Vol. 8, No., pp. -0, 2008 Probability Modeling for Improving Spam Filtering Parameters S. C. Chiemeke University of Benin Nigeria O. B. Longe 2 University of Ibadan

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

How To Filter Emails On Gmail On A Computer Or A Computer (For Free)

How To Filter Emails On Gmail On A Computer Or A Computer (For Free) www.ijcsi.org 115 Email Filtering and Analysis Using Classification Algorithms Akshay Iyer, Akanksha Pandey, Dipti Pamnani, Karmanya Pathak and IT Dept, VESIT Chembur, Mumbai-74, India Prof. Mrs. Jayshree

More information

User Manual. version 3.0-r1

User Manual. version 3.0-r1 User Manual version 3.0-r1 Contents 1 What is Confixx? - General Information 5 1.1 Login................................ 5 1.2 Settings Lag............................ 6 2 The Sections of the Web Interface

More information

Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development

Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development Author André Tschentscher Address Fachhochschule Erfurt - University of Applied Sciences Applied Computer Science

More information

Chapter 2 Text Processing with the Command Line Interface

Chapter 2 Text Processing with the Command Line Interface Chapter 2 Text Processing with the Command Line Interface Abstract This chapter aims to help demystify the command line interface that is commonly used in UNIX and UNIX-like systems such as Linux and Mac

More information