Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class



Similar documents
Machine Learning Final Project Spam Filtering

CS 348: Introduction to Artificial Intelligence Lab 2: Spam Filtering

Introduction to Learning & Decision Trees

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

Predicting the Stock Market with News Articles

Programming Exercise 3: Multi-class Classification and Neural Networks

1 Maximum likelihood estimation

Web Document Clustering

Spam Detection A Machine Learning Approach

Feature Subset Selection in Spam Detection

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

4.5 Symbol Table Applications

1 Topic. 2 Scilab. 2.1 What is Scilab?

Simple Language Models for Spam Detection

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Mammoth Scale Machine Learning!

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Social Media Mining. Data Mining Essentials

Getting Started with R and RStudio 1

Workflow Solutions for Very Large Workspaces

CLC Server Command Line Tools USER MANUAL

Beginner s Matlab Tutorial

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Projektgruppe. Categorization of text documents via classification

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Solutions IT Ltd Virus and Antispam filtering solutions

CS1112 Spring 2014 Project 4. Objectives. 3 Pixelation for Identity Protection. due Thursday, 3/27, at 11pm

Predict Influencers in the Social Network

Reporting Manual. Prepared by. NUIT Support Center Northwestern University

How to utilize Administration and Monitoring Console (AMC) in your TDI solution

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Classification of Documents using Text Mining Package tm

Business Objects Version 5 : Introduction

Mining a Corpus of Job Ads

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Detecting Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo

Data Mining Algorithms Part 1. Dejan Sarka

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Are you ready for more efficient and effective ways to manage discovery?

Analytics on Big Data

Introduction. A. Bellaachia Page: 1

Introduction to SQL for Data Scientists

Logistic Regression for Spam Filtering

CSCI567 Machine Learning (Fall 2014)

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Combining Global and Personal Anti-Spam Filtering

Windows PowerShell Cookbook

Author Gender Identification of English Novels

Jet Data Manager 2012 User Guide

Data quality in Accounting Information Systems

How To Send From A Netbook To A Spam Box On A Pc Or Mac Or Mac (For A Mac) On A Mac Or Ipo (For An Ipo) On An Ipot Or Ipot (For Mac) (For

Setting up PostgreSQL

Big Data & Scripting Part II Streaming Algorithms

Machine Learning in Spam Filtering

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Less naive Bayes spam detection

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

MATLAB DFS. Interface Library. User Guide

Package dsstatsclient

Magento Search Extension TECHNICAL DOCUMENTATION

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Using SQL Server Management Studio

Statistical Feature Selection Techniques for Arabic Text Categorization

Distributed R for Big Data

IBM SPSS Data Preparation 22

Homework Assignment 7

CLUSTERING FOR FORENSIC ANALYSIS

Categorical Data Visualization and Clustering Using Subjective Factors

Introduction to Synoptic

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

1. Classification problems

Active Learning SVM for Blogs recommendation

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Novell ZENworks Asset Management 7.5

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015

Introduction to Matlab

GUI Input and Output. Greg Reese, Ph.D Research Computing Support Group Academic Technology Services Miami University

OTN Developer Day: Oracle Big Data

Sales Performance Management Using Salesforce.com and Tableau 8 Desktop Professional & Server

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

DiskPulse DISK CHANGE MONITOR

Visualization with Excel Tools and Microsoft Azure

Distributed Computing and Big Data: Hadoop and MapReduce

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

AMATH 352 Lecture 3 MATLAB Tutorial Starting MATLAB Entering Variables

Admin Report Kit for Active Directory

Analysis of MapReduce Algorithms

Learning from Diversity

IBM SPSS Direct Marketing 23

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

ReadMe: Software for Automated Content Analysis 1

NEXT Analytics Business Intelligence User Guide

Technical Report. The KNIME Text Processing Feature:

Transcription:

Problem 1. (10 Points) James 6.1 Problem 2. (10 Points) James 6.3 Problem 3. (10 Points) James 6.5 Problem 4. (15 Points) James 6.7 Problem 5. (15 Points) James 6.10 Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class Problems 6 and 7 implement naive Bayes on a spam data set. Naive Bayes Spam Filtering Walkthrough. Working with data can often be messy and frustrating, especially when you are learning a language. The following is designed to be an introduction to working with real world data. You will implement a spam filter for the lingspam dataset using a Naive Bayes spam filter. This dataset contains legitimate emails and spam sent to a linguistics email list. Step 1: Download the dataset Homework3Data from the course website. Place it in your working directory for R. The data are partitioned into ham and spam, and further partitioned into training and testing with 350 examples each training set and 130 in each testing, for a total of 960. These are a collection of emails to a linguistics server list. They have been preprocessed for you by 1) removing non-letter characters, 2)removing stopwords, and 2) stemming. Stopwords are short functional words, such as in, is, the. Stemming involves trimming inflected words to their stem, such as reducing running to run. For example, the original ham email 3-380msg1.txt is Subject: negative concord i am interested in the grammar of negative concord in various dialects of american and british english. if anyone out there speaks natively a dialect that uses negative concord and is willing to answer grammaticality questions about their dialect, would they please send me an email note to that effect and i ll get back to them with my queries. my address is : kroch @ change. ling. upenn. edu thanks. The cleaned version is negative concord interest grammar negative concord various dialect american british english anyone speak natively dialect negative concord answer grammaticality question dialect please send email note effect ll back query address kroch change ling upenn edu thank Likewise, the original spam message spmsga108.txt is 1

Subject: the best, just got better the 2 newest and hottest interactive adult web sites will soon be the largest!!! check out both of the following adult web sites free samples and trial memberships!!!! live triple x adult entertainment and pictures. new content added weekly!!! girls, boys, couples, and more!!! check them both out : http : / / www2. dirtyonline. com http : / / www. chixchat. com The cleaned version is best better newest hottest interactive adult web site soon largest check both follow adult web site free sample trial membership live triple x adult entertainment picture content add weekly girl boy couple check both http www dirtyonline com http www chixchat com The original email database can be found at: http://csmining.org/index.php/ling-spam-datasets.html The cleaned version was pulled from Andrew Ng s course materials for CS229: http://openclassroom.stanford.edu/mainfolder/documentpage.php?course=machinelearning&doc=exercises/ex6/ex6.html However, there are R packages like tm that can clean datasets, or this can easily be done through custom code. Step 2: Upload the data into R. Open a new document in R. Save it as NaiveBayesFiles.R. We are going to make a function to read the emails from a directory into your workspace. Copy the following code into NaiveBayesFiles.R: # To read in data from the directories: # Partially based on code from C. Shalizi read.directory <- function(dirname) { # Store the emails in a list emails = list(); # Get a list of filenames in the directory filenames = dir(dirname,full.names=true); for (i in 1:length(filenames)){ emails[[i]] = scan(filenames[i],what="",quiet=true); return(emails) # Example: ham.test <- read.directory("homework3data/nonspam-test/") The above code creates a reusable function in R, which takes the input argument dirname. Use functions when you need to repeatedly use a bit of code. In this case, we will be reading in data from four different directories. This code creates a list of emails. Lists store objects with an ordering. In this case, the order is the dictated by the position in the directory. The code is structured as follows: 2

emails = list() instantiates the variable email as a list, albeit with no entries currently. filenames = dir(dirname,full.names=true) creates a vector of strings, populated by the names of the files in directory dirname. The option full.names=true means that the path directory is prepended onto the file name to give a path; if this is set to FALSE, only the file names are given. for (i in 1:length(filenames)){ loops through all of the file names. emails[[i]] = scan(filenames[i],what="",quiet=true) fills list element i with a vector (or list) of input values. Here what="" means that all fields will be read as strings. The default separator is white space; this can be changed through the sep argument. The argument quiet = TRUE keeps scan from displaying statistics about the imported document, which will fill up a command screen when hundreds of documents are read. The net result of this command is that element i of email is filled by a vector of strings, with each string representing a word. return(emails) returns the object emails when the function read.directory(dirname) is called. To be able to use a function, save it after making any changes, and then run the source file by selecting File Source File... and then selecting NaiveBayesFiles.R. Read in all four email directories: ham.train <- read.directory("spamdata/nonspam-train/") ham.test <- read.directory("spamdata/nonspam-test/") spam.train <- read.directory("spamdata/spam-train/") spam.test <- read.directory("spamdata/spam-test/") Step 3: Make a dictionary. We are going to use all of the files (training and testing, ham and spam) to make a dictionary, or a list of all the words used. First, copy the following code into NaiveBayesFile.R: # Make dictionary sorted by number of times a word appears in corpus # (useful for using commonly appearing words as factors) # NOTE: Use the *entire* corpus: training, testing, spam and ham make.sorted.dictionary.df <- function(emails){ # This returns a dataframe that is sorted by the number of times # a word appears # List of vectors to one big vetor dictionary.full <- unlist(emails) # Tabulates the full dictionary tabulate.dic <- tabulate(factor(dictionary.full)) # Find unique values dictionary <- unique(dictionary.full) # Sort them alphabetically dictionary <- sort(dictionary) dictionary.df <- data.frame(word = dictionary, count = tabulate.dic) 3

sort.dictionary.df <- dictionary.df[order(dictionary.df$count,decreasing=true),]; return(sort.dictionary.df) The function make.sorted.dictionary.df(emails) takes the list emails generated by read.directory() and creates a data frame that holds 1) the list of unique words, and 2) the number of times each appears in the corpus, which are sorted in descending order based on frequency. We want the sorted list so that we can easily access the most used words. The code is structured as follows: dictionary.full <- unlist(emails) takes the list emails and converts it into a very long vector, dictionary.full. tabulate.dic <- tabulate(factor(dictionary.full)) makes a vector of counts for each word that appears in dictionary.full. The argument factor(dictionary.full) tells the tabulate function that it is to bin data according to the enumerated categories for dictionary.full. dictionary <- unique(dictionary.full) returns a vector of unique words in dictionary.full. dictionary <- sort(dictionary) sorts the vector dictionary from a to z; this is necessary because factor(dictionary.full) is a sorted list of words in dictionary.full. dictionary.df <- data.frame(word = dictionary, count = tabulate.dic) creates a new data frame, dictionary.df, with two categories, word and count. The word category is populated by the alphabetized dictionary of factors, the count category is populated by the counts for the alphabetized words. sort.dictionary.df <- dictionary.df[order(dictionary.df$count,decreasing=true),] reorders the elements of dictionary.df to be sorted in descending order according to the count values. order(dictionary.df$count,decreasing=true) returns an index ordering when the count values are sorted in a descending manner; this is used to rearrange the rows, but not the columns, of the data frame. return(sort.dictionary.df) returns the sorted values as a data frame. We now create a dictionary for the full dataset by concatenating the individual lists into a single large one for all emails, and then using the large list to create a dictionary: all.emails <- c(ham.train,ham.test,spam.train,spam.test) dictionary <- make.sorted.dictionary.df(all.emails) Step 4: Make dictionary counts for all subsets. We are now going to create counts for each dictionary word in all documents and place them in a document (rows) by word (columns) matrix. Copy the following script to NaiveBayesFiles.R: # Make a document-term matrix, which counts the number of times each # dictionary element is used in a document 4

make.document.term.matrix <- function(emails,dictionary){ # This takes the email and dictionary objects from above and outputs a # document term matrix num.emails <- length(emails); num.words <- length(dictionary); # Instantiate a matrix where rows are documents and columns are words dtm <- mat.or.vec(num.emails,num.words); # A matrix filled with zeros for (i in 1:num.emails){ num.words.email <- length(emails[[i]]); email.temp <- emails[[i]]; for (j in 1:num.words.email){ ind <- which(dictionary == email.temp[j]); dtm[i,ind] <- dtm[i,ind] + 1; return(dtm); # Example: dtm <- make.document.term.matrix(emails.train,dictionary) In many cases, producing a full document-term matrix is a poor idea. It requires a large amount of storage space for relatively little information. In settings with large corpora or large dictionaries, it is often best to store the data in the form of tokens that simply denote document number, word number in the dictionary, and count for that word-document combination. Only counts of at least 1 are stored, yielding something like this: document word count 3 1 2 3 5 1 3 11 2 3 12 2......... However, matrix representation greatly simplifies the coding required for implementing a naive Bayes classifier. The function is structured as follows: num.emails <- length(emails) calculates the number of emails by computing the length of the list. num.words <- length(dictionary) calculates the length of the dictionary. dtm <- mat.or.vec(num.emails,num.words) instantiates a matrix with num.emails rows and num.words columns that is filled with zeros. Note that mat.or.vec(num.emails) would instantiate a vector with num.emails entries that is filled with zeros. for (i in 1:num.emails){ loops through each of the emails. num.words.email <- length(emails[[i]]) calculates the number of words in email i. 5

email.temp <- emails[[i]] temporarily stores the vector email i in email.temp. for (j in 1:num.words.email){ loops through each of the words in email i. ind <- which(dictionary == email.temp[j]) stores the dictionary index equal to the jth word of email i. dtm[i,ind] <- dtm[i,ind] + 1 adds a count of 1 to to the observation matrix for word ind in email i. return(dtm) returns the document-term matrix. We now create a document term matrix for each group of emails. Since we are looping over a large space in R, this may take a few minutes to run: dtm.ham.train <- make.document.term.matrix(ham.train,dictionary$word) dtm.ham.test <- make.document.term.matrix(ham.test,dictionary$word) dtm.spam.train <- make.document.term.matrix(spam.train,dictionary$word) dtm.spam.test <- make.document.term.matrix(spam.test,dictionary$word) Step 5: Create a naive Bayes classifier. All of your data is now in matrix form. We are now going to use it to compute naive Bayes probabilities. Computationally, we may ignore the normalizer p(x test ) since we only need to determine which class has a higher likelihood. A test document is declared spam if n test p(y = 1) j=1 n test p(x j = x test j y = 1) p(y = 0) j=1 p(x j = x test j y = 0), and ham otherwise. Therefore, we only need to estimate the probabilities p(y = 1), p(y = 0), p(x j = k y = 1) for k = 1,..., D, and p(x j = k y = 0) for k = 1,..., D. Note that the ˆp(y = 0) = ˆp(y = 1) = 0.5 because the number of spam documents is equal to the number of ham documents in both the training and test sets. When computing probabilities defined by products it is more convenient to work in log probabilities. Consider a dictionary with two words, each with equal probability given that the document is spam, and a test document with 100 words. Then, p(y = 1 x test ) = 1 100 2 j=1 1 2 = ( ) 1 101 3.9 10 31. 2 While best case for a small document, this is way beyond machine precision. If we compute the log probability, log ( p(y = 1 x test ) ) ( (1 ) ) 101 = log = 101 log(2) 70.01. 2 This can easily be stored within machine precision. Even when computing a vector of probabilities, it is often a good idea work in log probabilities. Note that programs like R or Matlab use base e. To compute a vector of log probabilities from a document term matrix with µ phantom examples, copy the following code into NaiveBayesFiles.R: 6

make.log.pvec <- function(dtm,mu){ # Sum up the number of instances per word pvec.no.mu <- colsums(dtm) # Sum up number of words n.words <- sum(pvec.no.mu) # Get dictionary size dic.len <- length(pvec.no.mu) # Incorporate mu and normalize log.pvec <- log(pvec.no.mu + mu) - log(mu*dic.len + n.words) return(log.pvec) The following commands were introduced in this code: sum(pvec.no.mu) sums over all values in the vector pvec.no.mu. colsums(dtm) sums over the rows of the matrix dtm, producing a vector with the number of elements equal to the number of columns in dtm. pvec.no.mu + mu adds the scalar value mu to all elements of the vector pvec.no.mu. (pvec.no.mu + mu)/(mu*dic.len + n.words) divides all elements of the vector pvec.no.mu + mu by the scalar value mu*dic.len + n.words. Write a function to compute log (ˆp(y = ỹ x test ) ) and discriminate between ham and spam, naive.bayes < function(log.pvec.spam, log.pvec.ham, log.prior.spam, log.prior.ham, dtm.test). This takes the log probabilities for the dictionary in a spam document (log.pvec.spam), a ham document (log.pvec.ham), the log priors (log.prior.spam, log.prior.ham) and a document term matrix for the testing corpus. Note: it will be much more efficient to use matrix operations than loops for data sets of this size. Question 6. (20 points) Set µ = 1 D and use the training set for training, and the testing set for testing. What proportion is classified correctly? Additionally, report the proportion of your: true positives (spam classified as spam divided by the total amount of testing spam), true negatives (ham classified as ham divided by the total amount of testing ham), false positives (ham classified as spam divided by the total amount of testing ham) and false negatives (spam classified as ham divided by the total amount of testing spam). Question 7. (20 points) Finally, we will use feature selection to remove features that are irrelevant to classification. Instead of calculating the probability over the entire dictionary, we will simply count the number of times each of the n most relevant features appear and treat the set of features themselves as a dictionary. a. (10 points) A common way to select features is by using the mutual information between feature x k and class label y. Mutual information is expressed in terms of entropies, I(X, Y ) = H(X) H(X Y ) = H(Y ) H(Y X). 7

Show that I k = 1 ỹ=0 p(x test = k y = ỹ)p(y = ỹ) log p(xtest = k y = ỹ) p(x test = k) + ( 1 p(x test = k y = ỹ) ) p(y = ỹ) log 1 p(xtest = k y = ỹ) 1 p(x test. = k) b. (10 points) Compute the mutual information for all features; use this to select the top n features as a dictionary. Set µ = 1 n. Compute the proportion classified correctly, the proportion of false negatives, and the proportion of false positives for n = {200, 500, 1000, 2500, 5000, 10000. Display the results in three graphs. What happens? Why do you think this is? 8