Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class
|
|
|
- Caitlin Waters
- 10 years ago
- Views:
Transcription
1 Problem 1. (10 Points) James 6.1 Problem 2. (10 Points) James 6.3 Problem 3. (10 Points) James 6.5 Problem 4. (15 Points) James 6.7 Problem 5. (15 Points) James 6.10 Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class Problems 6 and 7 implement naive Bayes on a spam data set. Naive Bayes Spam Filtering Walkthrough. Working with data can often be messy and frustrating, especially when you are learning a language. The following is designed to be an introduction to working with real world data. You will implement a spam filter for the lingspam dataset using a Naive Bayes spam filter. This dataset contains legitimate s and spam sent to a linguistics list. Step 1: Download the dataset Homework3Data from the course website. Place it in your working directory for R. The data are partitioned into ham and spam, and further partitioned into training and testing with 350 examples each training set and 130 in each testing, for a total of 960. These are a collection of s to a linguistics server list. They have been preprocessed for you by 1) removing non-letter characters, 2)removing stopwords, and 2) stemming. Stopwords are short functional words, such as in, is, the. Stemming involves trimming inflected words to their stem, such as reducing running to run. For example, the original ham msg1.txt is Subject: negative concord i am interested in the grammar of negative concord in various dialects of american and british english. if anyone out there speaks natively a dialect that uses negative concord and is willing to answer grammaticality questions about their dialect, would they please send me an note to that effect and i ll get back to them with my queries. my address is : change. ling. upenn. edu thanks. The cleaned version is negative concord interest grammar negative concord various dialect american british english anyone speak natively dialect negative concord answer grammaticality question dialect please send note effect ll back query address kroch change ling upenn edu thank Likewise, the original spam message spmsga108.txt is 1
2 Subject: the best, just got better the 2 newest and hottest interactive adult web sites will soon be the largest!!! check out both of the following adult web sites free samples and trial memberships!!!! live triple x adult entertainment and pictures. new content added weekly!!! girls, boys, couples, and more!!! check them both out : http : / / www2. dirtyonline. com http : / / www. chixchat. com The cleaned version is best better newest hottest interactive adult web site soon largest check both follow adult web site free sample trial membership live triple x adult entertainment picture content add weekly girl boy couple check both http www dirtyonline com http www chixchat com The original database can be found at: The cleaned version was pulled from Andrew Ng s course materials for CS229: However, there are R packages like tm that can clean datasets, or this can easily be done through custom code. Step 2: Upload the data into R. Open a new document in R. Save it as NaiveBayesFiles.R. We are going to make a function to read the s from a directory into your workspace. Copy the following code into NaiveBayesFiles.R: # To read in data from the directories: # Partially based on code from C. Shalizi read.directory <- function(dirname) { # Store the s in a list s = list(); # Get a list of filenames in the directory filenames = dir(dirname,full.names=true); for (i in 1:length(filenames)){ s[[i]] = scan(filenames[i],what="",quiet=true); return( s) # Example: ham.test <- read.directory("homework3data/nonspam-test/") The above code creates a reusable function in R, which takes the input argument dirname. Use functions when you need to repeatedly use a bit of code. In this case, we will be reading in data from four different directories. This code creates a list of s. Lists store objects with an ordering. In this case, the order is the dictated by the position in the directory. The code is structured as follows: 2
3 s = list() instantiates the variable as a list, albeit with no entries currently. filenames = dir(dirname,full.names=true) creates a vector of strings, populated by the names of the files in directory dirname. The option full.names=true means that the path directory is prepended onto the file name to give a path; if this is set to FALSE, only the file names are given. for (i in 1:length(filenames)){ loops through all of the file names. s[[i]] = scan(filenames[i],what="",quiet=true) fills list element i with a vector (or list) of input values. Here what="" means that all fields will be read as strings. The default separator is white space; this can be changed through the sep argument. The argument quiet = TRUE keeps scan from displaying statistics about the imported document, which will fill up a command screen when hundreds of documents are read. The net result of this command is that element i of is filled by a vector of strings, with each string representing a word. return( s) returns the object s when the function read.directory(dirname) is called. To be able to use a function, save it after making any changes, and then run the source file by selecting File Source File... and then selecting NaiveBayesFiles.R. Read in all four directories: ham.train <- read.directory("spamdata/nonspam-train/") ham.test <- read.directory("spamdata/nonspam-test/") spam.train <- read.directory("spamdata/spam-train/") spam.test <- read.directory("spamdata/spam-test/") Step 3: Make a dictionary. We are going to use all of the files (training and testing, ham and spam) to make a dictionary, or a list of all the words used. First, copy the following code into NaiveBayesFile.R: # Make dictionary sorted by number of times a word appears in corpus # (useful for using commonly appearing words as factors) # NOTE: Use the *entire* corpus: training, testing, spam and ham make.sorted.dictionary.df <- function( s){ # This returns a dataframe that is sorted by the number of times # a word appears # List of vectors to one big vetor dictionary.full <- unlist( s) # Tabulates the full dictionary tabulate.dic <- tabulate(factor(dictionary.full)) # Find unique values dictionary <- unique(dictionary.full) # Sort them alphabetically dictionary <- sort(dictionary) dictionary.df <- data.frame(word = dictionary, count = tabulate.dic) 3
4 sort.dictionary.df <- dictionary.df[order(dictionary.df$count,decreasing=true),]; return(sort.dictionary.df) The function make.sorted.dictionary.df( s) takes the list s generated by read.directory() and creates a data frame that holds 1) the list of unique words, and 2) the number of times each appears in the corpus, which are sorted in descending order based on frequency. We want the sorted list so that we can easily access the most used words. The code is structured as follows: dictionary.full <- unlist( s) takes the list s and converts it into a very long vector, dictionary.full. tabulate.dic <- tabulate(factor(dictionary.full)) makes a vector of counts for each word that appears in dictionary.full. The argument factor(dictionary.full) tells the tabulate function that it is to bin data according to the enumerated categories for dictionary.full. dictionary <- unique(dictionary.full) returns a vector of unique words in dictionary.full. dictionary <- sort(dictionary) sorts the vector dictionary from a to z; this is necessary because factor(dictionary.full) is a sorted list of words in dictionary.full. dictionary.df <- data.frame(word = dictionary, count = tabulate.dic) creates a new data frame, dictionary.df, with two categories, word and count. The word category is populated by the alphabetized dictionary of factors, the count category is populated by the counts for the alphabetized words. sort.dictionary.df <- dictionary.df[order(dictionary.df$count,decreasing=true),] reorders the elements of dictionary.df to be sorted in descending order according to the count values. order(dictionary.df$count,decreasing=true) returns an index ordering when the count values are sorted in a descending manner; this is used to rearrange the rows, but not the columns, of the data frame. return(sort.dictionary.df) returns the sorted values as a data frame. We now create a dictionary for the full dataset by concatenating the individual lists into a single large one for all s, and then using the large list to create a dictionary: all. s <- c(ham.train,ham.test,spam.train,spam.test) dictionary <- make.sorted.dictionary.df(all. s) Step 4: Make dictionary counts for all subsets. We are now going to create counts for each dictionary word in all documents and place them in a document (rows) by word (columns) matrix. Copy the following script to NaiveBayesFiles.R: # Make a document-term matrix, which counts the number of times each # dictionary element is used in a document 4
5 make.document.term.matrix <- function( s,dictionary){ # This takes the and dictionary objects from above and outputs a # document term matrix num. s <- length( s); num.words <- length(dictionary); # Instantiate a matrix where rows are documents and columns are words dtm <- mat.or.vec(num. s,num.words); # A matrix filled with zeros for (i in 1:num. s){ num.words. <- length( s[[i]]); .temp <- s[[i]]; for (j in 1:num.words. ){ ind <- which(dictionary == .temp[j]); dtm[i,ind] <- dtm[i,ind] + 1; return(dtm); # Example: dtm <- make.document.term.matrix( s.train,dictionary) In many cases, producing a full document-term matrix is a poor idea. It requires a large amount of storage space for relatively little information. In settings with large corpora or large dictionaries, it is often best to store the data in the form of tokens that simply denote document number, word number in the dictionary, and count for that word-document combination. Only counts of at least 1 are stored, yielding something like this: document word count However, matrix representation greatly simplifies the coding required for implementing a naive Bayes classifier. The function is structured as follows: num. s <- length( s) calculates the number of s by computing the length of the list. num.words <- length(dictionary) calculates the length of the dictionary. dtm <- mat.or.vec(num. s,num.words) instantiates a matrix with num. s rows and num.words columns that is filled with zeros. Note that mat.or.vec(num. s) would instantiate a vector with num. s entries that is filled with zeros. for (i in 1:num. s){ loops through each of the s. num.words. <- length( s[[i]]) calculates the number of words in i. 5
6 .temp <- s[[i]] temporarily stores the vector i in .temp. for (j in 1:num.words. ){ loops through each of the words in i. ind <- which(dictionary == .temp[j]) stores the dictionary index equal to the jth word of i. dtm[i,ind] <- dtm[i,ind] + 1 adds a count of 1 to to the observation matrix for word ind in i. return(dtm) returns the document-term matrix. We now create a document term matrix for each group of s. Since we are looping over a large space in R, this may take a few minutes to run: dtm.ham.train <- make.document.term.matrix(ham.train,dictionary$word) dtm.ham.test <- make.document.term.matrix(ham.test,dictionary$word) dtm.spam.train <- make.document.term.matrix(spam.train,dictionary$word) dtm.spam.test <- make.document.term.matrix(spam.test,dictionary$word) Step 5: Create a naive Bayes classifier. All of your data is now in matrix form. We are now going to use it to compute naive Bayes probabilities. Computationally, we may ignore the normalizer p(x test ) since we only need to determine which class has a higher likelihood. A test document is declared spam if n test p(y = 1) j=1 n test p(x j = x test j y = 1) p(y = 0) j=1 p(x j = x test j y = 0), and ham otherwise. Therefore, we only need to estimate the probabilities p(y = 1), p(y = 0), p(x j = k y = 1) for k = 1,..., D, and p(x j = k y = 0) for k = 1,..., D. Note that the ˆp(y = 0) = ˆp(y = 1) = 0.5 because the number of spam documents is equal to the number of ham documents in both the training and test sets. When computing probabilities defined by products it is more convenient to work in log probabilities. Consider a dictionary with two words, each with equal probability given that the document is spam, and a test document with 100 words. Then, p(y = 1 x test ) = j=1 1 2 = ( ) While best case for a small document, this is way beyond machine precision. If we compute the log probability, log ( p(y = 1 x test ) ) ( (1 ) ) 101 = log = 101 log(2) This can easily be stored within machine precision. Even when computing a vector of probabilities, it is often a good idea work in log probabilities. Note that programs like R or Matlab use base e. To compute a vector of log probabilities from a document term matrix with µ phantom examples, copy the following code into NaiveBayesFiles.R: 6
7 make.log.pvec <- function(dtm,mu){ # Sum up the number of instances per word pvec.no.mu <- colsums(dtm) # Sum up number of words n.words <- sum(pvec.no.mu) # Get dictionary size dic.len <- length(pvec.no.mu) # Incorporate mu and normalize log.pvec <- log(pvec.no.mu + mu) - log(mu*dic.len + n.words) return(log.pvec) The following commands were introduced in this code: sum(pvec.no.mu) sums over all values in the vector pvec.no.mu. colsums(dtm) sums over the rows of the matrix dtm, producing a vector with the number of elements equal to the number of columns in dtm. pvec.no.mu + mu adds the scalar value mu to all elements of the vector pvec.no.mu. (pvec.no.mu + mu)/(mu*dic.len + n.words) divides all elements of the vector pvec.no.mu + mu by the scalar value mu*dic.len + n.words. Write a function to compute log (ˆp(y = ỹ x test ) ) and discriminate between ham and spam, naive.bayes < function(log.pvec.spam, log.pvec.ham, log.prior.spam, log.prior.ham, dtm.test). This takes the log probabilities for the dictionary in a spam document (log.pvec.spam), a ham document (log.pvec.ham), the log priors (log.prior.spam, log.prior.ham) and a document term matrix for the testing corpus. Note: it will be much more efficient to use matrix operations than loops for data sets of this size. Question 6. (20 points) Set µ = 1 D and use the training set for training, and the testing set for testing. What proportion is classified correctly? Additionally, report the proportion of your: true positives (spam classified as spam divided by the total amount of testing spam), true negatives (ham classified as ham divided by the total amount of testing ham), false positives (ham classified as spam divided by the total amount of testing ham) and false negatives (spam classified as ham divided by the total amount of testing spam). Question 7. (20 points) Finally, we will use feature selection to remove features that are irrelevant to classification. Instead of calculating the probability over the entire dictionary, we will simply count the number of times each of the n most relevant features appear and treat the set of features themselves as a dictionary. a. (10 points) A common way to select features is by using the mutual information between feature x k and class label y. Mutual information is expressed in terms of entropies, I(X, Y ) = H(X) H(X Y ) = H(Y ) H(Y X). 7
8 Show that I k = 1 ỹ=0 p(x test = k y = ỹ)p(y = ỹ) log p(xtest = k y = ỹ) p(x test = k) + ( 1 p(x test = k y = ỹ) ) p(y = ỹ) log 1 p(xtest = k y = ỹ) 1 p(x test. = k) b. (10 points) Compute the mutual information for all features; use this to select the top n features as a dictionary. Set µ = 1 n. Compute the proportion classified correctly, the proportion of false negatives, and the proportion of false positives for n = {200, 500, 1000, 2500, 5000, Display the results in three graphs. What happens? Why do you think this is? 8
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
CS 348: Introduction to Artificial Intelligence Lab 2: Spam Filtering
THE PROBLEM Spam is e-mail that is both unsolicited by the recipient and sent in substantively identical form to many recipients. In 2004, MSNBC reported that spam accounted for 66% of all electronic mail.
Introduction to Learning & Decision Trees
Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing
CSCI6900 Assignment 2: Naïve Bayes on Hadoop
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 2: Naïve Bayes on Hadoop DUE: Friday, September 18 by 11:59:59pm Out September 4, 2015 1 IMPORTANT NOTES You are expected to use
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
Feature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks
CS 188: Artificial Intelligence Naïve Bayes Machine Learning Up until now: how use a model to make optimal decisions Machine learning: how to acquire a model from data / experience Learning parameters
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
4.5 Symbol Table Applications
Set ADT 4.5 Symbol Table Applications Set ADT: unordered collection of distinct keys. Insert a key. Check if set contains a given key. Delete a key. SET interface. addkey) insert the key containskey) is
1 Topic. 2 Scilab. 2.1 What is Scilab?
1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical
Simple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Mammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
Spam Filtering based on Naive Bayes Classification. Tianhao Sun
Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous
W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015
W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Getting Started with R and RStudio 1
Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following
Workflow Solutions for Very Large Workspaces
Workflow Solutions for Very Large Workspaces February 3, 2016 - Version 9 & 9.1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CLC Server Command Line Tools USER MANUAL
CLC Server Command Line Tools USER MANUAL Manual for CLC Server Command Line Tools 2.5 Windows, Mac OS X and Linux September 4, 2015 This software is for research purposes only. QIAGEN Aarhus A/S Silkeborgvej
Beginner s Matlab Tutorial
Christopher Lum [email protected] Introduction Beginner s Matlab Tutorial This document is designed to act as a tutorial for an individual who has had no prior experience with Matlab. For any questions
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
Projektgruppe. Categorization of text documents via classification
Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
Solutions IT Ltd Virus and Antispam filtering solutions 01324 877183 [email protected]
Contents Reduce Spam & Viruses... 2 Start a free 14 day free trial to separate the wheat from the chaff... 2 Emails with Viruses... 2 Spam Bourne Emails... 3 Legitimate Emails... 3 Filtering Options...
CS1112 Spring 2014 Project 4. Objectives. 3 Pixelation for Identity Protection. due Thursday, 3/27, at 11pm
CS1112 Spring 2014 Project 4 due Thursday, 3/27, at 11pm You must work either on your own or with one partner. If you work with a partner you must first register as a group in CMS and then submit your
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Reporting Manual. Prepared by. NUIT Support Center Northwestern University
Reporting Manual Prepared by NUIT Support Center Northwestern University Updated: February 2013 CONTENTS 1. Introduction... 1 2. Reporting... 1 2.1 Reporting Functionality... 1 2.2 Creating Reports...
How to utilize Administration and Monitoring Console (AMC) in your TDI solution
How to utilize Administration and Monitoring Console (AMC) in your TDI solution An overview of the basic functions of Tivoli Directory Integrator's Administration and Monitoring Console and how it can
G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.
SQL databases An introduction AMP: Apache, mysql, PHP This installations installs the Apache webserver, the PHP scripting language, and the mysql database on your computer: Apache: runs in the background
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Classification of Documents using Text Mining Package tm
Classification of Documents using Text Mining Package tm Pavel Brazdil LIAAD - INESC Porto LA FEP, Univ. of Porto http://www.liaad.up.pt Escola de verão Aspectos de processamento da LN F. Letras, UP, 4th
Business Objects Version 5 : Introduction
Business Objects Version 5 : Introduction Page 1 TABLE OF CONTENTS Introduction About Business Objects Changing Your Password Retrieving Pre-Defined Reports Formatting Your Report Using the Slice and Dice
Mining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
Detecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo
Detecting Email Spam MGS 8040, Data Mining Audrey Gies Matt Labbe Tatiana Restrepo 5 December 2011 INTRODUCTION This report describes a model that may be used to improve likelihood of recognizing undesirable
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Are you ready for more efficient and effective ways to manage discovery?
LexisNexis Early Data Analyzer + LAW PreDiscovery + Concordance Software Are you ready for more efficient and effective ways to manage discovery? Did you know that all-in-one solutions often omit robust
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
Introduction. A. Bellaachia Page: 1
Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.
Introduction to SQL for Data Scientists
Introduction to SQL for Data Scientists Ben O. Smith College of Business Administration University of Nebraska at Omaha Learning Objectives By the end of this document you will learn: 1. How to perform
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
CSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Combining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
Windows PowerShell Cookbook
Windows PowerShell Cookbook Lee Holmes O'REILLY' Beijing Cambridge Farnham Koln Paris Sebastopol Taipei Tokyo Table of Contents Foreword Preface xvii xxi Part I. Tour A Guided Tour of Windows PowerShell
Author Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
Jet Data Manager 2012 User Guide
Jet Data Manager 2012 User Guide Welcome This documentation provides descriptions of the concepts and features of the Jet Data Manager and how to use with them. With the Jet Data Manager you can transform
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
How To Send Email From A Netbook To A Spam Box On A Pc Or Mac Or Mac (For A Mac) On A Mac Or Ipo (For An Ipo) On An Ipot Or Ipot (For Mac) (For
INSTITUTE of TECHNOLOGY CARLOW Intelligent Anti-Spam Technology User Manual Author: CHEN LIU (C00140374) Supervisor: Paul Barry Date: 16 th April 2010 Content 1. System Requirement... 3 2. Interface and
Setting up PostgreSQL
Setting up PostgreSQL 1 Introduction to PostgreSQL PostgreSQL is an object-relational database management system based on POSTGRES, which was developed at the University of California at Berkeley. PostgreSQL
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System
Less naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:[email protected] also CoSiNe Connectivity Systems
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA
MATLAB DFS. Interface Library. User Guide
MATLAB DFS Interface Library User Guide DHI Water Environment Health Agern Allé 5 DK-2970 Hørsholm Denmark Tel: +45 4516 9200 Fax: +45 4516 9292 E-mail: [email protected] Web: www.dhigroup.com 2007-03-07/MATLABDFS_INTERFACELIBRARY_USERGUIDE.DOC/JGR/HKH/2007Manuals/lsm
Package dsstatsclient
Maintainer Author Version 4.1.0 License GPL-3 Package dsstatsclient Title DataSHIELD client site stattistical functions August 20, 2015 DataSHIELD client site
Magento Search Extension TECHNICAL DOCUMENTATION
CHAPTER 1... 3 1. INSTALLING PREREQUISITES AND THE MODULE (APACHE SOLR)... 3 1.1 Installation of the search server... 3 1.2 Configure the search server for usage with the search module... 7 Deploy the
Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing
CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and
Using SQL Server Management Studio
Using SQL Server Management Studio Microsoft SQL Server Management Studio 2005 is a graphical tool for database designer or programmer. With SQL Server Management Studio 2005 you can: Create databases
Statistical Feature Selection Techniques for Arabic Text Categorization
Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000
Distributed R for Big Data
Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for
IBM SPSS Data Preparation 22
IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release
Homework Assignment 7
Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
CLUSTERING FOR FORENSIC ANALYSIS
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 129-136 Impact Journals CLUSTERING FOR FORENSIC ANALYSIS
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Introduction to Synoptic
Introduction to Synoptic 1 Introduction Synoptic is a tool that summarizes log files. More exactly, Synoptic takes a set of log files, and some rules that tell it how to interpret lines in those logs,
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Novell ZENworks Asset Management 7.5
Novell ZENworks Asset Management 7.5 w w w. n o v e l l. c o m October 2006 USING THE WEB CONSOLE Table Of Contents Getting Started with ZENworks Asset Management Web Console... 1 How to Get Started...
NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015
NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015 Starting a Python and an NLTK Session Open a Python 2.7 IDLE (Python GUI) window or a Python interpreter
Introduction to Matlab
Introduction to Matlab Social Science Research Lab American University, Washington, D.C. Web. www.american.edu/provost/ctrl/pclabs.cfm Tel. x3862 Email. [email protected] Course Objective This course provides
GUI Input and Output. Greg Reese, Ph.D Research Computing Support Group Academic Technology Services Miami University
GUI Input and Output Greg Reese, Ph.D Research Computing Support Group Academic Technology Services Miami University GUI Input and Output 2010-13 Greg Reese. All rights reserved 2 Terminology User I/O
OTN Developer Day: Oracle Big Data
OTN Developer Day: Oracle Big Data Hands On Lab Manual Oracle Big Data Connectors: Introduction to Oracle R Connector for Hadoop ORACLE R CONNECTOR FOR HADOOP 2.0 HANDS-ON LAB Introduction to Oracle R
Sales Performance Management Using Salesforce.com and Tableau 8 Desktop Professional & Server
Sales Performance Management Using Salesforce.com and Tableau 8 Desktop Professional & Server Author: Phil Gilles Sales Operations Analyst, Tableau Software March 2013 p2 Executive Summary Managing sales
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
DiskPulse DISK CHANGE MONITOR
DiskPulse DISK CHANGE MONITOR User Manual Version 7.9 Oct 2015 www.diskpulse.com [email protected] 1 1 DiskPulse Overview...3 2 DiskPulse Product Versions...5 3 Using Desktop Product Version...6 3.1 Product
Visualization with Excel Tools and Microsoft Azure
Visualization with Excel Tools and Microsoft Azure Introduction Power Query and Power Map are add-ins that are available as free downloads from Microsoft to enhance the data access and data visualization
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs
1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be
AMATH 352 Lecture 3 MATLAB Tutorial Starting MATLAB Entering Variables
AMATH 352 Lecture 3 MATLAB Tutorial MATLAB (short for MATrix LABoratory) is a very useful piece of software for numerical analysis. It provides an environment for computation and the visualization. Learning
Admin Report Kit for Active Directory
Admin Report Kit for Active Directory Reporting tool for Microsoft Active Directory Enterprise Product Overview Admin Report Kit for Active Directory (ARKAD) is a powerful reporting solution for the Microsoft
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Learning from Diversity
Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7
DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7 UNDER THE GUIDANCE Dr. N.P. DHAVALE, DGM, INFINET Department SUBMITTED TO INSTITUTE FOR DEVELOPMENT AND RESEARCH IN BANKING TECHNOLOGY
ReadMe: Software for Automated Content Analysis 1
ReadMe: Software for Automated Content Analysis 1 Daniel Hopkins 2 Gary King 3 Matthew Knowles 4 Steven Melendez 5 Version 0.99835 March 8, 2012 1 Available from http://gking.harvard.edu/readme under the
NEXT Analytics Business Intelligence User Guide
NEXT Analytics Business Intelligence User Guide This document provides an overview of the powerful business intelligence functions embedded in NEXT Analytics v5. These functions let you build more useful
Technical Report. The KNIME Text Processing Feature:
Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold [email protected] [email protected] Copyright 2012 by KNIME.com AG
