Homework Assignment 7

Similar documents
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Lecture 10: Regression Trees

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Data Mining. Nonlinear Classification

Data Mining Practical Machine Learning Tools and Techniques

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Question 2 Naïve Bayes (16 points)

Chapter 6. The stacking ensemble approach

Model Combination. 24 Novembre 2009

Classification and Regression Trees

Using multiple models: Bagging, Boosting, Ensembles, Forests

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Fraud Detection for Online Retail using Random Forests

Machine Learning Final Project Spam Filtering

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Data Mining - Evaluation of Classifiers

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Classification/Decision Trees (II)

1 Maximum likelihood estimation

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Classification and Regression by randomforest

Local classification and local likelihoods

Leveraging Ensemble Models in SAS Enterprise Miner

Experiments in Web Page Classification for Semantic Web

Evaluation & Validation: Credibility: Evaluating what has been learned

Knowledge Discovery and Data Mining

Author Gender Identification of English Novels

Supervised Learning (Big Data Analytics)

Section 6: Model Selection, Logistic Regression and more...

Knowledge Discovery and Data Mining

How To Solve The Kd Cup 2010 Challenge

If A is divided by B the result is 2/3. If B is divided by C the result is 4/7. What is the result if A is divided by C?

CS570 Data Mining Classification: Ensemble Methods

Decompose Error Rate into components, some of which can be measured on unlabeled data

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

1. How different is the t distribution from the normal?

The Taxman Game. Robert K. Moniot September 5, 2003

Distances, Clustering, and Classification. Heatmaps

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

Data Mining Lab 5: Introduction to Neural Networks

Gerry Hobbs, Department of Statistics, West Virginia University

CART 6.0 Feature Matrix

Knowledge Discovery and Data Mining

Updates to Graphing with Excel

Supervised Feature Selection & Unsupervised Dimensionality Reduction

COC131 Data Mining - Clustering

Lab 13: Logistic Regression

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Fast Analytics on Big Data with H20

Spam Detection A Machine Learning Approach

Decision Trees from large Databases: SLIQ

Mining the Software Change Repository of a Legacy Telephony System

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Data mining techniques: decision trees

Classification and Regression Trees (CART) Theory and Applications

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Towards better accuracy for Spam predictions

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Free Pre-Algebra Lesson 8 page 1

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Unit 4 DECISION ANALYSIS. Lesson 37. Decision Theory and Decision Trees. Learning objectives:

Classification of Bad Accounts in Credit Card Industry

Data Mining Methods: Applications for Institutional Research

Getting Started with R and RStudio 1

Package trimtrees. February 20, 2015

Chapter 12 Bagging and Random Forests

Pristine s Day Trading Journal...with Strategy Tester and Curve Generator

Big Data Decision Trees with R

Predictive Data modeling for health care: Comparative performance study of different prediction models

Club Accounts Question 6.

Decision Support Systems

How To Perform An Ensemble Analysis

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Neural Networks and Support Vector Machines

Predicting the Stock Market with News Articles

Getting Even More Out of Ensemble Selection

Employer Health Insurance Premium Prediction Elliott Lui

Azure Machine Learning, SQL Data Mining and R

Car Insurance. Havránek, Pokorný, Tomášek

THE DIMENSION OF A VECTOR SPACE

A semi-supervised Spam mail detector

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Package bigrf. February 19, 2015

On the application of multi-class classification in physical therapy recommendation

Revealing the Secrets of Microsoft Project

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Social Media Mining. Data Mining Essentials

Spam Filtering with Naive Bayesian Classification

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Bayes and Naïve Bayes. cs534-machine Learning

Decision Making under Uncertainty

PloneSurvey User Guide (draft 3)

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Transcription:

Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448 (b) What should the constant classifier predict? Answer:, since most letters are not spam. (c) What is the error rate of the constant classifier? Answer: 39% : > sum(spam$spam!="")/nrow(spam) [1] 0.3940448 2. Training and testing (10 points) Divide the data set at random into a training set of 2301 rows and a testing set of 2300 rows. Check that the two halves do not overlap, and that they have the right number of rows. What fraction of each half is spam? (Do not hand in a list of 2301 row numbers.) Include your code for all of this. Answer: This is like the training/testing divide from the last homework. > training.rows = sample(1:nrow(spam),2301,replace=false) > training.data = spam[training.rows,] > testing.data = spam[-training.rows,] > dim(training.data) [1] 2301 58 > dim(testing.data) [1] 2300 58 > intersect(rownames(training.data),rownames(testing.data)) character(0) > sum(training.data$spam=="spam")/2301 [1] 0.4059105 1

> sum(testing.data$spam=="spam")/2300 [1] 0.3821739 In other words, the training data is 41% spam, and the testing data is 38%. This is not statistically significant. 3. Fitting (40 points) (a) Fit a prototype linear classifier to the training data, and report its error rate on the testing data. Answer: I ll re-use the code from last time. (See those solutions.) > prototype.predictions = prototype.classifier(newdata=testing.data[,-58], examples.inputs = training.data[,-58], examples.labels = training.data$spam) > sum(prototype.predictions!= testing.data$spam)/nrow(testing.data) [1] 0.3160870 The error rate on the testing data is 32%. (b) Fit a classification tree to the training data. Prune the tree by crossvalidation (see below). Include a plot of the CV error versus tree size, a plot of the best tree, and its error rate on the testing data. Which variables appear in the tree? > spam.tr = tree(spam ~., data=training.data) > summary(spam.tr) Classification tree: tree(formula = spam ~., data = training.data) Variables actually used in tree construction: [1] "A.53" "A.7" "A.52" "A.25" "A.5" "A.56" "A.55" "A.16" "A.27" Number of terminal nodes: 12 Residual mean deviance: 0.4665 = 1068 / 2289 Misclassification error rate: 0.0804 = 185 / 2301 (Notice that summary tells us which variables appear in the tree.) Using. on the right-hand side of the formula means include every variable from the data frame other than the response. Using cross-validation to prune, and plotting error versus size: spam.tr.cv = cv.tree(spam.tr,method="misclass") plot(spam.tr.cv) The best tree is, in this case, the largest tree found. In part this is because the default tree-fitting algorithm includes some sensible stopping rules. plot(spam.tr) The error rate on the testing data: 2

460.0 40.0 22.0 7.0 6.7 0.0 -Inf misclass 200 300 400 500 600 700 800 900 2 4 6 8 10 12 size Figure 1: Mis-classification rate versus tree size for prunings of the default tree. (The upper horizontal axis shows the values of the error/size trade-off parameter which give us a tree with a given number of nodes.) 3

A.53 < 0.0555 A.7 < 0.055 A.25 < 0.435 A.52 < 0.0775 spam spam A.52 < 0.191 A.27 < 0.14 spam A.25 < 0.03 A.5 < 0.01 A.56 < 10.5 spam A.55 < 3.4085 A.16 < 0.04 spam spam Figure 2: The default tree. 4

> spam.tr.predctions = predict(spam.tr,newdata=testing.data,type="class") > sum(spam.tr.predictions!= testing.data$spam)/nrow(testing.data) [1] 0.1034783 (This could be done in one line, but it looks uglier.) This rate is 10%; the difference between this and the 8% rate on the training data is statistically significant, but not substantively huge. Things look a bit different if we start by fitting a very big tree. > spam.tr.big = tree(spam ~., data=training.data,minsize=1,mindev=0) > summary(spam.tr.big) Classification tree: tree(formula = spam ~., data = training.data, minsize = 1, mindev = 0) Variables actually used in tree construction: [1] "A.53" "A.7" "A.52" "A.25" "A.5" "A.27" "A.45" "A.56" "A.55" "A.11" "A.20" "A.3" [13] "A.19" "A.57" "A.46" "A.16" "A.12" "A.49" "A.21" "A.22" "A.29" "A.14" "A.28" "A.8" [25] "A.9" "A.15" "A.24" "A.1" "A.44" "A.10" "A.13" "A.33" "A.39" "A.2" "A.18" "A.37" Number of terminal nodes: 125 Residual mean deviance: 0.001274 = 2.773 / 2176 Misclassification error rate: 0.0004346 = 1 / 2301 minsize is the minimum number of observations to allow in a node, and mindev is the minimum improvement in the error needed to create a node. This tells the tree-growing algorithm to ignore such limits. Plotting the tree gives us a pretty picture, but trying to label all the nodes seems foolish. To get the best size, we look for the point where the error rate bottoms out. Unfortunately, cv.tree lists sizes in decreasing order, and which.min returns the first match. Use rev to reverse vectors. > rev(spam.tr.big.cv$size)[which.min(rev(spam.tr.big.cv$dev))] [1] 19 (Why do we need to use rev twice?) spam.tr.pruned = prune.tree(spam.tr.big,best=19) > summary(spam.tr.pruned) Classification tree: snip.tree(tree = spam.tr.big, nodes = c(27, 129, 14, 10, 37, 135, 66, 17, 49, 19, 26, 48, 25, 36, 128, 134)) Variables actually used in tree construction: [1] "A.53" "A.7" "A.52" "A.25" "A.5" "A.27" "A.45" "A.56" "A.46" "A.55" "A.16" Number of terminal nodes: 19 Residual mean deviance: 0.3911 = 892.6 / 2282 5

Figure 3: Maximal tree obtained from the spam data. 6

Misclassification error rate: 0.06389 = 147 / 2301 Much smaller than the full tree, this can actually be plotted and labeled. The error rate on the testing data (9.1%) is a bit better than that of the slightly smaller tree we got at first, but not hugely: > spam.pruned.preds = predict(spam.tr.pruned,newdata=testing.data,type="class") > sum(spam.pruned.preds!= testing.data$spam)/2300 [1] 0.09130435 Whether the extra 1% of error is worth having somewhat fewer leaves is up to you. (c) Use bagging to fit an ensemble of 100 trees to the training data. Report the error rate of the ensemble on the testing data. Include a plot of the importance of the variables, according to the ensemble. Answer: > spam.bag = bagging(spam ~., data = training.data, minsplit=1, cp = 0) > summary(spam.bag) Length Class Mode formula 3 formula call trees 100 -none- list votes 4602 -none- numeric class 2301 -none- character samples 230100 -none- numeric importance 57 -none- numeric With bagging, we ll not worry about individual trees being too overfit; rather we ll let the averaging take care of that for us. (In fact, here if I use the default control settings bagging sometimes returns the same tree on every bootstrap sample!) The summary method, evidently, is not too informative. In part however this is because there are 100 separate trees to keep track of! The in-sample and out-of-sample error rates are 5.9% and 7.6%: > sum(spam.bag$class!= training.data$spam)/nrow(training.data) [1] 0.05910474 > predict(spam.bag,newdata=testing.data)$error [1] 0.07565217 (d) Use boosting to fit an ensemble of 100 trees to the training data. Report the error rate of the ensemble on the testing data. Include a plot of the importance of the variables. Answer: > spam.boost = adaboost.m1(spam ~., data = training.data, boos=false, minsplit=1, cp =0, maxdepth=5) > summary(spam.boost) 7

A.53 < 0.0555 A.7 < 0.055 A.25 < 0.435 A.5 < 0.06 A.55 < 2.5845 spam A.52 < 0.0775 A.46 < 0.2 spam spam A.7 < 0.075 spam A.52 < 0.191 A.27 < 0.14 spam A.25 < 0.03 A.55 < 3.4085 A.27 < 0.035 A.45 < 0.04 A.5 < 0.01 A.56 < 10.5 A.46 < 0.045 spam A.16 < 0.04 spam spam plot(spam.tr.pruned) text(spam.tr.pruned,cex=0.5) Figure 4: Tree obtained by starting from the maximal tree and using crossvalidation to prune. 8

Importance 0 2 4 6 8 10 0 10 20 30 40 50 Feature plot(1:57,spam.bag$importance,xlab="feature", ylab="importance") Figure 5: Importance of the variables according to bagging 9

Length Class Mode formula 3 formula call trees 100 -none- list weights 100 -none- numeric votes 4602 -none- numeric class 2301 -none- character importance 57 -none- numeric The error rate on the training data is 9%, and on the testing data it s 11%: > sum(spam.boost$class!= training.data$spam)/nrow(training.data) [1] 0.09126467 > predict(spam.boost,newdata=testing.data)$error [1] 0.1108696 The importance plot: For reference, here are the two importance plots comapred: (e) Which (if any) of these methods out-performs the constant classifier? Answer: They all do. 4. Errors (20 points) Pick the prediction method from the previous problem with the lowest error rate. This was the bagged trees, at 7.6% mis-classification. (a) What is its rate of false negatives? That is, what fraction of the spam e-mails in the training set did it not classify as spam? Answer: > testing.spam.rows = (testing.data$spam == "spam") > predict(spam.bag, newdata=testing.data[testing.spam.rows,])$error [1] 0.1467577 That is, first we check which rows of the testing data have the true class of spam. (Note that testing.spam.rows is a Boolean vector, of length 2300.) Then we then predict only on those data points; all the errors then are false negatives. This gives an error rate of 14.7%. (b) What is its rate of false positives? That is, what fraction of the genuine e-mails in the testing set did it classify as spam? Answer: We can re-use the same trick: > predict(spam.bag, newdata=testing.data[!testing.spam.rows,])$error [1] 0.03166784 Notice that we need to logically negate testing.spam.rows we want the rows which aren t spam. This gives an error rate of 3.2%. Quick sanity check: the two error rates together should add up to the over-all error rate. And, indeed, 0.147 0.38+0.032 (1 0.38) = 0.076, as it should. 10

Importance 0 5 10 15 20 25 0 10 20 30 40 50 Feature plot(1:57,spam.boost$importance,xlab="feature", ylab="importance") Figure 6: Importance of variables according to boosting. 11

Importance 0 5 10 15 20 25 0 10 20 30 40 50 Feature > plot(1:57,spam.boost$importance,xlab="feature", ylab="importance") > points(1:57,spam.bag$importance,pch=20) Figure 7: Variable importance from boosting (open circles) compared to bagging (filled circles). Boosting gives more importance to a smaller number of variables. 12

In this case, the predict method for bagging not only returns the error rate, it will actually return the confusion matrix, the contigency table of actual versus observed classes: > predict(spam.bag, newdata=testing.data)$confusion Observed Class Predicted Class spam 1376 129 spam 45 750 From this we can calculate that the false positive rate is 45/(45 + 1376) = 3.2%, and the false negative rate is 129/(129+750) = 14.7%, as we saw above. (c) What fraction of e-mails it classified as spam were actually spam? Answer: From the confusion matrix, 750/(750 + 45) = 94%. Without the confusion matrix, we use Bayes s rule. (To save space, below +1 stands for the class spam, and 1 for the class.) ) (Y = +1 Ŷ = +1 Pr = Pr = ) Pr (Ŷ = +1 Y = +1 (Ŷ = +1 Y = +1 ) Pr (Y = +1) + Pr (1 0.147)(0.38) (1 0.147)(0.38) + (0.032)(1 0.38) = 0.94 The confusion matrix is easier. Pr (Y = +1) ) (Ŷ = +1 Y = 1 Pr (Y = 1) 13