# Homework Assignment 7

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Homework Assignment , Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the s are actually spam? Answer: 39%. > sum(spam\$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] (b) What should the constant classifier predict? Answer:, since most letters are not spam. (c) What is the error rate of the constant classifier? Answer: 39% : > sum(spam\$spam!="")/nrow(spam) [1] Training and testing (10 points) Divide the data set at random into a training set of 2301 rows and a testing set of 2300 rows. Check that the two halves do not overlap, and that they have the right number of rows. What fraction of each half is spam? (Do not hand in a list of 2301 row numbers.) Include your code for all of this. Answer: This is like the training/testing divide from the last homework. > training.rows = sample(1:nrow(spam),2301,replace=false) > training.data = spam[training.rows,] > testing.data = spam[-training.rows,] > dim(training.data) [1] > dim(testing.data) [1] > intersect(rownames(training.data),rownames(testing.data)) character(0) > sum(training.data\$spam=="spam")/2301 [1]

2 > sum(testing.data\$spam=="spam")/2300 [1] In other words, the training data is 41% spam, and the testing data is 38%. This is not statistically significant. 3. Fitting (40 points) (a) Fit a prototype linear classifier to the training data, and report its error rate on the testing data. Answer: I ll re-use the code from last time. (See those solutions.) > prototype.predictions = prototype.classifier(newdata=testing.data[,-58], examples.inputs = training.data[,-58], examples.labels = training.data\$spam) > sum(prototype.predictions!= testing.data\$spam)/nrow(testing.data) [1] The error rate on the testing data is 32%. (b) Fit a classification tree to the training data. Prune the tree by crossvalidation (see below). Include a plot of the CV error versus tree size, a plot of the best tree, and its error rate on the testing data. Which variables appear in the tree? > spam.tr = tree(spam ~., data=training.data) > summary(spam.tr) Classification tree: tree(formula = spam ~., data = training.data) Variables actually used in tree construction: [1] "A.53" "A.7" "A.52" "A.25" "A.5" "A.56" "A.55" "A.16" "A.27" Number of terminal nodes: 12 Residual mean deviance: = 1068 / 2289 Misclassification error rate: = 185 / 2301 (Notice that summary tells us which variables appear in the tree.) Using. on the right-hand side of the formula means include every variable from the data frame other than the response. Using cross-validation to prune, and plotting error versus size: spam.tr.cv = cv.tree(spam.tr,method="misclass") plot(spam.tr.cv) The best tree is, in this case, the largest tree found. In part this is because the default tree-fitting algorithm includes some sensible stopping rules. plot(spam.tr) The error rate on the testing data: 2

3 Inf misclass size Figure 1: Mis-classification rate versus tree size for prunings of the default tree. (The upper horizontal axis shows the values of the error/size trade-off parameter which give us a tree with a given number of nodes.) 3

4 A.53 < A.7 < A.25 < A.52 < spam spam A.52 < A.27 < 0.14 spam A.25 < 0.03 A.5 < 0.01 A.56 < 10.5 spam A.55 < A.16 < 0.04 spam spam Figure 2: The default tree. 4

5 > spam.tr.predctions = predict(spam.tr,newdata=testing.data,type="class") > sum(spam.tr.predictions!= testing.data\$spam)/nrow(testing.data) [1] (This could be done in one line, but it looks uglier.) This rate is 10%; the difference between this and the 8% rate on the training data is statistically significant, but not substantively huge. Things look a bit different if we start by fitting a very big tree. > spam.tr.big = tree(spam ~., data=training.data,minsize=1,mindev=0) > summary(spam.tr.big) Classification tree: tree(formula = spam ~., data = training.data, minsize = 1, mindev = 0) Variables actually used in tree construction: [1] "A.53" "A.7" "A.52" "A.25" "A.5" "A.27" "A.45" "A.56" "A.55" "A.11" "A.20" "A.3" [13] "A.19" "A.57" "A.46" "A.16" "A.12" "A.49" "A.21" "A.22" "A.29" "A.14" "A.28" "A.8" [25] "A.9" "A.15" "A.24" "A.1" "A.44" "A.10" "A.13" "A.33" "A.39" "A.2" "A.18" "A.37" Number of terminal nodes: 125 Residual mean deviance: = / 2176 Misclassification error rate: = 1 / 2301 minsize is the minimum number of observations to allow in a node, and mindev is the minimum improvement in the error needed to create a node. This tells the tree-growing algorithm to ignore such limits. Plotting the tree gives us a pretty picture, but trying to label all the nodes seems foolish. To get the best size, we look for the point where the error rate bottoms out. Unfortunately, cv.tree lists sizes in decreasing order, and which.min returns the first match. Use rev to reverse vectors. > rev(spam.tr.big.cv\$size)[which.min(rev(spam.tr.big.cv\$dev))] [1] 19 (Why do we need to use rev twice?) spam.tr.pruned = prune.tree(spam.tr.big,best=19) > summary(spam.tr.pruned) Classification tree: snip.tree(tree = spam.tr.big, nodes = c(27, 129, 14, 10, 37, 135, 66, 17, 49, 19, 26, 48, 25, 36, 128, 134)) Variables actually used in tree construction: [1] "A.53" "A.7" "A.52" "A.25" "A.5" "A.27" "A.45" "A.56" "A.46" "A.55" "A.16" Number of terminal nodes: 19 Residual mean deviance: = /

6 Figure 3: Maximal tree obtained from the spam data. 6

7 Misclassification error rate: = 147 / 2301 Much smaller than the full tree, this can actually be plotted and labeled. The error rate on the testing data (9.1%) is a bit better than that of the slightly smaller tree we got at first, but not hugely: > spam.pruned.preds = predict(spam.tr.pruned,newdata=testing.data,type="class") > sum(spam.pruned.preds!= testing.data\$spam)/2300 [1] Whether the extra 1% of error is worth having somewhat fewer leaves is up to you. (c) Use bagging to fit an ensemble of 100 trees to the training data. Report the error rate of the ensemble on the testing data. Include a plot of the importance of the variables, according to the ensemble. Answer: > spam.bag = bagging(spam ~., data = training.data, minsplit=1, cp = 0) > summary(spam.bag) Length Class Mode formula 3 formula call trees 100 -none- list votes none- numeric class none- character samples none- numeric importance 57 -none- numeric With bagging, we ll not worry about individual trees being too overfit; rather we ll let the averaging take care of that for us. (In fact, here if I use the default control settings bagging sometimes returns the same tree on every bootstrap sample!) The summary method, evidently, is not too informative. In part however this is because there are 100 separate trees to keep track of! The in-sample and out-of-sample error rates are 5.9% and 7.6%: > sum(spam.bag\$class!= training.data\$spam)/nrow(training.data) [1] > predict(spam.bag,newdata=testing.data)\$error [1] (d) Use boosting to fit an ensemble of 100 trees to the training data. Report the error rate of the ensemble on the testing data. Include a plot of the importance of the variables. Answer: > spam.boost = adaboost.m1(spam ~., data = training.data, boos=false, minsplit=1, cp =0, maxdepth=5) > summary(spam.boost) 7

8 A.53 < A.7 < A.25 < A.5 < 0.06 A.55 < spam A.52 < A.46 < 0.2 spam spam A.7 < spam A.52 < A.27 < 0.14 spam A.25 < 0.03 A.55 < A.27 < A.45 < 0.04 A.5 < 0.01 A.56 < 10.5 A.46 < spam A.16 < 0.04 spam spam plot(spam.tr.pruned) text(spam.tr.pruned,cex=0.5) Figure 4: Tree obtained by starting from the maximal tree and using crossvalidation to prune. 8

9 Importance Feature plot(1:57,spam.bag\$importance,xlab="feature", ylab="importance") Figure 5: Importance of the variables according to bagging 9

10 Length Class Mode formula 3 formula call trees 100 -none- list weights 100 -none- numeric votes none- numeric class none- character importance 57 -none- numeric The error rate on the training data is 9%, and on the testing data it s 11%: > sum(spam.boost\$class!= training.data\$spam)/nrow(training.data) [1] > predict(spam.boost,newdata=testing.data)\$error [1] The importance plot: For reference, here are the two importance plots comapred: (e) Which (if any) of these methods out-performs the constant classifier? Answer: They all do. 4. Errors (20 points) Pick the prediction method from the previous problem with the lowest error rate. This was the bagged trees, at 7.6% mis-classification. (a) What is its rate of false negatives? That is, what fraction of the spam s in the training set did it not classify as spam? Answer: > testing.spam.rows = (testing.data\$spam == "spam") > predict(spam.bag, newdata=testing.data[testing.spam.rows,])\$error [1] That is, first we check which rows of the testing data have the true class of spam. (Note that testing.spam.rows is a Boolean vector, of length 2300.) Then we then predict only on those data points; all the errors then are false negatives. This gives an error rate of 14.7%. (b) What is its rate of false positives? That is, what fraction of the genuine s in the testing set did it classify as spam? Answer: We can re-use the same trick: > predict(spam.bag, newdata=testing.data[!testing.spam.rows,])\$error [1] Notice that we need to logically negate testing.spam.rows we want the rows which aren t spam. This gives an error rate of 3.2%. Quick sanity check: the two error rates together should add up to the over-all error rate. And, indeed, (1 0.38) = 0.076, as it should. 10

11 Importance Feature plot(1:57,spam.boost\$importance,xlab="feature", ylab="importance") Figure 6: Importance of variables according to boosting. 11

12 Importance Feature > plot(1:57,spam.boost\$importance,xlab="feature", ylab="importance") > points(1:57,spam.bag\$importance,pch=20) Figure 7: Variable importance from boosting (open circles) compared to bagging (filled circles). Boosting gives more importance to a smaller number of variables. 12

13 In this case, the predict method for bagging not only returns the error rate, it will actually return the confusion matrix, the contigency table of actual versus observed classes: > predict(spam.bag, newdata=testing.data)\$confusion Observed Class Predicted Class spam spam From this we can calculate that the false positive rate is 45/( ) = 3.2%, and the false negative rate is 129/( ) = 14.7%, as we saw above. (c) What fraction of s it classified as spam were actually spam? Answer: From the confusion matrix, 750/( ) = 94%. Without the confusion matrix, we use Bayes s rule. (To save space, below +1 stands for the class spam, and 1 for the class.) ) (Y = +1 Ŷ = +1 Pr = Pr = ) Pr (Ŷ = +1 Y = +1 (Ŷ = +1 Y = +1 ) Pr (Y = +1) + Pr ( )(0.38) ( )(0.38) + (0.032)(1 0.38) = 0.94 The confusion matrix is easier. Pr (Y = +1) ) (Ŷ = +1 Y = 1 Pr (Y = 1) 13

### Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

### Lecture 10: Regression Trees

Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass

### Model Combination. 24 Novembre 2009

Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

### Classification and Regression Trees

Classification and Regression Trees 36-350, Data Mining 6 November 2009 Contents 1 Prediction Trees 1 2 Regression Trees 4 2.1 Example: California Real Estate Again............... 4 2.2 Regression Tree

### Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

### Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

### Fraud Detection for Online Retail using Random Forests

Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.

### !"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"

!"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"!"#"\$%&#'()*+',\$\$-.&#',/"-0%.12'32./4'5,5'6/%&)\$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

### Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

### Classification and regression trees

Classification and regression trees December 9 Introduction We ve seen that local methods and splines both operate by partitioning the sample space of the regression variable(s), and then fitting separate/piecewise

### Data Mining with R. December 13, 2008

Data Mining with R John Maindonald (Centre for Mathematics and Its Applications, Australian National University) and Yihui Xie (School of Statistics, Renmin University of China) December 13, 2008 Data

### Classification: Basic Concepts, Decision Trees, and Model Evaluation. General Approach for Building Classification Model

10 10 Classification: Basic Concepts, Decision Trees, and Model Evaluation Dr. Hui Xiong Rutgers University Introduction to Data Mining 1//009 1 General Approach for Building Classification Model Tid Attrib1

### Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

### Neural Networks & Boosting

Neural Networks & Boosting Bob Stine Dept of Statistics, School University of Pennsylvania Questions How is logistic regression different from OLS? Logistic mean function for probabilities Larger weight

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### Classification/Decision Trees (II)

Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).

### A Lightweight Solution to the Educational Data Mining Challenge

A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

### Section 6: Model Selection, Logistic Regression and more...

Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building

### Local classification and local likelihoods

Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Classification and Regression by randomforest

Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

### Leveraging Ensemble Models in SAS Enterprise Miner

ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

### Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

### Author Gender Identification of English Novels

Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### CS570 Data Mining Classification: Ensemble Methods

CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:

### Decompose Error Rate into components, some of which can be measured on unlabeled data

Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance

### If A is divided by B the result is 2/3. If B is divided by C the result is 4/7. What is the result if A is divided by C?

Problem 3 If A is divided by B the result is 2/3. If B is divided by C the result is 4/7. What is the result if A is divided by C? Suggested Questions to ask students about Problem 3 The key to this question

### 2/2/2011. Data source. Define a new data source. Connect to the Iris database in SQL Server Impersonation Information. Default option D B M G

Data Mining in SQL Server 2005 SQL Server 2005 Data Mining structures SQL Server 2005 mining structures created and managed using Visual Studio 2005 Business Intelligence Projects Analysis Services Project

### Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

### Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

### M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how

### Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

### Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

### CART 6.0 Feature Matrix

CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window

### Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

### Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

### Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

### COC131 Data Mining - Clustering

COC131 Data Mining - Clustering Martin D. Sykora m.d.sykora@lboro.ac.uk Tutorial 05, Friday 20th March 2009 1. Fire up Weka (Waikako Environment for Knowledge Analysis) software, launch the explorer window

### Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

### Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

### Lab 13: Logistic Regression

Lab 13: Logistic Regression Spam Emails Today we will be working with a corpus of emails received by a single gmail account over the first three months of 2012. Just like any other email address this account

### Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

### Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

### Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar

### The Taxman Game. Robert K. Moniot September 5, 2003

The Taxman Game Robert K. Moniot September 5, 2003 1 Introduction Want to know how to beat the taxman? Legally, that is? Read on, and we will explore this cute little mathematical game. The taxman game

### Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

### Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Lecture Slides for INTRODUCTION TO Machine Learning By: alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Postedited by: R. Basili Learning a Class from Examples Class C of a family car Prediction:

### Data mining techniques: decision trees

Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39

### 1. How different is the t distribution from the normal?

Statistics 101 106 Lecture 7 (20 October 98) c David Pollard Page 1 Read M&M 7.1 and 7.2, ignoring starred parts. Reread M&M 3.2. The effects of estimated variances on normal approximations. t-distributions.

### Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

### DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

### Classification and Regression Trees (CART) Theory and Applications

Classification and Regression Trees (CART) Theory and Applications A Master Thesis Presented by Roman Timofeev (188778) to Prof. Dr. Wolfgang Härdle CASE - Center of Applied Statistics and Economics Humboldt

### Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

### Sharing Classifiers among Ensembles from Related Problem Domains

Sharing Classifiers among Ensembles from Related Problem Domains Yi Zhang yi-zhang-2@uiowa.edu W. Nick Street nick-street@uiowa.edu Samuel Burer samuel-burer@uiowa.edu Department of Management Sciences,

### Data Mining Lab 5: Introduction to Neural Networks

Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese

### Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

### Updates to Graphing with Excel

Updates to Graphing with Excel NCC has recently upgraded to a new version of the Microsoft Office suite of programs. As such, many of the directions in the Biology Student Handbook for how to graph with

### Free Pre-Algebra Lesson 8 page 1

Free Pre-Algebra Lesson 8 page 1 Lesson 8 Factor Pairs Measuring more accurately requires breaking our inches into fractions of an inch, little parts smaller than a whole inch. You can think ahead and

### Package trimtrees. February 20, 2015

Package trimtrees February 20, 2015 Type Package Title Trimmed opinion pools of trees in a random forest Version 1.2 Date 2014-08-1 Depends R (>= 2.5.0),stats,randomForest,mlbench Author Yael Grushka-Cockayne,

### Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

### Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Unit 4 DECISION ANALYSIS. Lesson 37. Decision Theory and Decision Trees. Learning objectives:

Unit 4 DECISION ANALYSIS Lesson 37 Learning objectives: To learn how to use decision trees. To structure complex decision making problems. To analyze the above problems. To find out limitations & advantages

### Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

### Big Data Decision Trees with R

REVOLUTION ANALYTICS WHITE PAPER Big Data Decision Trees with R By Richard Calaway, Lee Edlefsen, and Lixin Gong Fast, Scalable, Distributable Decision Trees Revolution Analytics RevoScaleR package provides

### Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

### SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

### Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

### On the application of multi-class classification in physical therapy recommendation

RESEARCH Open Access On the application of multi-class classification in physical therapy recommendation Jing Zhang 1,PengCao 1,DouglasPGross 2 and Osmar R Zaiane 1* Abstract Recommending optimal rehabilitation

### Identifying Market Price Levels using Differential Evolution

Identifying Market Price Levels using Differential Evolution Michael Mayo University of Waikato, Hamilton, New Zealand mmayo@waikato.ac.nz WWW home page: http://www.cs.waikato.ac.nz/~mmayo/ Abstract. Evolutionary

### Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

### Selecting Data Mining Model for Web Advertising in Virtual Communities

Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: jerzy.surma@gmail.com Mariusz Łapczyński

### Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

### Decision Support Systems

Decision Support Systems 50 (2011) 602 613 Contents lists available at ScienceDirect Decision Support Systems journal homepage: www.elsevier.com/locate/dss Data mining for credit card fraud: A comparative

### On Cross-Validation and Stacking: Building seemingly predictive models on random data

On Cross-Validation and Stacking: Building seemingly predictive models on random data ABSTRACT Claudia Perlich Media6 New York, NY 10012 claudia@media6degrees.com A number of times when using cross-validation

### Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598. Keynote, Outlier Detection and Description Workshop, 2013

Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

### Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

### Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

### Employer Health Insurance Premium Prediction Elliott Lui

Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than

### MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

### CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition

CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition Neil P. Slagle College of Computing Georgia Institute of Technology Atlanta, GA npslagle@gatech.edu Abstract CNFSAT embodies the P

### Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank Engineering the input and output Attribute selection Scheme independent, scheme

### Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing