M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest
|
|
|
- Alaina Hodges
- 10 years ago
- Views:
Transcription
1 Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how to train and use random forests with R. The method is implemented in the package randomforest which is to be installed and loaded. The method is illustrated on examples taken from the UCI Machine Learning repository 1, which are available in the R package mlbench: library(randomforest) randomforest Type rfnews() to see new features/changes/bug fixes. library(mlbench) Exercice 1 Random forest: training, interpretation and tuning This exercise illustrates the use of random forest to discriminate sonar signals bounced off a metal cylinder from those bounced off a roughly cylindrical rock. Data are contained in the dataset Sonar: data(sonar) 1. Using help(sonar) read the description of the dataset. How many potential predictors are included in the dataset and what are their types? How many observations are included in the dataset? Answer: The dataset contains 208 observations on 60 predictors, all of them being numerical. The last variable of the dataset Class defines the class (M or R) of the observation, which is to be predicted. 2. Randomly split the data into a training and a test sets of size 100 and 108, respectively. Answer: ind.test <- sample(1:208, 108, replace=false) sonar.train <- Sonar[-ind.test, ] sonar.test <- Sonar[ind.test, ] 3. Use the function randomforest to train a random forest on the training set with default parameters (use the formula syntax described in help(randomforest) and the options keep.inbag=true and importance=true). What is the OOB error? Make a contingency tables between OOB predictions and real classes. Answer: rf.sonar <- randomforest(class~., data=sonar.train, keep.inbag=true, importance=true) rf.sonar 1
2 Call: randomforest(formula = Class ~., data = sonar.train, keep.inbag = TRUE, importance = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 7 OOB estimate of error rate: 20% Confusion matrix: M R class.error M R Using the help, describe what is in predicted, err.rate, confusion, votes, oob.times, importance, ntree, mtry and inbag in the random forest object trained in the previous question. Answer: head(rf.sonar$predicted) M R R R R R Levels: M R contains the OOB predictions of the random forest; head(rf.sonar$err.rate) OOB M R [1,] [2,] [3,] [4,] [5,] [6,] contains the evolution of the misclassification rate during the learning process while the number of trees in the forest is increasing. The misclassification rate is given for each of the two classes and as an overall miscalssification rate. It is estimated from OOB predictions; rf.sonar$confusion M R class.error M R contains the confusion matrix between OOB predictions and real classes; head(rf.sonar$votes) M R contains the ratio of trees voting for each of the two classes for every observations (it is thus a matrix). The votes are based on trees for which the observations is out-of-the-bag; head(rf.sonar$oob.times) [1] contains, for every observation, the number of trees for which the observation is out-of-bag;
3 head(rf.sonar$importance) M R MeanDecreaseAccuracy MeanDecreaseGini V e V e V e V e V e V e contains, for every variable, the importance of the variable, expressed as the mean decrease accuracy in predicted each of the two classes, or the overall mean decrease accuracy or another importance index which is the mean decrease in Gini index (not explained in the lesson); rf.sonar$ntree; rf.sonar$mtry [1] 500 [1] 7 respectively contain the number of trees in the forest and the number of variables used for each split (here equal to 7 60); head(rf.sonar$inbag[,1:5]) [,1] [,2] [,3] [,4] [,5] [1,] [2,] [3,] [4,] [5,] [6,] is a matrix which contains 1 each time an observation (row) is included in a the bag of a tree (column). 5. What is the OOB prediction for the 50th observation in your training sample? What is the true class for this observation? How many times has it been included in a tree and in which trees has it been included? How many trees voted for the prediction R for this observation? Answer: The OOB prediction for the 50th observation in the training sample is rf.sonar$predicted[50] 98 R Levels: M R and the true class for this observation is sonar.train$class[50] [1] M Levels: M R The observation has been included in nb.oob.50 <- rf.sonar$oob.times[50] rf.sonar$ntree - nb.oob.50 [1] 299 trees which are the trees numbered: which(rf.sonar$inbag[50, ]==1)
4 [1] [18] [35] [52] [69] [86] [103] [120] [137] [154] [171] [188] [205] [222] [239] [256] [273] [290] For this observations, the number of (OOB) trees that have voted R is equal to rf.sonar$votes[50,"r"] * nb.oob.50 [1] 132 (over 201 OOB trees for this observation). 6. Using the plot.randomforest function, make a plot that displays the evolution of the OOB misclassification rate when the number of trees in the forest increases. What is represented by the different colors? Add a legend and comment. Answer: plot(rf.sonar, main="evolution of misclassification rates")
5 Evolution of misclassification rates Error trees The three graphics show the evolution of the misclassification rate, the misclassification rate for the "R" class and the misclassification rate for the "M" class respectively. tail(rf.sonar$err.rate) OOB M R [495,] [496,] [497,] [498,] [499,] [500,] shows that the black line corresponds to the overall misclassification rate, the green one to the misclassification rate for "R" and the red one to the misclassification rate for "M". The legend can be added with: plot(rf.sonar, main="evolution of misclassification rates") legend("topright", col=c("black", "red", "green"), lwd=2, lty=1, legend=c("overall", "class M", "class R"))
6 Evolution of misclassification rates Error overall class M class R trees The graphic shows that the three final misclassification rates are close, all equal to about 20%. 7. Make a barplot of the variable importances ranked in decreasing order. Comment. Answer: ordered.variables <- order(rf.sonar$importance[,3], decreasing=true) barplot(rf.sonar$importance[ordered.variables,3], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)
7 Variable importance (decreasing order) V11 V12 V49 V9 V27 V43 V10 V26 V16 V31 V36 V58 V22 V37 V23 V52 V35 V45 V15 V20 V21 V48 V44 V51 V28 V4 V33 V17 V55 V1 V54 V14 V41 V46 V29 V39 V19 V53 V38 V24 V47 V59 V7 V13 V34 V57 V25 V8 V30 V42 V60 V6 V18 V2 V50 V56 V5 V40 V3 V Two variables have a particularly large importance in the accuracy of the forest (V11 and V12) and three other are rather importance also (V49, V9 and V27). 8. With the function predict.randomforest, predict the output of the random forest on the test set and find the test misclassification rate. Comment. Answer: The output of the random forest for the test set are found with: pred.test <- predict(rf.sonar, sonar.test) Hence, the misclassification rate for the test set is equal to: sum(pred.test!= sonar.test$class)/length(pred.test) [1] which is smaller than the OOB misclassification rate and seems to show that the model does not overfit. 9. Use the package e1071 and the function tune.randomforest to find, with a 10-fold cross validation performed on the training set only, which pair of parameters is the best among ntree=c(500, 1000, 1500, 2000) and mtry=c(5, 7, 10, 15, 20). Set the option importance to TRUE to obtain the variable importances for the best model. Use the functions summary.tune and plot.tune to interpret the result.
8 library(e1071) Answer: The tuning is performed with: res.tune.rf <- tune.randomforest(sonar.train[,1:60], sonar.train[,61], mtry=c(5, 7, 10, 15, 20), ntree=c(500, 1000, 1500, 2000), importance=true) which can be analyzed with: summary(res.tune.rf) Parameter tuning of 'randomforest': - sampling method: 10-fold cross validation - best parameters: mtry ntree best performance: Detailed performance results: mtry ntree error dispersion plot(res.tune.rf)
9 Performance of randomforest' ntree mtry The best parameters obtained by cross validation are a number of trees equal to 2000 and a number of variables used for every split equal to 15. The CV error is then a bit smaller (about 17%) than the OOB error obtained with the default parameters. 10. From the previous question, extract the best model (in the object obtained with the function tune.randomforest, it is in $best.model). Compare its OOB error and its evolution, its most important variables and its test error with the random forest obtained with default parameters. Comment. Answer: The best model is obtained with rf.sonar.2 <- res.tune.rf$best.model rf.sonar.2 Call: best.randomforest(x = sonar.train[, 1:60], y = sonar.train[, 61], mtry = c(5, 7, 10, 15, 20) Type of random forest: classification Number of trees: 2000 No. of variables tried at each split: 15 OOB estimate of error rate: 19% Confusion matrix: M R class.error M
10 R The OOB error is thus equal to 19% compared to 20% with the default parameter. The evolution of the OOB error is obtained with: plot(rf.sonar.2) legend("topright", col=c("black", "red", "green"), lwd=2, lty=1, legend=c("overall", "class M", "class R")) rf.sonar.2 Error overall class M class R trees that gives very similar results than with the random forest with default parameters. The important variables are given with the following barplot: ordered.variables <- order(rf.sonar.2$importance[,3], decreasing=true) barplot(rf.sonar.2$importance[ordered.variables,3], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)
11 Variable importance (decreasing order) V11 V49 V12 V27 V9 V43 V52 V31 V15 V10 V26 V28 V21 V36 V37 V23 V16 V35 V17 V20 V1 V58 V48 V45 V47 V22 V51 V54 V33 V44 V18 V50 V46 V39 V7 V14 V4 V25 V34 V13 V8 V19 V55 V32 V60 V40 V30 V2 V38 V24 V42 V29 V59 V5 V3 V56 V53 V6 V41 V The five most important variables are the same than the ones found in the previous forest (but not in the same order). Finally, the test error rate is obtained with: pred.test.2 <- predict(rf.sonar.2, sonar.test) sum(pred.test.2!= sonar.test$class)/length(sonar.test$class) [1] The test error rate is a bit larger than in the previous case, which is expected: the forest is based on trees that are more flexible (more variables used for the splits) and thus tends to overfit more than the previous model. Exercice 2 Comparison between random forest and CART boosting This exercise s aim is to compare random forest and CART boosting from an accuracy point of view. The notion of importance is also introduced for CART boosting. The comparison is illustrated on the dataset Vehicle data(vehicle) (package mlbench) which purpose is to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. 1. Using
12 help(vehicle) read the description of the dataset. How many potential predictors are included in the dataset and what are their types? How many observations are included in the dataset? Answer: The dataset contains 846 observations on 18 predictors, all of them being numerical. The last variable of the dataset Class defines the class (bus, opel, saab or van) of the observation, which is to be predicted. 2. Split the data set into a training and a test sets, having respective sizes 400 and 446. Answer: The data set is split into a training and test sets with: ind.test <- sample(1:nrow(vehicle), 446, replace=false) vehicle.test <- Vehicle[ind.test, ] vehicle.train <- Vehicle[-ind.test, ] 3. Using the function tune.randomforest, train a random forest with possible number of trees 500, 1000, 2000 and 3000 and possible number of variables selected to every split 4, 5, 6, 8, 10. Keep the variable importances and use the options xtest and ytest to obtain the test misclassification error directly. Analyze the results of the tuning process. Warning: When using the options xtest and ytest, the forest is not kept unless setting keep.forest=true which means that it cannot be used to make predictions with new data. For using it with the method tune, you must thus set keep.forest=true or the function will return an error while trying to compute the CV error. Answer: The forest is tuned with the following command line: vehicle.tune.rf <- tune.randomforest(vehicle.train[,1:18], vehicle.train[,19], xtest=vehicle.test[,1:18], ytest=vehicle.test[,19], mtry=c(4, 5, 6, 8, 10), ntree=c(500, 1000, 2000, 3000), importance=true, keep.forest=true) The performances of the different pairs of parameters can be visualized with: summary(vehicle.tune.rf) Parameter tuning of 'randomforest': - sampling method: 10-fold cross validation - best parameters: mtry ntree best performance: Detailed performance results: mtry ntree error dispersion
13 plot(vehicle.tune.rf) 3000 Performance of randomforest' ntree mtry The CV misclassification errors range from to with a minimum obtained for 1000 trees trained from 4 variables at each split. 4. Analyze the best forest obtained with the tuning process in the previous question in term of out-of-bag error, test error and their evolutions. What are the most important variables in this model? Answer: The best model obtained by the tuning process is obtained with: vehicle.rf <- vehicle.tune.rf$best.model vehicle.rf Call: best.randomforest(x = vehicle.train[, 1:18], y = vehicle.train[, 19], mtry = c(4, 5, 6, 8, 1 Type of random forest: classification Number of trees: 1000
14 No. of variables tried at each split: 4 OOB estimate of error rate: 28.5% Confusion matrix: bus opel saab van class.error bus opel saab van Test set error rate: 25.56% Confusion matrix: bus opel saab van class.error bus opel saab van which shows that the OOB misclassification rate is about 28.5% mainly due to a difficulty to discrimate between the classes opel and saab. The test misclassification error is slightly smaller (about 25.56%) with the same problem to discriminate opel and saab observations. The evolution of the misclassification errors is displayed using plot(vehicle.rf, main="error evolutions for the dataset 'Vehicle'") legend(700, 0.45, col=1:5, lty=1, lwd=2, cex=0.8, legend=c("overall", levels(vehicle$class)))
15 Error evolutions for the dataset 'Vehicle' Error overall bus opel saab van trees This figure shows that the misclassification rate is stabilized (or even slightly increases) at the end of the training process (when the number of trees in the forest exceed 500 approximately). The two classes opel and saab exibit the worse misclassification rates, as already explained. Finally, the most important variables are ordered.variables <- order(vehicle.rf$importance[,5], decreasing=true) barplot(vehicle.rf$importance[ordered.variables,5], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)
16 Variable importance (decreasing order) Max.L.Ra Sc.Var.maxis Elong Sc.Var.Maxis D.Circ Scat.Ra Max.L.Rect Holl.Ra Comp Pr.Axis.Rect Skew.Maxis Kurt.Maxis Circ Rad.Ra Pr.Axis.Ra Ra.Gyr Skew.maxis Kurt.maxis One variable (Max.L.Ra, max.length aspect ratio) is more important than the others with another set of 5 variables (SC.Var.maxis, Elong, Sc.Var.Maxis, D.Circ and Scat.Ra) with a large importance compared to the others. 5. This question aims at building a simple bagging of trees to compare with the previous the random forest. The package boot and rpart library(boot) library(rpart) will be used: to do so, answer the following questions: (a) Define a function boot.vehicle <- function(x, d) {... } that returns 3 values: the OOB misclassification rate for the classification tree trained with the subsample x[d, ]; the test misclassification rate for the same classification tree; the misclassification rates obtained with the same classification tree on OOB observations when one of the variable has been permuted (it is thus a vector of size 18, one value for each variable).
17 Answer: boot.vehicle <- function(x, d) { inbag <- x[d, ] oobag <- x[-d, ] cur.tree <- rpart(class~., data=inbag, method="class") # OOB error cur.oob <- predict(cur.tree, oobag) cur.oob <- sum(oobag$class!= colnames(cur.oob)[apply(cur.oob, 1, which.max)])/ nrow(cur.oob) # test error cur.test <- predict(cur.tree, vehicle.test) cur.test <- sum(vehicle.test$class!= colnames(cur.test)[apply(cur.test, 1, which.max)])/ nrow(cur.test) # OOB shuffled error shuffled.oob <- sapply(1:(ncol(oobag)-1), function(ind) { new.oob <- oobag new.oob[,ind] <- oobag[sample(1:nrow(oobag), nrow(oobag), replace=false), ind] cur.oob <- predict(cur.tree, new.oob) cur.oob <- sum(oobag$class!= colnames(cur.oob)[apply(cur.oob, 1, which.max)])/ nrow(cur.oob) return(cur.oob) }) return(c(cur.oob, cur.test, shuffled.oob)) } (b) Use the function defined in the previous question with boot to obtain bootstrap estimates of the OOB misclassification rate (computed only from the OOB observations in the training data set) and of the test misclassification rate for a bagging of R trees with R being the optimal parameter found in question 3. Answer: Results of the bagging approach are obtained from: res.bagging <- boot(vehicle.train, boot.vehicle, R=vehicle.tune.rf$best.parameters$ntree) which gives the following OOB misclassification rate: mean(res.bagging$t[,1]) [1] and the following test misclassification rate: mean(res.bagging$t[,2]) [1] (c) Use the result of the previous function to compute and represent the variable importances. Answer: A variable s importance is the mean accuracy of the difference between the OOB accuracy and the accuracy obtained with permuted values. difference.acc <- sweep(res.bagging$t[,3:20], 1, res.bagging$t[,1], "-") cart.importance <- apply(difference.acc, 2, mean) names(cart.importance) <- colnames(vehicle.test)[1:18] ordered.variables <- order(cart.importance, decreasing=true) barplot(cart.importance[ordered.variables], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)
18 Variable importance (decreasing order) Max.L.Ra Sc.Var.Maxis Elong Sc.Var.maxis D.Circ Comp Scat.Ra Holl.Ra Max.L.Rect Skew.maxis Skew.Maxis Circ Kurt.maxis Rad.Ra Kurt.Maxis Pr.Axis.Ra Pr.Axis.Rect Ra.Gyr One variable (Max.L.Ra, max.length aspect ratio) is more important than the others with another set of 3 variables (SC.Var.maxis, Elong and Sc.Var.Maxis) with a large importance compared to the others. 6. Analyze the results (OOB and test errors, important variables) of the bagging of classification trees by comparing them with the ones obtained from a random forest. Answer: Misclassification rates (for both OOB and test sets) are smaller with random forests than with bagging of trees. Bagging of trees even shows a tendancy to overfit the data (with a larger test error than OOB error). The variable importances are consistent between the two methods with a larger set of important variables for random forest than for bagging of CART.
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
Package trimtrees. February 20, 2015
Package trimtrees February 20, 2015 Type Package Title Trimmed opinion pools of trees in a random forest Version 1.2 Date 2014-08-1 Depends R (>= 2.5.0),stats,randomForest,mlbench Author Yael Grushka-Cockayne,
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
Applied Multivariate Analysis - Big data analytics
Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix [email protected] http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Package bigrf. February 19, 2015
Version 0.1-11 Date 2014-05-16 Package bigrf February 19, 2015 Title Big Random Forests: Classification and Regression Forests for Large Data Sets Maintainer Aloysius Lim OS_type
Data Science with R Ensemble of Decision Trees
Data Science with R Ensemble of Decision Trees [email protected] 3rd August 2014 Visit http://handsondatascience.com/ for more Chapters. The concept of building multiple decision trees to produce
A Methodology for Predictive Failure Detection in Semiconductor Fabrication
A Methodology for Predictive Failure Detection in Semiconductor Fabrication Peter Scheibelhofer (TU Graz) Dietmar Gleispach, Günter Hayderer (austriamicrosystems AG) 09-09-2011 Peter Scheibelhofer (TU
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Package missforest. February 20, 2015
Type Package Package missforest February 20, 2015 Title Nonparametric Missing Value Imputation using Random Forest Version 1.4 Date 2013-12-31 Author Daniel J. Stekhoven Maintainer
Didacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
2 Decision tree + Cross-validation with R (package rpart)
1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
Homework Assignment 7
Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Data Analysis of Trends in iphone 5 Sales on ebay
Data Analysis of Trends in iphone 5 Sales on ebay By Wenyu Zhang Mentor: Professor David Aldous Contents Pg 1. Introduction 3 2. Data and Analysis 4 2.1 Description of Data 4 2.2 Retrieval of Data 5 2.3
Benchmarking Open-Source Tree Learners in R/RWeka
Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
R software tutorial: Random Forest Clustering Applied to Renal Cell Carcinoma Steve Horvath and Tao Shi
R software tutorial: Random Forest Clustering Applied to Renal Cell Carcinoma Steve orvath and Tao Shi Correspondence: [email protected] Department of uman Genetics and Biostatistics University
Data mining on the EPIRARE survey data
Deliverable D1.5 Data mining on the EPIRARE survey data A. Coi 1, M. Santoro 2, M. Lipucci 2, A.M. Bianucci 1, F. Bianchi 2 1 Department of Pharmacy, Unit of Research of Bioinformatic and Computational
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Data Science with R. Introducing Data Mining with Rattle and R. [email protected]
http: // togaware. com Copyright 2013, [email protected] 1/35 Data Science with R Introducing Data Mining with Rattle and R [email protected] Senior Director and Chief Data Miner,
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012
PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping Version 1.0, Oct 2012 This document describes PaRFR, a Java package that implements a parallel random
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
Implementation of Breiman s Random Forest Machine Learning Algorithm
Implementation of Breiman s Random Forest Machine Learning Algorithm Frederick Livingston Abstract This research provides tools for exploring Breiman s Random Forest algorithm. This paper will focus on
1 Topic. 2 Scilab. 2.1 What is Scilab?
1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable
Página 1 de 6 Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable STATISTICA Data Miner includes a complete deployment engine with various options for deploying solutions
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
# load in the files containing the methyaltion data and the source # code containing the SSRPMM functions
################ EXAMPLE ANALYSES TO ILLUSTRATE SS-RPMM ######################## # load in the files containing the methyaltion data and the source # code containing the SSRPMM functions # Note, the SSRPMM
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Computational Assignment 4: Discriminant Analysis
Computational Assignment 4: Discriminant Analysis -Written by James Wilson -Edited by Andrew Nobel In this assignment, we will investigate running Fisher s Discriminant analysis in R. This is a powerful
Package MixGHD. June 26, 2015
Type Package Package MixGHD June 26, 2015 Title Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions Version 1.7 Date 2015-6-15 Author
In this presentation, you will be introduced to data mining and the relationship with meaningful use.
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Mining Wiki Usage Data for Predicting Final Grades of Students
Mining Wiki Usage Data for Predicting Final Grades of Students Gökhan Akçapınar, Erdal Coşgun, Arif Altun Hacettepe University [email protected], [email protected], [email protected]
Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16
Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16 Since R is command line driven and the primary software of Chapter 16, this document details
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report
2012 Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report Dinesh Ganti(61310071), Gauri Singh(61310560), Ravi Shankar(61310210), Shouri Kamtala(61310215),
How to Deploy Models using Statistica SVB Nodes
How to Deploy Models using Statistica SVB Nodes Abstract Dell Statistica is an analytics software package that offers data preparation, statistics, data mining and predictive analytics, machine learning,
# For usage of the functions, it is necessary to install the "survival" and the "penalized" package.
###################################################################### ### R-script for the manuscript ### ### ### ### Survival models with preclustered ### ### gene groups as covariates ### ### ### ###
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
An Experimental Study on Rotation Forest Ensembles
An Experimental Study on Rotation Forest Ensembles Ludmila I. Kuncheva 1 and Juan J. Rodríguez 2 1 School of Electronics and Computer Science, University of Wales, Bangor, UK [email protected]
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort [email protected] Session Number: TBR14 Insurance has always been a data business The industry has successfully
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
Lecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
CART 6.0 Feature Matrix
CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
ModelMap: an R Package for Model Creation and Map Production
ModelMap: an R Package for Model Creation and Map Production Elizabeth A. Freeman, Tracey S. Frescino, Gretchen G. Moisen April 15, 2014 Abstract The ModelMap package (Freeman, 2009) for R (R Development
Chapter 7. Feature Selection. 7.1 Introduction
Chapter 7 Feature Selection Feature selection is not used in the system classification experiments, which will be discussed in Chapter 8 and 9. However, as an autonomous system, OMEGA includes feature
Using Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, [email protected] Department of Statistics,UC Berkeley Andy Liaw, andy [email protected] Biometrics Research,Merck Research Labs Leo Breiman,
A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML
www.bsc.es A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML Josep Ll. Berral, Nicolas Poggi, David Carrera Workshop on Big Data Benchmarks Toronto, Canada 2015 1 Context ALOJA: framework
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
On the application of multi-class classification in physical therapy recommendation
RESEARCH Open Access On the application of multi-class classification in physical therapy recommendation Jing Zhang 1,PengCao 1,DouglasPGross 2 and Osmar R Zaiane 1* Abstract Recommending optimal rehabilitation
Microsoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql [email protected] http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
Consensus Based Ensemble Model for Spam Detection
Consensus Based Ensemble Model for Spam Detection Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Information Security Submitted By Paritosh
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs
1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be
Data Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
partykit: A Toolkit for Recursive Partytioning
partykit: A Toolkit for Recursive Partytioning Achim Zeileis, Torsten Hothorn http://eeecon.uibk.ac.at/~zeileis/ Overview Status quo: R software for tree models New package: partykit Unified infrastructure
Tutorial Segmentation and Classification
MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION 1.0.8 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Exploratory Data Analysis using Random Forests
Exploratory Data Analysis using Random Forests Zachary Jones and Fridolin Linder Abstract Although the rise of "big data" has made machine learning algorithms more visible and relevant for social scientists,
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
The Operational Value of Social Media Information. Social Media and Customer Interaction
The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia
Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering
IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons
New Ensemble Combination Scheme
New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments
Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for
Data Mining Lab 5: Introduction to Neural Networks
Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese
Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD
Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize
We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap
Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2
Graphics in R. Biostatistics 615/815
Graphics in R Biostatistics 615/815 Last Lecture Introduction to R Programming Controlling Loops Defining your own functions Today Introduction to Graphics in R Examples of commonly used graphics functions
New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO
IBM SPSS Statistics 20 Part 1: Descriptive Statistics
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 1: Descriptive Statistics Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the
Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups
Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln Log-Rank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics)
Fraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
