M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

Size: px
Start display at page:

Download "M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest"

Transcription

1 Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how to train and use random forests with R. The method is implemented in the package randomforest which is to be installed and loaded. The method is illustrated on examples taken from the UCI Machine Learning repository 1, which are available in the R package mlbench: library(randomforest) randomforest Type rfnews() to see new features/changes/bug fixes. library(mlbench) Exercice 1 Random forest: training, interpretation and tuning This exercise illustrates the use of random forest to discriminate sonar signals bounced off a metal cylinder from those bounced off a roughly cylindrical rock. Data are contained in the dataset Sonar: data(sonar) 1. Using help(sonar) read the description of the dataset. How many potential predictors are included in the dataset and what are their types? How many observations are included in the dataset? Answer: The dataset contains 208 observations on 60 predictors, all of them being numerical. The last variable of the dataset Class defines the class (M or R) of the observation, which is to be predicted. 2. Randomly split the data into a training and a test sets of size 100 and 108, respectively. Answer: ind.test <- sample(1:208, 108, replace=false) sonar.train <- Sonar[-ind.test, ] sonar.test <- Sonar[ind.test, ] 3. Use the function randomforest to train a random forest on the training set with default parameters (use the formula syntax described in help(randomforest) and the options keep.inbag=true and importance=true). What is the OOB error? Make a contingency tables between OOB predictions and real classes. Answer: rf.sonar <- randomforest(class~., data=sonar.train, keep.inbag=true, importance=true) rf.sonar 1

2 Call: randomforest(formula = Class ~., data = sonar.train, keep.inbag = TRUE, importance = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 7 OOB estimate of error rate: 20% Confusion matrix: M R class.error M R Using the help, describe what is in predicted, err.rate, confusion, votes, oob.times, importance, ntree, mtry and inbag in the random forest object trained in the previous question. Answer: head(rf.sonar$predicted) M R R R R R Levels: M R contains the OOB predictions of the random forest; head(rf.sonar$err.rate) OOB M R [1,] [2,] [3,] [4,] [5,] [6,] contains the evolution of the misclassification rate during the learning process while the number of trees in the forest is increasing. The misclassification rate is given for each of the two classes and as an overall miscalssification rate. It is estimated from OOB predictions; rf.sonar$confusion M R class.error M R contains the confusion matrix between OOB predictions and real classes; head(rf.sonar$votes) M R contains the ratio of trees voting for each of the two classes for every observations (it is thus a matrix). The votes are based on trees for which the observations is out-of-the-bag; head(rf.sonar$oob.times) [1] contains, for every observation, the number of trees for which the observation is out-of-bag;

3 head(rf.sonar$importance) M R MeanDecreaseAccuracy MeanDecreaseGini V e V e V e V e V e V e contains, for every variable, the importance of the variable, expressed as the mean decrease accuracy in predicted each of the two classes, or the overall mean decrease accuracy or another importance index which is the mean decrease in Gini index (not explained in the lesson); rf.sonar$ntree; rf.sonar$mtry [1] 500 [1] 7 respectively contain the number of trees in the forest and the number of variables used for each split (here equal to 7 60); head(rf.sonar$inbag[,1:5]) [,1] [,2] [,3] [,4] [,5] [1,] [2,] [3,] [4,] [5,] [6,] is a matrix which contains 1 each time an observation (row) is included in a the bag of a tree (column). 5. What is the OOB prediction for the 50th observation in your training sample? What is the true class for this observation? How many times has it been included in a tree and in which trees has it been included? How many trees voted for the prediction R for this observation? Answer: The OOB prediction for the 50th observation in the training sample is rf.sonar$predicted[50] 98 R Levels: M R and the true class for this observation is sonar.train$class[50] [1] M Levels: M R The observation has been included in nb.oob.50 <- rf.sonar$oob.times[50] rf.sonar$ntree - nb.oob.50 [1] 299 trees which are the trees numbered: which(rf.sonar$inbag[50, ]==1)

4 [1] [18] [35] [52] [69] [86] [103] [120] [137] [154] [171] [188] [205] [222] [239] [256] [273] [290] For this observations, the number of (OOB) trees that have voted R is equal to rf.sonar$votes[50,"r"] * nb.oob.50 [1] 132 (over 201 OOB trees for this observation). 6. Using the plot.randomforest function, make a plot that displays the evolution of the OOB misclassification rate when the number of trees in the forest increases. What is represented by the different colors? Add a legend and comment. Answer: plot(rf.sonar, main="evolution of misclassification rates")

5 Evolution of misclassification rates Error trees The three graphics show the evolution of the misclassification rate, the misclassification rate for the "R" class and the misclassification rate for the "M" class respectively. tail(rf.sonar$err.rate) OOB M R [495,] [496,] [497,] [498,] [499,] [500,] shows that the black line corresponds to the overall misclassification rate, the green one to the misclassification rate for "R" and the red one to the misclassification rate for "M". The legend can be added with: plot(rf.sonar, main="evolution of misclassification rates") legend("topright", col=c("black", "red", "green"), lwd=2, lty=1, legend=c("overall", "class M", "class R"))

6 Evolution of misclassification rates Error overall class M class R trees The graphic shows that the three final misclassification rates are close, all equal to about 20%. 7. Make a barplot of the variable importances ranked in decreasing order. Comment. Answer: ordered.variables <- order(rf.sonar$importance[,3], decreasing=true) barplot(rf.sonar$importance[ordered.variables,3], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)

7 Variable importance (decreasing order) V11 V12 V49 V9 V27 V43 V10 V26 V16 V31 V36 V58 V22 V37 V23 V52 V35 V45 V15 V20 V21 V48 V44 V51 V28 V4 V33 V17 V55 V1 V54 V14 V41 V46 V29 V39 V19 V53 V38 V24 V47 V59 V7 V13 V34 V57 V25 V8 V30 V42 V60 V6 V18 V2 V50 V56 V5 V40 V3 V Two variables have a particularly large importance in the accuracy of the forest (V11 and V12) and three other are rather importance also (V49, V9 and V27). 8. With the function predict.randomforest, predict the output of the random forest on the test set and find the test misclassification rate. Comment. Answer: The output of the random forest for the test set are found with: pred.test <- predict(rf.sonar, sonar.test) Hence, the misclassification rate for the test set is equal to: sum(pred.test!= sonar.test$class)/length(pred.test) [1] which is smaller than the OOB misclassification rate and seems to show that the model does not overfit. 9. Use the package e1071 and the function tune.randomforest to find, with a 10-fold cross validation performed on the training set only, which pair of parameters is the best among ntree=c(500, 1000, 1500, 2000) and mtry=c(5, 7, 10, 15, 20). Set the option importance to TRUE to obtain the variable importances for the best model. Use the functions summary.tune and plot.tune to interpret the result.

8 library(e1071) Answer: The tuning is performed with: res.tune.rf <- tune.randomforest(sonar.train[,1:60], sonar.train[,61], mtry=c(5, 7, 10, 15, 20), ntree=c(500, 1000, 1500, 2000), importance=true) which can be analyzed with: summary(res.tune.rf) Parameter tuning of 'randomforest': - sampling method: 10-fold cross validation - best parameters: mtry ntree best performance: Detailed performance results: mtry ntree error dispersion plot(res.tune.rf)

9 Performance of randomforest' ntree mtry The best parameters obtained by cross validation are a number of trees equal to 2000 and a number of variables used for every split equal to 15. The CV error is then a bit smaller (about 17%) than the OOB error obtained with the default parameters. 10. From the previous question, extract the best model (in the object obtained with the function tune.randomforest, it is in $best.model). Compare its OOB error and its evolution, its most important variables and its test error with the random forest obtained with default parameters. Comment. Answer: The best model is obtained with rf.sonar.2 <- res.tune.rf$best.model rf.sonar.2 Call: best.randomforest(x = sonar.train[, 1:60], y = sonar.train[, 61], mtry = c(5, 7, 10, 15, 20) Type of random forest: classification Number of trees: 2000 No. of variables tried at each split: 15 OOB estimate of error rate: 19% Confusion matrix: M R class.error M

10 R The OOB error is thus equal to 19% compared to 20% with the default parameter. The evolution of the OOB error is obtained with: plot(rf.sonar.2) legend("topright", col=c("black", "red", "green"), lwd=2, lty=1, legend=c("overall", "class M", "class R")) rf.sonar.2 Error overall class M class R trees that gives very similar results than with the random forest with default parameters. The important variables are given with the following barplot: ordered.variables <- order(rf.sonar.2$importance[,3], decreasing=true) barplot(rf.sonar.2$importance[ordered.variables,3], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)

11 Variable importance (decreasing order) V11 V49 V12 V27 V9 V43 V52 V31 V15 V10 V26 V28 V21 V36 V37 V23 V16 V35 V17 V20 V1 V58 V48 V45 V47 V22 V51 V54 V33 V44 V18 V50 V46 V39 V7 V14 V4 V25 V34 V13 V8 V19 V55 V32 V60 V40 V30 V2 V38 V24 V42 V29 V59 V5 V3 V56 V53 V6 V41 V The five most important variables are the same than the ones found in the previous forest (but not in the same order). Finally, the test error rate is obtained with: pred.test.2 <- predict(rf.sonar.2, sonar.test) sum(pred.test.2!= sonar.test$class)/length(sonar.test$class) [1] The test error rate is a bit larger than in the previous case, which is expected: the forest is based on trees that are more flexible (more variables used for the splits) and thus tends to overfit more than the previous model. Exercice 2 Comparison between random forest and CART boosting This exercise s aim is to compare random forest and CART boosting from an accuracy point of view. The notion of importance is also introduced for CART boosting. The comparison is illustrated on the dataset Vehicle data(vehicle) (package mlbench) which purpose is to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. 1. Using

12 help(vehicle) read the description of the dataset. How many potential predictors are included in the dataset and what are their types? How many observations are included in the dataset? Answer: The dataset contains 846 observations on 18 predictors, all of them being numerical. The last variable of the dataset Class defines the class (bus, opel, saab or van) of the observation, which is to be predicted. 2. Split the data set into a training and a test sets, having respective sizes 400 and 446. Answer: The data set is split into a training and test sets with: ind.test <- sample(1:nrow(vehicle), 446, replace=false) vehicle.test <- Vehicle[ind.test, ] vehicle.train <- Vehicle[-ind.test, ] 3. Using the function tune.randomforest, train a random forest with possible number of trees 500, 1000, 2000 and 3000 and possible number of variables selected to every split 4, 5, 6, 8, 10. Keep the variable importances and use the options xtest and ytest to obtain the test misclassification error directly. Analyze the results of the tuning process. Warning: When using the options xtest and ytest, the forest is not kept unless setting keep.forest=true which means that it cannot be used to make predictions with new data. For using it with the method tune, you must thus set keep.forest=true or the function will return an error while trying to compute the CV error. Answer: The forest is tuned with the following command line: vehicle.tune.rf <- tune.randomforest(vehicle.train[,1:18], vehicle.train[,19], xtest=vehicle.test[,1:18], ytest=vehicle.test[,19], mtry=c(4, 5, 6, 8, 10), ntree=c(500, 1000, 2000, 3000), importance=true, keep.forest=true) The performances of the different pairs of parameters can be visualized with: summary(vehicle.tune.rf) Parameter tuning of 'randomforest': - sampling method: 10-fold cross validation - best parameters: mtry ntree best performance: Detailed performance results: mtry ntree error dispersion

13 plot(vehicle.tune.rf) 3000 Performance of randomforest' ntree mtry The CV misclassification errors range from to with a minimum obtained for 1000 trees trained from 4 variables at each split. 4. Analyze the best forest obtained with the tuning process in the previous question in term of out-of-bag error, test error and their evolutions. What are the most important variables in this model? Answer: The best model obtained by the tuning process is obtained with: vehicle.rf <- vehicle.tune.rf$best.model vehicle.rf Call: best.randomforest(x = vehicle.train[, 1:18], y = vehicle.train[, 19], mtry = c(4, 5, 6, 8, 1 Type of random forest: classification Number of trees: 1000

14 No. of variables tried at each split: 4 OOB estimate of error rate: 28.5% Confusion matrix: bus opel saab van class.error bus opel saab van Test set error rate: 25.56% Confusion matrix: bus opel saab van class.error bus opel saab van which shows that the OOB misclassification rate is about 28.5% mainly due to a difficulty to discrimate between the classes opel and saab. The test misclassification error is slightly smaller (about 25.56%) with the same problem to discriminate opel and saab observations. The evolution of the misclassification errors is displayed using plot(vehicle.rf, main="error evolutions for the dataset 'Vehicle'") legend(700, 0.45, col=1:5, lty=1, lwd=2, cex=0.8, legend=c("overall", levels(vehicle$class)))

15 Error evolutions for the dataset 'Vehicle' Error overall bus opel saab van trees This figure shows that the misclassification rate is stabilized (or even slightly increases) at the end of the training process (when the number of trees in the forest exceed 500 approximately). The two classes opel and saab exibit the worse misclassification rates, as already explained. Finally, the most important variables are ordered.variables <- order(vehicle.rf$importance[,5], decreasing=true) barplot(vehicle.rf$importance[ordered.variables,5], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)

16 Variable importance (decreasing order) Max.L.Ra Sc.Var.maxis Elong Sc.Var.Maxis D.Circ Scat.Ra Max.L.Rect Holl.Ra Comp Pr.Axis.Rect Skew.Maxis Kurt.Maxis Circ Rad.Ra Pr.Axis.Ra Ra.Gyr Skew.maxis Kurt.maxis One variable (Max.L.Ra, max.length aspect ratio) is more important than the others with another set of 5 variables (SC.Var.maxis, Elong, Sc.Var.Maxis, D.Circ and Scat.Ra) with a large importance compared to the others. 5. This question aims at building a simple bagging of trees to compare with the previous the random forest. The package boot and rpart library(boot) library(rpart) will be used: to do so, answer the following questions: (a) Define a function boot.vehicle <- function(x, d) {... } that returns 3 values: the OOB misclassification rate for the classification tree trained with the subsample x[d, ]; the test misclassification rate for the same classification tree; the misclassification rates obtained with the same classification tree on OOB observations when one of the variable has been permuted (it is thus a vector of size 18, one value for each variable).

17 Answer: boot.vehicle <- function(x, d) { inbag <- x[d, ] oobag <- x[-d, ] cur.tree <- rpart(class~., data=inbag, method="class") # OOB error cur.oob <- predict(cur.tree, oobag) cur.oob <- sum(oobag$class!= colnames(cur.oob)[apply(cur.oob, 1, which.max)])/ nrow(cur.oob) # test error cur.test <- predict(cur.tree, vehicle.test) cur.test <- sum(vehicle.test$class!= colnames(cur.test)[apply(cur.test, 1, which.max)])/ nrow(cur.test) # OOB shuffled error shuffled.oob <- sapply(1:(ncol(oobag)-1), function(ind) { new.oob <- oobag new.oob[,ind] <- oobag[sample(1:nrow(oobag), nrow(oobag), replace=false), ind] cur.oob <- predict(cur.tree, new.oob) cur.oob <- sum(oobag$class!= colnames(cur.oob)[apply(cur.oob, 1, which.max)])/ nrow(cur.oob) return(cur.oob) }) return(c(cur.oob, cur.test, shuffled.oob)) } (b) Use the function defined in the previous question with boot to obtain bootstrap estimates of the OOB misclassification rate (computed only from the OOB observations in the training data set) and of the test misclassification rate for a bagging of R trees with R being the optimal parameter found in question 3. Answer: Results of the bagging approach are obtained from: res.bagging <- boot(vehicle.train, boot.vehicle, R=vehicle.tune.rf$best.parameters$ntree) which gives the following OOB misclassification rate: mean(res.bagging$t[,1]) [1] and the following test misclassification rate: mean(res.bagging$t[,2]) [1] (c) Use the result of the previous function to compute and represent the variable importances. Answer: A variable s importance is the mean accuracy of the difference between the OOB accuracy and the accuracy obtained with permuted values. difference.acc <- sweep(res.bagging$t[,3:20], 1, res.bagging$t[,1], "-") cart.importance <- apply(difference.acc, 2, mean) names(cart.importance) <- colnames(vehicle.test)[1:18] ordered.variables <- order(cart.importance, decreasing=true) barplot(cart.importance[ordered.variables], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)

18 Variable importance (decreasing order) Max.L.Ra Sc.Var.Maxis Elong Sc.Var.maxis D.Circ Comp Scat.Ra Holl.Ra Max.L.Rect Skew.maxis Skew.Maxis Circ Kurt.maxis Rad.Ra Kurt.Maxis Pr.Axis.Ra Pr.Axis.Rect Ra.Gyr One variable (Max.L.Ra, max.length aspect ratio) is more important than the others with another set of 3 variables (SC.Var.maxis, Elong and Sc.Var.Maxis) with a large importance compared to the others. 6. Analyze the results (OOB and test errors, important variables) of the bagging of classification trees by comparing them with the ones obtained from a random forest. Answer: Misclassification rates (for both OOB and test sets) are smaller with random forests than with bagging of trees. Bagging of trees even shows a tendancy to overfit the data (with a larger test error than OOB error). The variable importances are consistent between the two methods with a larger set of important variables for random forest than for bagging of CART.

Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

More information

Package trimtrees. February 20, 2015

Package trimtrees. February 20, 2015 Package trimtrees February 20, 2015 Type Package Title Trimmed opinion pools of trees in a random forest Version 1.2 Date 2014-08-1 Depends R (>= 2.5.0),stats,randomForest,mlbench Author Yael Grushka-Cockayne,

More information

Classification and Regression by randomforest

Classification and Regression by randomforest Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

More information

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya

More information

Applied Multivariate Analysis - Big data analytics

Applied Multivariate Analysis - Big data analytics Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Package bigrf. February 19, 2015

Package bigrf. February 19, 2015 Version 0.1-11 Date 2014-05-16 Package bigrf February 19, 2015 Title Big Random Forests: Classification and Regression Forests for Large Data Sets Maintainer Aloysius Lim OS_type

More information

Data Science with R Ensemble of Decision Trees

Data Science with R Ensemble of Decision Trees Data Science with R Ensemble of Decision Trees Graham.Williams@togaware.com 3rd August 2014 Visit http://handsondatascience.com/ for more Chapters. The concept of building multiple decision trees to produce

More information

How To Do Data Mining In R

How To Do Data Mining In R Data Mining with R John Maindonald (Centre for Mathematics and Its Applications, Australian National University) and Yihui Xie (School of Statistics, Renmin University of China) December 13, 2008 Data

More information

II. Methods - 2 - X (i.e. if the system is convective or not). Y = 1 X ). Usually, given these estimates, an

II. Methods - 2 - X (i.e. if the system is convective or not). Y = 1 X ). Usually, given these estimates, an STORMS PREDICTION: LOGISTIC REGRESSION VS RANDOM FOREST FOR UNBALANCED DATA Anne Ruiz-Gazen Institut de Mathématiques de Toulouse and Gremaq, Université Toulouse I, France Nathalie Villa Institut de Mathématiques

More information

A Methodology for Predictive Failure Detection in Semiconductor Fabrication

A Methodology for Predictive Failure Detection in Semiconductor Fabrication A Methodology for Predictive Failure Detection in Semiconductor Fabrication Peter Scheibelhofer (TU Graz) Dietmar Gleispach, Günter Hayderer (austriamicrosystems AG) 09-09-2011 Peter Scheibelhofer (TU

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Package missforest. February 20, 2015

Package missforest. February 20, 2015 Type Package Package missforest February 20, 2015 Title Nonparametric Missing Value Imputation using Random Forest Version 1.4 Date 2013-12-31 Author Daniel J. Stekhoven Maintainer

More information

Didacticiel Études de cas

Didacticiel Études de cas 1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see

More information

Predicting borrowers chance of defaulting on credit loans

Predicting borrowers chance of defaulting on credit loans Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm

More information

2 Decision tree + Cross-validation with R (package rpart)

2 Decision tree + Cross-validation with R (package rpart) 1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing

More information

Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar

More information

Homework Assignment 7

Homework Assignment 7 Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448

More information

Data Analytics and Business Intelligence (8696/8697)

Data Analytics and Business Intelligence (8696/8697) http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/36 Data Analytics and Business Intelligence (8696/8697) Ensemble Decision Trees Graham.Williams@togaware.com Data Scientist Australian

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Data Analysis of Trends in iphone 5 Sales on ebay

Data Analysis of Trends in iphone 5 Sales on ebay Data Analysis of Trends in iphone 5 Sales on ebay By Wenyu Zhang Mentor: Professor David Aldous Contents Pg 1. Introduction 3 2. Data and Analysis 4 2.1 Description of Data 4 2.2 Retrieval of Data 5 2.3

More information

Benchmarking Open-Source Tree Learners in R/RWeka

Benchmarking Open-Source Tree Learners in R/RWeka Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

R software tutorial: Random Forest Clustering Applied to Renal Cell Carcinoma Steve Horvath and Tao Shi

R software tutorial: Random Forest Clustering Applied to Renal Cell Carcinoma Steve Horvath and Tao Shi R software tutorial: Random Forest Clustering Applied to Renal Cell Carcinoma Steve orvath and Tao Shi Correspondence: shorvath@mednet.ucla.edu Department of uman Genetics and Biostatistics University

More information

Data mining on the EPIRARE survey data

Data mining on the EPIRARE survey data Deliverable D1.5 Data mining on the EPIRARE survey data A. Coi 1, M. Santoro 2, M. Lipucci 2, A.M. Bianucci 1, F. Bianchi 2 1 Department of Pharmacy, Unit of Research of Bioinformatic and Computational

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Data Science with R. Introducing Data Mining with Rattle and R. Graham.Williams@togaware.com

Data Science with R. Introducing Data Mining with Rattle and R. Graham.Williams@togaware.com http: // togaware. com Copyright 2013, Graham.Williams@togaware.com 1/35 Data Science with R Introducing Data Mining with Rattle and R Graham.Williams@togaware.com Senior Director and Chief Data Miner,

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012 PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping Version 1.0, Oct 2012 This document describes PaRFR, a Java package that implements a parallel random

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

Implementation of Breiman s Random Forest Machine Learning Algorithm

Implementation of Breiman s Random Forest Machine Learning Algorithm Implementation of Breiman s Random Forest Machine Learning Algorithm Frederick Livingston Abstract This research provides tools for exploring Breiman s Random Forest algorithm. This paper will focus on

More information

1 Topic. 2 Scilab. 2.1 What is Scilab?

1 Topic. 2 Scilab. 2.1 What is Scilab? 1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable

Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable Página 1 de 6 Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable STATISTICA Data Miner includes a complete deployment engine with various options for deploying solutions

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

# load in the files containing the methyaltion data and the source # code containing the SSRPMM functions

# load in the files containing the methyaltion data and the source # code containing the SSRPMM functions ################ EXAMPLE ANALYSES TO ILLUSTRATE SS-RPMM ######################## # load in the files containing the methyaltion data and the source # code containing the SSRPMM functions # Note, the SSRPMM

More information

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Computational Assignment 4: Discriminant Analysis

Computational Assignment 4: Discriminant Analysis Computational Assignment 4: Discriminant Analysis -Written by James Wilson -Edited by Andrew Nobel In this assignment, we will investigate running Fisher s Discriminant analysis in R. This is a powerful

More information

Package MixGHD. June 26, 2015

Package MixGHD. June 26, 2015 Type Package Package MixGHD June 26, 2015 Title Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions Version 1.7 Date 2015-6-15 Author

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Mining Wiki Usage Data for Predicting Final Grades of Students

Mining Wiki Usage Data for Predicting Final Grades of Students Mining Wiki Usage Data for Predicting Final Grades of Students Gökhan Akçapınar, Erdal Coşgun, Arif Altun Hacettepe University gokhana@hacettepe.edu.tr, erdal.cosgun@hacettepe.edu.tr, altunar@hacettepe.edu.tr

More information

Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16

Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16 Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16 Since R is command line driven and the primary software of Chapter 16, this document details

More information

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

More information

Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report

Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report 2012 Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report Dinesh Ganti(61310071), Gauri Singh(61310560), Ravi Shankar(61310210), Shouri Kamtala(61310215),

More information

How to Deploy Models using Statistica SVB Nodes

How to Deploy Models using Statistica SVB Nodes How to Deploy Models using Statistica SVB Nodes Abstract Dell Statistica is an analytics software package that offers data preparation, statistics, data mining and predictive analytics, machine learning,

More information

# For usage of the functions, it is necessary to install the "survival" and the "penalized" package.

# For usage of the functions, it is necessary to install the survival and the penalized package. ###################################################################### ### R-script for the manuscript ### ### ### ### Survival models with preclustered ### ### gene groups as covariates ### ### ### ###

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

An Experimental Study on Rotation Forest Ensembles

An Experimental Study on Rotation Forest Ensembles An Experimental Study on Rotation Forest Ensembles Ludmila I. Kuncheva 1 and Juan J. Rodríguez 2 1 School of Electronics and Computer Science, University of Wales, Bangor, UK l.i.kuncheva@bangor.ac.uk

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

Lecture 13: Validation

Lecture 13: Validation Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model

More information

CART 6.0 Feature Matrix

CART 6.0 Feature Matrix CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

ModelMap: an R Package for Model Creation and Map Production

ModelMap: an R Package for Model Creation and Map Production ModelMap: an R Package for Model Creation and Map Production Elizabeth A. Freeman, Tracey S. Frescino, Gretchen G. Moisen April 15, 2014 Abstract The ModelMap package (Freeman, 2009) for R (R Development

More information

Chapter 7. Feature Selection. 7.1 Introduction

Chapter 7. Feature Selection. 7.1 Introduction Chapter 7 Feature Selection Feature selection is not used in the system classification experiments, which will be discussed in Chapter 8 and 9. However, as an autonomous system, OMEGA includes feature

More information

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

More information

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML www.bsc.es A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML Josep Ll. Berral, Nicolas Poggi, David Carrera Workshop on Big Data Benchmarks Toronto, Canada 2015 1 Context ALOJA: framework

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

On the application of multi-class classification in physical therapy recommendation

On the application of multi-class classification in physical therapy recommendation RESEARCH Open Access On the application of multi-class classification in physical therapy recommendation Jing Zhang 1,PengCao 1,DouglasPGross 2 and Osmar R Zaiane 1* Abstract Recommending optimal rehabilitation

More information

Microsoft Azure Machine learning Algorithms

Microsoft Azure Machine learning Algorithms Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation

More information

Consensus Based Ensemble Model for Spam Detection

Consensus Based Ensemble Model for Spam Detection Consensus Based Ensemble Model for Spam Detection Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Information Security Submitted By Paritosh

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs 1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be

More information

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

More information

partykit: A Toolkit for Recursive Partytioning

partykit: A Toolkit for Recursive Partytioning partykit: A Toolkit for Recursive Partytioning Achim Zeileis, Torsten Hothorn http://eeecon.uibk.ac.at/~zeileis/ Overview Status quo: R software for tree models New package: partykit Unified infrastructure

More information

Tutorial Segmentation and Classification

Tutorial Segmentation and Classification MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION 1.0.8 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel

More information

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 - Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Exploratory Data Analysis using Random Forests

Exploratory Data Analysis using Random Forests Exploratory Data Analysis using Random Forests Zachary Jones and Fridolin Linder Abstract Although the rise of "big data" has made machine learning algorithms more visible and relevant for social scientists,

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

The Operational Value of Social Media Information. Social Media and Customer Interaction

The Operational Value of Social Media Information. Social Media and Customer Interaction The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia

More information

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

II. RELATED WORK. Sentiment Mining

II. RELATED WORK. Sentiment Mining Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

More information

Predicting Flight Delays

Predicting Flight Delays Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

More information

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for

More information

Data Mining Lab 5: Introduction to Neural Networks

Data Mining Lab 5: Introduction to Neural Networks Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese

More information

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,

More information

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize

More information

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2

More information

Graphics in R. Biostatistics 615/815

Graphics in R. Biostatistics 615/815 Graphics in R Biostatistics 615/815 Last Lecture Introduction to R Programming Controlling Loops Defining your own functions Today Introduction to Graphics in R Examples of commonly used graphics functions

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO

More information

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

IBM SPSS Statistics 20 Part 1: Descriptive Statistics CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 1: Descriptive Statistics Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the

More information

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln Log-Rank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics)

More information

Fraud Detection for Online Retail using Random Forests

Fraud Detection for Online Retail using Random Forests Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.

More information

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass

More information