Package extratrees. February 19, 2015

Similar documents
Package bigrf. February 19, 2015

Package missforest. February 20, 2015

Package trimtrees. February 20, 2015

Package TSfame. February 15, 2013

Package retrosheet. April 13, 2015

Package hive. January 10, 2011

Fast Analytics on Big Data with H20

Package uptimerobot. October 22, 2015

Chapter 12 Bagging and Random Forests

Package Rdsm. February 19, 2015

Package sjdbc. R topics documented: February 20, 2015

Package hive. July 3, 2015

Package MDM. February 19, 2015

Package acrm. R topics documented: February 19, 2015

Package cpm. July 28, 2015

Package SHELF. February 5, 2016

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Package biganalytics

Package DCG. R topics documented: June 8, Type Package

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

D-optimal plans in observational studies

Package HadoopStreaming

Package png. February 20, 2015

Package dsstatsclient

Classification and Regression by randomforest

Package decompr. August 17, 2016

Package GSA. R topics documented: February 19, 2015

Predicting borrowers chance of defaulting on credit loans

Package wordnet. January 6, 2016

Package RCassandra. R topics documented: February 19, Version Title R/Cassandra interface

How To Make A Credit Risk Model For A Bank Account

Package PKI. July 28, 2015

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Applied Multivariate Analysis - Big data analytics

2+2 Just type and press enter and the answer comes up ans = 4

Unit Storage Structures 1. Storage Structures. Unit 4.3

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

Package HHG. July 14, 2015

4.5 Symbol Table Applications

Package jrvfinance. R topics documented: October 6, 2015

How To Understand How Weka Works

Computational Assignment 4: Discriminant Analysis

Package survpresmooth

Programming Exercise 3: Multi-class Classification and Neural Networks

Package PKI. February 20, 2013

Decision Trees from large Databases: SLIQ

High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++

Server Load Prediction

Package changepoint. R topics documented: November 9, Type Package Title Methods for Changepoint Detection Version 2.

Informatica Cloud Connector for SharePoint 2010/2013 User Guide

Package plan. R topics documented: February 20, 2015

Package EstCRM. July 13, 2015

Chapter 12 Discovering New Knowledge Data Mining

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Package R4CDISC. September 5, 2015

SQL Server Array Library László Dobos, Alexander S. Szalay

Data Mining. Nonlinear Classification

Package polynom. R topics documented: June 24, Version 1.3-8

Package bigdata. R topics documented: February 19, 2015

Web Document Clustering

Deal with big data in R using bigmemory package

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

The Scientific Data Mining Process

Package benford.analysis

Data Mining: STATISTICA

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) # seconds, 4.23GB of memory used by R

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Package PRISMA. February 15, 2013

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Package dsmodellingclient

Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein


Data Mining and Automatic Quality Assurance of Survey Data

partykit: A Toolkit for Recursive Partytioning

Package tvm. R topics documented: August 21, Type Package Title Time Value of Money Functions Version Author Juan Manuel Truppia

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Data Mining Techniques Chapter 6: Decision Trees

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Package empiricalfdr.deseq2

Package multivator. R topics documented: February 20, Type Package Title A multivariate emulator Version Depends R(>= 2.10.

Package searchconsoler

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Package CoImp. February 19, 2015

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Data processing goes big

Package hazus. February 20, 2015

The PageRank Citation Ranking: Bring Order to the Web

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

Data Mining Techniques for Prognosis in Pancreatic Cancer

Package hier.part. February 20, Index 11. Goodness of Fit Measures for a Regression Hierarchy

SAS Data Set Encryption Options

Sub-class Error-Correcting Output Codes

Package hoarder. June 30, 2015

Transcription:

Version 1.0.5 Date 2014-12-27 Package extratrees February 19, 2015 Title Extremely Randomized Trees (ExtraTrees) Method for Classification and Regression Author, Ildefons Magrans de Abril Maintainer <jaak.simm@gmail.com> Classification and regression based on an ensemble of decision trees. The package also provides extensions of ExtraTrees to multi-task learning and quantile regression. Uses Java implementation of the method. Depends R (>= 2.7.0), rjava (>= 0.5-0) Suggests testthat, Matrix SystemRequirements Java (>= 1.6) NeedsCompilation no License Apache License 2.0 URL http://github.com/jaak-s/extratrees Repository CRAN Date/Publication 2014-12-27 23:41:04 R topics documented: extratrees.......................................... 2 predict.extratrees...................................... 5 prepareforsave....................................... 6 selecttrees......................................... 7 setjavamemory....................................... 8 tojavacsmatrix....................................... 8 tojavamatrix........................................ 9 tojavamatrix2d....................................... 9 tormatrix.......................................... 10 Index 11 1

2 extratrees extratrees Function for training ExtraTree classifier or regression. This function executes ExtraTree building method (implemented in Java). ## Default S3 method: extratrees(x, y, ntree=500, mtry = if (!is.null(y) &&!is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), nodesize = if (!is.null(y) &&!is.factor(y)) 5 else 1, numrandomcuts = 1, evencuts = FALSE, numthreads = 1, quantile = F, weights = NULL, subsetsizes = NULL, subsetgroups = NULL, tasks = NULL, proboftaskcuts = mtry / ncol(x), numrandomtaskcuts = 1, na.action = "stop",...) x y a numberic input data matrix, each row is an input. a vector of output values: if vector of numbers then regression, if vector of factors then classification. ntree the number of trees (default 500). mtry nodesize numrandomcuts evencuts the number of features tried at each node (default is ncol(x)/3 for regression and sqrt(ncol(x)) for classification). the size of leaves of the tree (default is 5 for regression and 1 for classification) the number of random cuts for each (randomly chosen) feature (default 1, which corresponds to the official ExtraTrees method). The higher the number of cuts the higher the chance of a good cut. if FALSE then cutting thresholds are uniformly sampled (default). If TRUE then the range is split into even intervals (the number of intervals is numrandomcuts) and a cut is uniformly sampled from each interval. numthreads the number of CPU threads to use (default is 1).

extratrees 3 quantile weights subsetsizes subsetgroups tasks if TRUE then quantile regression is performed (default is FALSE), only for regression data. Then use predict(et, newdata, quantile=k) to make predictions for k quantile. a vector of sample weights, one positive real value for each sample. NULL means standard learning, i.e. equal weights. subset size (one integer) or subset sizes (vector of integers, requires subset- Groups), if supplied every tree is built from a random subset of size subsetsizes. NULL means no subsetting, i.e. all samples are used. list specifying subset group for each sample: from samples in group g, each tree will randomly select subsetsizes[g] samples. vector of tasks, integers from 1 and up. NULL if no multi-task learning proboftaskcuts probability of performing task cut at a node (default mtry / ncol(x)). Used only if tasks is specified. numrandomtaskcuts number of times task cut is performed at a node (default 1). Used only if tasks is specified. na.action specifies how to handle NA in x: "stop" (default) will give error is any NA present, "zero" will set all NA to zero and "fuse" will build trees by skipping samples when the chosen feature is NA for them.... not used currently. Details For classification ExtraTrees at each node chooses the cut based on minimizing the Gini impurity index and for regression the variance. For more details see the package vignette, i.e. vignette("extratrees"). If Java runs out of memory: java.lang.outofmemoryerror: Java heap space, then (assuming you have free memory) you can increase the heap size by: options( java.parameters = "-Xmx2g" ) before calling library( "extratrees" ), where 2g defines 2GB of heap size. Change it as necessary. The trained model from input x and output values y, stored in ExtraTree object. See Also predict.extratrees for predicting and prepareforsave for saving ExtraTrees models to disk.

4 extratrees Examples ## Regression with ExtraTrees: n <- 1000 ## number of samples p <- 5 ## number of dimensions y <- (x[,1]>0.5) + 0.8*(x[,2]>0.6) + 0.5*(x[,3]>0.4) + 0.1*runif(nrow(x)) et <- extratrees(x, y, nodesize=3, mtry=p, numrandomcuts=2) yhat <- predict(et, x) ####################################### ## Multi-task regression with ExtraTrees: n <- 1000 ## number of samples p <- 5 ## number of dimensions task <- sample(1:10, size=n, replace=true) ## y depends on the task: y <- 0.5*(x[,1]>0.5) + 0.6*(x[,2]>0.6) + 0.8*(x[cbind(1:n,(task %% 2) + 3)]>0.4) et <- extratrees(x, y, nodesize=3, mtry=p-1, numrandomcuts=2, tasks=task) yhat <- predict(et, x, newtasks=task) ####################################### ## Classification with ExtraTrees (with test data) make.data <- function(n) { p <- 4 f <- function(x) (x[,1]>0.5) + (x[,2]>0.6) + (x[,3]>0.4) y <- as.factor(f(x)) return(list(x=x, y=y)) } train <- make.data(800) test <- make.data(500) et <- extratrees(train$x, train$y) yhat <- predict(et, test$x) ## accuracy mean(test$y == yhat) ## class probabilities yprob = predict(et, test$x, probability=true) head(yprob) ####################################### ## Quantile regression with ExtraTrees (with test data) make.qdata <- function(n) { p <- 4 f <- function(x) (x[,1]>0.5) + 0.8*(x[,2]>0.6) + 0.5*(x[,3]>0.4) y <- as.numeric(f(x)) return(list(x=x, y=y)) } train <- make.qdata(400) test <- make.qdata(200)

predict.extratrees 5 ## learning extra trees: et <- extratrees(train$x, train$y, quantile=true) ## estimate median (0.5 quantile) yhat0.5 <- predict(et, test$x, quantile = 0.5) ## estimate 0.8 quantile (80%) yhat0.8 <- predict(et, test$x, quantile = 0.8) ####################################### ## Weighted regression with ExtraTrees make.wdata <- function(n) { p <- 4 f <- function(x) (x[,1]>0.5) + 0.8*(x[,2]>0.6) + 0.5*(x[,3]>0.4) y <- as.numeric(f(x)) return(list(x=x, y=y)) } train <- make.wdata(400) test <- make.wdata(200) ## first half of the samples have weight 1, rest 0.3 weights <- rep(c(1, 0.3), each = nrow(train$x) / 2) et <- extratrees(train$x, train$y, weights = weights, numrandomcuts = 2) ## estimates of the weighted model yhat <- predict(et, test$x) predict.extratrees Function for making predictions from trained ExtraTree object. This function makes predictions for regression/classification using the given trained ExtraTree object and provided input matrix (newdata). ## S3 method for class extratrees predict(object, newdata, quantile=null, alls=f, probability=f, newtasks=null,...) object newdata quantile alls probability extratree (S3) object, created by extratrees(). a new numberic input data matrix, for each row a prediction is made. the quantile value between 0.0 and 1.0 for quantile regression, or NULL (default) for standard predictions. whether or not to return outputs of all trees (default FALSE). whether to return a matrix of class (factor) probabilities, default FALSE. Can only be used in the case of classification. Calculated as the proportion of trees voting for particular class.

6 prepareforsave newtasks list of tasks, for each input in newdata (default NULL). Must be NULL if no multi-task learning was used at training.... not used currently. The vector of predictions from the ExtraTree et. The length of the vector is equal to the the number of rows in newdata. Examples ## Regression with ExtraTrees: n <- 1000 ## number of samples p <- 5 ## number of dimensions y <- (x[,1]>0.5) + 0.8*(x[,2]>0.6) + 0.5*(x[,3]>0.4) + 0.1*runif(nrow(x)) et <- extratrees(x, y, nodesize=3, mtry=p, numrandomcuts=2) yhat <- predict(et, x) prepareforsave Prepares ExtraTrees object for save() function This function prepares ExtraTrees for saving by serializing the trees in Java VM. It is equivalent to calling.jcache(et$jobject). Afterwards the object can be saved by save (or automatic R session saving) and will be fully recovered after load. Note: the object can still be used as usual after prepareforsave. prepareforsave(object) object extratrees (S3) object, created by extratrees(). Nothing is returned.

selecttrees 7 Examples et <- extratrees(iris[,1:4], iris$species) prepareforsave(et) ## saving to a file save(et, file="temp.rdata") ## testing: remove et and load it back from file rm(list = "et") load("temp.rdata") predict(et, iris[,1:4]) selecttrees Makes a sub-extratrees object by keeping only selected trees. This function creates a sub-extratrees object by keeping only selected trees specified by selection. selecttrees(object, selection) object selection extratrees (S3) object, created by extratrees(). a list of logicals (T/F) of length object$ntree. A new ExtraTrees (S3) object based on the existing object by keeping only the trees present in the selection. Examples ## Regression with ExtraTrees: n <- 1000 ## number of samples p <- 5 ## number of dimensions y <- (x[,1]>0.5) + 0.8*(x[,2]>0.6) + 0.5*(x[,3]>0.4) + 0.1*runif(nrow(x)) et <- extratrees(x, y, nodesize=3, mtry=p, numrandomcuts=2, ntree=500) ## random selection of trees: trees <- sample(c(false, TRUE), replace=true, et$ntree) et2 <- selecttrees(et, selection=trees)

8 tojavacsmatrix setjavamemory Utility function for setting Java memory. Function for setting JVM memory, specified in MB. If you get java.lang.outofmemoryerror you can use this function to increase the memory available to ExtraTrees. setjavamemory( memoryinmb ) memoryinmb Integer specifying the amount of memory (MB) Examples ## use 2G memory setjavamemory(2000) tojavacsmatrix Utility function for converting an R SparseMatrix (package Matrix) to Java (column) sparse matrix. Internal function used for converting an R SparseMatrix (package Matrix) to a CSparseMatrix object in Java. CSparseMatrix class is a custom Java class used for storing sparse matrices by the implementation of ExtraTrees in Java. tojavacsmatrix( m ) m matrix of numeric values. reference to Java matrix with the same contents as the input R matrix.

tojavamatrix 9 tojavamatrix Utility function for converting an R matrix (numeric matrix) to Java matrix. Internal function used for converting an R matrix to a Matrix object in Java. Matrix class is a custom Java class used for storing matrices by the implementation of ExtraTrees in Java. tojavamatrix( m ) m matrix of numeric values. reference to Java matrix with the same contents as the input R matrix. tojavamatrix2d Utility function for converting an R matrix (standard matrix or Sparse- Matrix) to appropriate Java matrix object. Internal function used for converting an R matrix to an appropriate object in Java. It uses tojava- Matrix() and tojavacsmatrix() underneath and returns a reference to general matrix representation in Java of type Array2D (interface). tojavamatrix2d( m ) m matrix of numeric values.

10 tormatrix reference to Java matrix (dense or sparse) with the same contents as the input R matrix. tormatrix Utility function for converting Java matrix to R matrix (matrix of doubles). Internal function used for converting a Matrix object from Java to an R matrix. Matrix class is a custom Java class used for storing matrices by the implementation of ExtraTrees in Java. tormatrix( javam ) javam Java matrix (Matrix class). R (double) matrix with the same contents as the input.

Index Topic java,matrix,conversion tojavacsmatrix, 8 tojavamatrix, 9 tojavamatrix2d, 9 tormatrix, 10 Topic java.lang.outofmemoryerror,memory,jvm setjavamemory, 8 Topic regression,classification,trees extratrees, 2 predict.extratrees, 5 selecttrees, 7 Topic save,load,extratrees prepareforsave, 6 extratrees, 2 predict.extratrees, 3, 5 prepareforsave, 3, 6 selecttrees, 7 setjavamemory, 8 tojavacsmatrix, 8 tojavamatrix, 9 tojavamatrix2d, 9 tormatrix, 10 11