The caret Package: A Unified Interface for Predictive Models
|
|
|
- Lora Patrick
- 9 years ago
- Views:
Transcription
1 The caret Package: A Unified Interface for Predictive Models Max Kuhn Pfizer Global R&D Nonclinical Statistics Groton, CT [email protected] February 26, 2014
2 Motivation Theorem (No Free Lunch) In the absence of any knowledge about the prediction problem, no model can be said to be uniformly better than any other Given this, it makes sense to use a variety of different models to find one that best fits the data R has many packages for predictive modeling (aka machine learning)(aka pattern recognition)... Max Kuhn (Pfizer Global R&D) caret February 26, / 37
3 Model Function Consistency Since there are many modeling packages written by different people, there are some inconsistencies in how models are specified and predictions are made. For example, many models have only one method of specifying the model (e.g. formula method only) The table below shows the syntax to get probability estimates from several classification models: obj Class Package predict Function Syntax lda MASS predict(obj) (no options needed) glm stats predict(obj, type = "response") gbm gbm predict(obj, type = "response", n.trees) mda mda predict(obj, type = "posterior") rpart rpart predict(obj, type = "prob") Weka RWeka predict(obj, type = "probability") LogitBoost catools predict(obj, type = "raw", niter) Max Kuhn (Pfizer Global R&D) caret February 26, / 37
4 The caret Package The caret package was developed to: create a unified interface for modeling and prediction (currently 147 different models) streamline model tuning using resampling provide a variety of helper functions and classes for day to day model building tasks increase computational efficiency using parallel processing First commits within Pfizer: 6/2005 First version on CRAN: 10/2007 Website: JSS Paper: Book: Applied Predictive Modeling (AppliedPredictiveModeling.com) Max Kuhn (Pfizer Global R&D) caret February 26, / 37
5 Example Data SGI has several data sets on their homepage for MLC++ software at One data set can be used to predict telecom customer churn based on information about their account. Let s read in the data first: > library(c50) > data(churn) Max Kuhn (Pfizer Global R&D) caret February 26, / 37
6 Example Data > str(churntrain) 'data.frame': 3333 obs. of 20 variables: $ state : Factor w/ 51 levels "AK","AL","AR",..: $ account_length : int $ area_code : Factor w/ 3 levels "area_code_408",..: $ international_plan : Factor w/ 2 levels "no","yes": $ voice_mail_plan : Factor w/ 2 levels "no","yes": $ number_vmail_messages : int $ total_day_minutes : num $ total_day_calls : int $ total_day_charge : num $ total_eve_minutes : num $ total_eve_calls : int $ total_eve_charge : num $ total_night_minutes : num $ total_night_calls : int $ total_night_charge : num $ total_intl_minutes : num $ total_intl_calls : int $ total_intl_charge : num $ number_customer_service_calls: int $ churn : Factor w/ 2 levels "yes","no": Max Kuhn (Pfizer Global R&D) caret February 26, / 37
7 Example Data Here is a vector of predictor names that we might need later. > predictors <- names(churntrain)[names(churntrain)!= "churn"] Note that the class s leading level is "yes". Many of the functions used here model the probability of the first factor level. In other words, we want to predict churn, not retention. Max Kuhn (Pfizer Global R&D) caret February 26, / 37
8 Data Splitting A simple, stratified random split is used here: > alldata <- rbind(churntrain, churntest) > > set.seed(1) > intrainingset <- createdatapartition(alldata$churn, + p =.75, list = FALSE) > churntrain <- alldata[ intrainingset,] > churntest <- alldata[-intrainingset,] > > ## For this presentation, we will stick with the original split Other functions: createfolds, createmultifolds, createresamples Max Kuhn (Pfizer Global R&D) caret February 26, / 37
9 Data Pre Processing Methods preprocess calculates values that can be used to apply to any data set (e.g. training, set, unknowns). Current methods: centering, scaling, spatial sign transformation, PCA or ICA signal extraction, imputation, Box Cox transformations and others. > numerics <- c("account_length", "total_day_calls", "total_night_calls") > ## Determine means and sd's > procvalues <- preprocess(churntrain[,numerics], + method = c("center", "scale", "YeoJohnson")) > ## Use the predict methods to do the adjustments > trainscaled <- predict(procvalues, churntrain[,numerics]) > testscaled <- predict(procvalues, churntest[,numerics]) Max Kuhn (Pfizer Global R&D) caret February 26, / 37
10 Data Pre Processing Methods > procvalues Call: preprocess.default(x = churntrain[, numerics], method = c("center", "scale", "YeoJohnson")) Created from 3333 samples and 3 variables Pre-processing: centered, scaled, Yeo-Johnson transformation Lambda estimates for Yeo-Johnson transformation: 0.89, 1.17, 0.93 preprocess can also be called within other functions for each resampling iteration (more in a bit). Max Kuhn (Pfizer Global R&D) caret February 26, / 37
11 Boosted Trees (Original adaboost Algorithm) A method to boost weak learning algorithms (e.g. single trees) into strong learning algorithms. Boosted trees try to improve the model fit over different trees by considering past fits (not unlike iteratively reweighted least squares) The basic tree boosting algorithm: Initialize equal weights per sample; for j = 1... M iterations do Fit a classification tree using sample weights (denote the model equation as f j (x)); forall the misclassified samples do increase sample weight end Save a stage weight (β j ) based on the performance of the current model; end Max Kuhn (Pfizer Global R&D) caret February 26, / 37
12 Boosted Trees (Original adaboost Algorithm) In this formulation, the categorical response y i is coded as either { 1, 1} and the model f j (x) produces values of { 1, 1}. The final prediction is obtained by first predicting using all M trees, then weighting each prediction f (x) = 1 M M β j f j (x) where f j is the j th tree fit and β j is the stage weight for that tree. The final class is determined by the sign of the model prediction. In English: the final prediction is a weighted average of each tree s prediction. The weights are based on quality of each tree. j =1 Max Kuhn (Pfizer Global R&D) caret February 26, / 37
13 Boosted Trees Parameters Most implementations of boosting have three tuning parameters: number of iterations (i.e. trees) complexity of the tree (i.e. number of splits) learning rate (aka. shrinkage ): how quickly the algorithm adapts Boosting functions for trees in R: gbm in the gbm package, ada in ada and blackboost in mboost. Max Kuhn (Pfizer Global R&D) caret February 26, / 37
14 Using the gbm Package The gbm function in the gbm package can be used to fit the model, then predict.gbm and other functions are used to predict and evaluate the model. > library(gbm) > # The gbm function does not accept factor response values so we > # will make a copy and modify the outcome variable > forgbm <- churntrain > forgbm$churn <- ifelse(forgbm$churn == "yes", 1, 0) > > gbmfit <- gbm(formula = churn ~., # Use all predictors + distribution = "bernoulli", # For classification + data = forgbm, + n.trees = 2000, # 2000 boosting iterations + interaction.depth = 7, # How many splits in each tree + shrinkage = 0.01, # learning rate + verbose = FALSE) # Do not print the details Max Kuhn (Pfizer Global R&D) caret February 26, / 37
15 Tuning the gbm Model This code assumes that we know appropriate values of the three tuning parameters. As previously discussed, one approach for model tuning is resampling. We can fit models with different values of the tuning parameters to many resampled versions of the training set and estimate the performance based on the hold out samples. From this, a profile of performance can be obtained across different gbm models. An optimal value can be chosen from this profile. Max Kuhn (Pfizer Global R&D) caret February 26, / 37
16 Tuning the gbm Model foreach resampled data set do Hold out samples ; foreach combination of tree depth, learning rate and number of trees do Fit the model on the resampled data set; Predict the hold outs and save results; end Calculate the average AUC ROC across the hold out sets of predictions end Determine tuning parameters based on the highest resampled ROC AUC; Max Kuhn (Pfizer Global R&D) caret February 26, / 37
17 Model Tuning using train In a basic call to train, we specify the predictor set, the outcome data and the modeling technique (eg. boosting via the gbm package): > gbmtune <- train(x = churntrain[,predictors], + y= churntrain$churn, + method = "gbm") > > # or, using the formula interface > gbmtune <- train(churn ~., data = churntrain, method = "gbm") Note that train uses the original factor input for classification. For binary outcomes, the function models the probability of the first factor level (churn for these data). A numeric object would indicate we are doing regression. One problem is that gbm spits out a lot of information during model fitting. Max Kuhn (Pfizer Global R&D) caret February 26, / 37
18 Passing Model Parameters Through train We can pass options through train to gbm using the three dots. Let s omit the function s logging using the gbm option verbose = FALSE. gbmtune <- train(churn ~., data = churntrain, method = "gbm", verbose = FALSE) Max Kuhn (Pfizer Global R&D) caret February 26, / 37
19 Changing the Resampling Technique By default, train uses the bootstrap for resampling. We ll switch to 5 repeats of 10 fold cross validation instead. We can then change the type of resampling via the control function. ctrl <- traincontrol(method = "repeatedcv", repeats = 5) gbmtune <- train(churn ~., data = churntrain, method = "gbm", verbose = FALSE, trcontrol = ctrl) Max Kuhn (Pfizer Global R&D) caret February 26, / 37
20 Using Different Performance Metrics At this point train will 1 fit a sequence of different gbm models, 2 estimate performance using resampling 3 pick the gbm model with the best performance The default metrics for classification problems are accuracy and Cohen s Kappa. Suppose we wanted to estimate sensitivity, specificity and the area under the ROC curve (and pick the model with the largest AUC). We need to tell train to produce class probabilities, estimate these statistics and to rank models by the ROC AUC. Max Kuhn (Pfizer Global R&D) caret February 26, / 37
21 Using Different Performance Metrics The twoclasssummary function is defined in caret and calculates the sensitivity, specificity and ROC AUC. Other custom functions can be used (see?train). ctrl <- traincontrol(method = "repeatedcv", repeats = 5, classprobs = TRUE, summaryfunction = twoclasssummary) gbmtune <- train(churn ~., data = churntrain, method = "gbm", metric = "ROC", verbose = FALSE, trcontrol = ctrl) Max Kuhn (Pfizer Global R&D) caret February 26, / 37
22 Expanding the Search Grid By default, train uses a minimal search grid: 3 values for each tuning parameter. We might also want to expand the scope of the possible gbm models to test. Let s look at tree depths from 1 to 7, boosting iterations from 100 to 1,000 and two different learning rates. We can create a data frame with these parameters to pass to train. The column names should be the parameter name proceeded by a dot. See?train for the tunable parameters for each model type. Max Kuhn (Pfizer Global R&D) caret February 26, / 37
23 Expanding the Search Grid ctrl <- traincontrol(method = "repeatedcv", repeats = 5, classprobs = TRUE, summaryfunction = twoclasssummary) grid <- expand.grid(interaction.depth = seq(1, 7, by = 2), n.trees = seq(100, 1000, by = 50), shrinkage = c(0.01, 0.1)) gbmtune <- train(churn ~., data = churntrain, method = "gbm", metric = "ROC", tunegrid = grid, verbose = FALSE, trcontrol = ctrl) Max Kuhn (Pfizer Global R&D) caret February 26, / 37
24 Runnig the Model > grid <- expand.grid(interaction.depth = seq(1, 7, by = 2), + n.trees = seq(100, 1000, by = 50), + shrinkage = c(0.01, 0.1)) > > ctrl <- traincontrol(method = "repeatedcv", repeats = 5, + summaryfunction = twoclasssummary, + classprobs = TRUE) > > set.seed(1) > gbmtune <- train(churn ~., data = churntrain, + method = "gbm", + metric = "ROC", + tunegrid = grid, + verbose = FALSE, + trcontrol = ctrl) Max Kuhn (Pfizer Global R&D) caret February 26, / 37
25 Model Tuning train uses as many tricks as possible to reduce the number of models fits (e.g. using sub models). Here, it uses the kernlab function sigest to analytically estimate the RBF scale parameter. Currently, there are options for 147 models (see?train for a list) Allows user defined search grid, performance metrics and selection rules Easily integrates with any parallel processing framework that can emulate lapply Formula and non formula interfaces Methods: predict, print, plot, ggplot, varimp, resamples, xyplot, densityplot, histogram, stripplot,... Max Kuhn (Pfizer Global R&D) caret February 26, / 37
26 Plotting the Results > ggplot(gbmtune) + theme(legend.position = "top") Max Tree Depth ROC (Repeated Cross Validation) # Boosting Iterations Max Kuhn (Pfizer Global R&D) caret February 26, / 37
27 Prediction and Performance Assessment The predict method can be used to get results for other data sets: > gbmpred <- predict(gbmtune, churntest) > str(gbmpred) Factor w/ 2 levels "yes","no": > gbmprobs <- predict(gbmtune, churntest, type = "prob") > str(gbmprobs) 'data.frame': 1667 obs. of 2 variables: $ yes: num $ no : num Max Kuhn (Pfizer Global R&D) caret February 26, / 37
28 Predction and Performance Assessment > confusionmatrix(gbmpred, churntest$churn) Confusion Matrix and Statistics Reference Prediction yes no yes no Accuracy : % CI : (0.94, 0.961) No Information Rate : P-Value [Acc > NIR] : < 2e-16 Kappa : Mcnemar's Test P-Value : 1.15e-12 Sensitivity : Specificity : Pos Pred Value : Neg Pred Value : Prevalence : Detection Rate : Detection Prevalence : Max Kuhn Balanced (Pfizer Global Accuracy R&D) : caret February 26, / 37
29 Test Set ROC curve via the proc Package > roccurve <- roc(response = churntest$churn, + predictor = gbmprobs[, "yes"], + levels = rev(levels(churntest$churn))) > roccurve Call: roc.default(response = churntest$churn, predictor = gbmprobs[, " Data: gbmprobs[, "yes"] in 1443 controls (churntest$churn no) < 224 Area under the curve: > # plot(roccurve) Max Kuhn (Pfizer Global R&D) caret February 26, / 37
30 Test Set ROC curve via the proc Package Sensitivity (0.962, 0.826) (0.994, 0.674) Specificity Max Kuhn (Pfizer Global R&D) caret February 26, / 37
31 Switching To Other Models It is simple to go between models using train. A different method value is needed, any optional changes to preproc and potential options to pass through using... IF we set the seed to the same value prior to calling train, the exact same resamples are used to evaluate the model. Max Kuhn (Pfizer Global R&D) caret February 26, / 37
32 Switching To Other Models Minimal changes are needed to fit different models to the same data: > ## Using the same seed prior to train() will ensure that > ## the same resamples are used (even in parallel) > set.seed(1) > svmtune <- train(churn ~., data = churntrain, + ## Tell it to fit a SVM model and tune + ## over Cost and the RBF parameter + method = "svmradial", + ## This pre-processing will be applied to + ## these data and new samples too. + preproc = c("center", "scale"), + ## Tune over different values of cost + tunelength = 10, + trcontrol = ctrl, + metric = "ROC") Max Kuhn (Pfizer Global R&D) caret February 26, / 37
33 Switching To Other Models How about: > set.seed(1) > fdatune <- train(churn ~., data = churntrain, + ## Now try a flexible discriminant model + ## using MARS basis functions + method = "fda", + tunelength = 10, + trcontrol = ctrl, + metric = "ROC") Max Kuhn (Pfizer Global R&D) caret February 26, / 37
34 Other Functions and Classes nearzerovar: a function to remove predictors that are sparse and highly unbalanced findcorrelation: a function to remove the optimal set of predictors to achieve low pair wise correlations predictors: class for determining which predictors are included in the prediction equations (e.g. rpart, earth, lars models) (currently 7 methods) confusionmatrix, sensitivity, specificity, pospredvalue, negpredvalue: classes for assessing classifier performance varimp: classes for assessing the aggregate effect of a predictor on the model equations (currently 28 methods) Max Kuhn (Pfizer Global R&D) caret February 26, / 37
35 Other Functions and Classes knnreg: nearest neighbor regression plsda, splsda: PLS discriminant analysis icr: independent component regression pcannet: nnet:::nnet with automatic PCA pre processing step bagearth, bagfda: bagging with MARS and FDA models normalize2reference: RMA like processing of Affy arrays using a training set spatialsign: class for transforming numeric data (x = x/ x ) maxdissim: a function for maximum dissimilarity sampling rfe: a class/framework for recursive feature selection (RFE) with external resampling step sbf: a class/framework for applying univariate filters prior to predictive modeling with external resampling featureplot: a wrapper for several lattice functions Max Kuhn (Pfizer Global R&D) caret February 26, / 37
36 Thanks Ray DiGiacomo R Core Pfizer s Statistics leadership for providing the time and support to create R packages caret contributors: Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer and the R Core Team. Max Kuhn (Pfizer Global R&D) caret February 26, / 37
37 Session Info R Under development (unstable) ( r65008), x86_64-apple-darwin Base packages: base, datasets, graphics, grdevices, methods, parallel, splines, stats, utils Other packages: C , caret , class 7.3-9, domc 1.3.2, e , earth 3.2-6, foreach 1.4.1, gbm 2.1, ggplot , iterators 1.0.6, kernlab , knitr 1.5, lattice , mda 0.4-4, plotmo 1.3-2, plotrix 3.5-2, pls 2.4-3, plyr 1.8, proc 1.6, randomforest 4.6-7, survival Loaded via a namespace (and not attached): car , codetools 0.2-8, colorspace 1.2-4, compiler 3.1.0, dichromat 2.0-0, digest 0.6.4, evaluate 0.5.1, formatr 0.10, grid 3.1.0, gtable 0.1.2, highr 0.3, labeling 0.2, MASS , munsell 0.4.2, nnet 7.3-7, proto , RColorBrewer 1.0-5, Rcpp , reshape , scales 0.2.3, stringr 0.6.2, tools This presentation was created with the knitr function at 23:23 on Wednesday, Feb 26, Max Kuhn (Pfizer Global R&D) caret February 26, / 37
user! 2013 Max Kuhn, Ph.D Pfizer Global R&D Groton, CT [email protected]
Predictive Modeling with R and the caret Package user! 23 Max Kuhn, Ph.D Pfizer Global R&D Groton, CT [email protected] Outline Conventions in R Data Splitting and Estimating Performance Data Pre-Processing
Applied Predictive Modeling in R R. user! 2014
Applied Predictive Modeling in R R user! 204 Max Kuhn, Ph.D Pfizer Global R&D Groton, CT [email protected] Outline Conventions in R Data Splitting and Estimating Performance Data Pre-Processing Over
Journal of Statistical Software
JSS Journal of Statistical Software November 2008, Volume 28, Issue 5. http://www.jstatsoft.org/ Building Predictive Models in R Using the caret Package Max Kuhn Pfizer Global R&D Abstract The caret package,
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Consensus Based Ensemble Model for Spam Detection
Consensus Based Ensemble Model for Spam Detection Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Information Security Submitted By Paritosh
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
netresponse probabilistic tools for functional network analysis
netresponse probabilistic tools for functional network analysis Leo Lahti 1,2, Olli-Pekka Huovilainen 1, António Gusmão 1 and Juuso Parkkinen 1 (1) Dpt. Information and Computer Science, Aalto University,
Package hybridensemble
Type Package Package hybridensemble May 30, 2015 Title Build, Deploy and Evaluate Hybrid Ensembles Version 1.0.0 Date 2015-05-26 Imports randomforest, kernelfactory, ada, rpart, ROCR, nnet, e1071, NMOF,
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Didacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest
Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Package missforest. February 20, 2015
Type Package Package missforest February 20, 2015 Title Nonparametric Missing Value Imputation using Random Forest Version 1.4 Date 2013-12-31 Author Daniel J. Stekhoven Maintainer
Visualising big data in R
Visualising big data in R April 2013 Birmingham R User Meeting Alastair Sanderson www.alastairsanderson.com 23rd April 2013 The challenge of visualising big data Only a few million pixels on a screen,
The Operational Value of Social Media Information. Social Media and Customer Interaction
The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort [email protected] Motivation Location matters! Observed value at one location is
CS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort [email protected] Session Number: TBR14 Insurance has always been a data business The industry has successfully
Maschinelles Lernen mit MATLAB
Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical
Data Science with R. Introducing Data Mining with Rattle and R. [email protected]
http: // togaware. com Copyright 2013, [email protected] 1/35 Data Science with R Introducing Data Mining with Rattle and R [email protected] Senior Director and Chief Data Miner,
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
partykit: A Toolkit for Recursive Partytioning
partykit: A Toolkit for Recursive Partytioning Achim Zeileis, Torsten Hothorn http://eeecon.uibk.ac.at/~zeileis/ Overview Status quo: R software for tree models New package: partykit Unified infrastructure
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
Package bigrf. February 19, 2015
Version 0.1-11 Date 2014-05-16 Package bigrf February 19, 2015 Title Big Random Forests: Classification and Regression Forests for Large Data Sets Maintainer Aloysius Lim OS_type
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
MHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given
Fortgeschrittene Computerintensive Methoden
Fortgeschrittene Computerintensive Methoden Einheit 3: mlr - Machine Learning in R Bernd Bischl Matthias Schmid, Manuel Eugster, Bettina Grün, Friedrich Leisch Institut für Statistik LMU München SoSe 2014
Fortgeschrittene Computerintensive Methoden
Fortgeschrittene Computerintensive Methoden Einheit 5: mlr - Machine Learning in R Bernd Bischl Matthias Schmid, Manuel Eugster, Bettina Grün, Friedrich Leisch Institut für Statistik LMU München SoSe 2015
2 Decision tree + Cross-validation with R (package rpart)
1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing
Beating the MLB Moneyline
Beating the MLB Moneyline Leland Chen [email protected] Andrew He [email protected] 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
Risk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Univariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
ModEco Tutorial In this tutorial you will learn how to use the basic features of the ModEco Software.
ModEco Tutorial In this tutorial you will learn how to use the basic features of the ModEco Software. Contents: Getting Started Page 1 Section 1: File and Data Management Page 1 o 1.1: Loading Single Environmental
Package trimtrees. February 20, 2015
Package trimtrees February 20, 2015 Type Package Title Trimmed opinion pools of trees in a random forest Version 1.2 Date 2014-08-1 Depends R (>= 2.5.0),stats,randomForest,mlbench Author Yael Grushka-Cockayne,
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
A Short Tour of the Predictive Modeling Process
Chapter 2 A Short Tour of the Predictive Modeling Process Before diving in to the formal components of model building, we present a simple example that illustrates the broad concepts of model building.
Regularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Data Mining Lab 5: Introduction to Neural Networks
Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
How To Understand Data Mining In R And Rattle
http: // togaware. com Copyright 2014, [email protected] 1/40 Data Analytics and Business Intelligence (8696/8697) Introducing Data Science with R and Rattle [email protected] Chief
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
Microsoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql [email protected] http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING
Wrocław University of Technology Internet Engineering Henryk Maciejewski APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING PRACTICAL GUIDE Wrocław (2011) 1 Copyright by Wrocław University of Technology
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
High Productivity Data Processing Analytics Methods with Applications
High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research
JetBlue Airways Stock Price Analysis and Prediction
JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue
Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL
Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Decompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
IST 557 Final Project
George Slota DataMaster 5000 IST 557 Final Project Abstract As part of a competition hosted by the website Kaggle, a statistical model was developed for prediction of United States Census 2010 mailing
MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION
MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION Matthew A. Lanham & Ralph D. Badinelli Virginia Polytechnic Institute and State University Department of Business
Data Science with R Ensemble of Decision Trees
Data Science with R Ensemble of Decision Trees [email protected] 3rd August 2014 Visit http://handsondatascience.com/ for more Chapters. The concept of building multiple decision trees to produce
Data Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
