Fortgeschrittene Computerintensive Methoden



Similar documents
Fortgeschrittene Computerintensive Methoden

Model Combination. 24 Novembre 2009

2 Decision tree + Cross-validation with R (package rpart)

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Predictive Data modeling for health care: Comparative performance study of different prediction models

Data Mining. Nonlinear Classification

An Introduction to WEKA. As presented by PACE

Maschinelles Lernen mit MATLAB

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Open-Source Machine Learning: R Meets Weka

Homework Assignment 7

Open-Source Machine Learning: R Meets Weka

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Introduction Predictive Analytics Tools: Weka

HT2015: SC4 Statistical Data Mining and Machine Learning

Knowledge Discovery and Data Mining

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Chapter 6. The stacking ensemble approach

The R pmmltransformations Package

Chapter 12 Bagging and Random Forests

Azure Machine Learning, SQL Data Mining and R

Feature Selection with Monte-Carlo Tree Search

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Classification and Regression by randomforest

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Data Mining Practical Machine Learning Tools and Techniques

Decision Trees from large Databases: SLIQ

Analysis Tools and Libraries for BigData

Benchmarking Open-Source Tree Learners in R/RWeka

Cross-Validation. Synonyms Rotation estimation

How To Understand How Weka Works

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Lecture 13: Validation

Data Mining - Evaluation of Classifiers

Knowledge Discovery and Data Mining

Monday Morning Data Mining

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Knowledge Discovery and Data Mining

Machine learning for algo trading

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Applied Multivariate Analysis - Big data analytics

partykit: A Toolkit for Recursive Partytioning

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

The Artificial Prediction Market

Package missforest. February 20, 2015

Evaluation & Validation: Credibility: Evaluating what has been learned

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Predicting borrowers chance of defaulting on credit loans

Business Analytics and Credit Scoring

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Random Forest Based Imbalanced Data Cleaning and Classification

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Package hybridensemble

Data mining techniques: decision trees

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

WEKA KnowledgeFlow Tutorial for Version 3-5-8

Supervised Learning (Big Data Analytics)

MD - Data Mining

Tutorial Exercises for the Weka Explorer

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Unsupervised Data Mining (Clustering)

Computational Assignment 4: Discriminant Analysis

More Data Mining with Weka

Roulette Sampling for Cost-Sensitive Learning

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

Introduction Predictive Analytics Tools: Weka, R!

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Decompose Error Rate into components, some of which can be measured on unlabeled data

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Data Mining Tools. Jean- Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05 Jean-

A Review of Missing Data Treatment Methods

CS570 Data Mining Classification: Ensemble Methods

Nonparametric statistics and model selection

II. RELATED WORK. Sentiment Mining

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

CLINICAL DECISION SUPPORT FOR HEART DISEASE USING PREDICTIVE MODELS

Using multiple models: Bagging, Boosting, Ensembles, Forests

Introduction to Machine Learning Using Python. Vikram Kamath

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Fortgeschrittene Computerintensive Methoden: Fuzzy Clustering Steffen Unkel

Scalable Developments for Big Data Analytics in Remote Sensing

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Data Mining of Web Access Logs

Comparison of Data Mining Techniques used for Financial Data Analysis

Machine Learning Capacity and Performance Analysis and R

Implementation of Breiman s Random Forest Machine Learning Algorithm

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Transcription:

Fortgeschrittene Computerintensive Methoden Einheit 5: mlr - Machine Learning in R Bernd Bischl Matthias Schmid, Manuel Eugster, Bettina Grün, Friedrich Leisch Institut für Statistik LMU München SoSe 2015

mlr? No unifying interface for machine learning in R Experiments require lengthy, tedious and error-prone code Machine learning is (also) experimental science: We need powerful and flexible tools! mlr now exists for 3-4 years (or longer?), grown quite large Still heavily in development, but official releases are stable Was used for nearly all of my papers We cannot cover everything today, short intro + overview Focus: Resampling / model selection / benchmarking! Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 1 /??

mlr? Machine learning experiments are well structured Definition by plugging operators together (e.g., Weka or RapidMiner): mlr: abstractions, glue code and some own implementations Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 2 /??

The mlr Team Bernd Bischl (Muenchen) Michel Lang (Dortmund) Jakob Richter (Dortmund) Lars Kotthoff (Cork) Julia Schiffner (Duesseldorf) Eric Studerus (Basel) Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 3 /??

Package + Documentation Main project page on Github URL: https://github.com/berndbischl/mlr Contains further links, tutorial, issue tracker. Official versions are released to CRAN. How to install install.packages("mlr") install_github("mlr", username = "berndbischl") Documentation Tutorial on project page (still growing) R docs and examples in HTML on project page (and in package) Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 4 /??

Disclaimer: Slides are here to remind me of what I want to show you. And the covered examples have to be short. Refer to tutorial, examples and technical docs later! Hence: Let s explore the web material a bit! Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 5 /??

Features I Clear S3 / object oriented interface Easy extension mechanism through S3 inheritance Abstract description of learners and data by properties Description of data and task Many convenience methods, generic building blocks for own experiments Resampling like bootstrapping, cross-validation and subsampling Easy tuning of hyperparameters Variable selection Benchmark experiments with 2 levels of resampling Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 6 /??

Features II Growing tutorial / examples Extensive unit-testing (testthat) Extensive argument checks to help user with errors Parallelization through parallelmap Local, socket, MPI and BatchJobs modes Parallelize experiments without touching code Job granularity can be changed, e.g., so jobs don t complete too early Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 7 /??

Remarks on S3 Not much needed (for usage) Makes extension process easy Extension is explained in tutorial If you simply use the package, you don t really need to care! Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 8 /??

Task Abstractions Regression, (cos-sens.) classification, clustering, survival tasks Internally: data frame with annotations: target column(s), weights, misclassification costs,... ) task = makeclassiftask(data = iris, target = "Species") print(task) ## Supervised task: iris ## Type: classif ## Target: Species ## Observations: 150 ## Features: ## numerics factors ordered ## 4 0 0 ## Missings: FALSE ## Has weights: FALSE ## Has blocking: FALSE ## Classes: 3 ## setosa versicolor virginica ## 50 50 50 ## Positive class: NA Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 9 /??

Learner Abstractions 49 classification, 6 clustering, 42 regression, 10 survival Reduction algorithms for cost-sensitive Internally: functions to train and predict, parameter set and annotations lrn = makelearner("classif.rpart") print(lrn) ## Learner classif.rpart from package rpart ## Type: classif ## Name: Decision Tree; Short name: rpart ## Class: classif.rpart ## Properties: twoclass,multiclass,missings,numerics,factors,ordered,prob,weights ## Predict-Type: response ## Hyperparameters: xval=0 Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 10 /??

Learner Abstractions getparamset(lrn) ## Type len Def Constr Req Trafo ## minsplit integer - 20 1 to Inf - - ## minbucket integer - - 1 to Inf - - ## cp numeric - 0.01 0 to 1 - - ## maxcompete integer - 4 0 to Inf - - ## maxsurrogate integer - 5 0 to Inf - - ## usesurrogate discrete - 2 0,1,2 - - ## surrogatestyle discrete - 0 0,1 - - ## maxdepth integer - 30 1 to 30 - - ## xval integer - 10 0 to Inf - - ## parms untyped - - - - - Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 11 /??

Learner Abstractions head(listlearners("classif", properties = c("prob", "multiclass"))) ## [1] "classif.bdk" "classif.boosting" ## [3] "classif.cforest" "classif.ctree" ## [5] "classif.extratrees" "classif.gbm" Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 12 /??

Learner Abstractions head(listlearners(iris.task)) ## [1] "classif.bdk" "classif.boosting" ## [3] "classif.cforest" "classif.ctree" ## [5] "classif.extratrees" "classif.fnn" Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 13 /??

Performance Measures 21 classification, 7 regression, 1 survival Internally: performance function, aggregation function and annotations print(mmce) ## Name: Mean misclassification error ## Performance measure: mmce ## Properties: classif,classif.multi,req.pred,req.truth ## Minimize: TRUE ## Best: 0; Worst: 1 ## Aggregated by: test.mean ## Note: print(timetrain) ## Name: Time of fitting the model ## Performance measure: timetrain ## Properties: classif,classif.multi,regr,surv,costsens,cluster,req.model ## Minimize: TRUE ## Best: 0; Worst: Inf ## Aggregated by: test.mean ## Note: Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 14 /??

Performance Measures listmeasures("classif") ## [1] "f1" "featperc" "mmce" ## [4] "tn" "tp" "mcc" ## [7] "fn" "fp" "npv" ## [10] "bac" "timeboth" "acc" ## [13] "ppv" "multiclass.auc" "fnr" ## [16] "auc" "tnr" "ber" ## [19] "timepredict" "fpr" "gmean" ## [22] "tpr" "gpr" "fdr" ## [25] "timetrain" Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 15 /??

Overview of Implemented Learners Classification LDA, QDA, RDA, MDA Trees and forests Boosting (different variants) SVMs (different variants)... Regression Linear, lasso and ridge Boosting Trees and forests Gaussian processes... Clustering K-Means EM DBscan X-Means... Much better documented on web page, let s go there! Survival Analysis Cox-PH Cox-Boost Random survival forest Penalized regression... Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 16 /??

Example ex1.r: Training and prediction Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 17 /??

Resampling Hold-Out Cross-validation Bootstrap Subsampling Stratification Blocking and quite a few extensions Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 18 /??

Resampling Resampling techniques: CV, Bootstrap, Subsampling,... cv3f = makeresampledesc("cv", iters = 3, stratify = TRUE) 10-fold CV of rpart on iris lrn = makelearner("classif.rpart") cv10f = makeresampledesc("cv", iters = 10) measures = list(mmce, acc) resample(lrn, task, cv10f, measures)$aggr ## mmce.test.mean acc.test.mean ## 0.06667 0.93333 crossval(lrn, task)$aggr ## mmce.test.mean ## 0.06667 Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 19 /??

Example ex2.r: Resampling Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 20 /??

Benchmarking Compare multiple learners on multiple tasks Fair comparisons: same training and test sets for each learner data("sonar", package = "mlbench") tasks = list( makeclassiftask(data = iris, target = "Species"), makeclassiftask(data = Sonar, target = "Class") ) learners = list( makelearner("classif.rpart"), makelearner("classif.randomforest"), makelearner("classif.ksvm") ) benchmark(learners, tasks, cv10f, mmce) ## task.id learner.id mmce.test.mean ## 1 iris classif.rpart 0.06000 ## 2 iris classif.randomforest 0.04667 ## 3 iris classif.ksvm 0.04000 ## 4 Sonar classif.rpart 0.30762 ## 5 Sonar classif.randomforest 0.20619 ## 6 Sonar classif.ksvm 0.19214 Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 21 /??

Example ex3.r: Benchmarking Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 22 /??

Visualizations plotlearnerprediction(makelearner("classif.randomforest"), task, features = c("sepal.length", "Sepal.Width")) 4.5 classif.randomforest: Train: mmce=0.0933; CV: mmce.test.mean=0.273 4.0 Sepal.Width 3.5 3.0 Species setosa versicolor virginica 2.5 2.0 5 6 7 8 Sepal.Length Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 23 /??

Example ex4.r: ROC analysis Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 24 /??

Remarks on model selection I Basic machine learning Fit parameters of model to predict new data Generalisation error commonly estimated by resampling, e.g. 10-fold cross-validation 2nd, 3rd,... level of inference Comparing inducers or hyperparameters is harder Feature selection either in 2nd level or adds a 3rd one... Statistical comparisons on the 2nd stage are non-trivial Still active research Very likely that high performance computing is needed Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 25 /??

Tuning / Grid Search / Random search Tuning Used to find best hyperparameters for a method in a data-dependend way Must be done for some methods, like SVMs Grid search Basic method: Exhaustively try all combinations of finite grid Inefficient, combinatorial explosion Searches large, irrelevant areas Reasonable for continuous parameters? Still often default method Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 26 /??

Tuning / Grid Search / Random search Random search Randomly draw parameters mlr supports all types and depencies here Scales better then grid search Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 27 /??

Example ex5.r: Basic tuning: grid search Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 28 /??

Remarks on Model Selection II Salzberg (1997): On comparing classifiers: Pitfalls to avoid and a recommended approach Many articles do not contain sound statistical methodology Compare against enough reasonable algorithms Do not just report mean performance values Do not cheat through repeated tuning Use double cross-validation for tuning and evaluation Apply the correct test (assumptions?) Adapt for multiple testing Think about independence and distributions Do not rely solely on UCI We might overfit on it Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 29 /??

Nested Resampling Ensures unbiased results for meta model optimization Use this for tuning and feature selection Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 30 /??

Example ex6.r: Tuning + nested resampling via wrappers Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 31 /??

Parallelization Activate with parallelmap::parallelstart Backends: local, multicore, socket, mpi and BatchJobs parallelstart("batchjobs") benchmark([...]) parallelstop() Parallelization levels parallelgetregisteredlevels() ## mlr: mlr.benchmark, mlr.resample, mlr.selectfeatures, mlr.tuneparams Defaults to first possible / most outer loop Few iterations in benchmark (loop over learners tasks), many in resampling parallelstart("multicore", level = "mlr.resample") Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 32 /??

Parallelization parallelmap is documented here: https://github.com/berndbischl/parallelmap BatchJobs is documented here: https://github.com/tudo-r/batchjobs Make sure to read the Wiki page for Dortmund! Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 33 /??

Example ex7.r: Parallelization Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 34 /??

Outlook / Not Shown Today I Regression Survival analysis Clustering Regular cost-sensitive learning (class-specific costs) Cost-sensitive learning (example-dependent costs) Smarter tuning algorithms CMA-ES, iterated F-facing, model-based optimization... Multi-critera optimization... Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 35 /??

Outlook / Not Shown Today II Variable selection Filters: Simply imported available R methods Wrappers: Forward, backward, stochastic search, GAs Bagging for arbitary base learners Generic imputation for missing values Wrapping / tuning of preprocessing Over / Undersampling for unbalanced class sizes mlr connection to OpenML... Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 36 /??