Fortgeschrittene Computerintensive Methoden



Similar documents
Fortgeschrittene Computerintensive Methoden

Data Mining. Nonlinear Classification

HT2015: SC4 Statistical Data Mining and Machine Learning

Knowledge Discovery and Data Mining

Model Combination. 24 Novembre 2009

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Lecture 13: Validation

Knowledge Discovery and Data Mining

Data Mining Practical Machine Learning Tools and Techniques

Chapter 6. The stacking ensemble approach

A semi-supervised Spam mail detector

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Data Mining - Evaluation of Classifiers

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Supervised Learning (Big Data Analytics)

MD - Data Mining

Knowledge Discovery and Data Mining

Decision Trees from large Databases: SLIQ

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Classification and Regression by randomforest

Cross-Validation. Synonyms Rotation estimation

Comparison of Data Mining Techniques used for Financial Data Analysis

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Azure Machine Learning, SQL Data Mining and R

Hong Kong Stock Index Forecasting

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Scalable Developments for Big Data Analytics in Remote Sensing

Business Analytics and Credit Scoring

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Data Mining Techniques for Prognosis in Pancreatic Cancer

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Machine learning for algo trading

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

2 Decision tree + Cross-validation with R (package rpart)

WEKA. Machine Learning Algorithms in Java

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

CS570 Data Mining Classification: Ensemble Methods

Car Insurance. Prvák, Tomi, Havri

Maschinelles Lernen mit MATLAB

On the effect of data set size on bias and variance in classification learning

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Prerequisites. Course Outline

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Statistical Machine Learning

II. RELATED WORK. Sentiment Mining

Employer Health Insurance Premium Prediction Elliott Lui

Chapter 12 Bagging and Random Forests

The Artificial Prediction Market

Decompose Error Rate into components, some of which can be measured on unlabeled data

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Logistic Regression for Spam Filtering

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Predicting the Betting Line in NBA Games

Predicting borrowers chance of defaulting on credit loans

Random Forest Based Imbalanced Data Cleaning and Classification

An Introduction to WEKA. As presented by PACE

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Predict Influencers in the Social Network

Predictive Data modeling for health care: Comparative performance study of different prediction models

MS1b Statistical Data Mining

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

R: A Free Software Project in Statistical Computing

Statistics W4240: Data Mining Columbia University Spring, 2014

How To Understand How Weka Works

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Fortgeschrittene Computerintensive Methoden: Fuzzy Clustering Steffen Unkel

Applied Multivariate Analysis - Big data analytics

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University

A Novel Classification Approach for C2C E-Commerce Fraud Detection

Package hybridensemble

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Classification of Bad Accounts in Credit Card Industry

L25: Ensemble learning

Why Ensembles Win Data Mining Competitions

Studying Auto Insurance Data

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev

Ensemble Data Mining Methods

Distributed Tuning of Machine Learning Algorithms using MapReduce Clusters

Beating the NCAA Football Point Spread

Data Mining Yelp Data - Predicting rating stars from review text

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

More Data Mining with Weka

Feature Selection with Monte-Carlo Tree Search

Using multiple models: Bagging, Boosting, Ensembles, Forests

Car Insurance. Havránek, Pokorný, Tomášek

L13: cross-validation

Transcription:

Fortgeschrittene Computerintensive Methoden Einheit 3: mlr - Machine Learning in R Bernd Bischl Matthias Schmid, Manuel Eugster, Bettina Grün, Friedrich Leisch Institut für Statistik LMU München SoSe 2014

Introduction + Motivation No unifying interface for machine learning in R Experiments require lengthy, tedious and error-prone code Machine learning is (also) experimental science: We need powerful and flexible tools! mlr now exists for 2-3 years, grown quite large Still heavily in development, but official releases are stable Was used for nearly all of my papers We cannot cover everything today, short intro + overview Focus: Resampling / model selection / benchmarking! Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 1 / 17

Package + Documentation Main project page on Github URL: https://github.com/berndbischl/mlr Contains further links, tutorial, issue tracker. Official versions are released to CRAN. How to install install.packages("mlr") install_github("mlr", username = "berndbischl") Documentation Tutorial on project page (still growing) R docs and examples in HTML on project page (and in package) Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 2 / 17

Features I Clear S3 / object oriented interface Easy extension mechanism through S3 inheritance Abstract description of learners and data by properties Description of data and task Many convenience methods, generic building blocks for own experiments Resampling like bootstrapping, cross-validation and subsampling Easy tuning of hyperparameters Variable selection Benchmark experiments with 2 levels of resampling Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 3 / 17

Features II Growing tutorial / examples Extensive unit-testing (testthat) Extensive argument checks to help user with errors Parallelization through parallelmap (local, socket, MPI and BatchJobs modes) Parallelize experiments without touching code Job granularity can be changed, e.g., so jobs don t complete too early Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 4 / 17

Remarks on S3 Not much needed (for usage) Makes extension process easy Extension is explained in tutorial If you simply use the package, you don t really need to care! Constructors / Accessors # c o n s t r u c t o r s task = makeclassiftask ( data = i r i s, t a r g e t = " Species " ) l r n = makelearner ( " c l a s s i f. lda " ) # accessors task$task. desc$target gettaskdata ( task ) Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 5 / 17

Overview of Implemented Learners Classification LDA, QDA, RDA, MDA Logistic / multinomial regression k-nearest-neighbours Naive Bayes Decision trees (many variants) Random forests Boosting (different variants) SVMs (different variants) Neural Networks... Regression Linear model Penalized LM: lasso and ridge Boosting k-nearest-neighbours Decision trees, random forests SVMs (different variants) Neural / RBF networks... Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 6 / 17

Examples 1 + 2 ex1.r: Training and prediction ex2.r: Probabilities and ROC Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 7 / 17

Resampling Hold-Out Cross-validation Bootstrap Subsampling and quite a few extensions Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 8 / 17

Example 3 ex3.r: Resampling + Comparison Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 9 / 17

Remarks on model selection I Basic machine learning Fit parameters of model to predict new data Generalisation error commonly estimated by resampling, e.g. 10-fold cross-validation 2nd, 3rd,... level of inference Comparing inducers or hyperparameters is harder Feature selection either in 2nd level or adds a 3rd one... Statistical comparisons on the 2nd stage are non-trivial Still active research Very likely that high performance computing is needed Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 10 / 17

Tuning / Grid Search Used to find best hyperparameters for a method in a data-dependend way Most basic method Exhaustively try all combinations of finite grid Inefficient Searches large, irrelevant areas Combinatorial explosion Reasonable for continuous parameters? Still often default method Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 11 / 17

Example 4 ex4.r: Basic tuning: grid search Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 12 / 17

Remarks on Model Selection II Salzberg (1997): On comparing classifiers: Pitfalls to avoid and a recommended approach Many articles do not contain sound statistical methodology Compare against enough reasonable algorithms Do not just report mean performance values Do not cheat through repeated tuning Use double cross-validation for tuning and evaluation Apply the correct test (assumptions?) Adapt for multiple testing Think about independence and distributions Do not rely solely on UCI We might overfit on it Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 13 / 17

Nested Resampling Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 14 / 17

Example 5 ex5.r: Tuning + nested resampling via wrappers Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 15 / 17

Outlook / Not Shown Today I Available Smarter tuning algorithms CMA-ES, iterated F-facing, model-based optimization... Variable selection Filters: Simply imported available R methods Wrappers: Forward, backward, stochastic search, GAs Bagging for arbitary base learners Generic imputation for missing values Wrapping / tuning of preprocessing Regular cost-sensitive learning (class-specific costs)... Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 16 / 17

Outlook / Not Shown Today II Current work, most of it nearly done Survival analysis: Tasks, learners and measures Cost-sensitive learning (example-dependent costs) Over / Undersampling for unbalanced class sizes Multi-critera optimization... Bernd Bischl c SoSe 2014 Fortgeschrittene Computerintensive Methoden 3 17 / 17