Fortgeschrittene Computerintensive Methoden
|
|
|
- Buddy Horn
- 9 years ago
- Views:
Transcription
1 Fortgeschrittene Computerintensive Methoden Einheit 5: mlr - Machine Learning in R Bernd Bischl Matthias Schmid, Manuel Eugster, Bettina Grün, Friedrich Leisch Institut für Statistik LMU München SoSe 2015
2 mlr? No unifying interface for machine learning in R Experiments require lengthy, tedious and error-prone code Machine learning is (also) experimental science: We need powerful and flexible tools! mlr now exists for 3-4 years (or longer?), grown quite large Still heavily in development, but official releases are stable Was used for nearly all of my papers We cannot cover everything today, short intro + overview Focus: Resampling / model selection / benchmarking! Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 1 /??
3 mlr? Machine learning experiments are well structured Definition by plugging operators together (e.g., Weka or RapidMiner): mlr: abstractions, glue code and some own implementations Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 2 /??
4 The mlr Team Bernd Bischl (Muenchen) Michel Lang (Dortmund) Jakob Richter (Dortmund) Lars Kotthoff (Cork) Julia Schiffner (Duesseldorf) Eric Studerus (Basel) Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 3 /??
5 Package + Documentation Main project page on Github URL: Contains further links, tutorial, issue tracker. Official versions are released to CRAN. How to install install.packages("mlr") install_github("mlr", username = "berndbischl") Documentation Tutorial on project page (still growing) R docs and examples in HTML on project page (and in package) Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 4 /??
6 Disclaimer: Slides are here to remind me of what I want to show you. And the covered examples have to be short. Refer to tutorial, examples and technical docs later! Hence: Let s explore the web material a bit! Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 5 /??
7 Features I Clear S3 / object oriented interface Easy extension mechanism through S3 inheritance Abstract description of learners and data by properties Description of data and task Many convenience methods, generic building blocks for own experiments Resampling like bootstrapping, cross-validation and subsampling Easy tuning of hyperparameters Variable selection Benchmark experiments with 2 levels of resampling Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 6 /??
8 Features II Growing tutorial / examples Extensive unit-testing (testthat) Extensive argument checks to help user with errors Parallelization through parallelmap Local, socket, MPI and BatchJobs modes Parallelize experiments without touching code Job granularity can be changed, e.g., so jobs don t complete too early Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 7 /??
9 Remarks on S3 Not much needed (for usage) Makes extension process easy Extension is explained in tutorial If you simply use the package, you don t really need to care! Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 8 /??
10 Task Abstractions Regression, (cos-sens.) classification, clustering, survival tasks Internally: data frame with annotations: target column(s), weights, misclassification costs,... ) task = makeclassiftask(data = iris, target = "Species") print(task) ## Supervised task: iris ## Type: classif ## Target: Species ## Observations: 150 ## Features: ## numerics factors ordered ## ## Missings: FALSE ## Has weights: FALSE ## Has blocking: FALSE ## Classes: 3 ## setosa versicolor virginica ## ## Positive class: NA Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 9 /??
11 Learner Abstractions 49 classification, 6 clustering, 42 regression, 10 survival Reduction algorithms for cost-sensitive Internally: functions to train and predict, parameter set and annotations lrn = makelearner("classif.rpart") print(lrn) ## Learner classif.rpart from package rpart ## Type: classif ## Name: Decision Tree; Short name: rpart ## Class: classif.rpart ## Properties: twoclass,multiclass,missings,numerics,factors,ordered,prob,weights ## Predict-Type: response ## Hyperparameters: xval=0 Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 10 /??
12 Learner Abstractions getparamset(lrn) ## Type len Def Constr Req Trafo ## minsplit integer to Inf - - ## minbucket integer to Inf - - ## cp numeric to ## maxcompete integer to Inf - - ## maxsurrogate integer to Inf - - ## usesurrogate discrete - 2 0,1,2 - - ## surrogatestyle discrete - 0 0,1 - - ## maxdepth integer to ## xval integer to Inf - - ## parms untyped Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 11 /??
13 Learner Abstractions head(listlearners("classif", properties = c("prob", "multiclass"))) ## [1] "classif.bdk" "classif.boosting" ## [3] "classif.cforest" "classif.ctree" ## [5] "classif.extratrees" "classif.gbm" Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 12 /??
14 Learner Abstractions head(listlearners(iris.task)) ## [1] "classif.bdk" "classif.boosting" ## [3] "classif.cforest" "classif.ctree" ## [5] "classif.extratrees" "classif.fnn" Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 13 /??
15 Performance Measures 21 classification, 7 regression, 1 survival Internally: performance function, aggregation function and annotations print(mmce) ## Name: Mean misclassification error ## Performance measure: mmce ## Properties: classif,classif.multi,req.pred,req.truth ## Minimize: TRUE ## Best: 0; Worst: 1 ## Aggregated by: test.mean ## Note: print(timetrain) ## Name: Time of fitting the model ## Performance measure: timetrain ## Properties: classif,classif.multi,regr,surv,costsens,cluster,req.model ## Minimize: TRUE ## Best: 0; Worst: Inf ## Aggregated by: test.mean ## Note: Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 14 /??
16 Performance Measures listmeasures("classif") ## [1] "f1" "featperc" "mmce" ## [4] "tn" "tp" "mcc" ## [7] "fn" "fp" "npv" ## [10] "bac" "timeboth" "acc" ## [13] "ppv" "multiclass.auc" "fnr" ## [16] "auc" "tnr" "ber" ## [19] "timepredict" "fpr" "gmean" ## [22] "tpr" "gpr" "fdr" ## [25] "timetrain" Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 15 /??
17 Overview of Implemented Learners Classification LDA, QDA, RDA, MDA Trees and forests Boosting (different variants) SVMs (different variants)... Regression Linear, lasso and ridge Boosting Trees and forests Gaussian processes... Clustering K-Means EM DBscan X-Means... Much better documented on web page, let s go there! Survival Analysis Cox-PH Cox-Boost Random survival forest Penalized regression... Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 16 /??
18 Example ex1.r: Training and prediction Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 17 /??
19 Resampling Hold-Out Cross-validation Bootstrap Subsampling Stratification Blocking and quite a few extensions Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 18 /??
20 Resampling Resampling techniques: CV, Bootstrap, Subsampling,... cv3f = makeresampledesc("cv", iters = 3, stratify = TRUE) 10-fold CV of rpart on iris lrn = makelearner("classif.rpart") cv10f = makeresampledesc("cv", iters = 10) measures = list(mmce, acc) resample(lrn, task, cv10f, measures)$aggr ## mmce.test.mean acc.test.mean ## crossval(lrn, task)$aggr ## mmce.test.mean ## Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 19 /??
21 Example ex2.r: Resampling Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 20 /??
22 Benchmarking Compare multiple learners on multiple tasks Fair comparisons: same training and test sets for each learner data("sonar", package = "mlbench") tasks = list( makeclassiftask(data = iris, target = "Species"), makeclassiftask(data = Sonar, target = "Class") ) learners = list( makelearner("classif.rpart"), makelearner("classif.randomforest"), makelearner("classif.ksvm") ) benchmark(learners, tasks, cv10f, mmce) ## task.id learner.id mmce.test.mean ## 1 iris classif.rpart ## 2 iris classif.randomforest ## 3 iris classif.ksvm ## 4 Sonar classif.rpart ## 5 Sonar classif.randomforest ## 6 Sonar classif.ksvm Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 21 /??
23 Example ex3.r: Benchmarking Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 22 /??
24 Visualizations plotlearnerprediction(makelearner("classif.randomforest"), task, features = c("sepal.length", "Sepal.Width")) 4.5 classif.randomforest: Train: mmce=0.0933; CV: mmce.test.mean= Sepal.Width Species setosa versicolor virginica Sepal.Length Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 23 /??
25 Example ex4.r: ROC analysis Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 24 /??
26 Remarks on model selection I Basic machine learning Fit parameters of model to predict new data Generalisation error commonly estimated by resampling, e.g. 10-fold cross-validation 2nd, 3rd,... level of inference Comparing inducers or hyperparameters is harder Feature selection either in 2nd level or adds a 3rd one... Statistical comparisons on the 2nd stage are non-trivial Still active research Very likely that high performance computing is needed Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 25 /??
27 Tuning / Grid Search / Random search Tuning Used to find best hyperparameters for a method in a data-dependend way Must be done for some methods, like SVMs Grid search Basic method: Exhaustively try all combinations of finite grid Inefficient, combinatorial explosion Searches large, irrelevant areas Reasonable for continuous parameters? Still often default method Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 26 /??
28 Tuning / Grid Search / Random search Random search Randomly draw parameters mlr supports all types and depencies here Scales better then grid search Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 27 /??
29 Example ex5.r: Basic tuning: grid search Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 28 /??
30 Remarks on Model Selection II Salzberg (1997): On comparing classifiers: Pitfalls to avoid and a recommended approach Many articles do not contain sound statistical methodology Compare against enough reasonable algorithms Do not just report mean performance values Do not cheat through repeated tuning Use double cross-validation for tuning and evaluation Apply the correct test (assumptions?) Adapt for multiple testing Think about independence and distributions Do not rely solely on UCI We might overfit on it Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 29 /??
31 Nested Resampling Ensures unbiased results for meta model optimization Use this for tuning and feature selection Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 30 /??
32 Example ex6.r: Tuning + nested resampling via wrappers Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 31 /??
33 Parallelization Activate with parallelmap::parallelstart Backends: local, multicore, socket, mpi and BatchJobs parallelstart("batchjobs") benchmark([...]) parallelstop() Parallelization levels parallelgetregisteredlevels() ## mlr: mlr.benchmark, mlr.resample, mlr.selectfeatures, mlr.tuneparams Defaults to first possible / most outer loop Few iterations in benchmark (loop over learners tasks), many in resampling parallelstart("multicore", level = "mlr.resample") Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 32 /??
34 Parallelization parallelmap is documented here: BatchJobs is documented here: Make sure to read the Wiki page for Dortmund! Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 33 /??
35 Example ex7.r: Parallelization Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 34 /??
36 Outlook / Not Shown Today I Regression Survival analysis Clustering Regular cost-sensitive learning (class-specific costs) Cost-sensitive learning (example-dependent costs) Smarter tuning algorithms CMA-ES, iterated F-facing, model-based optimization... Multi-critera optimization... Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 35 /??
37 Outlook / Not Shown Today II Variable selection Filters: Simply imported available R methods Wrappers: Forward, backward, stochastic search, GAs Bagging for arbitary base learners Generic imputation for missing values Wrapping / tuning of preprocessing Over / Undersampling for unbalanced class sizes mlr connection to OpenML... Bernd Bischl SoSe 2015 Fortgeschrittene Computerintensive Methoden 5 36 /??
Fortgeschrittene Computerintensive Methoden
Fortgeschrittene Computerintensive Methoden Einheit 3: mlr - Machine Learning in R Bernd Bischl Matthias Schmid, Manuel Eugster, Bettina Grün, Friedrich Leisch Institut für Statistik LMU München SoSe 2014
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
2 Decision tree + Cross-validation with R (package rpart)
1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing
Data Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
An Introduction to WEKA. As presented by PACE
An Introduction to WEKA As presented by PACE Download and Install WEKA Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html 2 Content Intro and background Exploring WEKA Data Preparation Creating Models/
Maschinelles Lernen mit MATLAB
Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Open-Source Machine Learning: R Meets Weka
Open-Source Machine Learning: R Meets Weka Kurt Hornik Christian Buchta Achim Zeileis Weka? Weka is not only a flightless endemic bird of New Zealand (Gallirallus australis, picture from Wekapedia) but
Homework Assignment 7
Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
Open-Source Machine Learning: R Meets Weka
Open-Source Machine Learning: R Meets Weka Kurt Hornik, Christian Buchta, Michael Schauerhuber, David Meyer, Achim Zeileis http://statmath.wu-wien.ac.at/ zeileis/ Weka? Weka is not only a flightless endemic
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
Introduction Predictive Analytics Tools: Weka
Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
The R pmmltransformations Package
The R pmmltransformations Package Tridivesh Jena Alex Guazzelli Wen-Ching Lin Michael Zeller Zementis, Inc.* Zementis, Inc. Zementis, Inc. Zementis, Inc. Tridivesh.Jena@ Alex.Guazzelli@ Wenching.Lin@ Michael.Zeller@
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Feature Selection with Monte-Carlo Tree Search
Feature Selection with Monte-Carlo Tree Search Robert Pinsler 20.01.2015 20.01.2015 Fachbereich Informatik DKE: Seminar zu maschinellem Lernen Robert Pinsler 1 Agenda 1 Feature Selection 2 Feature Selection
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Analysis Tools and Libraries for BigData
+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I
Benchmarking Open-Source Tree Learners in R/RWeka
Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
How To Understand How Weka Works
More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz More Data Mining with Weka a practical course
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments
Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Lecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Monday Morning Data Mining
Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline: - data mining - IceCube - Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI
Data Mining Knowledge Discovery, Data Warehousing and Machine Learning Final remarks Lecturer: JERZY STEFANOWSKI Email: [email protected] Data Mining a step in A KDD Process Data mining:
Applied Multivariate Analysis - Big data analytics
Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix [email protected] http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
partykit: A Toolkit for Recursive Partytioning
partykit: A Toolkit for Recursive Partytioning Achim Zeileis, Torsten Hothorn http://eeecon.uibk.ac.at/~zeileis/ Overview Status quo: R software for tree models New package: partykit Unified infrastructure
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
Package missforest. February 20, 2015
Type Package Package missforest February 20, 2015 Title Nonparametric Missing Value Imputation using Random Forest Version 1.4 Date 2013-12-31 Author Daniel J. Stekhoven Maintainer
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Business Analytics and Credit Scoring
Study Unit 5 Business Analytics and Credit Scoring ANL 309 Business Analytics Applications Introduction Process of credit scoring The role of business analytics in credit scoring Methods of logistic regression
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
Package hybridensemble
Type Package Package hybridensemble May 30, 2015 Title Build, Deploy and Evaluate Hybrid Ensembles Version 1.0.0 Date 2015-05-26 Imports randomforest, kernelfactory, ada, rpart, ROCR, nnet, e1071, NMOF,
Data mining techniques: decision trees
Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39
CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing
CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate
WEKA KnowledgeFlow Tutorial for Version 3-5-8
WEKA KnowledgeFlow Tutorial for Version 3-5-8 Mark Hall Peter Reutemann July 14, 2008 c 2008 University of Waikato Contents 1 Introduction 2 2 Features 3 3 Components 4 3.1 DataSources..............................
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
270107 - MD - Data Mining
Coordinating unit: Teaching unit: Academic year: Degree: ECTS credits: 015 70 - FIB - Barcelona School of Informatics 715 - EIO - Department of Statistics and Operations Research 73 - CS - Department of
Tutorial Exercises for the Weka Explorer
CHAPTER Tutorial Exercises for the Weka Explorer 17 The best way to learn about the Explorer interface is simply to use it. This chapter presents a series of tutorial exercises that will help you learn
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Johan Perols Assistant Professor University of San Diego, San Diego, CA 92110 [email protected] April
M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest
Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how
Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg
Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
Computational Assignment 4: Discriminant Analysis
Computational Assignment 4: Discriminant Analysis -Written by James Wilson -Edited by Andrew Nobel In this assignment, we will investigate running Fisher s Discriminant analysis in R. This is a powerful
More Data Mining with Weka
More Data Mining with Weka Class 5 Lesson 1 Simple neural networks Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz Lesson 5.1: Simple neural networks Class
Roulette Sampling for Cost-Sensitive Learning
Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca
RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo 178627 Database And Data Mining Research Group
RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE Luigi Grimaudo 178627 Database And Data Mining Research Group Summary RapidMiner project Strengths How to use RapidMiner Operator
Introduction Predictive Analytics Tools: Weka, R!
Introduction Predictive Analytics Tools: Weka, R! Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego! Available Data Mining Tools! COTs:! n IBM
Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/
Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/ Email: [email protected] Office: Dipartimento di Ingegneria
Decompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort [email protected] Session Number: TBR14 Insurance has always been a data business The industry has successfully
Data Mining Tools. Jean- Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, 75252 Paris, Cedex 05 Jean- Gabriel.Ganascia@lip6.
Data Mining Tools Jean- Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, 75252 Paris, Cedex 05 Jean- [email protected] DATA BASES Data mining Extraction Data mining Interpretation/
A Review of Missing Data Treatment Methods
A Review of Missing Data Treatment Methods Liu Peng, Lei Lei Department of Information Systems, Shanghai University of Finance and Economics, Shanghai, 200433, P.R. China ABSTRACT Missing data is a common
CS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:
Nonparametric statistics and model selection
Chapter 5 Nonparametric statistics and model selection In Chapter, we learned about the t-test and its variations. These were designed to compare sample means, and relied heavily on assumptions of normality.
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
CLINICAL DECISION SUPPORT FOR HEART DISEASE USING PREDICTIVE MODELS
CLINICAL DECISION SUPPORT FOR HEART DISEASE USING PREDICTIVE MODELS Srpriva Sundaraman Northwestern University [email protected] Sunil Kakade Northwestern University [email protected]
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
Introduction to Machine Learning Using Python. Vikram Kamath
Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression
Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research [email protected]
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research [email protected] Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
Is a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
SURVEY REPORT DATA SCIENCE SOCIETY 2014
SURVEY REPORT DATA SCIENCE SOCIETY 2014 TABLE OF CONTENTS Contents About the Initiative 1 Report Summary 2 Participants Info 3 Participants Expertise 6 Suggested Discussion Topics 7 Selected Responses
Fortgeschrittene Computerintensive Methoden: Fuzzy Clustering Steffen Unkel
Fortgeschrittene Computerintensive Methoden: Fuzzy Clustering Steffen Unkel Institut für Statistik LMU München Sommersemester 2013 Outline 1 Setting the scene 2 Methods for fuzzy clustering 3 The assessment
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
Data Mining of Web Access Logs
Data Mining of Web Access Logs A minor thesis submitted in partial fulfilment of the requirements for the degree of Master of Applied Science in Information Technology Anand S. Lalani School of Computer
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Machine Learning Capacity and Performance Analysis and R
Machine Learning and R May 3, 11 30 25 15 10 5 25 15 10 5 30 25 15 10 5 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 100 80 60 40 100 80 60 40 100 80 60 40 30 25 15 10 5 25 15 10
Implementation of Breiman s Random Forest Machine Learning Algorithm
Implementation of Breiman s Random Forest Machine Learning Algorithm Frederick Livingston Abstract This research provides tools for exploring Breiman s Random Forest algorithm. This paper will focus on
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
