Using multiple models: Bagging, Boosting, Ensembles, Forests

Similar documents

Leveraging Ensemble Models in SAS Enterprise Miner

Data Mining Practical Machine Learning Tools and Techniques

Classification and Regression by randomforest

Knowledge Discovery and Data Mining

Package bigrf. February 19, 2015

Gerry Hobbs, Department of Statistics, West Virginia University

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Data Mining. Nonlinear Classification

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Decision Trees from large Databases: SLIQ

Chapter 12 Bagging and Random Forests

Comparison of Data Mining Techniques used for Financial Data Analysis

How To Make A Credit Risk Model For A Bank Account

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Model Combination. 24 Novembre 2009

CS570 Data Mining Classification: Ensemble Methods

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Implementation of Breiman s Random Forest Machine Learning Algorithm

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Knowledge Discovery and Data Mining

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Social Media Mining. Data Mining Essentials

Distributed forests for MapReduce-based machine learning

Predicting borrowers chance of defaulting on credit loans

Classification of Bad Accounts in Credit Card Industry

Didacticiel Études de cas

Using Random Forest to Learn Imbalanced Data

Performance Metrics for Graph Mining Tasks

Supervised Feature Selection & Unsupervised Dimensionality Reduction

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Random Forest Based Imbalanced Data Cleaning and Classification

Data Mining Methods: Applications for Institutional Research

Data Mining Techniques Chapter 6: Decision Trees

How To Perform An Ensemble Analysis

Better credit models benefit us all

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Why Ensembles Win Data Mining Competitions

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Information Management course

Beating the MLB Moneyline

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Random Forests for Scientific Discovery. Leo Breiman, UC Berkeley Adele Cutler, Utah State University

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Chapter 6. The stacking ensemble approach

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

MHI3000 Big Data Analytics for Health Care Final Project Report

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

Package trimtrees. February 20, 2015

Introduction to Data Mining

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Monday Morning Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Fast Analytics on Big Data with H20

MS1b Statistical Data Mining

REVIEW OF ENSEMBLE CLASSIFICATION

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Course Syllabus. Purposes of Course:

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Knowledge Discovery and Data Mining

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Predicting Flight Delays

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

User Authentication/Identification From Web Browsing Behavior

Data Mining - Evaluation of Classifiers

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Class Imbalance Learning in Software Defect Prediction

Homework Assignment 7

DATA MINING - 1DL360

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Classification and Prediction

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Standardization and Its Effects on K-Means Clustering Algorithm

OUTLIER ANALYSIS. Data Mining 1

Benchmarking Open-Source Tree Learners in R/RWeka

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

New Ensemble Combination Scheme

life science data mining

Data Mining: Overview. What is Data Mining?

Applied Multivariate Analysis - Big data analytics

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Azure Machine Learning, SQL Data Mining and R

Transcription:

Using multiple models: Bagging, Boosting, Ensembles, Forests

Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or majority voting from multiple models If different training datasets cause significant differences in learned model, then bagging can improve accuracy.

Performance (decision tree) Mis-classication Rates (Percent) Data Set e S e B Decrease Waveform 29.0 19.4 33% Heart 10.0 5.3 47% breast cancer 6.0 4.2 30% Ionosphere 11.2 8.6 23% Diabetes 23.4 8.8 20% Glass 32.0 24.9 22% Soybean 14.5 10.6 27% 50 bootstrap samples each.

Performance (Regression trees) Mean Squared Test Set Error Data Set e S e B Decrease Boston Housing 19.1 11.7 39% Ozone 23.1 18.0 22% Friedman #1 11.4 6.2 46% Friedman #2 30,800 21,700 30% Friedman #3.0403.0249 38%

Bagging Advantage improved accuracy Disadvantage no simple, interpretable model Bagging can improve performance for unstable learning algorithms Performance of stable learning methods can deteriorate Bagging good models can make them optimal Bagging poor models can make them worse

Boosting Multiple models developed in sequence by assigning higher weights (boosting) for those training cases that are difficult to classify Generate the first model Repeat Weight the training data such that the misclassified cases get higher weights Generate the next model Combine predictions from individual models (weighted by accuracy of the models)

Stream Data Mining Examples Surveillance of trading data for security fraud Mining web click stream data Monitoring network traffic Analysis of sensor network data Measuring power consumption Stream data Continuous flow of information vs. finite, statically stored data Data volumes too large to be stored on permanent devices Burstiness in data - data rate of the stream isn't constant Continuously evolving patterns Stream Data Mining IBM Research

Stream data mining - approaches Data condensation/summarization condensed representation is used to track the changes over time Temporal granularity Greater importance to more recent data Privacy Protecting personal data in the stream data points K-anonymity: minimum k other data points from which this cannot be distinguished Clustering Stream data points is first represented as members of micro-clusters; then track these micro-clusters rather than the potentially infinite number of individual data points Classifier ensembles Separate classifiers developed on blocks of sequential data, instead of continuously updating a single classifier Each classifier in ensemble can be weighted by accuracy on the current test cases

Random Forests Multiple trees developed Majority voting for predicted class - each tree votes for the predicted class label of an example Developing each tree (N = training data size, M = number of variables) Sample N cases at random, with replacement use as training data for this tree (bootstrap sample) Select m << M, and use m randomly chosen variables for selection at each node (select the best split amongst the m variables) Grow the full tree without pruning (m is a fixed parameter) Random Forests - Leo Breiman and Adele Cutler

Random forests error rate Error rate depends on correlation between any two trees in the forest higher correlation increases the forest error rate error rate of each individual tree in the forest a tree with a low error rate decreases the forest error rate Reducing m reduces correlation increases the error rate of individual trees Optimal m value (usually in a wide range)

Random Forest error rate Out-of-bag error estimate Bootstrapped sample leaves out about a third of the cases (oob) For each tree, pass each oob case (say, case q) thru the tree and get its class prediction (say, j is the predicted class from the tree) Proportion of times that j does not match the true class (out of the number of times that the case q is oob) gives the error rate for this case Average over the error across all training cases (an unbiased estimate of error)

Random Forests variable importance Importance of a variable v For each tree, run the oob cases thru the tree and find the number of correct classifications (ncorrecta) For a variable v, randomly permute its values amongst the oob cases and run the oob cases thru the tree; find the number of correct classifications (ncorrectp) Calculate (ncorrecta ncorrectp) for the tree Average over all trees gives importance score of variable v Significance If importance score is independent for different trees, then z-scores can be calculated and significance level obtained assuming normal distribution of scores (low correlations between importance scores across trees for many datasets)

Random Forests - Proximities Proximity After each tree is built, pass all the cases thru the tree If two cases j and k are in the same terminal node, increase their proximity score by 1 Normalize the proximity values - divide by number of trees A measure of similarity between two data points Can be used for imputing missing values

Random Forests Missing values using proximity If variable m in the j-th case is missing Iterative process to fill missing value Weighted average over all cases k where the variable is not missing, weighted by proximity(j, k) (For categorical variable, take the most frequent value, with frequency weighted by proximities) Replaced missing values are used in next iteration of RF, where new proximities are calculated Re-calculate fills for missing value Repeat till no further improvements (4-6 iterations adequate)

Random Forests missing values replacement

Random Forests Detecting outliers using proximities Outlier a case with low proximity to all other cases Measure of outlyingness For a case j: 1/(sum of squares of prox(j,k) for all other cases k) Values > 10 are outliers? Mis-labeled data can be outliers Class labels can be prone to error Labels often assigned by hand

Outliers based on proximity PIMA Indian hepatitis data, 768 cases)

Mislabeled cases as outliers

Random Forests Picturing the Data Matrix of proximity values Proximity(j, k) (1 proximity(j, k) ) gives the squared distance in highdimensional Euclidean space Metric scaling projects the data into lower dimensional space, while preserving the distances between them scaling coordinates (related to the eigenvectors of a modified version of the proximity matrix) 3 or 4 scaling coordinates usually enough Plot 1 st versus the 2 nd scaling coordinates to get a picture of the data

Using proximities and scaling coordinates to picture the data

Random Forests Unsupervised learning - Clustering Data with no class labels How to use to grow trees? Original data of N cases class 1 Create a synthetic data set of N cases class 2 Random sampling from univariate distributions in original data Let x(v, j) be the v-th variable in the j-th case of class 1 For a case of class 2: Select the 1 st variable s value at random from the N values x(1, j), j=1,..n Select the 2nd variable s value at random from the N values x(2, j), j=1,..n Etc. Class 2 destroys the dependencies between variables in Class 1

Random Forests - Clustering Use RF on 2 class problem Error rates close to 50% implies RF cannot distinguish Low error implies good separation between classes Strong dependency structure amongst variables in the original data We can use all the RF tools (proximities, scaling views, variable importance, outliers, etc.) on the original data Testing approach: Dataset with class labels, Remove labels Apply clustering Do clusters correspond to original classes?

Metric scaling for supervised learning data

RF for clustering Metric scaling for unsupervised learning data