Knowledge Discovery and Data Mining



Similar documents
Data Mining. Nonlinear Classification

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Model Combination. 24 Novembre 2009

Chapter 6. The stacking ensemble approach

Data Mining - Evaluation of Classifiers

Data Mining Practical Machine Learning Tools and Techniques

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Lecture 13: Validation

L25: Ensemble learning

CS570 Data Mining Classification: Ensemble Methods

Supervised Learning (Big Data Analytics)

REVIEW OF ENSEMBLE CLASSIFICATION

Evaluation & Validation: Credibility: Evaluating what has been learned

Using multiple models: Bagging, Boosting, Ensembles, Forests

Comparison of Data Mining Techniques used for Financial Data Analysis

Data Mining Part 5. Prediction

Cross Validation. Dr. Thomas Jensen Expedia.com

Social Media Mining. Data Mining Essentials

Leveraging Ensemble Models in SAS Enterprise Miner

Predicting borrowers chance of defaulting on credit loans

Azure Machine Learning, SQL Data Mining and R

Decision Trees from large Databases: SLIQ

Better credit models benefit us all

How To Make A Credit Risk Model For A Bank Account

Data Mining Techniques for Prognosis in Pancreatic Cancer

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Data Mining Methods: Applications for Institutional Research

Gerry Hobbs, Department of Statistics, West Virginia University

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Chapter 12 Bagging and Random Forests

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Cross-Validation. Synonyms Rotation estimation

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

How To Perform An Ensemble Analysis

Predict Influencers in the Social Network

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Ensemble Data Mining Methods

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Inductive Learning in Less Than One Sequential Data Scan

L13: cross-validation

Predicting Student Persistence Using Data Mining and Statistical Analysis Methods

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Data Analytics and Business Intelligence (8696/8697)

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Decompose Error Rate into components, some of which can be measured on unlabeled data

MHI3000 Big Data Analytics for Health Care Final Project Report

Monday Morning Data Mining

Introduction to Data Mining

Data Mining for Knowledge Management. Classification

Predictive Data modeling for health care: Comparative performance study of different prediction models

A Learning Algorithm For Neural Network Ensembles

Random Forest Based Imbalanced Data Cleaning and Classification

Maschinelles Lernen mit MATLAB

An Overview of Knowledge Discovery Database and Data mining Techniques

Advanced Ensemble Strategies for Polynomial Models

Metalearning for Dynamic Integration in Ensemble Methods

Why Ensembles Win Data Mining Competitions

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Cross-validation for detecting and preventing overfitting

CART 6.0 Feature Matrix

HT2015: SC4 Statistical Data Mining and Machine Learning

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Data Mining Algorithms Part 1. Dejan Sarka

New Ensemble Combination Scheme

Classification algorithm in Data mining: An Overview

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

On Cross-Validation and Stacking: Building seemingly predictive models on random data

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Classification and Regression by randomforest

Studying Auto Insurance Data

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

How To Choose A Churn Prediction

II. RELATED WORK. Sentiment Mining

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Logistic Regression for Spam Filtering

Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Getting Even More Out of Ensemble Selection

How To Identify A Churner

On the application of multi-class classification in physical therapy recommendation

Customer Classification And Prediction Based On Data Mining Technique

Ensembles and PMML in KNIME

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Applied Multivariate Analysis - Big data analytics

Transcription:

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right number of discrete states for nominal attributes Feature Selection Supervised/Unsupervised Classification Techniques Decision Tree, Neural Networks, Naïve Bayes, Nearest Neighbor Model Testing and Evaluation Accuracy, Recall/Precision/F-Measure, Bagging, Boosting, Cross- Validation, ROC Curve, Lift Charts The whole activity is not a sequential process. More importantly, you would always be required to make good use of common sense. Sajjad Haider Fall 2013 2 1

The No Free Lunch Theorem The No Free Lunch Theorem (1996;1997) implies that it is hopeless to dream for a learning algorithm which is consistently better than other learning algorithms. It is important to notice, however, that the NFL Theorem considers the whole problem space, that is, all possible learning tasks; while in real practice, we are usually only interested in a given task, and in such situation, the effort of trying to find the best algorithm is valid. Sajjad Haider Fall 2013 3 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for assessing accuracy based on randomly sampled partitions of the given data. Sajjad Haider Fall 2013 4 2

Holdout Method The holdout method is what we have alluded to so far in our discussions about accuracy. In this method, the given data are randomly partitioned into two independent sets, a training set and a test set. Typically, two-thirds of the data are allocated to the training set, and the remaining one-third is allocated to the test set. The training set is used to derive the model, whose accuracy is estimated with the test set. The estimate is pessimistic because only a portion of the initial data is used to derive the model. Sajjad Haider Fall 2013 5 Performance Evaluation We evaluate performance of the classifier on the testing set With large labeled dataset, we would typically take 2/3 for training, 1/3 for testing What can we do if we have a small dataset? Can t afford to take 1/3 for testing Small testing means predicted error will be far from true error Sajjad Haider Fall 2013 6 3

Cross Validation Cross Validation helps to get a more accurate performance evaluation on small dataset K-fold Cross Validation Divide the labeled data set into k subsets Repeat k times: Take subset i as the test data and the rest of subsets as the training data. Train the classifier and asses the testing error on subset i. Average the testing error from the k iterations Leave one out Cross Validation Cross validation with k = 1 Sajjad Haider Fall 2013 7 Bootstrap The bootstrap method samples the given training tuples uniformly with replacement. That is, each time a tuple is selected, it is equally likely to be selected again and re-added to the training set. There are several bootstrap methods. A commonly used one is the.632 bootstrap, which works as follows. Suppose we are given a data set of d tuples. The data set is sampled d times, with replacement, resulting in a bootstrap sample or training set of d samples. Sajjad Haider Fall 2013 8 4

Bootstrap (Cont d) It is very likely that some of the original data tuples will occur more than once in this sample. The data tuples that did not make it into the training set end up forming the test set. Suppose we were to try this out several times. As it turns out, on average, 63.2% of the original data tuples will end up in the bootstrap, and the remaining 36.8% will form the test set Sajjad Haider Fall 2013 9 Bootstrap (Cont d) We can repeat the sampling procedure k times, where in each iteration, we use the current test set to obtain an accuracy estimate of the model obtained from the current bootstrap sample. The overall accuracy of the model is then estimated as Acc(M) = (i=1 to K) (0.632xAcc(Mi) train_set + 0.368xAcc(Mi) test_set ) where Acc(Mi)test set is the accuracy of the model obtained with bootstrap sample I when it is applied to test set i. Acc(Mi)train set is the accuracy of the model obtained with bootstrap sample i when it is applied to the original set of data tuples. The bootstrap method works well with small data sets. Sajjad Haider Fall 2013 10 5

Ensemble Models Bagging and boosting are examples of ensemble methods, or methods that use a combination of models. Each combines a series of k learned models (classifiers or predictors), M1, M2,, Mk, with the aim of creating an improved composite model, M*. Both bagging and boosting can be used for classification as well as prediction Sajjad Haider Fall 2013 11 Bagging The name Bagging came from the abbreviation Bootstrap AGGregatING (1996). As the name implies, the two ingredients of Bagging are bootstrap and aggregation. Baggin applies bootstrap sampling to obtain the data subsets for training the base learners. Given a training data set containing m number of examples, a sample of m will be generated by sampling with replacement. Sajjad Haider Fall 2013 12 6

Bagging (Cont d) By applying the process T times, T samples of m training examples are obtained. Then for each sample a base learner can be trained by applying a learning algorithm. Bagging adopts the most popular strategies for aggregating the outputs of the base learners, that is, voting for classification and averaging for regression. Sajjad Haider Fall 2013 13 Bagging Sajjad Haider Fall 2013 14 7

Bagging (Cont d) The bagged classifier often has significantly greater accuracy than a single classifier derived from D, the original training data. It will not be considerably worse and is more robust to the effects of noisy data. The increased accuracy occurs because the composite model reduces the variance of the individual classifiers. For prediction, it was theoretically proven that a bagged predictor will always have improved accuracy over a single predictor derived from D. Sajjad Haider Fall 2013 15 Boosting In boosting, weights are assigned to each training tuple. A series of k classifiers is iteratively learned. After a classifier Mi is learned, the weights are updated to allow the subsequent classifier,mi+1, to pay more attention to the training tuples that were misclassified by Mi. The final boosted classifier, M, combines the votes of each individual classifier, where the weight of each classifier s vote is a function of its accuracy. Sajjad Haider Fall 2013 16 8

Boosting Algorithm Sajjad Haider Fall 2013 17 Boosting Algorithm (Cont d) Sajjad Haider Fall 2013 18 9

Idea Behind Ada Boost Algorithm is iterative Maintains distribution of weights over the training examples Initially distribution of weights is uniform At successive iterations, the weight of misclassified examples is increased, forcing the weak learner to focus on the hard examples in the training set. Sajjad Haider Fall 2013 19 Bagging vs. Boosting Because of the way boosting focuses on the misclassified tuples, it risks overfitting the resulting composite model to such data. Therefore, sometimes the resulting boosted model may be less accurate than a single model derived from the same data. Bagging is less susceptible to model overfitting. While both can significantly improve accuracy in comparison to a single model, boosting tends to achieve greater accuracy. Sajjad Haider Fall 2013 20 10

Random Forest Random Forest (2001) is a representative of the state-of-the-art ensemble methods. It is an extension of Bagging, where the major difference with Bagging is the incorporation of randomized feature selection. During the construction of a component decision tree, at each step of split selection, RF first randomly selects a subset of features, and then carries out the conventional split selection within the selected feature subset. Sajjad Haider Fall 2013 21 Random Forest (Cont d) A parameter K controls the incorporation of randomness. When K equals the total number of features, the constructed decision tree is identical to the traditional decision tree; when K = 1, a feature will be selected randomly. The suggested value of K is the logarithm of the number of features. Notice that randomness is only introduced into the feature selection process, not into the choice of split points on the selected feature. Sajjad Haider Fall 2013 22 11

Stacking Stacking is a general procedure where a learner is trained to combine the individual learners. Here, the individual learners are called the first-level learners, while the combiner is called the second-level learner, or meta-learner. The basic idea is to train the first-level learners using the original training data set, and then generate a new data set for training the second-level learner, where the outputs of the first-level learners are regarded as input features while the original labels are still regarded as labels of the new training data. Sajjad Haider Fall 2013 23 Stacking (Cont d) The first-level learners are often generated by applying different learning algorithms, and so, stacked ensembles are often heterogeneous, though it is also possible to construct homogeneous stacked ensembles. Sajjad Haider Fall 2013 24 12

Stacking (Cont d) In the training phase of stacking, a new data set needs to be generated from the first-level classifiers. If the exact data that are used to train the first level learner are also used to generate the new data set for training the second-level learner, there will be a high risk of overfitting. Hence, it is suggested that the instances used for generating the new data set are excluded from the training examples for the first-level learners. Sajjad Haider Fall 2013 25 KNIME Demo Ensemble Model Sajjad Haider Fall 2013 26 13