Model Combination. 24 Novembre 2009


 Evelyn Brooks
 3 years ago
 Views:
Transcription
1 Model Combination 24 Novembre 2009 Datamining
2 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy learning Datamining
3 Principles of model combination Model selection versus model combination For a given learning task, we typically train several models. Model selection Choose the model that maximizes some performance criterion on a test set. Model combination Combine several models into a single aggregate model (aka ensemble or committee) Datamining
4 Why combine models? Principles of model combination The error of an ensemble < the error of an individual model provided that: 1 the models are diverse or independent; 2 each model performs at least slightly better than chance : err < 0.5. Justification: The error probability of a set of J models 1 follows a binomial distribution assumes independent models; 2 equals the probability that at least J/2 models are wrong. Datamining
5 Principles of model combination Ex. An ensemble of 21 models is in error when 11 make an error. Individual error=0.3 Individual error= P(maj in error) = P(maj in error) = # classifiers in error # classifiers in error p < 0.5 P(ensemble error) < P(individual error) p > 0.5 P(ensemble error) > P(individual error) Datamining
6 Phases of model combination Principles of model combination 1 Diversification : choose diverse models to cover different regions of your instance space 2 Intégration : combine these models to maximize the performance of the ensemble. Datamining
7 Model diversity Principles of model combination The diversity of 2 models h 1 et h 2 is quantified in terms of their error (non)correlation Different definitions, e.g. the error correlation of 2 models is the probability that both make the same error given that one of them makes an error: C err (h 1, h 2 ) = P (h 1 (x i ) = h 2 (x i ) h 1 (x i ) y i h 2 (x i ) y i ) Many other measures of diversity have been proposed [Kuncheva, 2003]. Datamining
8 Diversification techniques Principles of model combination Resampling : vary the data used to train a given algorithm Bagging : "bootstrap aggregation" Random Forests : bagging + "variable resampling" Boosting : resampling through adaptive weighting Hybrid learning : vary the algorithms trained on a given dataset Stacking : "stacked generalization" Socalled multistrategy models Datamining
9 Integration techniques Principles of model combination Static : integration procedure is fixed, e.g., vote and retain the majority or mean/median of the individual predictions Dynamic : base predictions are combined using an adaptive procedure, e.g. metalearning Datamining
10 Plan Principles of model combination 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for multistrategy learning Datamining
11 Bagging Resampling techniques Bagging = "bootstrap aggregation" Diversification via resampling Create T bootstrap replicates of the dataset D (samples of size D using random draws with replacement) Apply a given learning algorithm to the T replicates to produce T models or hypotheses Static integration Ensemble prediction = mean (regression) or uniform vote (classification) of the T models Datamining
12 Bagging schema Resampling techniques Diversification : Create bootstrap replicates of D D... Data D 1 Data D 2... Data D T Method L Method L... Method L Hypothesis h 1... Hypothesis h 2 Hypothesis h T H... Integration : Final hypothesis H=f(h 1, h 2,..., h T ) Datamining
13 Resampling techniques C4.5 with and without bagging Datamining
14 Bagging: summary Resampling techniques Bootstrap resampling yields highly overlapping training sets Reduces error due to variance: improves highly unstable or highvariance algorithms, i.e., where small data variations lead to very different models (e.g. decision trees, neural networks) "Reasonable" values of T 25 for regression 50 for classification [Breiman, 1996] Datamining
15 Plan Resampling techniques 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for multistrategy learning Datamining
16 Random Forests (RF) Resampling techniques RF combines random instance selection (bagging) with random variable selection Base learning algorithm: decision trees (CART). Forest = an ensemble of T trees. [Breiman, 2001] Datamining
17 RF algorithm Resampling techniques T RN contains n examples and p variables Hyperparameters : T = # of trees to build; m = # of variables to choose at each test node, m p To build each tree 1 Create a bootstrap replicate T RN i of T RN 2 Create a CART tree with the following modifications: At each tree node, randomly choose m candidate variables Do not prune the tree Integration method: uniform vote of the T trees Datamining
18 Optimizing RF Resampling techniques The ensemble error ɛ F depends on two factors : correlation co among trees of the forest: if co then ɛ F strength s i of individual trees. Strength prediction accuracy: if s i then ɛ F Impact of m, the # of randomly drawn variables Increasing m increases both co (ɛ F ) and s (ɛ F ) : co vs s tradeoff Choose m by estimating ɛ i on V AL i = {x j T RN x j / T RN i } Impact of T, the number of trees T can be increased without risk of overfitting [Breiman, 2001] Datamining
19 Plan Resampling techniques 1 Principles of method combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for multistrategy learning Datamining
20 Boosting : the basic idea Resampling techniques Diversification: sequential adaptive resampling by instance weighting Initially: all instances have equal weights: 1/ T RN At each of the T iterations,  apply the algorithm and estimate the resubsitution error  increase/decrease weights of misclassified/correctly classified cases (focus learning on difficult cases) Integration by weighted voting : Apply the T base models to the current instance Return the weighted majority prediction (base classifiers weights their predictive power). Datamining
21 Boosting schema Resampling techniques Diversification through instance weighting T RN 1 D 1 Method L Hypothesis h 1 error T RN 2 D 2 Method L Hypothesis h 2 error... error... T RN T D T Method L Hypothesis h T Integration : Final hypothesis H=f(h 1, h 2,..., h T ) Datamining
22 Bagging versus boosting Resampling techniques Bagging Boosting Diversification : resampling 1 in parallel sequentially 2 random draws adaptive weighting Integration : vote 3 uniform weighted by 1 ɛ(h j ) Datamining
23 AdaBoost algorithm Resampling techniques Input: T RN = {(x i, y i )}, y i { 1, 1}, algo L Initialization: D 1 (i) = 1/N uniform distribution For t = 1 to T 1 h t L(T RN, D i ) 2 ɛ t i h t(x i ) y i D t (i) (weighted error) 3 if ɛ t > 0.5 then T t 1; exit 4 Compute D t+1 by adjusting weights of training cases Result: H final (x) weighted combination of h t s Freund & Schapire, Datamining
24 Algorithmic details Resampling techniques L a weak learner (ɛ < 0.5) Computing D t+1 ( ) α t 1 2 ln 1 ɛt ɛ t (since ɛ t 0.5, α t 0 e αt 1) { D t+1 (i) Dt(i) e α t if y Z t i h t (x i ) weight e αt if y i = h t (x i ) weight Dt(i) Z t exp( α t y i h t (x i )) Z t normalization factor i D t+1(i) = 1 ( T ) H final (x) sgn t=1 α th t (x) Datamining
25 Resampling techniques C4.5 with and without boosting Datamining
26 Resampling techniques Performance of bagging vs boosting Datamining
27 Boosting: summary Resampling techniques The power of boosting comes from adaptive resampling Like bagging, boosting reduces variance Also reduces bias by obliging the learner to focus on hard cases combined hypothesis more flexible Fast convergence Sensitive to noise : when base learners misclassify noisy examples weights increase overfitting to noise Datamining
28 Plan Resampling techniques 1 Principles of method combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for multistrategy learning Datamining
29 Stacking Hybrid methods "Stacked generalization" [Wolpert 1992] Level 0 : diversification through the use of different learning algorithms Level 1 : integration through metalearning Datamining
30 Stacking schema Hybrid methods Final hypothesis = L(D 1 ) Intégration Method L Data D 1 =target values of D 0 + predictions of h i s Hypothesis h 1 Hypothesis h 2... Hypothesis h T Diversification Method L 1 Method L 2... Method L T Data D 0 Datamining
31 Level 0 : Diversification Hybrid methods Input : D 0 = {(x i, y i )}, i = 1..N, J learning algorithms L j Diversification by Kfold crossvalidation : for k = 1 to K do for j = 1 to J h kj L j (D T ST k 0 ) foreach x i T ST k pred ij h kj (x) Result: J predictions for each instance of D 0 Datamining
32 Level 1 metadata Hybrid methods Notation : i = 1... N examples; j = 1... J models Input : D 1 = {(z i, y i )}, z i R J are the base predictions, y i the target values regression : z i = [z i1,..., z kj ], where z ij = approximation of y i by model h j classification : discrete classifier : z i = [z i1,..., z ij], with z ij C = class predicted by model h j for example i probabilistic classifier : z i = [z i1,..., z ij], with z ij = (z ij1, z ij2,..., z ij C ) = class probability distribution Datamining
33 Examples de metadata Hybrid methods Level 0 (baselevel) : Iris. L 1 =J48, L 2 =IB1, L 3 =NB Level 1 : discrete classifiers # J48 IB1 NB Target 1 setosa setosa versicolor setosa virginica setosa virginica virginica Level 1 : probabilistic classifiers # J48 IB1 NB Target p1 p2 p3 p1 p2 p3 p1 p2 p setosa virginica Datamining
34 Hybrid methods Level 1 : Integration through metalearning The level 1 algorithm depends on the variables of D 1 realvalued predictions (regression ou probabilistic classification) : L M {trees, IBk, MLP, SVM, linear regressors,...} discrete predictions (discrete classification) : L M {trees, IBk, Naïve Bayes,...} The combined model H L M (D 1 ) Datamining
35 Using stacked models Hybrid methods After creation of thecombined model H, create the final base models h j by training the J algorithmes on D 0. In production mode: 2step prediction given a new case x R d Level 0 : generate a metaexample z R J : For j = 1 to J, z j h j (x) Level 1: combined prediction H(z) Datamining
36 Stacking : summary Hybrid methods improves considerably over crossvalidated model selection on average, 3 hypotheses play a significant role in the combined hypothesis stacking is just one example of hybrid model combination, many other examples have been proposed Datamining
37 Hybrid methods Generic multistrategy learning algorithm Diversification Train M learning algorithms on a baselevel dataset and measure their performance on a test set Select J models with minimal correlation error Integration Static : continuous predictions : compute the mean, median, linear combination, etc. discrete prédictions : uniform or weighted vote Dynamic : use metalearning Datamining
38 References Hybrid methods L. Breiman (1996). Bagging Predictors. Machine Learning 24(2): L. Breiman (2001). Random Forests. Machine Learning 45: T. G. Dietterich (1998). Machine Learning Research: Four Current Directions. AI Magazine 18(4): Y. Freund, R. E. Schapire (1996). Experiments with a new boosting algorithm. ICML96. L. Kuncheva, C. J. Whitaker (2003). Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning 51: D. Wolpert (1992). Stacked Generalization, Neural Networks 5: Datamining
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 20150305
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 20150305 Roman Kern (KTI, TU Graz) Ensemble Methods 20150305 1 / 38 Outline 1 Introduction 2 Classification
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationL25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo GutierrezOsuna
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationA Study Of Bagging And Boosting Approaches To Develop MetaClassifier
A Study Of Bagging And Boosting Approaches To Develop MetaClassifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet524121,
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationCS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of HanKamberPei, Tan et al., and Li Xiong Günay (Emory) Classification:
More informationClass #6: Nonlinear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Nonlinear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Nonlinear classification Linear Support Vector Machines
More informationChapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida  1 
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida  1  Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
More informationDecompose Error Rate into components, some of which can be measured on unlabeled data
BiasVariance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data BiasVariance Decomposition for Regression BiasVariance Decomposition for Classification BiasVariance
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19  Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19  Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.standrews.ac.uk twk@standrews.ac.uk Tom Kelsey ID505919B &
More informationComparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationREVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
More informationBoosting. riedmiller@informatik.unifreiburg.de
. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät AlbertLudwigsUniversität Freiburg riedmiller@informatik.unifreiburg.de
More informationEnsemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
More informationApplied Multivariate Analysis  Big data analytics
Applied Multivariate Analysis  Big data analytics Nathalie VillaVialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
More informationA Learning Algorithm For Neural Network Ensembles
A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICETUNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationOn the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
More informationAdvanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
More informationData Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank Engineering the input and output Attribute selection Scheme independent, scheme
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationSolving Regression Problems Using Competitive Ensemble Models
Solving Regression Problems Using Competitive Ensemble Models Yakov Frayman, Bernard F. Rolfe, and Geoffrey I. Webb School of Information Technology Deakin University Geelong, VIC, Australia {yfraym,brolfe,webb}@deakin.edu.au
More informationGeneralizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
More informationActive Learning with Boosting for Spam Detection
Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38 Outline 1 Spam Filters
More informationIntroduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011
Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning
More informationFine Particulate Matter Concentration Level Prediction by using Treebased Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Treebased Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
More informationIncreasing the Accuracy of Predictive Algorithms: A Review of Ensembles of Classifiers
1906 Category: Software & Systems Design Increasing the Accuracy of Predictive Algorithms: A Review of Ensembles of Classifiers Sotiris Kotsiantis University of Patras, Greece & University of Peloponnese,
More informationWhen Efficient Model Averaging OutPerforms Boosting and Bagging
When Efficient Model Averaging OutPerforms Boosting and Bagging Ian Davidson 1 and Wei Fan 2 1 Department of Computer Science, University at Albany  State University of New York, Albany, NY 12222. Email:
More informationChapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida  1  Outline A brief introduction to the bootstrap Bagging: basic concepts
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationData Analytics and Business Intelligence (8696/8697)
http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/36 Data Analytics and Business Intelligence (8696/8697) Ensemble Decision Trees Graham.Williams@togaware.com Data Scientist Australian
More informationCLASS distribution, i.e., the proportion of instances belonging
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging, Boosting, and HybridBased Approaches Mikel Galar,
More informationComparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal
Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal Magdalena Graczyk 1, Tadeusz Lasota 2, Bogdan Trawiński 1, Krzysztof Trawiński 3 1 Wrocław University of Technology,
More informationWhy Ensembles Win Data Mining Competitions
Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:
More informationInductive Learning in Less Than One Sequential Data Scan
Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Research Hawthorne, NY 10532 {weifan,haixun,psyu}@us.ibm.com ShawHwa Lo Statistics Department,
More informationClassification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
More informationEnsemble Approach for the Classification of Imbalanced Data
Ensemble Approach for the Classification of Imbalanced Data Vladimir Nikulin 1, Geoffrey J. McLachlan 1, and Shu Kay Ng 2 1 Department of Mathematics, University of Queensland v.nikulin@uq.edu.au, gjm@maths.uq.edu.au
More informationMetalearning for Dynamic Integration in Ensemble Methods
Metalearning for Dynamic Integration in Ensemble Methods Fábio Pinto 12 July 2013 Faculdade de Engenharia da Universidade do Porto Ph.D. in Informatics Engineering Supervisor: Doutor Carlos Soares Cosupervisor:
More informationWelcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA
Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/
More informationMining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS1332014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationClassification: Basic Concepts, Decision Trees, and Model Evaluation. General Approach for Building Classification Model
10 10 Classification: Basic Concepts, Decision Trees, and Model Evaluation Dr. Hui Xiong Rutgers University Introduction to Data Mining 1//009 1 General Approach for Building Classification Model Tid Attrib1
More informationHow Boosting the Margin Can Also Boost Classifier Complexity
Lev Reyzin lev.reyzin@yale.edu Yale University, Department of Computer Science, 51 Prospect Street, New Haven, CT 652, USA Robert E. Schapire schapire@cs.princeton.edu Princeton University, Department
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationTRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
More informationMHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data preprocessing The data given
More informationEnsemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008
Ensemble Learning Better Predictions Through Diversity Todd Holloway ETech 2008 Outline Building a classifier (a tutorial example) Neighbor method Major ideas and challenges in classification Ensembles
More informationBetter credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring  Overview Random Forest  Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
More informationData Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
More informationClassification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
More informationHT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
More informationENSEMBLE METHODS FOR CLASSIFIERS
Chapter 45 ENSEMBLE METHODS FOR CLASSIFIERS Lior Rokach Department of Industrial Engineering TelAviv University liorr@eng.tau.ac.il Abstract Keywords: The idea of ensemble methodology is to build a predictive
More informationHomework Assignment 7
Homework Assignment 7 36350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the emails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
More informationBootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
More informationEnsembles and PMML in KNIME
Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany First.Last@UniKonstanz.De
More informationEcommerce Transaction Anomaly Classification
Ecommerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of ecommerce
More informationEnsemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble machine learning tutorial/
Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau
More informationRandom forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
More information6 Classification and Regression Trees, 7 Bagging, and Boosting
hs24 v.2004/01/03 Prn:23/02/2005; 14:41 F:hs24011.tex; VTEX/ES p. 1 1 Handbook of Statistics, Vol. 24 ISSN: 01697161 2005 Elsevier B.V. All rights reserved. DOI 10.1016/S01697161(04)240111 1 6 Classification
More informationCrossValidation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
More informationUsing Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,
More informationMonday Morning Data Mining
Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline:  data mining  IceCube  Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik
More informationIntroduction To Ensemble Learning
Educational Series Introduction To Ensemble Learning Dr. Oliver Steinki, CFA, FRM Ziad Mohammad July 2015 What Is Ensemble Learning? In broad terms, ensemble learning is a procedure where multiple learner
More informationTraining Methods for Adaptive Boosting of Neural Networks for Character Recognition
Submission to NIPS*97, Category: Algorithms & Architectures, Preferred: Oral Training Methods for Adaptive Boosting of Neural Networks for Character Recognition Holger Schwenk Dept. IRO Université de Montréal
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationBuilding Ensembles of Neural Networks with Classswitching
Building Ensembles of Neural Networks with Classswitching Gonzalo MartínezMuñoz, Aitor SánchezMartínez, Daniel HernándezLobato and Alberto Suárez Universidad Autónoma de Madrid, Avenida Francisco Tomás
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Daybyday Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Resampling techniques g Threeway data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationEmployer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
More informationStudying Auto Insurance Data
Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and nontraditional tools, I downloaded a wellstudied data from http://www.statsci.org/data/general/motorins.
More informationA New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
More informationBOOSTING  A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on elearning (elearning2014), 2223 September 2014, Belgrade, Serbia BOOSTING  A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
More informationII. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
More informationData Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
More informationFilterBoost: Regression and Classification on Large Datasets
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley Machine Learning Department Carnegie Mellon University Pittsburgh, PA 523 jkbradle@cs.cmu.edu Robert E. Schapire Department
More informationModel Compression. Cristian Bucilă Computer Science Cornell University. Rich Caruana. Alexandru NiculescuMizil. cristi@cs.cornell.
Model Compression Cristian Bucilă Computer Science Cornell University cristi@cs.cornell.edu Rich Caruana Computer Science Cornell University caruana@cs.cornell.edu Alexandru NiculescuMizil Computer Science
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationUNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationHYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan
More informationData Mining Classification: Alternative Techniques. InstanceBased Classifiers. Lecture Notes for Chapter 5. Introduction to Data Mining
Data Mining Classification: Alternative Techniques InstanceBased Classifiers Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar Set of Stored Cases Atr1... AtrN Class A B
More informationEnsemble Approaches for Regression: A Survey
Ensemble Approaches for Regression: A Survey JOÃO MENDESMOREIRA, LIAADINESC TEC, FEUP, Universidade do Porto CARLOS SOARES, INESC TEC, FEP, Universidade do Porto ALíPIO MÁRIO JORGE, LIAADINESC TEC,
More informationCase Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets
Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets Ricardo Ramos Guerra Jörg Stork Master in Automation and IT Faculty of Computer Science and Engineering
More informationFortgeschrittene Computerintensive Methoden
Fortgeschrittene Computerintensive Methoden Einheit 3: mlr  Machine Learning in R Bernd Bischl Matthias Schmid, Manuel Eugster, Bettina Grün, Friedrich Leisch Institut für Statistik LMU München SoSe 2014
More informationAn Ensemble Method for Large Scale Machine Learning with Hadoop MapReduce
An Ensemble Method for Large Scale Machine Learning with Hadoop MapReduce by Xuan Liu Thesis submitted to the Faculty of Graduate and Postdoctoral Studies In partial fulfillment of the requirements For
More informationCar Insurance. Havránek, Pokorný, Tomášek
Car Insurance Havránek, Pokorný, Tomášek Outline Data overview Horizontal approach + Decision tree/forests Vertical (column) approach + Neural networks SVM Data overview Customers Viewed policies Bought
More informationOperations Research and Knowledge Modeling in Data Mining
Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 3058573 koda@sk.tsukuba.ac.jp
More informationMicrosoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
More informationAn Overview of Data Mining: Predictive Modeling for IR in the 21 st Century
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO
More informationCross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
More informationHong Kong Stock Index Forecasting
Hong Kong Stock Index Forecasting Tong Fu Shuo Chen Chuanqi Wei tfu1@stanford.edu cslcb@stanford.edu chuanqi@stanford.edu Abstract Prediction of the movement of stock market is a longtime attractive topic
More informationDistributed Regression For Heterogeneous Data Sets 1
Distributed Regression For Heterogeneous Data Sets 1 Yan Xing, Michael G. Madden, Jim Duggan, Gerard Lyons Department of Information Technology National University of Ireland, Galway Ireland {yan.xing,
More information