Model Combination. 24 Novembre 2009
|
|
|
- Evelyn Brooks
- 10 years ago
- Views:
Transcription
1 Model Combination 24 Novembre 2009 Datamining
2 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy learning Datamining
3 Principles of model combination Model selection versus model combination For a given learning task, we typically train several models. Model selection Choose the model that maximizes some performance criterion on a test set. Model combination Combine several models into a single aggregate model (aka ensemble or committee) Datamining
4 Why combine models? Principles of model combination The error of an ensemble < the error of an individual model provided that: 1 the models are diverse or independent; 2 each model performs at least slightly better than chance : err < 0.5. Justification: The error probability of a set of J models 1 follows a binomial distribution assumes independent models; 2 equals the probability that at least J/2 models are wrong. Datamining
5 Principles of model combination Ex. An ensemble of 21 models is in error when 11 make an error. Individual error=0.3 Individual error= P(maj in error) = P(maj in error) = # classifiers in error # classifiers in error p < 0.5 P(ensemble error) < P(individual error) p > 0.5 P(ensemble error) > P(individual error) Datamining
6 Phases of model combination Principles of model combination 1 Diversification : choose diverse models to cover different regions of your instance space 2 Intégration : combine these models to maximize the performance of the ensemble. Datamining
7 Model diversity Principles of model combination The diversity of 2 models h 1 et h 2 is quantified in terms of their error (non)correlation Different definitions, e.g. the error correlation of 2 models is the probability that both make the same error given that one of them makes an error: C err (h 1, h 2 ) = P (h 1 (x i ) = h 2 (x i ) h 1 (x i ) y i h 2 (x i ) y i ) Many other measures of diversity have been proposed [Kuncheva, 2003]. Datamining
8 Diversification techniques Principles of model combination Resampling : vary the data used to train a given algorithm Bagging : "bootstrap aggregation" Random Forests : bagging + "variable resampling" Boosting : resampling through adaptive weighting Hybrid learning : vary the algorithms trained on a given dataset Stacking : "stacked generalization" So-called multistrategy models Datamining
9 Integration techniques Principles of model combination Static : integration procedure is fixed, e.g., vote and retain the majority or mean/median of the individual predictions Dynamic : base predictions are combined using an adaptive procedure, e.g. meta-learning Datamining
10 Plan Principles of model combination 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for multistrategy learning Datamining
11 Bagging Resampling techniques Bagging = "bootstrap aggregation" Diversification via resampling Create T bootstrap replicates of the dataset D (samples of size D using random draws with replacement) Apply a given learning algorithm to the T replicates to produce T models or hypotheses Static integration Ensemble prediction = mean (regression) or uniform vote (classification) of the T models Datamining
12 Bagging schema Resampling techniques Diversification : Create bootstrap replicates of D D... Data D 1 Data D 2... Data D T Method L Method L... Method L Hypothesis h 1... Hypothesis h 2 Hypothesis h T H... Integration : Final hypothesis H=f(h 1, h 2,..., h T ) Datamining
13 Resampling techniques C4.5 with and without bagging Datamining
14 Bagging: summary Resampling techniques Bootstrap resampling yields highly overlapping training sets Reduces error due to variance: improves highly unstable or high-variance algorithms, i.e., where small data variations lead to very different models (e.g. decision trees, neural networks) "Reasonable" values of T 25 for regression 50 for classification [Breiman, 1996] Datamining
15 Plan Resampling techniques 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for multistrategy learning Datamining
16 Random Forests (RF) Resampling techniques RF combines random instance selection (bagging) with random variable selection Base learning algorithm: decision trees (CART). Forest = an ensemble of T trees. [Breiman, 2001] Datamining
17 RF algorithm Resampling techniques T RN contains n examples and p variables Hyperparameters : T = # of trees to build; m = # of variables to choose at each test node, m p To build each tree 1 Create a bootstrap replicate T RN i of T RN 2 Create a CART tree with the following modifications: At each tree node, randomly choose m candidate variables Do not prune the tree Integration method: uniform vote of the T trees Datamining
18 Optimizing RF Resampling techniques The ensemble error ɛ F depends on two factors : correlation co among trees of the forest: if co then ɛ F strength s i of individual trees. Strength prediction accuracy: if s i then ɛ F Impact of m, the # of randomly drawn variables Increasing m increases both co (ɛ F ) and s (ɛ F ) : co vs s trade-off Choose m by estimating ɛ i on V AL i = {x j T RN x j / T RN i } Impact of T, the number of trees T can be increased without risk of overfitting [Breiman, 2001] Datamining
19 Plan Resampling techniques 1 Principles of method combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for multistrategy learning Datamining
20 Boosting : the basic idea Resampling techniques Diversification: sequential adaptive resampling by instance weighting Initially: all instances have equal weights: 1/ T RN At each of the T iterations, - apply the algorithm and estimate the resubsitution error - increase/decrease weights of misclassified/correctly classified cases (focus learning on difficult cases) Integration by weighted voting : Apply the T base models to the current instance Return the weighted majority prediction (base classifiers weights their predictive power). Datamining
21 Boosting schema Resampling techniques Diversification through instance weighting T RN 1 D 1 Method L Hypothesis h 1 error T RN 2 D 2 Method L Hypothesis h 2 error... error... T RN T D T Method L Hypothesis h T Integration : Final hypothesis H=f(h 1, h 2,..., h T ) Datamining
22 Bagging versus boosting Resampling techniques Bagging Boosting Diversification : resampling 1 in parallel sequentially 2 random draws adaptive weighting Integration : vote 3 uniform weighted by 1 ɛ(h j ) Datamining
23 AdaBoost algorithm Resampling techniques Input: T RN = {(x i, y i )}, y i { 1, 1}, algo L Initialization: D 1 (i) = 1/N uniform distribution For t = 1 to T 1 h t L(T RN, D i ) 2 ɛ t i h t(x i ) y i D t (i) (weighted error) 3 if ɛ t > 0.5 then T t 1; exit 4 Compute D t+1 by adjusting weights of training cases Result: H final (x) weighted combination of h t s Freund & Schapire, Datamining
24 Algorithmic details Resampling techniques L a weak learner (ɛ < 0.5) Computing D t+1 ( ) α t 1 2 ln 1 ɛt ɛ t (since ɛ t 0.5, α t 0 e αt 1) { D t+1 (i) Dt(i) e α t if y Z t i h t (x i ) weight e αt if y i = h t (x i ) weight Dt(i) Z t exp( α t y i h t (x i )) Z t normalization factor i D t+1(i) = 1 ( T ) H final (x) sgn t=1 α th t (x) Datamining
25 Resampling techniques C4.5 with and without boosting Datamining
26 Resampling techniques Performance of bagging vs boosting Datamining
27 Boosting: summary Resampling techniques The power of boosting comes from adaptive resampling Like bagging, boosting reduces variance Also reduces bias by obliging the learner to focus on hard cases combined hypothesis more flexible Fast convergence Sensitive to noise : when base learners misclassify noisy examples weights increase overfitting to noise Datamining
28 Plan Resampling techniques 1 Principles of method combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for multistrategy learning Datamining
29 Stacking Hybrid methods "Stacked generalization" [Wolpert 1992] Level 0 : diversification through the use of different learning algorithms Level 1 : integration through meta-learning Datamining
30 Stacking schema Hybrid methods Final hypothesis = L(D 1 ) Intégration Method L Data D 1 =target values of D 0 + predictions of h i s Hypothesis h 1 Hypothesis h 2... Hypothesis h T Diversification Method L 1 Method L 2... Method L T Data D 0 Datamining
31 Level 0 : Diversification Hybrid methods Input : D 0 = {(x i, y i )}, i = 1..N, J learning algorithms L j Diversification by K-fold cross-validation : for k = 1 to K do for j = 1 to J h kj L j (D T ST k 0 ) foreach x i T ST k pred ij h kj (x) Result: J predictions for each instance of D 0 Datamining
32 Level 1 meta-data Hybrid methods Notation : i = 1... N examples; j = 1... J models Input : D 1 = {(z i, y i )}, z i R J are the base predictions, y i the target values regression : z i = [z i1,..., z kj ], where z ij = approximation of y i by model h j classification : discrete classifier : z i = [z i1,..., z ij], with z ij C = class predicted by model h j for example i probabilistic classifier : z i = [z i1,..., z ij], with z ij = (z ij1, z ij2,..., z ij C ) = class probability distribution Datamining
33 Examples de meta-data Hybrid methods Level 0 (base-level) : Iris. L 1 =J48, L 2 =IB1, L 3 =NB Level 1 : discrete classifiers # J48 IB1 NB Target 1 setosa setosa versicolor setosa virginica setosa virginica virginica Level 1 : probabilistic classifiers # J48 IB1 NB Target p1 p2 p3 p1 p2 p3 p1 p2 p setosa virginica Datamining
34 Hybrid methods Level 1 : Integration through meta-learning The level 1 algorithm depends on the variables of D 1 real-valued predictions (regression ou probabilistic classification) : L M {trees, IBk, MLP, SVM, linear regressors,...} discrete predictions (discrete classification) : L M {trees, IBk, Naïve Bayes,...} The combined model H L M (D 1 ) Datamining
35 Using stacked models Hybrid methods After creation of thecombined model H, create the final base models h j by training the J algorithmes on D 0. In production mode: 2-step prediction given a new case x R d Level 0 : generate a meta-example z R J : For j = 1 to J, z j h j (x) Level 1: combined prediction H(z) Datamining
36 Stacking : summary Hybrid methods improves considerably over cross-validated model selection on average, 3 hypotheses play a significant role in the combined hypothesis stacking is just one example of hybrid model combination, many other examples have been proposed Datamining
37 Hybrid methods Generic multistrategy learning algorithm Diversification Train M learning algorithms on a base-level dataset and measure their performance on a test set Select J models with minimal correlation error Integration Static : continuous predictions : compute the mean, median, linear combination, etc. discrete prédictions : uniform or weighted vote Dynamic : use meta-learning Datamining
38 References Hybrid methods L. Breiman (1996). Bagging Predictors. Machine Learning 24(2): L. Breiman (2001). Random Forests. Machine Learning 45: T. G. Dietterich (1998). Machine Learning Research: Four Current Directions. AI Magazine 18(4): Y. Freund, R. E. Schapire (1996). Experiments with a new boosting algorithm. ICML-96. L. Kuncheva, C. J. Whitaker (2003). Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning 51: D. Wolpert (1992). Stacked Generalization, Neural Networks 5: Datamining
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
CS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
Decompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Boosting. [email protected]
. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg [email protected]
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
A Learning Algorithm For Neural Network Ensembles
A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
Applied Multivariate Analysis - Big data analytics
Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix [email protected] http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
Solving Regression Problems Using Competitive Ensemble Models
Solving Regression Problems Using Competitive Ensemble Models Yakov Frayman, Bernard F. Rolfe, and Geoffrey I. Webb School of Information Technology Deakin University Geelong, VIC, Australia {yfraym,brolfe,webb}@deakin.edu.au
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
Active Learning with Boosting for Spam Detection
Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38 Outline 1 Spam Filters
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
How To Solve The Class Imbalance Problem In Data Mining
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,
Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal
Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal Magdalena Graczyk 1, Tadeusz Lasota 2, Bogdan Trawiński 1, Krzysztof Trawiński 3 1 Wrocław University of Technology,
Why Ensembles Win Data Mining Competitions
Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA
Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
How Boosting the Margin Can Also Boost Classifier Complexity
Lev Reyzin [email protected] Yale University, Department of Computer Science, 51 Prospect Street, New Haven, CT 652, USA Robert E. Schapire [email protected] Princeton University, Department
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Metalearning for Dynamic Integration in Ensemble Methods
Metalearning for Dynamic Integration in Ensemble Methods Fábio Pinto 12 July 2013 Faculdade de Engenharia da Universidade do Porto Ph.D. in Informatics Engineering Supervisor: Doutor Carlos Soares Co-supervisor:
Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008
Ensemble Learning Better Predictions Through Diversity Todd Holloway ETech 2008 Outline Building a classifier (a tutorial example) Neighbor method Major ideas and challenges in classification Ensembles
MHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Data Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
Ensembles and PMML in KNIME
Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany [email protected]
Homework Assignment 7
Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
6 Classification and Regression Trees, 7 Bagging, and Boosting
hs24 v.2004/01/03 Prn:23/02/2005; 14:41 F:hs24011.tex; VTEX/ES p. 1 1 Handbook of Statistics, Vol. 24 ISSN: 0169-7161 2005 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(04)24011-1 1 6 Classification
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/
Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau
Training Methods for Adaptive Boosting of Neural Networks for Character Recognition
Submission to NIPS*97, Category: Algorithms & Architectures, Preferred: Oral Training Methods for Adaptive Boosting of Neural Networks for Character Recognition Holger Schwenk Dept. IRO Université de Montréal
Using Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, [email protected] Department of Statistics,UC Berkeley Andy Liaw, andy [email protected] Biometrics Research,Merck Research Labs Leo Breiman,
Introduction To Ensemble Learning
Educational Series Introduction To Ensemble Learning Dr. Oliver Steinki, CFA, FRM Ziad Mohammad July 2015 What Is Ensemble Learning? In broad terms, ensemble learning is a procedure where multiple learner
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Monday Morning Data Mining
Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline: - data mining - IceCube - Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik
Studying Auto Insurance Data
Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.
ENSEMBLE METHODS FOR CLASSIFIERS
Chapter 45 ENSEMBLE METHODS FOR CLASSIFIERS Lior Rokach Department of Industrial Engineering Tel-Aviv University [email protected] Abstract Keywords: The idea of ensemble methodology is to build a predictive
Employer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
Lecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
How To Train A Neural Net
Model Compression Cristian Bucilă Computer Science Cornell University [email protected] Rich Caruana Computer Science Cornell University [email protected] Alexandru Niculescu-Mizil Computer Science
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
FilterBoost: Regression and Classification on Large Datasets
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley Machine Learning Department Carnegie Mellon University Pittsburgh, PA 523 [email protected] Robert E. Schapire Department
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
Operations Research and Knowledge Modeling in Data Mining
Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 [email protected]
An Ensemble Method for Large Scale Machine Learning with Hadoop MapReduce
An Ensemble Method for Large Scale Machine Learning with Hadoop MapReduce by Xuan Liu Thesis submitted to the Faculty of Graduate and Postdoctoral Studies In partial fulfillment of the requirements For
Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets
Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets Ricardo Ramos Guerra Jörg Stork Master in Automation and IT Faculty of Computer Science and Engineering
Fortgeschrittene Computerintensive Methoden
Fortgeschrittene Computerintensive Methoden Einheit 3: mlr - Machine Learning in R Bernd Bischl Matthias Schmid, Manuel Eugster, Bettina Grün, Friedrich Leisch Institut für Statistik LMU München SoSe 2014
Car Insurance. Havránek, Pokorný, Tomášek
Car Insurance Havránek, Pokorný, Tomášek Outline Data overview Horizontal approach + Decision tree/forests Vertical (column) approach + Neural networks SVM Data overview Customers Viewed policies Bought
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan
Getting the Most Out of Ensemble Selection
Getting the Most Out of Ensemble Selection Rich Caruana, Art Munson, Alexandru Niculescu-Mizil Department of Computer Science Cornell University Technical Report 2006-2045 {caruana, mmunson, alexn} @cs.cornell.edu
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
Microsoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql [email protected] http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
Cross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
Hong Kong Stock Index Forecasting
Hong Kong Stock Index Forecasting Tong Fu Shuo Chen Chuanqi Wei [email protected] [email protected] [email protected] Abstract Prediction of the movement of stock market is a long-time attractive topic
Why do statisticians "hate" us?
Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data
Ensemble Approaches for Regression: A Survey
Ensemble Approaches for Regression: A Survey JOÃO MENDES-MOREIRA, LIAAD-INESC TEC, FEUP, Universidade do Porto CARLOS SOARES, INESC TEC, FEP, Universidade do Porto ALíPIO MÁRIO JORGE, LIAAD-INESC TEC,
Model Trees for Classification of Hybrid Data Types
Model Trees for Classification of Hybrid Data Types Hsing-Kuo Pao, Shou-Chih Chang, and Yuh-Jye Lee Dept. of Computer Science & Information Engineering, National Taiwan University of Science & Technology,
