# An Introduction to Ensemble Learning in Credit Risk Modelling

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 An Introduction to Ensemble Learning in Credit Risk Modelling October 15, 2014 Han Sheng Sun, BMO Zi Jin, Wells Fargo

2 Disclaimer The opinions expressed in this presentation and on the following slides are solely those of the presenters, and are not necessarily those of the employers of the presenters (BMO or Wells Fargo). The methods presented are not necessarily in use at BMO or Wells Fargo, and the employers of the presenters do not guarantee the accuracy or reliability of the information provided herein. 2

3 Outline Introduction Credit Risk Modelling Ensemble Modelling From CART to Random Forest Why Random Forest works A new algorithm Experimental Results A note on SAS implementation 3

4 Credit Risk Credit Risk refers to the risk that a borrower will default on any type of debt by failing to make required payments. The risk is primarily that of the lender and includes lost principal and interest, disruption to cash flows, and increased collection costs. For example: A consumer may fail to make a payment due on a mortgage loan, credit card, line of credit, or other loans A company is unable to repay asset-secured fixed or floating charge debt An insolvent insurance company does not pay a policy obligation Three key credit risk parameters: probability of default (PD) loss given default (LGD) exposure at default (EAD) 4

5 Credit Risk Modelling Over the past decade, banks have devoted many resources to developing internal models to better quantify their financial risks and assign regulatory and economic capitals. In the context of data mining, the risk parameter modeling is typically casted as a supervised learning problem. Using the historical information as the training data, to predict the future default behaviors of its customers. Typically, regressions models or a single decision tree model for estimating parameters (PD/LGD/EAD) will be considered. Using a combination of several models is not often considered. 5

6 ENSEMBLE 6

7 Ensemble What is an ensemble? The general idea of ensemble modelling is to combine several individual models (base learner) in a reasonable way to obtain a final model that out-performs every one of them. Why ensemble? The fundamental reason for ensemble to work is that the combination of a collection of weak learners will produce a strong prediction result. 7

8 Ensemble Condorcet s jury theorem (French mathematician Marquis de Condorcet in 1785): If each voter has an independent probability p of being correct, and the probability of a majority of voters being correct is E then: p > 0.5 implies E > p E approaches to 1, if for all p > 0.5 as the number of voter approaches infinity. 8

9 Ensemble In 1977, John W. Tukey suggested combining two linear regression models, the first for fitting the original data, the second for fitting the residuals. The resulting model is almost always better than a single regression model. 9

10 CART RANDOM FOREST 10

11 CART to Random Forest Professor Leo Breiman (UC, Berkeley) Classification and Regression tree (CART) a.k.a. Decision tree or regression tree Bootstrap Aggregation (Bagging) An ensemble of CART to increase the performance from a single tree Random Forest (RF) A further improvement of the bagging algorithm CART (1984) Bagging (1996) RF (2001) Adaptive boosting (AdaBoost) formulated by Yoav Freund and Robert Schapire (1996). 11

12 CART (Classification and Regression Tree) 12

13 CART Advantages: Good predictive power Able to handle both numerical and categorical valued inputs Robust to missing values Highly interpretable results Limitations: Greedy algorithm: locally optimal decisions are made at each node and hard to return a globally-optimal tree. Over-fitting: over-complex trees which do not generalize well. 13

14 Bootstrap Aggregating (Bagging) Bagging Algorithm: Given a training set D = {( X, Y )}, i =1,, N: 1. For each b = 1 through B Bootstrap step: Draw a bootstrap sample of D; call it. Tree building: Train a chosen base leaner (e.g. a decision tree) on 2. Output ensemble by combining results of B trained models: B Regression: average 1 F( x) f Classification: take majority vote. i i B b 1 b f b 14

15 Random Forest (RF) Random Forest Algorithm: Given a training set D = {( X, Y )}, i =1,, N: 1. For each b = 1 through B b Bootstrap step: Draw a bootstrap sample of D; call it D. Random subset step: When building f b, randomly select a subset m b of m < M predictors X before making each split, grow fb on D rather than over all possible predictors. 2. Output ensemble by combining results of B trained models: B Regression: average 1 F( x) f B b 1 Classification: take majority vote. i i b 15

16 Why Random Forest Works? 16

17 Random Forest (RF) Remarkable result: ò RF 1 s ( ) 2 s The mean correlation between any two member of the forest; The mean strength s of a typical member of the forest. The generalization error of a random forest is bounded. A good random forest: Small : reduce the correlation between individual base learners/trees. Large s: make each base learner/tree as accurate as possible. 17

18 Random Forest: Equations Definition: The set is called a random forest. Prediction error: ò P ( (, ) 0) (, ) M X Y RF X Y RF where and M (, ) ( ;, ) RF X Y m X Y dp m(, x, y) I( f ( x; ) y) I( f ( x; ) y) Strength of an individual tree: The correlation of any two trees in the forest: 18

19 A New Algorithm Ensemble by PLS 19

20 Motivation 20

21 Partial Least Squares (PLS) The PLS technique works by successively extracting factors from both X and Y such that covariance between the extracted factors is maximized. Linear decomposition of X and Y: T X TP E T Y UQ F 21

22 Ensemble by Partial Least Squares (PLS) Ensemble by Partial Least Squares Algorithm: Given a training set D = {( X, Y )}, i = 1,, N: i i 1. For each b = 1 through B b Bootstrap step: Draw a bootstrap sample of D; call it D. Random subset step: Randomly select a subset of predictors X, train a tree f b m b on D and X rather than over all possible predictors. 2. Output ensemble by combining results of B trained models: PLS: use the output of individual tree f to formula a pseudo ˆb dataset {( fˆb, y i )}, i =1,, N, and fit a partial least square model to make the final prediction. m 22

23 Experimental Results 23

24 LGD data Retail Portfolio LGD data: 3500 accounts with 499 predictive variables Randomly split into 60% training samples (2100 accounts) and 40% validation samples (1400 accounts). Empirical distribution of the target variable: 24

25 Regression Tree Tree setting: MAXBRANCH MAXDEPTH LEAFSIZE Discriminatory Power: RCAP = on Validation RCAP = on Training 25

26 Random Forest Tree setting (Base learner): MAXBRANCH MAXDEPTH LEAFSIZE Discriminatory Power: Num. of Trees RCAP on Validation RCAP on Training

27 Ensemble by PLS Tree setting (Base learner): MAXBRANCH MAXDEPTH LEAFSIZE Discriminatory Power: Num. of Trees Random Forest RCAP on Validation Samples RCAP on Training Samples Ensemble-PLS RCAP on Validation Samples RCAP on Training Samples

28 Summary Result Tree setting (Base learner): Method RCAP on Validation Regression Tree Random Forest Ensemble by PLS

29 Remarks Advantages: Strong model performance : as in summary result Resist to over-fitting/robustness: fitting a larger model on the training data rarely leads to over-fitting on the validation data. Fool proof modelling method: no transformation on the input variables, no strict requirement on variables selections, does not require the pruning of the tree. Disadvantages: Interpretability: (solution) variable importance measure ˆtree Imp ( f ) dˆ I{split at node k is on variable j} 2 j m k 1 Computation difficulty: (solution) parallel computing k 29

30 SAS Implementation PROC SURVEYSELECT PROC ARBORETUM PROC PLS SAS/IML 30

31 Reference [ Marquis de Condorcet Essay on the Application of Analysis to the Probability of Majority Decisions. 1785] [J.W. Tukey(1977) Exploratory data analysis. Addison-Wesley, Reading] [L. Breiman, J. Friedman, C.J. Stone, and R. A. Olshen, Classification and regression trees. Chapman & Hall/CRC, 1984] [L. Breiman Bagging Predictors Machine Learning 24 (2): ] [L. Breiman. Random Forests Machine Learning 45(1):5-32.] [S. Wold, A. Ruhe, H. Wold. The collinearity problem in linear regression: the partial least squares (PLS) approach to generalized inverse. SIAM Journal on Scientific and Statistical Computing 5(3): ] 31

32 Thank you 32

### Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Model Combination. 24 Novembre 2009

Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### Classification and Regression Trees

Classification and Regression Trees Bob Stine Dept of Statistics, School University of Pennsylvania Trees Familiar metaphor Biology Decision tree Medical diagnosis Org chart Properties Recursive, partitioning

### Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

### Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

### CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

### Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

### REVIEW OF ENSEMBLE CLASSIFICATION

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

### Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

### Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

### Ensemble Data Mining Methods

Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

### Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya

### EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

### Trees and Random Forests

Trees and Random Forests Adele Cutler Professor, Mathematics and Statistics Utah State University This research is partially supported by NIH 1R15AG037392-01 Cache Valley, Utah Utah State University Leo

### The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner

Paper 3361-2015 The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Narmada Deve Panneerselvam, Spears School of Business, Oklahoma State University, Stillwater,

### A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

### Variable selection using random forests

Pattern Recognition Letters 31 (2010) January 25, 2012 Outline 1 2 Sensitivity to n and p Sensitivity to mtry and ntree 3 Procedure Starting example 4 Prostate data Four high dimensional classication datasets

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

### Leveraging Ensemble Models in SAS Enterprise Miner

ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

### Better credit models benefit us all

Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

### CS570 Data Mining Classification: Ensemble Methods

CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

### Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

### TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern

### Predicting borrowers chance of defaulting on credit loans

Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm

### Boosting. riedmiller@informatik.uni-freiburg.de

. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

### Selecting Data Mining Model for Web Advertising in Virtual Communities

Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: jerzy.surma@gmail.com Mariusz Łapczyński

### Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

### DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

### Data Analytics and Business Intelligence (8696/8697)

http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/36 Data Analytics and Business Intelligence (8696/8697) Ensemble Decision Trees Graham.Williams@togaware.com Data Scientist Australian

### ISSN: 2319-5967 ISO 9001:2008 Certified International Journal of Engineering Science and Innovative Technology (IJESIT) Volume 2, Issue 3, May 2013

Study of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant and P. M. Chawan Department of Computer Technology, VJTI, Mumbai, INDIA Abstract This paper describes about different

### Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

### Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau

Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau

### Beating the MLB Moneyline

Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

### Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

### Review of the Titanic Competition. 15 th February 2016

Review of the Titanic Competition 15 th February 2016 Agenda Data Analytics in the Society of Actuaries Team ZLAP Deloitte GI Team Where Can I Get More? Disclaimer: The material, content and views in the

### On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

### Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank Engineering the input and output Attribute selection Scheme independent, scheme

### An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO

### Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati

### Drug Store Sales Prediction

Drug Store Sales Prediction Chenghao Wang, Yang Li Abstract - In this paper we tried to apply machine learning algorithm into a real world problem drug store sales forecasting. Given store information,

### Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

### Neural Networks & Boosting

Neural Networks & Boosting Bob Stine Dept of Statistics, School University of Pennsylvania Questions How is logistic regression different from OLS? Logistic mean function for probabilities Larger weight

### Data Mining and Statistics for Decision Making. Wiley Series in Computational Statistics

Brochure More information from http://www.researchandmarkets.com/reports/2171080/ Data Mining and Statistics for Decision Making. Wiley Series in Computational Statistics Description: Data Mining and Statistics

### THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS by XIAO CHENG (Under the Direction of Jeongyoun Ahn) ABSTRACT Big Data has been the new trend in businesses.

### L25: Ensemble learning

L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

### New Ensemble Combination Scheme

New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

### II. RELATED WORK. Sentiment Mining

Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

### Building Ensembles of Neural Networks with Class-switching

Building Ensembles of Neural Networks with Class-switching Gonzalo Martínez-Muñoz, Aitor Sánchez-Martínez, Daniel Hernández-Lobato and Alberto Suárez Universidad Autónoma de Madrid, Avenida Francisco Tomás

### Homework Assignment 7

Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam\$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448

### When Efficient Model Averaging Out-Performs Boosting and Bagging

When Efficient Model Averaging Out-Performs Boosting and Bagging Ian Davidson 1 and Wei Fan 2 1 Department of Computer Science, University at Albany - State University of New York, Albany, NY 12222. Email:

### REPORT DOCUMENTATION PAGE

REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 Public Reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,

### Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble

Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble Dr. Hongwei Patrick Yang Educational Policy Studies & Evaluation College of Education University of Kentucky

### New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

### Data Mining Part 5. Prediction

Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

### Intelligible Models for Classification and Regression

Intelligible Models for Classification and Regression Yin Lou 1 Rich Caruana 2 Johannes Gehrke 1 Department of Computer Science 1 Microsoft Research 2 Cornell University Microsoft Corporation Aug. 13,

### Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision

### THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

### ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

### Classification and Regression by randomforest

Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

### Inductive Learning in Less Than One Sequential Data Scan

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Research Hawthorne, NY 10532 {weifan,haixun,psyu}@us.ibm.com Shaw-Hwa Lo Statistics Department,

### The Artificial Prediction Market

The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

### COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

### Searching for Gravitational Waves from the Coalescence of High Mass Black Hole Binaries

Searching for Gravitational Waves from the Coalescence of High Mass Black Hole Binaries 2015 SURE Presentation September 22 nd, 2015 Lau Ka Tung Department of Physics, The Chinese University of Hong Kong

### BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

### Credit Scoring Modelling for Retail Banking Sector.

Credit Scoring Modelling for Retail Banking Sector. Elena Bartolozzi, Matthew Cornford, Leticia García-Ergüín, Cristina Pascual Deocón, Oscar Iván Vasquez & Fransico Javier Plaza. II Modelling Week, Universidad

### A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

### Monday Morning Data Mining

Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline: - data mining - IceCube - Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik

### Efficiency in Software Development Projects

Efficiency in Software Development Projects Aneesh Chinubhai Dharmsinh Desai University aneeshchinubhai@gmail.com Abstract A number of different factors are thought to influence the efficiency of the software

### How Boosting the Margin Can Also Boost Classifier Complexity

Lev Reyzin lev.reyzin@yale.edu Yale University, Department of Computer Science, 51 Prospect Street, New Haven, CT 652, USA Robert E. Schapire schapire@cs.princeton.edu Princeton University, Department

### Actuarial. Modeling Seminar Part 2. Matthew Morton FSA, MAAA Ben Williams

Actuarial Data Analytics / Predictive Modeling Seminar Part 2 Matthew Morton FSA, MAAA Ben Williams Agenda Introduction Overview of Seminar Traditional Experience Study Traditional vs. Predictive Modeling

### CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

### Fraud Detection for Online Retail using Random Forests

Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.

### Why Ensembles Win Data Mining Competitions

Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:

### Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

### How to Deploy Models using Statistica SVB Nodes

How to Deploy Models using Statistica SVB Nodes Abstract Dell Statistica is an analytics software package that offers data preparation, statistics, data mining and predictive analytics, machine learning,

### Mining Wiki Usage Data for Predicting Final Grades of Students

Mining Wiki Usage Data for Predicting Final Grades of Students Gökhan Akçapınar, Erdal Coşgun, Arif Altun Hacettepe University gokhana@hacettepe.edu.tr, erdal.cosgun@hacettepe.edu.tr, altunar@hacettepe.edu.tr

### ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan

### Bike sharing model reuse framework for tree-based ensembles

Bike sharing model reuse framework for tree-based ensembles Gergo Barta 1 Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Magyar tudosok krt. 2.

### Studying Auto Insurance Data

Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

### ICPSR Summer Program

ICPSR Summer Program Data Mining Tools for Exploring Big Data Department of Statistics Wharton School, University of Pennsylvania www-stat.wharton.upenn.edu/~stine Modern data mining combines familiar

### Random forest algorithm in big data environment

Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

### Ensembles and PMML in KNIME

Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany First.Last@Uni-Konstanz.De

### ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

### Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Ensemble Learning Better Predictions Through Diversity Todd Holloway ETech 2008 Outline Building a classifier (a tutorial example) Neighbor method Major ideas and challenges in classification Ensembles

### Solving Regression Problems Using Competitive Ensemble Models

Solving Regression Problems Using Competitive Ensemble Models Yakov Frayman, Bernard F. Rolfe, and Geoffrey I. Webb School of Information Technology Deakin University Geelong, VIC, Australia {yfraym,brolfe,webb}@deakin.edu.au

### Applying CHAID for logistic regression diagnostics and classification accuracy improvement

MPRA Munich Personal RePEc Archive Applying CHAID for logistic regression diagnostics and classification accuracy improvement Evgeny Antipov and Elena Pokryshevskaya The State University Higher School