Why Ensembles Win Data Mining Competitions
|
|
- Elfrieda Oliver
- 8 years ago
- Views:
Transcription
1 Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: URL: dean@abbottanalytics.com 1
2 Outline Motivation for Ensembles How Ensembles are Built Do Ensembles Violate Occams Razor? Why Do Ensembles Win? 2
3 PAKDD Cup 2007 Results: Score Metric Changes Winner Ensembles Modeling Technique Modeling Implementa4on - > Par4cipant Affilia4on Loca4on - > Par4cipant Affilia4on Type - > AUCROC (Trapezoid al Rule)- > AUCROC (Trapezoidal Rule) Rank - > Top Decile Response Rate - > Top Decile Response Rate Rank - > TreeNet + Logis-c Regression Salford Systems Mainland China Prac--oner 70.01% % 7 Probit Regression SAS USA Prac--oner 69.99% % 6 MLP + n- Tuple Classifier Brazil Prac--oner 69.62% % 1 TreeNet Salford Systems USA Prac--oner 69.61% % 4 TreeNet Salford Systems Mainland China Prac--oner 69.42% % 2 Ridge Regression Rank Belgium Prac--oner 69.28% % 9 2- Layer Linear Regression USA Prac--oner 69.14% % 9 Logis-c Regression + Decision Stump + AdaBoost + VFI Mainland China Academia 69.10% % 4 Logis-c Average of Single Decision Func-ons Australia Prac--oner 68.85% % 17 Logis-c Regression Weka Singapore Academia 68.69% % 16 Logis-c Regression Mainland China Prac--oner 68.58% % 9 Decision Tree + Neural Network + Logis-c Regression Singapore 68.54% % 7 Scorecard Linear Addi-ve Model Xeno USA Prac--oner 68.28% % 20 Random Forest Weka USA 68.04% % 14 Expanding Regression Tree + RankBoost + Bagging Weka Mainland China Academia 68.02% % 14 SAS + Salford Logis-c Regression Systems India Prac--oner 67.58% % 19 J48 + BayesNet Weka Mainland China Academia 67.56% % 21 Neural Network + General Addi-ve Model Tiberius USA Prac--oner 67.54% % 21 Decision Tree + Neural Network Mainland China Academia 67.50% % 9 Decision Tree + Neural Network + Logis-c Regression SAS USA Academia 66.71% % 2 Neural Network SAS USA Academia 66.36% % 17 Decision Tree + Neural Network + Logis-c Regression SAS USA Academia 65.95% % 21 Neural Network SAS USA Academia 65.69% % 32 Mul-- dimension Balanced Random Forest Mainland China Academia 65.42% % 13 Neural Network SAS USA Academia 65.28% % 26 CHAID Decision Tree SPSS Argen-na Academia 64.53% % 24 Under- Sampling Based on Clustering + CART Decision Tree Taiwan Academia 64.45% % 25 Decision Tree + Neural Network + Polynomial Regression SAS USA Academia 64.26% % 30 3
4 Netflix Prize 2006 Netflix State-of-the-art (Cinematch) RMSE = Prize: reduce this RMSE by 10% => : Korbell team Progress Prize winner 107 algorithm ensemble Top algorithm: SVD with RMSE = nd algorithm: Restricted Boltzmann Machine with RMSE = Mini-ensemble (SVD+RBM) has RMSE =
5 Common Kinds of Ensembles vs. Single Models Ensembles { Single Classifiers From Zhuowen Tu, Ensemble Classification Methods: Bagging, Boosting, and Random Forests 5
6 What are Model Ensembles? Combining outputs from multiple models into single decision Models can be created using the same algorithm, or several different algorithms Decision Logic Ensemble Prediction 6
7 Creating Model Ensembles Step 1: Generate Component Models Can Vary Data or Model Parameters: Case (Record) Weights bootstrapping, sampling Data Values add noise, recode data Learning Parameters vary learning rates, pruning severity, random seeds Variable Subsets vary candidate inputs, features Single data set Multiple models and predictions 7
8 Creating Model Ensembles Step 2: Combining Models Combining Methods Estimation: Average Outputs Classification: Average probabilities or vote (best M of N) Variance Reduction Build complex, overfit models All models built in same manner Bias Reduction Build simple models Subsequent models weight records with errors more (or model actual errors) Multiple models and predictions Combine Decision or Prediction Value 8
9 How Model Complexity Effects Errors Giovanni Seni, John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: ) 9
10 Commonly Used Information- Theoretic Complexity Penalties BIC: Baysian Information Criterion AIC: Akaike Information Criterion MDL: Minimum Description Length For a nice summary: 10
11 Four Keys to Effective Ensembling Diversity of opinion Independence Decentralization Aggregation From The Wisdom of Crowds, James Surowiecki 11 11
12 Bagging Bagging Method Create many data sets by bootstrapping (can also do this with cross validation) Create one decision tree for each data set Combine decision trees by averaging (or voting) final decisions Primarily reduces model variance rather than bias Results On average, better than any individual tree Final Answer (average) 12
13 Boosting (Adaboost) Boosting Method Creating tree using training data set Score each data point, indicating when each incorrect decision is made (errors) Retrain, giving rows with incorrect decisions more weight. Repeat Final prediction is a weighted average of all models-> model regularization. Best to create weak models simple models (just a few splits for a decision tree) and let the boosting iterations find the complexity. Often used with trees or Naïve Bayes Results Usually better than individual tree or Bagging Reweight examples where classification incorrect Combine models via weighted sum 13
14 Random Forest Ensembles Random Forest (RF) Method Exact same methodology as Bagging, but with a twist At each split, rather than using the entire set of candidate inputs, use a random subset of candidate inputs Generates diversity of samples and inputs (splits) Results On average, better than any individual tree, Bagging, or even Boosting Final Answer (average) 14
15 Stochastic Gradient Boosting Implemented in MART (Jerry Friedman), and TreeNet (Salford Systems) Algorithm Begin with a simple model a constant value for a model Build a simple tree (perhaps 6 terminal nodes) now there are 6 possible levels, whereas before there was one level Score the model and compute errors. The score is the sum of all previous trees, weighted by a learning rate Build a new tree with the errors as the target variable. Results TreeNet has won 2 KDD-Cup competitions and numerous others It is less prone to outliers and overfit than Adaboost Predict errors in ensemble tree so far Combine models via weighted sum Build Final Answer (additive model) 15
16 Ensembles of Trees: Smoothers Ensembles smooth jagged decision boundaries Pictures from T.G. Dietterich. Ensemble methods in machine learning. In Multiple Classier Systems, Cagliari, Italy,
17 Heterogeneous Model Ensembles on Glass Data Percent Classification Error 40 % 35 % 30 % 25 % 20 % 15 % 10 % 5% 0% Max Error Min Error Avera ge Error Number Models Combin ed Model prediction diversity obtained by using different algorithms: tree, NN, RBF, Gaussian, Regression, k-nn Combining 3-5 models on average better than best single model Combining all 6 models not best (best is 3&4 model combination), but is close The is an example of reducing model variance through ensembles, but not model bias 17
18 Direct Marketing Example: Considerations or I-Miner From Abbott, D.W., "How to Improve Customer Acquisition Models with Ensembles", presented at Predictive Analytics World Conference, Washington, D.C., October 20, Steps: 1. Join by record all models applied to same data in same row order 2. Change probability names 3. Average probabilities 1. Decision is avg_prob > threshold 4. Decile Probability Ranks 18
19 Direct Marketing Example: Variable Inclusion in Model Ensembles Twenty-Five different variables represented in the ten models Only five were represented in seven or more models Twelve were represented in one or two models # Models with Common Variables # Models # Variables From Abbott, D.W., "How to Improve Customer Acquisition Models with Ensembles", presented at Predictive Analytics World Conference, Washington, D.C., October 20,
20 Fraud Detection Example: Deployment Stream Model scoring picks up scores from each model, combines in an ensemble, and pushes scores back to database 20
21 Fraud Detection Example: Overall Model Score on Validation Data Normalized Score Total Score (from validation population) Model Ensemble Average Average 5 Best Average 5 Worst Best Testing Worst Testing From Abbott, D, and Tom Konchan, Advanced Fraud Detection Techniques for Vendor Payments, Predictive Analytics Summit, San Diego, CA, February 24, Score weights false alarms and sensitivi ty Overall, ensemble is clearly best, and much better than best on testing data 21
22 Are Ensembles Better? Accuracy? Yes Interpretability? No Do Ensembles contradict Occam s Razor? Principle: simpler models generalize better; avoid overfit! They are more complex than single models (RF may have hundreds of trees in the ensemble) Yet these more complex models perform better on held-out data But are they really more complex? 22
23 Generalized Degrees of Freedom Linear Regression: a degree of freedom in the model is simple a parameter Does not extrapolate to non-linear methods Number of parameters in non-linear methods can produce more complexity or less Enter Generalized Degrees of Freedom (GDF) GDF (Ye 1998) randomly perturbs (adds noise to) the output variable, re-runs the modeling procedure, and measures the changes to the estimates (for same number of parameters) 23
24 The Math of GDF From Giovanni Seni, John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: ) 24
25 The Effect of GDF From Elder, J.F.E IV, The Generalization Paradox of Ensembles, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages
26 Why Ensembles Win Performance, performance, performance Most competitions care only about performance, not about interpretation or ability to deploy Single model sometimes provide insufficient accuracy Neural networks become stuck in local minima Decision trees Run out of data Are greedy can get fooled early Single algorithms keep pushing performance using the same ideas (basis function / algorithm), and are incapable of thinking outside of their box Different algorithms or algorithms built using resample data achieve the same level of accuracy but on different cases they identify different ways to get the same level of accuracy 26
27 Conclusion Ensembles can achieve significant model performance improvements The key to good ensembles is diversity in sampling and variable selection Can be applied to single algorithm, or across multiple algorithms Just do it! 27
28 References Giovanni Seni, John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: ) Elder, J.F.E IV, The Generalization Paradox of Ensembles, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages DOI: / Abbott, D.W., The Benefits of Creating Ensembles of Classifiers, Abbott Analytics, Inc., Abbott, D.W., A Comparison of Algorithms at PAKDD2007, Blog post at 28
29 References Tu, Zhuowen, Ensemble Classification Methods: Bagging, Boosting, and Random Forests, _CS_spring/cs269_2010_ensemble.pdf Ye, J. (1998), On Measuring and Correcting the Effects of Data Mining and Model Selection, Journal of the American Statistical Association, 93,
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationEnsemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/
Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationEnsemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008
Ensemble Learning Better Predictions Through Diversity Todd Holloway ETech 2008 Outline Building a classifier (a tutorial example) Neighbor method Major ideas and challenges in classification Ensembles
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationThe Predictive Data Mining Revolution in Scorecards:
January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &
More informationClassification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
More informationModel Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
More informationIntroduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
More informationCS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:
More informationHow To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationApplied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationImproper Payment Detection in Department of Defense Financial Transactions 1
Improper Payment Detection in Department of Defense Financial Transactions 1 Dean Abbott Abbott Consulting San Diego, CA dean@abbottconsulting.com Haleh Vafaie, PhD. Federal Data Corporation Bethesda,
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationAn Overview of Data Mining: Predictive Modeling for IR in the 21 st Century
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO
More informationThe Generalization Paradox of Ensembles
The Generalization Paradox of Ensembles John F. ELDER IV Ensemble models built by methods such as bagging, boosting, and Bayesian model averaging appear dauntingly complex, yet tend to strongly outperform
More informationData Analytics and Business Intelligence (8696/8697)
http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/36 Data Analytics and Business Intelligence (8696/8697) Ensemble Decision Trees Graham.Williams@togaware.com Data Scientist Australian
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationChapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
More informationBOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationBenchmarking of different classes of models used for credit scoring
Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want
More informationTree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems
Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern
More informationCombining Multiple Models Across Algorithms and Samples for Improved Results
Combining Multiple Models Across Algorithms and Samples for Improved Results Haleh Vafaie, PhD. Federal Data Corporation Bethesda, MD Hvafaie@feddata.com Dean Abbott Abbott Consulting San Diego, CA dean@abbottconsulting.com
More informationGeneralizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
More informationA Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
More informationRandom forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
More informationWhy do statisticians "hate" us?
Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data
More informationStrategies for Building Predictive Models
Strategies for Building Predictive Models Dean Abbott Abbott Analytics, Inc. KNIME User Group Meeting February 14, 2014 Email: dean@abbottanalytics.com Blog: http://abbottanalytics.blogspot.com Twitter:
More informationL25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
More informationLocation matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is
More informationOn the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationMACHINE LEARNING BRETT WUJEK, SAS INSTITUTE INC.
MACHINE LEARNING BRETT WUJEK, SAS INSTITUTE INC. AGENDA MACHINE LEARNING Background Use cases in healthcare, insurance, retail and banking Eamples: Unsupervised Learning Principle Component Analysis Supervised
More informationPredictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD
Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,
More informationMachine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)
Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of
More informationAdvanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
More informationData Mining & Data Stream Mining Open Source Tools
Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.
More informationRisk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
More informationDECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com
More informationComparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
More informationENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
More informationEXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
More informationDistributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
More informationWelcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA
Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/
More informationICPSR Summer Program
ICPSR Summer Program Data Mining Tools for Exploring Big Data Department of Statistics Wharton School, University of Pennsylvania www-stat.wharton.upenn.edu/~stine Modern data mining combines familiar
More informationBetter credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
More informationAUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
More informationMonday Morning Data Mining
Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline: - data mining - IceCube - Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationTHE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
More informationSolving Regression Problems Using Competitive Ensemble Models
Solving Regression Problems Using Competitive Ensemble Models Yakov Frayman, Bernard F. Rolfe, and Geoffrey I. Webb School of Information Technology Deakin University Geelong, VIC, Australia {yfraym,brolfe,webb}@deakin.edu.au
More informationIntroducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationUsing Adaptive Random Trees (ART) for optimal scorecard segmentation
A FAIR ISAAC WHITE PAPER Using Adaptive Random Trees (ART) for optimal scorecard segmentation By Chris Ralph Analytic Science Director April 2006 Summary Segmented systems of models are widely recognized
More informationGetting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise
More informationNew Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationData Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationEnsembles and PMML in KNIME
Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany First.Last@Uni-Konstanz.De
More informationEnsemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
More informationPerspectives on Data Mining
Perspectives on Data Mining Niall Adams Department of Mathematics, Imperial College London n.adams@imperial.ac.uk April 2009 Objectives Give an introductory overview of data mining (DM) (or Knowledge Discovery
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationMaking Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationFine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
More informationKnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE
POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE Most Effective Modeling Application Designed to Address Business Challenges Applying a predictive strategy to reach a desired business
More informationData Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
More informationMining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
More informationHow To Choose A Churn Prediction
Assessing classification methods for churn prediction by composite indicators M. Clemente*, V. Giner-Bosch, S. San Matías Department of Applied Statistics, Operations Research and Quality, Universitat
More informationHadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis
Webinar will begin shortly Hadoop s Advantages for Machine Learning and Predictive Analytics Presented by Hortonworks & Zementis September 10, 2014 Copyright 2014 Zementis, Inc. All rights reserved. 2
More informationNeural Networks and Support Vector Machines
INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines
More informationtesto dello schema Secondo livello Terzo livello Quarto livello Quinto livello
Extracting Knowledge from Biomedical Data through Logic Learning Machines and Rulex Marco Muselli Institute of Electronics, Computer and Telecommunication Engineering National Research Council of Italy,
More informationIdentifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
More informationData Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA
Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA An Overview of SAS Enterprise Miner The following article is in regards to Enterprise Miner v.4.3 that is available in SAS v9.1.3.
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationMHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given
More informationCOPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments
Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for
More informationMachine Learning Algorithms and Predictive Models for Undergraduate Student Retention
, 225 October, 2013, San Francisco, USA Machine Learning Algorithms and Predictive Models for Undergraduate Student Retention Ji-Wu Jia, Member IAENG, Manohar Mareboyana Abstract---In this paper, we have
More informationData Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca
Data Mining in CRM & Direct Marketing Jun Du The University of Western Ontario jdu43@uwo.ca Outline Why CRM & Marketing Goals in CRM & Marketing Models and Methodologies Case Study: Response Model Case
More informationPredicting Student Performance by Using Data Mining Methods for Classification
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationUsing Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,
More informationData Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
More informationWhy is Internal Audit so Hard?
Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets
More informationBIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationNew Ensemble Combination Scheme
New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,
More informationMERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION
MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION Matthew A. Lanham & Ralph D. Badinelli Virginia Polytechnic Institute and State University Department of Business
More informationCross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
More information