Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems
|
|
- Shawn Johns
- 8 years ago
- Views:
Transcription
1 Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems
2 Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern highly versatile regularized regression ISLE an effective way to compress ensemble models RuleLearner extracting most interesting sets of rules Examples of compression
3 Brief Overview of Salford Systems Salford Systems principally well known for three reasons o Products (CART, MARS, TreeNet, RandomForests ) o Brains trust (J. Friedman, L. Breiman, R. Olshen, C. Stone) o Awards (multiple data mining competitions since 2000 until today) Marquee list of major customers in banking, insurance, retail trade bricks-and-mortar and online, pharmaceuticals, etc.) o NAB Bank o American Express o VISA USA (credit cards) o Capital One (credit cards) o Traveler s Insurance (formerly Citigroup) o Johnson & Johnson Partners in China and Japan (Asian language versions)
4 Salford Competitive Awards 2010 Direct Marketing Association. 1 st place* 2009 KDDCup IDAnalytics*, and FEG Japan* 1 st Runner Up 2008 DMA Direct Marketing Association 1 st Runner Up 2007 Pacific Asia PAKDD: Credit Card Cross Sell. 1 st place 2006 DMA Direct Marketing Association: Predictive Modeling* 2006 PAKDD Pacific Asia KDD: Telco Customer Type Profiling 2005 BI-Cup Latin America: Predictive Modeling E-commerce* 1 st place 2004 KDDCup: Predictive Modeling Most Accurate * 2002 NCR/Teradata Duke University: Predictive Modeling-Churn o all four separate predictive modeling challenges 1 st place 2000 KDDCup: Predictive Modeling- Online behavior 1 st place 2000 KDDCup: CRM Analysis 1 st place *Won either by Salford or by client using Salford tools
5 Salford Predictive Modeler SPM Download a current version from our website Version will run without a license key for 10-days Request a license key from unlock@salford-systems.com Request configuration to meet your needs o Data handling capacity o Data mining engines made available
6 Stochastic Gradient Boosting New approach to machine learning /function approximation developed by Jerome H. Friedman at Stanford University o o Co-author of CART with Breiman, Olshen and Stone Author of MARS, PRIM, Projection Pursuit Also known as Treenet Good for classification and regression problems Built on small trees and thus o o o o Fast and efficient Data driven Immune to outliers Invariant to monotone transformations of variables Resistant to over training generalizes very well Can be remarkably accurate with little effort BUT resulting model may be very complex 6
7 The Algorithm Begin with a very small tree as initial model o o o Could be as small as ONE split generating 2 terminal nodes Typical model will have 3-5 splits in a tree, generating 4-6 terminal nodes Model is intentionally weak Compute residuals (prediction errors) for this simple model for every record in data Grow a second small tree to predict the residuals from the first tree Compute residuals from this new 2-tree model and grow a 3 rd tree to predict revised residuals Repeat this process to grow a sequence of trees Tree 1 Tree 2 Tree 3 More trees
8 Illustration: Saddle Function 500 {X1,X2} points randomly drawn from a [-3,+3] box to produce the XOR response surface Y = X1 * X2 Will use 3-node trees to show the evolution of Treenet response surface 1 Tree 2 Trees 3 Trees 4 Trees 10 Trees 20 Trees 30 Trees 40 Trees 100 Trees 195 Trees
9 Notes on Treenet Solution The solution evolves slowly and usually includes hundreds or even thousands of small trees The process is myopic only the next best tree given the current set of conditions is added There is a high degree of similarity and overlap among the resulting trees Very large tree sequences make the model scoring time and resource intensive Thus, ever present need to simplify (reduce) model complexity
10 Real Life Dataset Dataset supplied by 2006 DMC Data Mining competition 8,000 ipod auctions held at ebay from May 2005 to May 2006 in Germany Auction items were grouped into 15 mutually exclusive categories based on size, type (regular, mini, nano), and color Goal: predict whether the closing price will be above or below the category average
11 The Data: Available Fields Variable CATEGORY_NAME LISTING_DURTN_DAYS LISTING_TYPE_CODE QTY_AVAILABLE FEEDBACK_SCORE START_PRICE BUY_IT_NOW_PRICE BUY_IT_NOW_FLAG BOLD_FEE_FLAG CATEGORY_FEATURED_FEE_FLAG GALLERY_FEE_FLAG RESERVE_FEE_FLAG Salford Systems Copyright 2011 Description Product category duration of auction type of auction (normal auction, multi auction, etc) amount of offered items for multi auction feedback- rating of the seller of this auction listing start price in EUR buy it now price in EUR option for buy it now on this auction listing option for bold font on this auction listing show this auction listing on top of category auction listing with picture gallery auction listing with reserve- price Variety of additional fees 11
12 A Tree is a Variable Transformation Node 1 Avg = N = 506 X1 <= 1.47 Node 2 Avg = N = 476 X1 > 1.47 Terminal Node 3 Avg = N = 30 TREE_1 = F(X1,X2) X2 <= Terminal Node 1 Avg = N = 15 X2 > Terminal Node 2 Avg = N = 461 Any tree in a Treenet model can be represented by a derived continuous variable as a function of inputs
13 ISLE Compression of Treenet TN Model ISLE Model The original Treenet model combines all trees with equal coefficients ISLE accomplishes model compression by removing redundant trees and changing the relative contribution of the remaining trees by adjusting the coefficients Regularized Regression methodology provides the required machinery to accomplish this task!
14 OLS versus GPS OLS Regression A Sequence of Linear Models 1-variable model 2-variable model 3-variable model X1, X2, X3, X4, X5, X6, X1, X2, X3, X4, X5, X6, Learn Sample Test Sample GPS Regression Large Collection of Linear Models (Paths) 1-variable models, varying coefficients 2-variable models, varying coefficients 3-variable models, varying coefficients GPS (Generalized Path Seeker) introduced by Jerome Friedman in 2008 Dramatically expands the pool of potential linear models by including different sets of variables in addition to varying the magnitude of coefficients The optimal model of any desirable size can then be selected based on its performance on the TEST sample
15 Path Building Process Zero Coefficient Model Sequence of 1-variable models Sequence of 2-variable models Sequence of 3-variable models OLS Solution A Variable is Selected A Variable is Added A Variable is Added Variable Selection Strategy Elasticity Parameter controls the variable selection strategy along the path (using the LEARN sample only), it can be between 0 and 2, inclusive o Elasticity = 2 fast approximation of Ridge Regression, introduces variables as quickly as possible and then jointly varies the magnitude of coefficients lowest degree of compression o Elasticity = 1 fast approximation of Lasso Regression, introduces variables sparingly letting the current active variables develop their coefficients good degree of compression versus accuracy o Elasticity = 0 fast approximation of Stepwise Regression, introduces new variables only after the current active variables were fully developed excellent degree of compression but may loose accuracy
16 ISLE Compression on ebay Data Test Performance ISLE managed to half the model complexity while slightly improving the TEST sample performance Lasso variable selection strategy turned out to be the winning compression approach here Note that even higher compression ratios can be achieved if one is willing to sacrifice a little bit of accuracy All ISLE models show uniformly better performance than the corresponding TN models
17 A Node is a Variable Transformation Node 1 Avg = N = 506 X1 <= 1.47 Node 2 Avg = N = 476 X1 > 1.47 Terminal Node 3 Avg = N = 30 X2 <= Terminal Node 1 Avg = N = 15 X2 > Terminal Node 2 Avg = N = 461 NODE_X = F(X1,X2) Any node in a Treenet model can be represented by a derived dummy variable as a function of inputs
18 RuleLearner Compression GPS Regression TN Model T1_N1 + T1_T1 + T1_T2 + T1_T3 + T2_N1 + T2_T1 + T2_T2 + T2_T3 + T3_N1 + T3_T1 + Coefficient 1 Rule-set 1 Coefficient 2 Rule-set 2 Coefficient 3 Rule-set 3 Coefficient 4 Rule-set 4 Create an exhaustive set of dummy variables for every node (internal and terminal) and every tree in a TN model Run GPS regularized regression to extract an informative subset of node dummies along with the modeling coefficients Thus, a model compression can be achieved by eliminating redundant nodes Each selected node dummy represents a specific ruleset which can be interpreted directly for further insights
19 ISLE Compression in Adserving Web portal leveraging hundreds of Treenet models to optimize the user experience Optimal models may each have hundreds or thousands of trees Have sufficient time to score no more than 30 trees (each with 6 nodes) to execute page quickly enough Originally simply took the first 30 trees from the model o May have reworked the model manipulating the learn rate With ISLE much better accuracy is possible because we can now work with an optimal subset of 30 trees instead first 30
20 RuleLearner Compression on ebay Data Test Performance RuleLearner has reduced model complexity to 1/10 th while improved its performance A large number of HotSpots has been identified All RuleLearner models show uniformly better performance than the corresponding Treenet models
21 RuleLearner Compression in Banking Loan default dataset from a major bank, 16,000 accounts, 43 predictors, 10% overall default rate 64% compression using RuleLearner, no accuracy loss The top rules have lift of 6 and provide significant insights in terms of hot spots on loan defaults
22 Big Data The Future In model compression we can think of the tree building as simply a search engine for interesting transforms of the raw data Ideal transforms: o Handle missing values o Impervious to predictor outliers o Invariant to how the raw data is expressed (eg. Scale) Transforms can be discovered on very small samples of the data (ideally stratified samples) Second stage regression is much more easily run as a MapReduce job on the transforms o Challenge is that there can be many transforms o Results of such experiments very promising
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
More informationIdentifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
More informationChurn Modeling for Mobile Telecommunications:
Churn Modeling for Mobile Telecommunications: Winning the Duke/NCR Teradata Center for CRM Competition N. Scott Cardell, Mikhail Golovnya, Dan Steinberg Salford Systems http://www.salford-systems.com June
More informationData Mining Opportunities in Health Insurance
Data Mining Opportunities in Health Insurance Methods Innovations and Case Studies Dan Steinberg, Ph.D. Copyright Salford Systems 2008 Analytical Challenges for Health Insurance Competitive pressures in
More informationData Mining Approaches to Modeling Insurance Risk. Dan Steinberg, Mikhail Golovnya, Scott Cardell. Salford Systems 2009
Data Mining Approaches to Modeling Insurance Risk Dan Steinberg, Mikhail Golovnya, Scott Cardell Salford Systems 2009 Overview of Topics Covered Examples in the Insurance Industry Predicting at the outset
More informationTHE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
More informationThe Predictive Data Mining Revolution in Scorecards:
January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms
More informationNew Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationBenchmarking of different classes of models used for credit scoring
Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want
More informationClassification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
More informationBOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully
More informationHow To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
More informationData Analysis of Trends in iphone 5 Sales on ebay
Data Analysis of Trends in iphone 5 Sales on ebay By Wenyu Zhang Mentor: Professor David Aldous Contents Pg 1. Introduction 3 2. Data and Analysis 4 2.1 Description of Data 4 2.2 Retrieval of Data 5 2.3
More informationEXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
More informationWhy Ensembles Win Data Mining Competitions
Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationHeritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution
Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution Rie Johnson Tong Zhang 1 Introduction This document describes our entry nominated for the second prize of the Heritage
More informationData Mining: Overview. What is Data Mining?
Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationData Mining for Fun and Profit
Data Mining for Fun and Profit Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. - Ian H. Witten, Data Mining: Practical Machine Learning Tools
More informationSome vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.
Bonus Chapter Ten Major Predictive Analytics Vendors In This Chapter Angoss FICO IBM RapidMiner Revolution Analytics Salford Systems SAP SAS StatSoft, Inc. TIBCO This chapter highlights ten of the major
More informationData Mining in Sports Analytics. Salford Systems Dan Steinberg Mikhail Golovnya
Data Mining in Sports Analytics Salford Systems Dan Steinberg Mikhail Golovnya Data Mining Defined Data mining is the search for patterns in data using modern highly automated, computer intensive methods
More informationA Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
More informationMining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
More informationLocation matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is
More informationBetter credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
More informationGetting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise
More informationMachine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
More informationPackage acrm. R topics documented: February 19, 2015
Package acrm February 19, 2015 Type Package Title Convenience functions for analytical Customer Relationship Management Version 0.1.1 Date 2014-03-28 Imports dummies, randomforest, kernelfactory, ada Author
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationData Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &
More informationAn Overview of Data Mining: Predictive Modeling for IR in the 21 st Century
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO
More informationThe Operational Value of Social Media Information. Social Media and Customer Interaction
The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia
More informationPredicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
More informationComparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
More informationEnsemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/
Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau
More informationAddressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association
Addressing Analytics Challenges in the Insurance Industry Noe Tuason California State Automobile Association Overview Two Challenges: 1. Identifying High/Medium Profit who are High/Low Risk of Flight Prospects
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationRegularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationHow To Use Data Mining For Loyalty Based Management
Data Mining for Loyalty Based Management Petra Hunziker, Andreas Maier, Alex Nippe, Markus Tresch, Douglas Weers, Peter Zemp Credit Suisse P.O. Box 100, CH - 8070 Zurich, Switzerland markus.tresch@credit-suisse.ch,
More informationBeating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
More informationRisk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
More informationJournée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
More informationCART 6.0 Feature Matrix
CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window
More informationData Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product
Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:
More informationJetBlue Airways Stock Price Analysis and Prediction
JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationDistributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationStudying Auto Insurance Data
Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationUsing Adaptive Random Trees (ART) for optimal scorecard segmentation
A FAIR ISAAC WHITE PAPER Using Adaptive Random Trees (ART) for optimal scorecard segmentation By Chris Ralph Analytic Science Director April 2006 Summary Segmented systems of models are widely recognized
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationDATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
More informationBeating the MLB Moneyline
Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
More informationInsurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More informationFraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
More informationA Decision Theoretic Approach to Targeted Advertising
82 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 A Decision Theoretic Approach to Targeted Advertising David Maxwell Chickering and David Heckerman Microsoft Research Redmond WA, 98052-6399 dmax@microsoft.com
More informationWhy do statisticians "hate" us?
Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data
More informationMachine Learning Methods for Demand Estimation
Machine Learning Methods for Demand Estimation By Patrick Bajari, Denis Nekipelov, Stephen P. Ryan, and Miaoyu Yang Over the past decade, there has been a high level of interest in modeling consumer behavior
More informationModel Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
More informationREVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
More informationWinning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering
IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons
More informationTable of Contents. June 2010
June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationSEIZE THE DATA. 2015 SEIZE THE DATA. 2015
1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. BIG DATA CONFERENCE 2015 Boston August 10-13 Predicting and reducing deforestation
More informationDATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM
INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM M. Mayilvaganan 1, S. Aparna 2 1 Associate
More informationPredictive Modeling and Big Data
Predictive Modeling and Presented by Eileen Burns, FSA, MAAA Milliman Agenda Current uses of predictive modeling in the life insurance industry Potential applications of 2 1 June 16, 2014 [Enter presentation
More informationFine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
More informationSelecting Data Mining Model for Web Advertising in Virtual Communities
Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: jerzy.surma@gmail.com Mariusz Łapczyński
More informationKnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES
HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES Translating data into business value requires the right data mining and modeling techniques which uncover important patterns within
More informationHow to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
More informationA Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
More informationLIVEPERSON SOLUTIONS BRIEF. Identify Your Highest Value Visitors for Real-Time Engagement and Increased Sales
LIVEPERSON SOLUTIONS BRIEF Identify Your Highest Value Visitors for Real-Time Engagement and Increased Sales 2014 LIVEPERSON SOLUTIONS BRIEF Targeting Targeting technology in LivePerson s LiveEngage digital
More informationHow To Do Data Mining In R
Data Mining with R John Maindonald (Centre for Mathematics and Its Applications, Australian National University) and Yihui Xie (School of Statistics, Renmin University of China) December 13, 2008 Data
More informationSTATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
More informationChapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
More informationGeneralizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
More informationTitle: Lending Club Interest Rates are closely linked with FICO scores and Loan Length
Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length Introduction: The Lending Club is a unique website that allows people to directly borrow money from other people [1].
More informationNine Common Types of Data Mining Techniques Used in Predictive Analytics
1 Nine Common Types of Data Mining Techniques Used in Predictive Analytics By Laura Patterson, President, VisionEdge Marketing Predictive analytics enable you to develop mathematical models to help better
More informationAssessing Data Mining: The State of the Practice
Assessing Data Mining: The State of the Practice 2003 Herbert A. Edelstein Two Crows Corporation 10500 Falls Road Potomac, Maryland 20854 www.twocrows.com (301) 983-3555 Objectives Separate myth from reality
More informationPredictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar
Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Prepared by Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise.francis@data-mines.cm
More informationIncreasing Marketing ROI with Optimized Prediction
Increasing Marketing ROI with Optimized Prediction Yottamine s Unique and Powerful Solution Smart marketers are using predictive analytics to make the best offer to the best customer for the least cost.
More informationMortgage Business Transformation Program Using CART-based Joint Risk Modeling (and including a practical discussion of TreeNet)
For a free evaluation of the software used, visit http://www.salford-systems.com and click the download button on the homepage. Mortgage Business Transformation Program Using CART-based Joint Risk Modeling
More informationPredicting the End-Price of Online Auctions
Predicting the End-Price of Online Auctions Rayid Ghani, Hillery Simmons Accenture Technology Labs, Chicago, IL 60601, USA Rayid.Ghani@accenture.com, Hillery.D.Simmons@accenture.com Abstract. Online auctions
More informationA Comparison of Variable Selection Techniques for Credit Scoring
1 A Comparison of Variable Selection Techniques for Credit Scoring K. Leung and F. Cheong and C. Cheong School of Business Information Technology, RMIT University, Melbourne, Victoria, Australia E-mail:
More informationnot possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
More informationData Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction
Data Mining and Exploration Data Mining and Exploration: Introduction Amos Storkey, School of Informatics January 10, 2006 http://www.inf.ed.ac.uk/teaching/courses/dme/ Course Introduction Welcome Administration
More informationScalable Machine Learning to Exploit Big Data for Knowledge Discovery
Scalable Machine Learning to Exploit Big Data for Knowledge Discovery Una-May O Reilly MIT MIT ILP-EPOCH Taiwan Symposium Big Data: Technologies and Applications Lots of Data Everywhere Knowledge Mining
More informationPredicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)
260 IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011 Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case
More informationD-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationUnderstanding Characteristics of Caravan Insurance Policy Buyer
Understanding Characteristics of Caravan Insurance Policy Buyer May 10, 2007 Group 5 Chih Hau Huang Masami Mabuchi Muthita Songchitruksa Nopakoon Visitrattakul Executive Summary This report is intended
More informationAdvanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
More informationA Study on Prediction of User s Tendency Toward Purchases in Online Websites Based on Behavior Models
A Study on Prediction of User s Tendency Toward Purchases in Online Websites Based on Behavior Models Emad Gohari Boroujerdi, Soroush Mehri, Saeed Sadeghi Garmaroudi, Mohammad Pezeshki, Farid Rashidi Mehrabadi,
More informationApplying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation
Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation ABSTRACT Customer segmentation is fundamental for successful marketing
More informationData Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
More information