Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble



Similar documents
New Ensemble Combination Scheme

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Advanced Ensemble Strategies for Polynomial Models

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Ensemble Data Mining Methods

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Gerry Hobbs, Department of Statistics, West Virginia University

Getting Even More Out of Ensemble Selection

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Classification of Bad Accounts in Credit Card Industry

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers

An Experimental Study on Ensemble of Decision Tree Classifiers

Ensembles and PMML in KNIME

A Property & Casualty Insurance Predictive Modeling Process in SAS

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

Comparison of Data Mining Techniques used for Financial Data Analysis

Data Mining. Nonlinear Classification

Package acrm. R topics documented: February 19, 2015

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

A Learning Algorithm For Neural Network Ensembles

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Leveraging Ensemble Models in SAS Enterprise Miner

On the effect of data set size on bias and variance in classification learning

Data Warehousing and Data Mining in Business Applications

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Data Mining Applications in Higher Education

Operations Research and Knowledge Modeling in Data Mining

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Data Mining Methods: Applications for Institutional Research

Prediction of Stock Performance Using Analytical Techniques

Data Mining: A Magic Technology for College Recruitment. Tongshan Chang, Ed.D.

Knowledge Discovery and Data Mining

REPORT DOCUMENTATION PAGE

Solving Regression Problems Using Competitive Ensemble Models

ORIGINAL ARTICLE ENSEMBLE APPROACH FOR RULE EXTRACTION IN DATA MINING.

Roulette Sampling for Cost-Sensitive Learning

Data Mining - Evaluation of Classifiers

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Identification of User Patterns in Social Networks by Data Mining Techniques: Facebook Case

A Comparison of Variable Selection Techniques for Credit Scoring

Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal

Using Random Forest to Learn Imbalanced Data

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Benchmarking of different classes of models used for credit scoring

E-commerce Transaction Anomaly Classification

PharmaSUG2011 Paper HS03

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1

Subject Description Form

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

A fast, powerful data mining workbench designed for small to midsize organizations

Enhancing Compliance with Predictive Analytics

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Predictive Modeling of Titanic Survivors: a Learning Competition

D-optimal plans in observational studies

Model Combination. 24 Novembre 2009

International Journal of Software and Web Sciences (IJSWS)

Top 10 Algorithms in Data Mining

Data Mining for Direct Marketing: Problems and

Introduction to Data Mining

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Studying Auto Insurance Data

How To Identify A Churner

Data Mining Solutions for the Business Environment

How To Make A Credit Risk Model For A Bank Account

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Customer and Business Analytic

Data mining and statistical models in marketing campaigns of BT Retail

How Boosting the Margin Can Also Boost Classifier Complexity

REVIEW OF ENSEMBLE CLASSIFICATION

DATA MINING TECHNIQUES AND APPLICATIONS

Nagarjuna College Of

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

Data Mining in Financial Application

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Data Mining Part 5. Prediction

CLINICAL DECISION SUPPORT FOR HEART DISEASE USING PREDICTIVE MODELS

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Transcription:

Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble Dr. Hongwei Patrick Yang Educational Policy Studies & Evaluation College of Education University of Kentucky Lexington, KY Presented at the 2014 conference

Overview The study demonstrates predictive data mining models under model ensemble in the context of analyzing large data Data mining is usually defined as the data-driven process of discovering meaningful hidden patterns in large amounts of data through automatic as well as manual means Conference 2014 2

Overview Many industries use data mining to address business problems, such as bankrupt prediction, risk management, fraud detection, etc. Such applications in data mining typically take advantage of predictive data mining models as learning machines with a primary focus on making good predictions Conference 2014 3

Overview Among many types of predictive data mining models are decision trees, neural networks, and (traditional) regression models: Decision tree: Identify the most significant split of the outcome at each layer Neural network: Model nonlinear associations For each of the models/learning machines presented above, the outcome can be either a categorical one or a numerical one Conference 2014 4

Overview On the other hand, model ensemble techniques have recently become popular thanks to the growing power of computation Bagging and boosting are two of the most popular ensemble techniques Conference 2014 5

Overview Model ensemble techniques are designed to create a model ensemble/committee containing multiple component/base models The committee of models are averaged or pooled in a certain manner to improve the stability and accuracy of predictions Conference 2014 6

Overview Model ensemble techniques can be incorporated into many types of predictive models/learning machines (tree, neural network, regression, etc.) Ensemble-based modeling can also be combined with common feature/subset selection procedures (genetic algorithm, stepwise method, all-possible-subsets, etc.) Conference 2014 7

Numerical examples To demonstrate the effectiveness of predictive data mining in discovering meaningful information from large data, the study chooses the three types of predictive models which are commonly used, and analyzes them under two large scale applications Conference 2014 8

Numerical examples To further improve the predictions from each type of model, model ensemble is implemented during the modeling process to pool predictions from individual component model For comparison purposes, all models are also fitted without creating any model ensemble Conference 2014 9

Numerical examples Besides, the models are each evaluated for goodness-of-fit and performance at the final stage using various fit statistics including average squared error, ROC index, misclassification rate, Gini coefficient, K-S statistic, as applicable The entire analysis is performed under SAS Enterprise Miner 7.1 Conference 2014 10

Numerical examples Example one: Physicochemical properties of protein tertiary structure data A numerical outcome: 45,730 cases Example two: Bank marketing data A categorical outcome: 41,188 cases Both data sets are retrieved from the UC Irvine (UCI) Machine Learning Repository Conference 2014 11

Example one: Numerical outcome Conference 2014 12

Example one: Numerical outcome Table 1. Comparison of Models based on Training Data under a Numerical Outcome. Model Description Average Squared Error Root Average Squared Error Maximum Absolute Error EnRegTreeNN 21.338 4.619 15.000 EnReg 22.874 4.783 14.818 EnNN 23.122 4.809 16.556 EnTree 25.193 5.019 16.131 NN 23.591 4.857 19.663 Reg 23.574 4.855 19.668 Tree 24.103 4.910 17.412 Conference 2014 13

Example one: Numerical outcome Ensemble models tend to be more effective in reducing errors, although it is not guaranteed Average squared error: Lower is better Root average squared error: Lower is better Maximum absolute error: Lower is better Conference 2014 14

Example two: Categorical outcome Conference 2014 15

Example two: Categorical outcome Table 2. Comparison of Models based on Training Data under a Categorical Outcome. Model Description Misclassification Rate Roc Index Gini Coefficient Root Average Squared Error Kolmogorov -Smirnov Statistic Bin-Based Two-Way Kolmogorov -Smirnov Statistic Gain Cumulative Lift Cumulative Percent Captured Response EnRegTreeNN 0.237 0.078 0.947 0.894 0.780 0.772 504.305 6.043 60.541 EnReg 0.241 0.081 0.935 0.871 0.719 0.717 455.744 5.557 55.676 EnNN 0.252 0.086 0.919 0.838 0.682 0.681 428.767 5.288 52.973 EnTree 0.270 0.101 0.801 0.602 0.579 0.576 395.325 4.953 49.623 Tree 0.254 0.090 0.900 0.800 0.697 0.692 441.595 5.416 54.179 NN 0.261 0.098 0.912 0.823 0.675 0.670 400.087 5.001 50.027 Reg 0.261 0.097 0.912 0.823 0.668 0.666 408.710 5.087 50.889 Conference 2014 16

Example two: Categorical outcome Ensemble models typically have better discriminatory power among all models, as is indicated by each criterion Misclassification rate: Lower is better ROC index: Higher is better Gini coefficient: Higher is better K-S statistic: Higher is better Cumulative lift: Higher is better Cumulative percent captured response: Higher is better Conference 2014 17

Conclusions The study presents some initial evidence for the effectiveness of model ensemble in improving the performance of an individual learning machine (model) under a given type The study needs to be supplemented with additional information on the use of (real) bagging and boosting in improving the performance of individual learning machine Conference 2014 18

Conclusions The study provides applied researchers with more options beyond traditional regression modeling when reliable predictions are needed in their research The study serves as the foundation for a future research topic which adds feature selection to predictive data mining modeling under model ensemble for analyzing very large data sets Conference 2014 19

References Ao, S. (2008). Data mining and applications in Genomics. Berlin, Heidelberg, Germany: Springer Science+Business Media. Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Barutcuoglu, Z., & Alpaydin, E. (2003). A comparison of model aggregation methods for regression. In O. Kaynak, E. Alpaydin, E. Oja, & L. Xu. (Eds.), Artificial Neural Networks and Neural Information Processing - ICANN/ICONIP 2003 (pp. 76 83). NYC, NY: Springer. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140. Cerrito, P. B. (2006). Introduction to data mining: Using SAS Enterprise Miner. Cary, NC: SAS Institute Inc. Drucker, H. (1997). Improving regressor using boosting techniques. Proceedings of the 14th International Conferences on Machine Learning, 107-115. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121, 256-285. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Machine Learning: Proceedings of the Thirteenth International Conference, 148-156. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55, 119-139. Hill, C. M., & Malone, L. C., & Trocine, L. (2004). Data mining and traditional regression. In H. Bozdogan (Ed.), Statistical data mining and knowledge discovery, (pp. 233-249). London, UK: Chapman and Hall/CRC. Larose, D. T. (2005). Discovering knowledge in data: An introduction to data mining. Hoboken, NJ: John Wiley & Sons, Inc. Liu, B., Cui, Q., Jiang, T., & Ma, S. (2004). A combinational feature selection and ensemble neural network method for classification of gene expression data. BMC Bioinformatics, 5, 136. Oza, N. C. (2005). Ensemble Data Mining Methods. In J. Wang (Ed.), Encyclopedia of Data Warehousing and Mining (pp. 448-453). Hershey, PA: Information Science Reference. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197-227. Schapire, R. E.. (2002). The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, & B. Yu (Eds.), MSRI workshop on nonlinear estimation and classification. NYC, NY: Springer. Conference 2014 20