Advanced Ensemble Strategies for Polynomial Models



Similar documents
Getting Even More Out of Ensemble Selection

New Ensemble Combination Scheme

Data Mining. Nonlinear Classification

Model Combination. 24 Novembre 2009

A Learning Algorithm For Neural Network Ensembles

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Meta-learning. Synonyms. Definition. Characteristics

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble

Ensemble Data Mining Methods

REVIEW OF ENSEMBLE CLASSIFICATION

Social Media Mining. Data Mining Essentials

Data Mining Practical Machine Learning Tools and Techniques

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Comparison of Data Mining Techniques used for Financial Data Analysis

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Knowledge Discovery and Data Mining

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

L25: Ensemble learning

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Introducing diversity among the models of multi-label classification ensemble

Support Vector Machines with Clustering for Training with Very Large Datasets

Classification of Bad Accounts in Credit Card Industry

Machine Learning using MapReduce

An Overview of Knowledge Discovery Database and Data mining Techniques

An Introduction to Data Mining

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Chapter 6. The stacking ensemble approach

Knowledge Discovery and Data Mining

On the effect of data set size on bias and variance in classification learning

Using Data Mining for Mobile Communication Clustering and Characterization

Metalearning for Dynamic Integration in Ensemble Methods

Leveraging Ensemble Models in SAS Enterprise Miner

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Ensembles and PMML in KNIME

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

How To Identify A Churner

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013

Studying Auto Insurance Data

Cross Validation. Dr. Thomas Jensen Expedia.com

Distributed forests for MapReduce-based machine learning

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

Solving Regression Problems Using Competitive Ensemble Models

D-optimal plans in observational studies

Lecture 10: Regression Trees

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Introduction To Ensemble Learning

The Artificial Prediction Market

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Standardization and Its Effects on K-Means Clustering Algorithm

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

How To Predict Web Site Visits

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Using multiple models: Bagging, Boosting, Ensembles, Forests

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

II. RELATED WORK. Sentiment Mining

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Fast Analytics on Big Data with H20

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Random Forest Based Imbalanced Data Cleaning and Classification

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

Comparison of K-means and Backpropagation Data Mining Algorithms

Why Ensembles Win Data Mining Competitions

Azure Machine Learning, SQL Data Mining and R

Predict the Popularity of YouTube Videos Using Early View Data

MapReduce Approach to Collective Classification for Networks

Distributed Framework for Data Mining As a Service on Private Cloud

Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Data Mining Solutions for the Business Environment

Hybrid model rating prediction with Linked Open Data for Recommender Systems

Master Specialization in Knowledge Engineering

Gerry Hobbs, Department of Statistics, West Virginia University

E-commerce Transaction Anomaly Classification

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Decision Trees from large Databases: SLIQ

Component Ordering in Independent Component Analysis Based on Data Power

Polynomial Neural Network Discovery Client User Guide

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting Student Performance by Using Data Mining Methods for Classification

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

FOREX TRADING PREDICTION USING LINEAR REGRESSION LINE, ARTIFICIAL NEURAL NETWORK AND DYNAMIC TIME WARPING ALGORITHMS

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

Transcription:

Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer Science and Engineering, FEE, CTU in Prague, Karlovo nám. 13, 121 35 Praha 2, Czech Republic kordikp@fit.cvut.cz Abstract. Recently, ensemble methods proved to improve accuracy of weak learners and reduce overfitting of models with high plasticity. In this paper, we experiment with various state of the art ensemble strategies applied to polynomial models. We also explore the efficiency of ensembling, when applied to polynomial models with increasing plasticity. Results of our experiments show, that the effect of ensembling is extremely data dependent. For artificial Donojo-Johnson benchmarking data, we interpret results of all ensembles we have been experimenting with. Keywords Inductive modelling, Meta-learning, Ensemble strategies, Bagging, Boosting, Stacking, Cascade generalization, Polynomial models. 1 Introduction Combination (ensemble, blending) of diverse models (classifiers, regression models) becomes main stream in data mining. Increasing popularity of these methods is significantly influenced by their success in various data mining competitions [3, 11]. Better generalization performance of the ensemble can be explained by the bias-variance error decomposition [10]. Well established methods such as Bagging, Boosting or Stacking have been applied to combine the most of the existing data mining algorithms. Ensembles such as Random forests, Neural networks ensembles or ensemble of classifiers are also widely used. Ensembling often increases the plasticity of weak learners and improves the generalization of overfitted base models. Novel methods for combination of classifiers are summarized in [4, 12]. Related problem of regression models combination has been targeted by individual studies, but no comprehensive summarization is available. In this paper, we present several ensemble techniques for regression problems. We also study the effect of ensembling for base models with increasing plasticity. 1.1 Ensembles and inductive modelling In our study, we use polynomial base modes with plasticity regulated by maximal degree of the polynomial. Coefficients are estimated by standard Least Mean Squares method. Ensembling is not new to inductive modelling field. Combination of weak learners (polynomial models) was used in the MIA GMDH [9]. This algorithm is closely related to modern Cascade Generalization algorithm [8, 5]. In modern ensembles the diversity of base models is promoted. Bagging uses random training data selected from original training data with replacement. It allows base models to be specialized to particular data samples included in its training set. Another approach to actively promote the diversity in the ensemble is Negative correlation algorithm [13]. 77

In the original MIA GMDH algorithm, diversity of base models was promoted by different input features for each unit (base model). Such approach is used also in the Random forest algorithm. In the next section, individual ensemble methods will be shortly described. 2 Ensemble methods Most of the ensemble methods we have implemented into our FAKE GAME open source software project [2]are well known for classifiers. We had to adjust it for regression purposes. Often, the difference is minor. In Bagging algorithm several training subsets are created from the original training set. A model is created for each subset. The result for an unknown instance presented to ensemble is calculated as an average of their responses [7, 6, 5]. In Boosting [14, 5] algorithm there is a single model in the beginning. In regions where it fails to respond correctly, the instances are boosted - their weight is increased and a second model is created. The second model is more focused on instances where first model failed. In the regions where the second model failed, weights of instances are increased again and the third model is introduced and so on. Output for unknown instance is given by weighted average. Weight of each model is calculated from its performance (better the model, greater the weight). A Stacking [15, 5] uses several different models trained on a single training set. Responses of these models represent meta-data which are used as inputs of a single final model which is used for calculating output for unknown instances. Cascade Generalization [8, 5] presents a sequential approach to combination of models. In Cascade Generalization a sequence of learners is created. Inputs of learner A i consist of input base-data and also outputs of previous models A i 1. Thus each model in the sequence is partially base-learner and partially meta-learner. Unknown instances pass through the sequence and output of the last model becomes the output of the whole ensemble. We have also designed and implemented two local methods Area specialization and Divide ensemble. 2.1 Area Specialization Area Specialization uses same learning method as Boosting, but can use any other method. The essence of Area Specialization is in output computation for unknown instances. Shortly it gives output of that model which is best for the area where unknown vector is. First, distance for unknown vector to all learning vectors is calculated and closest N vectors are taken into next step (N identifies algorithm parameter called area, which determines how smooth will be transitions between areas of output different models). Then best model is chosen for every learning vector selected in the first phase. Next model weights are calculated as difference of target value of learning vector and output of best model for unknown vector. Weight values are inverted using gaussian function and summed up for all N learning vectors to corresponding models. Model weights are used in weighted average output. 2.2 Divide ensemble Divide ensemble divides learning data into clusters (for example using k-means algorithm) and assigns one model to each cluster. Response to unknown instance is given by model which is in charge of cluster which unknown instance belongs to. There are two main advantages of that approach. First, model will probably perform better on smaller area of learning data and has higher chance to adapt to it. Second, dividing data into smaller chunks and learning greater number, but smaller models may cause boost in learning speed (if the learning algorithm has more than linear complexity in relation to learning vectors). To reduce model s unexpected behavior near the edge of the cluster, where there are usually little or no learning vectors, we use clustering modification to enlarge all clusters by certain amount. Function 2 describes the process. It s inputs are coordinates of cluster centroids and an array containing indexes of vectors for each cluster. Each vector is checked comparing the distance to the other centers to distance to it s own center. If the ratio of distances is above certain threshold vector is added to the cluster belonging to the other center (that means vector can be in more than one clusters simultaneously). This feature improves model error, reduces outline values and makes transition between models (clusters) smoother. 78

Function 1 AreaSpecialization.getOutput(unknownVector) /* Compute euclidean distance of learning vectors to unknownvector. */ distance[] computedistancet olearningv ectors(unknownv ector) /* Get indexes of sorted learning vectors by distance. */ indexes sort(distance) for i = 0 to area do /* Take ith closest learning vector and find model with smallest error on that learning vector. */ bestm odelindex getbestm odelindex(learningv ectors[indexes[i]]) /* Compute difference between target value of closest learning vector and output of model for unknownvector which is best for that learning vector. */ dif f targetoutput[indexes[i]] model[bestm odelindex].getoutput(unknowv ector) /* Compute model weights from how well model performs on learning vector. */ modelw eights[bestm odelindex] +gaussian(dif f) modelw eights normalize(modelw eights) ensembleoutput 0 /* Weighted average. */ for i = 0 to modelsn umber do ensembleoutput +model[i].getoutput(uknownv ector) modelw eights[i] Function 2 DivideEnsemble.roughClustering(clusterCenters[], clusterindexes[][]) for each vector in data do distance[] computedistancet ocenters(vector) /* Get indexes of sorted learning vectors by distance. */ indexes[] sort(distance) closestdistance distance[indexes[0]] for i = 1 to clustercenters.length do currentdistance distance[indexes[i]] if distance[indexes[i]] > centerdistance[indexes[0]][indexes[i]] then /* If the distance of the vector to the other center is greater than distance between centers, do not skip that vector (it means vector is located in the half of the cluster that is behind its center). */ continue end if /* clustersizemultiplier is algorithms parameter which determines how much clusters will be resized. */ if currentdistance/closestdistance < clustersizem ultiplier then addv ectort ocluster(i, vector) end if 79

We have benchmarked these ensemble methods using polynomial base models on a synthetic data set. 3 Experimental setup The Donojo-Johnstone benchmark contains four synthetic time series (bump, block, doppler, heavysine) with no, middle or high noise added (see Figure 1). Fig. 1. The syntectic time series for benchmarking nonlinear regression models with medium noise added. In the following experiment, we used all 4 series with middle noise. For each series and ensemble method (5 base models in the ensemble by default), ten fold cross validation was repeated three times. The resulting Root Mean Squared (RMS) error of one ensemble, modelling single series, is therefore averaged over 30 measurements. High degree polynomials can highly overfit the data, making the resulting error extremely high. Therefore we have rounded maximal error to three times standard difference of error for a few cases, that were observed. 4 Results The graph (Figure 2) of the ensemble results on Dojo-Johnstone benchmarks, where errors of individual ensembles were averaged over all series with middle noises. Boosting in the graph (Figure 2) reduces variance (according to theoretical expectations) of ensembles with overfitted base models (degree 12 and more). With higher polynomial degree, irregularities can be observed near the clusters boundaries used in Divide (DIV) ensemble method. Stacking here reduces overfitting and makes the error development more smooth. Area specialization (AS) is very close to minimum of stacking and single regular polynomial model. The combination of the AS and DIV ensemble methods demonstrates the best results on this data set. 80

Polynomial 23456789 ensembles on Donoho-Johnstone benchmarks- medium noise RMSE 1 2 4 6 9deg12 15 20 25 30 International Conference on Inductive modelling ICIM 2010 AS[DIV[P]] AS[P] BAG[P] BST[P] CG[P] P[QN] ST[P,L] DIV[P] Fig. 2. The syntectic time series for benchmarking nonlinear regression models with medium noise added. The behaviour of CG method is rather strange and we have to analyze it in more details. For higher number of polynomial degree (15+) the variance is reduces, but for less plastic models, the error is much higher than that of the individual model. Polynomial ensembles on the Heart dataset 0.22 0.24 0.26 0.28 0.32 0.34 0.36 AS[DIV[P]] AS[P] BAG[P] BST[P] CG[P] P[QN] ST[P,L] DIV[P] RMSE 1 2 4 6 9 deg Fig. 3. Error of individual ensembles for base models with increasing plasticity on the Heart dataset from the UCI machine learning repository [1]. The last graph (see Figure 3) shows the performance of ensembles on a data set with significantly different properties. Also, the results are significantly different. We performed several experiments and found out that for data with similar properties (e.g. probability distribution do data, number of features, IO relationship) we receive similar behaviour of all ensembles. 81

5 Conclusion We have implemented several modern ensembling methods and experimented with polynomial units as base models. We can conclude, that the behaviour of ensembles is extremely data dependent. In future, we will also compare properties of inductive ensembles such as GMDH - MIA or GAME. Acknowledgements This research is partially supported by the grant Novel Model Ensembling Algorithms (SGS10/307/OHK3/3T/181) of the Czech Technical University in Prague and the research program Transdisciplinary Research in the Area of Biomedical Engineering II (MSM6840770012) sponsored by the Ministry of Education, Youth and Sports of the Czech Republic. References [1] Uci machine learning repository. available at http://www.ics.uci.edu/ mlearn/mlsummary.html, September 2006. [2] The fake game environment for the automatic knowledge extraction. available online at: http://www.sourceforge.net/projects/fakegame, November 2008. [3] J. Bennett, S. Lanning, and N. Netflix. The netflix prize. In In KDD Cup and Workshop in conjunction with KDD, 2007. [4] P. Brazdil, C. Giraud-Carrier, C. Soares, and R. Vilalta. Metalearning: Applications to Data Mining. Cognitive Technologies. Springer, January 2009. [5] P. Brazdil, C. Giraud-Carrier, C. Soares, and R. Vilalta. Metalearning, Applications to Data Mining. Cognitive Technologies. Springer Berlin Heidelberg, 2009. [6] L. Breiman. Bagging predictors. Mach. Learn., 24(2):123 140, 1996. [7] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the Second European Conference on Computational Learning Theory, pages 23 37. Springer Verlag, 1995. [8] J. Gama and P. Brazdil. Cascade generalization. Mach. Learn., 41(3):315 343, 2000. [9] A. G. Ivakhnenko. Polynomial theory of complex systems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-1(1):364 378, 1971. [10] R. A. Jacobs. Bias/variance analyses of mixtures-of-experts architectures. Neural Comput., 9(2):369 383, 1997. [11] Y. Koren. Collaborative filtering with temporal dynamics. In KDD 09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 447 456, New York, NY, USA, 2009. ACM. [12] L. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. John Wiley and Sons, New York, 2004. [13] Y. Liu and X. Yao. Ensemble learning via negative correlation. Neural Networks, 12:1399 1404, 1999. [14] R. E. Schapire. The strength of weak learnability. Mach. Learn., 5(2):197 227, 1990. [15] D. H. Wolpert. Stacked generalization. Neural Networks, 5:241 259, 1992. 82