Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer Science and Engineering, FEE, CTU in Prague, Karlovo nám. 13, 121 35 Praha 2, Czech Republic kordikp@fit.cvut.cz Abstract. Recently, ensemble methods proved to improve accuracy of weak learners and reduce overfitting of models with high plasticity. In this paper, we experiment with various state of the art ensemble strategies applied to polynomial models. We also explore the efficiency of ensembling, when applied to polynomial models with increasing plasticity. Results of our experiments show, that the effect of ensembling is extremely data dependent. For artificial Donojo-Johnson benchmarking data, we interpret results of all ensembles we have been experimenting with. Keywords Inductive modelling, Meta-learning, Ensemble strategies, Bagging, Boosting, Stacking, Cascade generalization, Polynomial models. 1 Introduction Combination (ensemble, blending) of diverse models (classifiers, regression models) becomes main stream in data mining. Increasing popularity of these methods is significantly influenced by their success in various data mining competitions [3, 11]. Better generalization performance of the ensemble can be explained by the bias-variance error decomposition [10]. Well established methods such as Bagging, Boosting or Stacking have been applied to combine the most of the existing data mining algorithms. Ensembles such as Random forests, Neural networks ensembles or ensemble of classifiers are also widely used. Ensembling often increases the plasticity of weak learners and improves the generalization of overfitted base models. Novel methods for combination of classifiers are summarized in [4, 12]. Related problem of regression models combination has been targeted by individual studies, but no comprehensive summarization is available. In this paper, we present several ensemble techniques for regression problems. We also study the effect of ensembling for base models with increasing plasticity. 1.1 Ensembles and inductive modelling In our study, we use polynomial base modes with plasticity regulated by maximal degree of the polynomial. Coefficients are estimated by standard Least Mean Squares method. Ensembling is not new to inductive modelling field. Combination of weak learners (polynomial models) was used in the MIA GMDH [9]. This algorithm is closely related to modern Cascade Generalization algorithm [8, 5]. In modern ensembles the diversity of base models is promoted. Bagging uses random training data selected from original training data with replacement. It allows base models to be specialized to particular data samples included in its training set. Another approach to actively promote the diversity in the ensemble is Negative correlation algorithm [13]. 77

In the original MIA GMDH algorithm, diversity of base models was promoted by different input features for each unit (base model). Such approach is used also in the Random forest algorithm. In the next section, individual ensemble methods will be shortly described. 2 Ensemble methods Most of the ensemble methods we have implemented into our FAKE GAME open source software project [2]are well known for classifiers. We had to adjust it for regression purposes. Often, the difference is minor. In Bagging algorithm several training subsets are created from the original training set. A model is created for each subset. The result for an unknown instance presented to ensemble is calculated as an average of their responses [7, 6, 5]. In Boosting [14, 5] algorithm there is a single model in the beginning. In regions where it fails to respond correctly, the instances are boosted - their weight is increased and a second model is created. The second model is more focused on instances where first model failed. In the regions where the second model failed, weights of instances are increased again and the third model is introduced and so on. Output for unknown instance is given by weighted average. Weight of each model is calculated from its performance (better the model, greater the weight). A Stacking [15, 5] uses several different models trained on a single training set. Responses of these models represent meta-data which are used as inputs of a single final model which is used for calculating output for unknown instances. Cascade Generalization [8, 5] presents a sequential approach to combination of models. In Cascade Generalization a sequence of learners is created. Inputs of learner A i consist of input base-data and also outputs of previous models A i 1. Thus each model in the sequence is partially base-learner and partially meta-learner. Unknown instances pass through the sequence and output of the last model becomes the output of the whole ensemble. We have also designed and implemented two local methods Area specialization and Divide ensemble. 2.1 Area Specialization Area Specialization uses same learning method as Boosting, but can use any other method. The essence of Area Specialization is in output computation for unknown instances. Shortly it gives output of that model which is best for the area where unknown vector is. First, distance for unknown vector to all learning vectors is calculated and closest N vectors are taken into next step (N identifies algorithm parameter called area, which determines how smooth will be transitions between areas of output different models). Then best model is chosen for every learning vector selected in the first phase. Next model weights are calculated as difference of target value of learning vector and output of best model for unknown vector. Weight values are inverted using gaussian function and summed up for all N learning vectors to corresponding models. Model weights are used in weighted average output. 2.2 Divide ensemble Divide ensemble divides learning data into clusters (for example using k-means algorithm) and assigns one model to each cluster. Response to unknown instance is given by model which is in charge of cluster which unknown instance belongs to. There are two main advantages of that approach. First, model will probably perform better on smaller area of learning data and has higher chance to adapt to it. Second, dividing data into smaller chunks and learning greater number, but smaller models may cause boost in learning speed (if the learning algorithm has more than linear complexity in relation to learning vectors). To reduce model s unexpected behavior near the edge of the cluster, where there are usually little or no learning vectors, we use clustering modification to enlarge all clusters by certain amount. Function 2 describes the process. It s inputs are coordinates of cluster centroids and an array containing indexes of vectors for each cluster. Each vector is checked comparing the distance to the other centers to distance to it s own center. If the ratio of distances is above certain threshold vector is added to the cluster belonging to the other center (that means vector can be in more than one clusters simultaneously). This feature improves model error, reduces outline values and makes transition between models (clusters) smoother. 78

Function 1 AreaSpecialization.getOutput(unknownVector) /* Compute euclidean distance of learning vectors to unknownvector. */ distance[] computedistancet olearningv ectors(unknownv ector) /* Get indexes of sorted learning vectors by distance. */ indexes sort(distance) for i = 0 to area do /* Take ith closest learning vector and find model with smallest error on that learning vector. */ bestm odelindex getbestm odelindex(learningv ectors[indexes[i]]) /* Compute difference between target value of closest learning vector and output of model for unknownvector which is best for that learning vector. */ dif f targetoutput[indexes[i]] model[bestm odelindex].getoutput(unknowv ector) /* Compute model weights from how well model performs on learning vector. */ modelw eights[bestm odelindex] +gaussian(dif f) modelw eights normalize(modelw eights) ensembleoutput 0 /* Weighted average. */ for i = 0 to modelsn umber do ensembleoutput +model[i].getoutput(uknownv ector) modelw eights[i] Function 2 DivideEnsemble.roughClustering(clusterCenters[], clusterindexes[][]) for each vector in data do distance[] computedistancet ocenters(vector) /* Get indexes of sorted learning vectors by distance. */ indexes[] sort(distance) closestdistance distance[indexes[0]] for i = 1 to clustercenters.length do currentdistance distance[indexes[i]] if distance[indexes[i]] > centerdistance[indexes[0]][indexes[i]] then /* If the distance of the vector to the other center is greater than distance between centers, do not skip that vector (it means vector is located in the half of the cluster that is behind its center). */ continue end if /* clustersizemultiplier is algorithms parameter which determines how much clusters will be resized. */ if currentdistance/closestdistance < clustersizem ultiplier then addv ectort ocluster(i, vector) end if 79

We have benchmarked these ensemble methods using polynomial base models on a synthetic data set. 3 Experimental setup The Donojo-Johnstone benchmark contains four synthetic time series (bump, block, doppler, heavysine) with no, middle or high noise added (see Figure 1). Fig. 1. The syntectic time series for benchmarking nonlinear regression models with medium noise added. In the following experiment, we used all 4 series with middle noise. For each series and ensemble method (5 base models in the ensemble by default), ten fold cross validation was repeated three times. The resulting Root Mean Squared (RMS) error of one ensemble, modelling single series, is therefore averaged over 30 measurements. High degree polynomials can highly overfit the data, making the resulting error extremely high. Therefore we have rounded maximal error to three times standard difference of error for a few cases, that were observed. 4 Results The graph (Figure 2) of the ensemble results on Dojo-Johnstone benchmarks, where errors of individual ensembles were averaged over all series with middle noises. Boosting in the graph (Figure 2) reduces variance (according to theoretical expectations) of ensembles with overfitted base models (degree 12 and more). With higher polynomial degree, irregularities can be observed near the clusters boundaries used in Divide (DIV) ensemble method. Stacking here reduces overfitting and makes the error development more smooth. Area specialization (AS) is very close to minimum of stacking and single regular polynomial model. The combination of the AS and DIV ensemble methods demonstrates the best results on this data set. 80

Polynomial 23456789 ensembles on Donoho-Johnstone benchmarks- medium noise RMSE 1 2 4 6 9deg12 15 20 25 30 International Conference on Inductive modelling ICIM 2010 AS[DIV[P]] AS[P] BAG[P] BST[P] CG[P] P[QN] ST[P,L] DIV[P] Fig. 2. The syntectic time series for benchmarking nonlinear regression models with medium noise added. The behaviour of CG method is rather strange and we have to analyze it in more details. For higher number of polynomial degree (15+) the variance is reduces, but for less plastic models, the error is much higher than that of the individual model. Polynomial ensembles on the Heart dataset 0.22 0.24 0.26 0.28 0.32 0.34 0.36 AS[DIV[P]] AS[P] BAG[P] BST[P] CG[P] P[QN] ST[P,L] DIV[P] RMSE 1 2 4 6 9 deg Fig. 3. Error of individual ensembles for base models with increasing plasticity on the Heart dataset from the UCI machine learning repository [1]. The last graph (see Figure 3) shows the performance of ensembles on a data set with significantly different properties. Also, the results are significantly different. We performed several experiments and found out that for data with similar properties (e.g. probability distribution do data, number of features, IO relationship) we receive similar behaviour of all ensembles. 81

5 Conclusion We have implemented several modern ensembling methods and experimented with polynomial units as base models. We can conclude, that the behaviour of ensembles is extremely data dependent. In future, we will also compare properties of inductive ensembles such as GMDH - MIA or GAME. Acknowledgements This research is partially supported by the grant Novel Model Ensembling Algorithms (SGS10/307/OHK3/3T/181) of the Czech Technical University in Prague and the research program Transdisciplinary Research in the Area of Biomedical Engineering II (MSM6840770012) sponsored by the Ministry of Education, Youth and Sports of the Czech Republic. References [1] Uci machine learning repository. available at http://www.ics.uci.edu/ mlearn/mlsummary.html, September 2006. [2] The fake game environment for the automatic knowledge extraction. available online at: http://www.sourceforge.net/projects/fakegame, November 2008. [3] J. Bennett, S. Lanning, and N. Netflix. The netflix prize. In In KDD Cup and Workshop in conjunction with KDD, 2007. [4] P. Brazdil, C. Giraud-Carrier, C. Soares, and R. Vilalta. Metalearning: Applications to Data Mining. Cognitive Technologies. Springer, January 2009. [5] P. Brazdil, C. Giraud-Carrier, C. Soares, and R. Vilalta. Metalearning, Applications to Data Mining. Cognitive Technologies. Springer Berlin Heidelberg, 2009. [6] L. Breiman. Bagging predictors. Mach. Learn., 24(2):123 140, 1996. [7] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the Second European Conference on Computational Learning Theory, pages 23 37. Springer Verlag, 1995. [8] J. Gama and P. Brazdil. Cascade generalization. Mach. Learn., 41(3):315 343, 2000. [9] A. G. Ivakhnenko. Polynomial theory of complex systems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-1(1):364 378, 1971. [10] R. A. Jacobs. Bias/variance analyses of mixtures-of-experts architectures. Neural Comput., 9(2):369 383, 1997. [11] Y. Koren. Collaborative filtering with temporal dynamics. In KDD 09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 447 456, New York, NY, USA, 2009. ACM. [12] L. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. John Wiley and Sons, New York, 2004. [13] Y. Liu and X. Yao. Ensemble learning via negative correlation. Neural Networks, 12:1399 1404, 1999. [14] R. E. Schapire. The strength of weak learnability. Mach. Learn., 5(2):197 227, 1990. [15] D. H. Wolpert. Stacked generalization. Neural Networks, 5:241 259, 1992. 82