Metalearning for Dynamic Integration in Ensemble Methods

Transcription

1 Metalearning for Dynamic Integration in Ensemble Methods Fábio Pinto 12 July 2013 Faculdade de Engenharia da Universidade do Porto Ph.D. in Informatics Engineering Supervisor: Doutor Carlos Soares Co-supervisor: Doutor João Mendes Moreira

2 Outline Abstract ii 1 Introduction Problem Statement Research Approach and Expected Contributions Proposal Organization Ensemble Learning Error, Accuracy and Diversity Bagging, Boosting and other Popular Ensemble Algorithms Ensemble Generation Ensemble Pruning Ensemble Integration and the Dynamic Approach Metalearning Metadata Metalearning for Data Mining Applications Metalearning for Ensemble Methods Research Plan Task 1. Foundations of Ensemble Learning and Metalearning Task 2. Metafeatures for homogeneous ensembles Task 3. Metalearning for dynamic selection of homogeneous ensembles Task 4. Metalearning for dynamic integration of homogeneous ensembles Task 5. Metalearning for dynamic selection and integration of heterogeneous ensembles Task 6. Dissertation Final Remarks 32 i

3 Abstract Ensemble methods have been receiving an increasing amount of attention, especially because of their successful application to high visibility problems (e.g., the NetFlix prize). An important challenge in ensemble learning (EL) is the management of the set of models to ensure a high level of accuracy, particularly with large number of models and in highly dynamic environments [49]. One approach to deal with these problems in the context of EL is the dynamic approach, which consists in the selection and combination of the best subset of model(s) for each test instance. An alternative approach to find models that are most suitable for a given set of data is metalearning (MtL). MtL uses data from past experiments to build models that relate the characteristics of learning problems with the behaviour of algorithms [5]. Thus, the general goal of this project is to investigate the use MtL for dynamic integration approaches to EL. ii

4 1 Introduction The world is deluged by data. The dissemination of the Internet around the globe together with the development of ubiquitous information-sensing mobile devices, wireless sensor networks and information store capacity, has enhanced the need to understand and make value of the data that is being generated. Data Science, a new coined term that brings together Statistics, Machine Learning (more particularly, Data Mining) and Computer Science, emerges as the field that can assist humans in this task [53]. Data Mining is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive information repositories, or data streams [30]. A Data Mining project usually includes one or more of the following tasks [21]: Regression Classification Anomaly detection Association rule learning Clustering Summarization In this project, we will focus on the first two, namely, regression and classification. In a typical regression problem 1 we have a dataset that consists of a set of n instances: {(x 1, f(x 1 )),..., (x n, f(x n ))}. The objective is to induce a function ˆf from the data, where ˆf : X R, where, ˆf(x) = f(x), x X, (1) f representes the unknown true function. The algorithm used to obtain the ˆf function is called induction algorithm or learner. The ˆf function is called model or predictor. The usual goal for regression is to minimize a squared error loss function, namely the mean squared error (MSE), MSE = 1 n n i ( ) 2 ˆf(xi ) f(x i ) (2) 1 We follow [49] very closely. 1

5 For classification, the concept is very similar. The goal is also to induce a function ˆf from a set of training examples. However, in classification the output of ˆf(x) is a categorical variable instead of a numeric one. This has several implications that differentiate classification from regression. One of them being, naturally, the error loss function to minimize. While in regression the majority of the evaluation measures minimize a squared error loss function, in classification the evaluation measures forcibly need to be different: accuracy, precision, recall, F-score, AUC, to name a few. See [30] for further details. Ensemble Learning is a process that uses a set of models (regression or classification), each of them obtained by applying a learning process to a given problem. This set of models is integrated in some way to obtain the final prediction [49]. EL has become increasingly popular both for regression and classification tasks. Besides the extensive research that reports great results with ensemble algorithms in a wide variety of problems [85], data mining competitions with great media coverage (i.e., Netflix prize, Heritage Health prize, etc) proved that ensembles constitute the best technique for predictive modeling when the main goal lays in accuracy. 1.1 Problem Statement The great disadvantage in applying ensemble methods is their black box nature [82]. When combining several models for a prediction task, is very difficult to understand how the ensemble works and extract knowledge from the system. For instance, a decision tree is an algorithm that besides accuracy also provides inner knowledge by inspecting the structure of the tree. This knowledge can be very useful to understand the domain of the prediction task and even to improve the final model. Along with the comprehensibility issue, using an ensemble to obtain predictions is a very blind process. If we have an ensemble that our evaluation methodology says to be accurate, we are going to apply that ensemble to any instance that we want to predict, regardless of its characteristics. We believe that a more dynamic approach can enhance predictive accuracy and, at the same, time provide interesting and useful knowledge. The dynamic approach can be divided into two different steps: the selection of a subset of models within an ensemble and their integration (in other words, how to combine their predictions) to make a final prediction [48]. Both processes are carried at the prediction time and accordingly to the characteristics of the instance that one is trying to predict. This problem has been adressed in literature by a few researchers [64][76][50][48]. However, their approaches are very sparse and the challenge remains on how to dynamically apply an ensemble 2

6 approach for predictive tasks that obtains competitive accuracies and provides insight concerning their behavior. 1.2 Research Approach and Expected Contributions We propose to combine Metalearning techniques with Ensemble Learning in order to test the hypothesis that a better dynamic selection and integration of ensembles can be achieved. Metalearning is the study of principled methods that exploit metaknowledge to obtain efficient models and solutions by adapting machine learning and data mining processes [5]. Our approach can also give important and useful insights about the relation between data and the performance of the ensembles for a better understanding of their behavior. Further details on the research plan are given in Section Expected Contributions Metafeatures: algorithm-dependent (for homogeneous ensembles) and algorithm-independent metafeatures (for heterogeneous ensembles); domain-independent and domain-dependent aggregated metafeatures; combination of aggregated metafeatures with metafeatures describing single instances. Application of Metalearning to Ensemble Methods: use a metamodel that relates data characteristics, ensemble characteristics, individual models characteristics and ensemble performance in order to improve prediction accuracy and gain domain insights. Dynamic Selection of Ensemble Models: use Metalearning to dynamically select the best subset of models within an ensemble for a given instance. Dynamic Integration of Ensemble Models: use Metalearning to dynamically combine the best subset of models within an ensemble for a given a instance. 1.3 Proposal Organization This thesis proposal is organized as follows. Section 2 presents the state of the art for Ensemble Learning, giving particular attention to contributions related to the dynamic selection and integration of ensemble models. A selected overview of Metalearning is provided in Section 3. The research plan is specified in Section 4. Finally, the proposal ends with some final remarks in Section 5. 3

7 2 Ensemble Learning Ensemble Learning (EL) is a process that uses a set of models, each of them obtained by applying a learning process to a given problem. This set of models is integrated in some way to obtain the final prediction [49]. The two pioneering works that laid the roots for EL research presented different perspectives on the same topic: Hansen and Salamon [31] published an empirical work in which was found that predictions made by an ensemble of classifiers (neural networks) are often more accurate than the best single classifier. On the other hand, Schapire [68] showed theoretically that weak learners can be combined to form a strong learner. EL is divided into three sub-processes: ensemble generation, ensemble pruning and ensemble integration [49]. In this state-of-the-art, we will focus particularly on the ensemble integration, given that our research plan will hopefully provide contributions for that sub-process. However, an important overview is also presented for ensemble generation, and the related topics of error decomposition and diversity. Finally, a very brief description of ensemble prunning state-of-the-art is also provided. 2.1 Error, Accuracy and Diversity Good ensembles must present specific characteristics: accurate predictors and errors in different parts of the input space. The generalization error decomposition for regression ensembles was a very important step in understanding the behavior of such systems, and its contributions helped to guide the research in ensemble generation. For classification, there is no such unifying theory. However, there is work in progress in the field [16]. It is well accepted in the Machine Learning community that generating diverse individual classifiers is good practice to achieve accurate ensembles. Again, diversity measures for regression are forcibly different from the classication ones. Although here are proven connections between diversity measures and accuracy, there is also evidence that raises doubts about the usefullness of such metrics in building ensembles [41]. This a research field that still lacks a complete grounded framework Regression The literature regarding the error decomposition of regression ensembles has two main schemes: the Error-Ambiguity and the Bias-Variance-Covariance decomposition. 4

8 The Error-Ambiguity decomposition was proposed by Krogh and Vedelsby [40] for an ensemble of k neural networks. Assuming that ˆf f (x) = k i=1 [α i ˆf i (x)] where k i=1 (α i) = 1 and α i 0, i = 1,..., k, they show that the error for one example is ( ˆf f f) 2 = k [α i ( ˆf k i (x) f) 2 ] [α i ( ˆf i (x) ˆf i ) 2 ] (3) i=1 i=1 The first term of the equation is the bias component (the error of the individual learners given their generalization ability) and the second one is the ambiguity component (measures the variability among the predictors of individual learners, depending on ensemble diversity). The expression shows clearly that the ensemble generalization error is less than or equal to the generalization error of a randomly selected single predictor. This is true because the ambiguity component is always non negative. This decomposition also shows that it is possible to reduce the ensemble generalization error by increasing the ambiguity without increasing the bias. Ueda and Nakano [79] proposed the Bias-Variance-Covariance decomposition. Here, it is assumed that ˆf f (x) = 1 k Σk i=1 [ ˆf i (x)], then E[( ˆf f f) 2 ] = bias k var + (1 1 ) covar (4) k The expression shows that the error of the ensemble depends a lot on the covariance component, which relates the correlation between the individual learners. If the learners make similar errors, this component will be large. Therefore, this expression shows that diversity is very important for accurate ensembles. Later, a study on the relation between the Error-Ambiguity and Bias-Variance-Covariance decomposition showed that is not possible to maximize the ambiguity component and minimize the bias component simultaneously [17]. Thus, generating diverse learners is a complex challenge. The Bias-Variance-Covariance decomposition provides a powerfull measure of regression ensemble diversity: the covariance term [17]. This component was already integrated in one successful ensemble generation algorithm (Negative Correlation Learning [45]). By adding a penalty term associated with the covariance component to the mean squared error function of the ensemble, the algorithm, together with an evolutionary framework, automatically searches for learners that are not correlated with those already present in the ensemble. Moreover, one should acknowledge that the presented error decomposition schemes assume that the integration function of the learners is averaging. In case of non-constant weighting functions 5

9 presented in Section 2.5, these theories do not hold. Another important topic for ensemble regression is the multicollinearity problem. This statistical phenomenon, in the context of EL, refers to the situation in which the predictions of two or more individuals learners of an ensemble are highly correlated. Given the exposition provided previously, is linear to conclude that this can be problematic. However, if the already mentioned principles of diversity are guaranteed, then it is possible, if not to avoid completely, at least smooth the problem in the ensemble generation or ensemble pruning phase [49] Classification Error decomposition in regression ensembles is a very well solved problem. In classification, more research is needed. One can find work in progress trying to adapt the concepts present in regression for classification problems by choosing to approximate the class posterior probabilities [78][24]. However, for some learning algorithms, like decision trees, is not possible to extract those probabilities: the outputs have no intrinsic ordinality. The work in this topic is divided into two directions: ordinal outputs and non-ordinal outputs (in which the outputs of the classifiers are taken as probabilities, as mentioned before). We follow Brown very closely [16]. For ordinal outputs, the theoretical framework for analysing a classifier error when its predictions are posterior probabilities was proposed by Tumer and Gosh [78]. Figure 1 shows their framework. For a one dimensional feature vector x, the solid curves show the true posterior probabilities of classes a and b, P(a) and P(b), respectively. The dotted curves show estimates of the posterior probabilities, from one of the predictors, ˆP (a) and ˆP (b). The solid vertical line at x indicates the optimal decision boundary and the dark shaded is named the Bayes error. This error can not be reduced. The dotted vertical line at ˆx indicates the boundary placed by our predictor. The light shaded area indicates the added error that our predictor makes in addition to the Bayes error. Tumer and Gosh show that the expected added error, if the decision boundary is instead placed by an ensemble, is ( ) 1 + δ(m 1) Eadd ens = E add M (5) where M is the number of classifiers. E add is the expected added error of the individual classifiers: they are assumed to have the same error. The δ is a correlation coefficient measuring the correlation between errors in approximating the posterior probabilities and, thus, a diversity measure. However, 6

10 Figure 1: Tumer and Ghoshs framework [78][16] for analysing classifier error. to achieve this expression, the authors take some critical assumptions. For example, they assume that the errors of the different classifiers have the same variance. Later, this work was extended by Roli and Fumera [24] where some of the assumptions were discarded, one of them being the uniformly weighted combination of the posterior probabilities. For non-ordinal outputs, the state of the art still not provides a unifying satisfactory theory. Ideally, we should have an expression that, similarly to the error decomposition in regression, decomposes the classification error rate into the error rates of the individual learners and a term that quantifies their diversity. The lack of a error decomposition for classification in a context of non-ordinal outputs has led to several diversity measures being proposed in the literature: Disagreement, Q-statistic, Kappastatistic, Kohavi-Wolpert variance, to name a few. 2 However, their uselfullness has been highly questioned. Kuncheva and Whitaker [41] showed through a broad range of experiments that the existing diversity measures do not provide a clear relation between those diversity measurements and the ensemble accuracy. Tang et al. [73] gave evidence that, compared to algorithms that seek diversity implicitly, exploiting diversity measures explicitly is ineffective while constructing strong ensembles. They also showed that diversity measures do not provide reliable information if the ensembles achieve good generalization performance but, at the same time, are highly correlated to average individual accuracies, which is not desirable. More recently, two new research directions emerged for understanding ensemble diversity in a classification context: Brown and Kuncheva s [15] good and bad diversity and information theoretic diversity. Brown and Kuncheva adopt the perspective that a diversity measure should be naturally defined 2 We refer the reader to [85] to further details. 7

11 as a consequence of two decisions in the design of the ensemble learning problem: the choice of error function and the choice of integration function. More particularly, with a zero-one loss function and majority voting integration scheme. The authors derive a decomposition of the majority vote error into three terms: average individual accuracy, good diversity and bad diversity. The good diversity measures the disagreement on datapoints where the ensemble is correct. The bad diversity measures the disagreement on datapoints where the ensemble is incorrect. Based on interaction information (a multivariate generalization of mutual information - see [85] for further details), Brown [14] presented a decomposition of the conditional interaction information between a set of predictors and a target variable. His mathematical formulation proposes to decompose classifier ensembles diversity into three components: relevancy (the sum of mutual information between each classifier and the target), redundancy (measures the dependency, independent to the target variable, among all possible subsets of classifiers) and conditional redundancy (measures the dependency among the classifiers given the class label). The main problem of this decomposition is that there is no effective process for estimating the diversity terms. Zhou and Li [86] provided a mathematical simplification of Brown s contribution and a complex estimation method for the diversity terms with promising results. 2.2 Bagging, Boosting and other Popular Ensemble Algorithms Research in EL has generated some algorithms that due to their simplicity and effectiveness have been widely adopted by the Machine Learning community and even in the industry. This section gives a brief overview of the most popular ones. Bagging stands for bootstrap aggregating and is due to Breiman [8]. This technique plays a central role in Random Forests [12], one of the most popular ensemble learning algorithms. Algorithm 1 shows the pseudocode for the bagging algorithm. Generically, given a data set containing n number of training instances, a sample D bs with replacement of n training instances will be generated. The process is repeated T times and T samples of n training instances are obtained. Then, from each sample, a model ˆf is generated by applying a base learning algorithm A. In terms of aggregating the outputs of the base learners and building the ensemble E, bagging adopts two of the most common ones: voting for classification (the most voted label is the final prediction) and averaging (the predictions of all the base learners are averaged to form the ensemble prediction) for regression [85]. The precision of the base learners can be estimated using the out-of-bag examples (the ones 8

12 input : Data set D = (x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) Base learning algorithm A Number of base learners T for t 1 to T do ˆf t = A(D, D bs ) %Train learner end output: E( ˆf 1,..., ˆf t ) Algorithm 1: Bagging pseudocode. Source: [85]. that were not selected for training) in each iteration, allowing to compute the error of the bagged ensemble. Schapire [68] publisehd a seminal paper in which he theoretically proved that any weak learner is potentially able to be boosted to a strong learner. This thesis originated the family of boosting algorithms. Synthetically, a boosting algorithm works by sequentially train learners and combine their outputs for a final prediction. However, each learner is forced to focus more on instances poorly predicted by the previous generated learners (if any) by assigning a weight to each instance based in e r. This weight influences then the instances selected for D r+1. Algorithm 2 presents the pseudocode for a boosting algorithm. input : Sample distribution d Base learning algorithm A Number of learning rounds R for r 1 to R do ˆf r = A(D r ) %Train weak learner e r = Evaluate Error( ˆf r ) D r+1 = Adjust Distribution (D r, e r ) end output: E = Combine Outputs( ˆf 1,..., ˆf t ) Algorithm 2: Boosting pseudocode. Source: [85]. By analysing the pseudocode presented in Algorithm 2, one can see that there are two main subjective processes: adjust distribution and combine outputs. There are several boosting algorithms in the literature with different procedures for these tasks [85], being AdaBoost [23] the most influential one. Bagging and boosting exploit variation in data in order to achieve greater diversity (and accuracy) in the final predictions. However, other ensemble methods exploit difference among learners. The concept of stacked generalization is due to Wolpert [82]. Stacking initialises by generating a set of models ( ˆf 1,..., ˆf t ) from a set of learning algorithms (A 1,..., A t ) and a dataset D. Then, a meta- 9

13 dataset is generated by replacing each base-level instance by the predictions of the models. 3 This new dataset is then presented to a learning algorithm that relates the predictions of the base-level models with the target output. A prediction from a stacking model is extracted by making the baselevel models predict an output, build a meta-instance and feed it to the meta-learner that provides the final prediction. The stacking framework was improved later with important contributions at the level of meta-features extraction and the selection of the meta-level algorithm [20]. Cascade generalization originated from the work of Gama and Brazdil [27]. Here, the models are used in sequence rather than in parallel as in stacking: the output of the first generated model feeds the second model; the outputs of the first and second model feed the third model, and so on. Meta-decision Trees is due to Todorovski and Dzeroski [74]. In this method, a decision tree is generated were each node corresponds to a model. This metamodel is induced using as attributes the class distribution properties extracted from the base-level examples, allowing to have one metaexample per base-level example. One of the advantages of Meta-decision Trees is that they provide some insight about base-level learning and the area of expertise of each model. This work is further detailed in Section 3.3. Other ensemble methods present in the literature, although with less impact in the research community, are cascading [2], delegating [22] and arbitrating [54]. 2.3 Ensemble Generation The first phase of developing an ensemble is model generation. If the models are generated using the same induction algorithm, the ensemble is called homogeneous, otherwise, in case the models are generated using different induction algorithms, the ensemble is heterogeneous. Higher diversity is expected when developing heterogeneous ensembles, thus, assuring a more accurate ensemble [28]. However, obtaining that diversity with different induction algorithms can be more difficult than with just one. Diversity is achieved either by manipulating data or by the model generation process Data manipulation Data manipulation for ensemble generation can be divided in three different sub-groups: subsampling from the training set, input features manipulation and output targets manipulation. The first consists in using different subsamples from the training set to generate different models. This 3 This can lead to overfitting. To avoid this problem, is often recommended to exclude the base-level instances from the meta-dataset and train the stack model in new data [85]. 10

14 method takes advantage of the unstability of some learning algorithms [9]. Given some randomness of the inductive process of a learning algorithm and its sensivity to changes in the training set, one can manipulate the generation of models to obtain a diverse ensemble. Two well known ensemble learning techniques that use this method are boosting [68] and bagging [8]. Several methods were developed for input features manipulation, being the most simple one the random feature selection. More complex techniques are noise injection [47], that consists in adding Gaussian noise to the inputs; iterative search methods for feature selection [81] and rotation forests [63], a method that combines selection and transformation of features using Principal Component Analysis (PCA). Output targets manipulation is a research field with very few publications. Leo Breiman signs the most important contributions: output noise injection [11], that essentially consists in adding Gaussian noise to the output variable of the training set; and iterated bagging [13]. The latter technique consists of initially generate a model and compute its residuals; a second model is generated with the output target being the residuals of the first model; this iterative process is repeated several times to develop the ensemble Model generation Achieving diversity by model generation manipulation can be done through three techniques: different parameter sets; induction algorithm manipulation or final model manipulation. The vast majority of learning algorithms is sensitive to parameter changes. The number of parameters is highly dependent to the selected algorithm. In order to achieve a diverse set of models, one must focus on the most sensitive parameters of the algorithm. Works on neural networks [59] and k-nearest neighbors [84] ensemble generation show the effectiveness of this technique. Approaches for ensemble generation by manipulation of the induction algorithm have two main categories: sequential and parallel. In sequential approaches [67][33], the generated models are only influenced by previous ones. The main feature of these techniques is the use of a decorrelation penalty term in the error function of the ensemble to increase diversity. Making use of the decomposition of the generalization error of an ensemble, the training of each network tries to minimize a function that has a covariance component, thus decreasing the generalization error. In parallel approaches, the generation of the models includes an exchange of information and usually is guided by an evolutionary framework [45]. Two distinct parallel techniques are the infinite ensemble of Support Vector Machines models [44] (the main concept is to create a kernel that gathers all the possible models in the hypothesis space) and Random Forests, that combines the bagging method with 11

15 random feature selection on the generated trees. Model manipulation is a less studied topic. This group of techniques focus on modify a model in some way so that its performance is boosted (i.e., given a set of rules, produced by one single learning process, one can repeatedly sample the set of rules and build n models [34]). 2.4 Ensemble Pruning Ensemble pruning consists of eliminating models from the ensemble, with the aim of improving its predictive ability or reducing computational costs. Research on ensemble pruning is divided into five categories: exponential search, randomized search, sequential search, ranking pruning and clustering pruning. Here, we follow [49] very closely. Exponential search pruning refers to the group of algorithms that tries to find the optimal set of k models from a pool of K models to integrate an ensemble. The searching space of this problem is very large and is a NP-complete problem. For small values of k, this can be a good approach. However, in most of the cases, this approach gives poor results in comparison with other pruning algorithms and has a very high computational cost. Randomized search pruning algorithms integrate an evolutionary framework in their process to search for a solution that is better than a random one. The GASEN (Genetic Algorithm based Selective ENsemble) [87] algorithm presented very promising results in a classification context. The search algorithm of a sequential pruning process can be named forward (if the search begins with an empty ensemble and adds models to the ensemble in each iteration), backward (f the search begins with all the models in the ensemble and eliminates models from the ensemble in each iteration) or forward-backward (if the selection can have both forward and backward steps.). Comparative studies show that CwE (Constructive with Exploration) [19] presents very robust results. In this algorithm, each time a new candidate model is added to the ensemble, all candidates are tested and it is selected the one that leads to the maximal improvement of the ensemble performance. When no model in the pool improves the ensemble performance, the selection stops. Ranking pruning algorithms sort the models according to a certain criterion and generate an ensemble containing the top k models in the ranking. Most of the algorithms of this category are rather simple and they do not seem to be competitive with state of the art pruning techniques. Clustering algorithms for ensemble pruning resort on grouping the models in several clusters and choose representative models (one or more) from each cluster. A good example of this type o algorithms is ARIA (Adaptive Radius Immune Algorithm) [19]. Here, just the most accurate model 12

16 from each cluster is selected. 2.5 Ensemble Integration and the Dynamic Approach Ensemble integration focus on how to combine the output of models previously generated for an ensemble in order to obtain one final prediction. Techniques for ensemble integration are divided into two main categories: constant weighting functions and non-constant weighting functions [49]. In the former, the weights assigned to each model in the ensemble are a constant value; in the later, the weights vary according to the instance to be predicted. We will pay particular attention to the non-constant weighting functions and more particularly, the dynamic approach Constant weighting functions Naturally, techniques for ensemble integration differ substantially in case of regression or classification. In the case of classication, the most frequent techniques are majority voting ( for binary problems, the final prediction is the label that received more than half of the votes; otherwise, the output will be the rejection option, usually the most frequent class), plurality voting (the final prediction is the label with largest number of votes), weighted voting (a weight is assigned to each learner according to its past performance) and soft voting (here, the output of the classifiers is a probability instead of a label; the techniques used in regression can therefore be applied). The most frequent techniques for regression are averaging (given a set of base learners, the final prediction is the average of the predictions made by the learners) and weighted averaging (given a set of base learners, the final prediction is obtained by averaging the outputs of different learners with different weights implying different importance). Usually the weights are estimated given the past performance of the base learners in some validation data. The great drawback of the most simplistic techniques for ensemble integration is the multicollinearity problem [56]. However, several techniques have been proposed that circumvent this problem. Caruana et al. [18] combined the ensemble integration phase with the ensemble generation one by implicitly calculating the weights as the number of times that each model is selected over the total number of models in the ensemble. Breiman [10] presented a regression version of the original stacking framework. To avoid the multicollinearity problem he used ridge regression as the stack model under the constraint that the coefficients of the regression (in other words, the weights for each model in the ensemble) need to 13

17 be non-negative. Although the results were not great, an important contribution of Breiman is the empirical observation that most of the weights are equal to zero, which reinforces the need for ensemble prunning. Merz and Pazzani [52] presented a technique that used principal component analysis for ensemble integration, named PCR*. After the principals components (PC) being obtained, the method orders the PC as function of the variation they can explain, making the selection of the PC much easier. In a important study with several techniques for constant weighting functions, PCR* showed very consistent results [51] Non-constant weighting functions This category of weighting functions can be divided into static (defined at learning time) and dynamic (defined at prediction time). The most important contribution for static non-constant weighting functions was previously mentioned in Section 2.2 (and further detailed in Section 3.3): Metadecision Trees. Although, one must acknowledge that this work was developed for classification. A regression version would require a different choice of meta attributes. The dynamic approach for non-constant weighting functions has been receiving an increasing amount of attention in the research community [49]. The motivation for this technique is that different models in the ensemble may have different performances on different regions of the input space. One must distinguish the concepts of dynamic selection (DS) and dynamic weighting (DW): while the former considers the selection of the models in an ensemble that are going to make a prediction, the later focus on how to combine the predictions of the models. Figure 2 shows a scheme for dynamic selection of models. The technique suggests that given an input X, similar data is selected from a validation set. This process is usually guided by some distance metric, like the Euclidean distance with the k-nearest neighbors algorithm. Then, one or more models are selected from the ensemble given their past performance on the similar data. After model selection, the predictions can be combined in some way to make the final prediction. The first paper concerning DS of classifiers is due to Ho et al. [32]. In this work, the authors proposed a selection based on a partition of training examples. The individual classifiers are evaluated on each partition to find the best one for each. Then, the test instance to be predicted is categorized into a partition and classified by the corresponding best classifier. A full dynamic approach was introduced by Merz [50] in a paper in which DS of classifiers was combined with the DW of the predictions. Results showed that a simple majority combination was superior to their dynamic approach. Woods [83] used a very similar approach but the results (with 14

18 Figure 2: Dynamic selection. Source: [48]. 4 different datasets) were better. Tsymbal [75][77] combined dynamic integration with classifier ensembles using bagging and boosting algorithms. Results suggest that dynamic integration improves significantly the performance of the ensembles instead of the more typical majority voting integration. Tsymbal also presented experiments in which a dynamic integration approach instead of the simple majority combination in Random Forests was better on some datasets [76]. Rooney et al. [65] extended the dynamic integration for regression problems. They claim that dynamic integration techniques are as effective for regression as stacked regression when the base models are simple. In another paper of the same authors [66], they combined the random subspace method (training data is transformed to contain different random subsets of the variables) with stacked regression and dynamic integration. Again, for simple models like linear regression and k- nearest neighbors, these techniques are more effective than bagging and boosting. Later, Rooney and Paterson [64] proposed a combination of stacking and dynamic integration for regression problems named wmetacomb. Premilinary results were promising. Ko et al. [38] and Moreira et al. [48] presented studies in which several variants of dynamic selection and integration are experimented. The former, showed comparisons of dynamic classifier selection and dynamic ensemble selection; results (no statistical verification was carried) suggested that using weak classifiers, the dynamic ensemble selection can marginally improve the accuracy, but not always performs better than dynamic classifier selection. The later, in a regression task, also found evidence that selecting dynamically several models for the prediction task increases prediction accuracy comparing to the selection of just one model. They also claim that using similarity measures according to the target values improves results. 15

19 Liyanage et al. [46] proposed a dynamically weighted ensemble classification (DWEC) framework whereby an ensemble of multiple classifiers are trained on clustered features. The decisions from these multiple classifiers are dynamically combined based on the distances of the cluster centres to each test data sample being classified. Results showed that their method is significantly better than a Suppor Vector Machine baseline classifier. 16

20 3 Metalearning Metalearning (MtL) is the study of principled methods that exploit metaknowledge to obtain efficient models and solutions by adapting machine learning and data mining processes [5]. Rendell and Cho [61] published the first experiments at the meta-level for classification. They characterized a classification problem and studied its impact on algorithm behavior, extracting metafeatures related to the size and concentration of the labels. In following years, research in MtL was boosted by two European projects: StatLog and METAL. The former provided an assessment of the strengths and weaknesses of several classification techniques, while the later focused on the development of a MtL assistant for providing user support in machine learning and data mining, both for classification and regression problems [70]. The key issue in MtL is metaknowledge: the experience or knowledge gained from one or more data mining tasks. Typically, this knowledge is not generally available to improve the task or to assist in the following data mining tasks. Therefore, MtL concentrates on the effective application of knowledge about learning systems to understand and improve their performance. Figure 3 shows a typical MtL process of knowledge acquisition. Initially (from A to B), the process starts by the user having a group of datasets and the extraction of data characteristics or metafeatures (this topic will be detailed in 3.1). Then, follows the steps of a normal data mining project: preprocessing and experiments (C), choose a learning strategy (D) and finally the evaluation phase (E). Both the stages D and E also provide metaknowledge to be stored in a meta-dataset. A Dataset Dataset B Metafeatures F Metadata Learning Techniques Choose Learning Strategy Performance Evaluation C D E Figure 3: Metalearning: knowledge acquisition. Adapted from: [80]. Some authors consider that concepts like boosting, bagging or stacking (in other words, model combination methods) are a form of MtL (or meta-learning). However, in this work, we refer to 17

21 MtL as the process of using metadata for improving and better understanding data mining processes. Within this notion of MtL, one of the key concepts is learning bias: it refers to any preference for choosing one hypothesis explaining the data over other (equally acceptable) hypotheses, where such preference is based on extra-evidential information independent of the data. MtL studies how to choose the most adequate bias dynamically. Bias can be divided in two schemes: declarative and procedural. The former refers to the representation of the space of hypotheses and affects the size of the search space (e.g., specify the use of linear functions). The later imposes constraints on the ordering of the inductive hypotheses (e.g., prefer smaller hypotheses) [5]. One of the key points in MtL is metadata, namely, the metatarget and the extracted metafeatures (or data characteristics). The metatarget needs to be a variable that has information about the performance of a given algorithm in a given dataset. On the other hand, metafeatures need to properly characterize the datasets in order to achieve an accurate and reliable metalevel model. These can be divided into three types: simple, statistical and information-theoretic measures; modelbased measures; and landmarkers. Further details on this subject are exposed in 3.1. MtL applications can refer to several distinctive tasks, namely algorithm recommendation, development of systems to support the KDD process, combination of base-learners, bias management in data streams, transfer of knowledge and development of complex systems with domain-specific metaknowldge [5]. In this work, we will focus on the topic of recommendations for data mining, given that is more related with our project goals. For a full exposure of MtL applications, we refer the reader to [5]. Last but not least, there is some work in the state of the art that relates ensemble methods with MtL. This is, however, a very restrict group. We will provide some insights of those works in 3.3 and expand their contributions to our future work. 3.1 Metadata Generating the metadata is the most important step in a MtL process. Besides choosing the appropriate metatarget for the task, it is crucial to select meaningful metafeatures that contain information for successfully achieve the main goal. For that, it is very important to take in consideration both task-dependent and algorithm-specific metafeatures [5]. For instance, if the base-level task is classification, the choice of metafeatures should be different than for a regression problem. Moreover, one should also consider the set of algorithms that the MtL system needs to relate. For example, the proportion of symbolic features should be meaningful to differentiate between a neural network 18

22 and a naïve Bayes. It is acknowledge that neural networks usually present a good performance when the dataset contains several numeric variables. On the other hand, naïve Bayes are more proper to symbolic attributes. Each example in a meta-dataset represents a learning problem. As in other learning task, MtL needs a satisfactory number of examples in order to induce a reliable model. The number of metaexamples is often seen as a problem for MtL [5] Metatarget Concerning the development of a MtL system, the first decision that must be made is about the type of metatarget, in others words, the dependent variable of the meta-level learning process. This variable can take several forms depending on the main goal of the MtL system and the nature of the base-level task (i.e., classification, regression, etc). The most simple form of metatarget is a classification scheme (binary or multi-class, depending on the number of algorithms) in which for a given dataset, the metamodel predicts the class that represents the algorithm with better performance from a set. The great disadvantage of this type of metatarget is if the metamodel fails its prediction, the costs can be very high. Another type of metatarget is instead of a single recommendation, the metamodel suggests a subset of algorithms. Given the algorithm with expected best performance, a heuristic measure can be defined to indicate the algorithms that also perform well in comparison with best algorithm. Typically, these metamodels are induced with rules or Inductive Logic Programming [37]. The previous type of metatarget provides several recommendations for the user. However, they are not ordered. This can negatively influence the data mining process. Therefore, algorithm recommendation in form of rankings seems a good alternative. A MtL method that provides recommendations in the form of rankings is proposed in [7]. The system includes an adaptation of the k-nearest neighbors algorithm that identifies algorithms which are expected to tie, providing a reduced ranking by only including one of them in the recommendation. Here, the metamodel also takes the form of a classifier. Finally, if one is interested in concrete value regarding the peformance of a algorithm in a dataset and not the actual relative performance of a set of algorithms, the metatarget can be defined as estimates of performance. In this case, the MtL problem takes the form of a regression, one for each base-algorithm. Besides that this type of metafeature provides more detailed information to the user, it also allows to transform the output of the metamodels in one of the previous recommendations forms mentioned above. 19

23 There is somewhat a lack of studies comparing the forms of metatarget. In fact, the work done on this subject is contradictory. Köpf et al. [39] found evidence that the transformation of the estimates of performance provided by regression metamodels are not a good option. On the other hand, Bensusan and Kalousis [4] provide evidence that better rankings can be obtained with the transformatiom of estimates of performance than by using a ranking algorithm. Further research is needed for a full comparison of methods Metafeatures Defining the metafeatures is probably the most important task in a MtL problem. If the data characterization does not provide useful information, the probability of success of the MtL system is highly reduced. Brazdil et al. [5] defined three fundamental issues that every metafeature should accomplish: Discriminative power. The metafeatures need to contain information that distinguishes between the base-algorithms in terms of their performance. Computational complexity. The computation of the metafeatures should not be too demanding. If not, it may not compensate generating a MtL system if one can save resources by exploring all the hypotheses for a given learning problem. Pfahringer et al. [58] suggested that the computational complexity of extracting metafeatures should not be at most O(n log n). Dimensionality. Given that the number of meta-examples of a MtL problem is usually small, the number of metafeatures should not be too large or overfitting may occur. Kalousis and Hilario [36] found evidence that feature selection can improve a MtL process which supports this claim. As mentioned before, metafeatures can be divided into three types (Figure 4): Simple, statistical and information-theoretic. These are the most common type of metafeatures extracted using descriptive statistics and information-theoretic measures. Some examples: number of features/examples, number of instances with missing values (simple); mean skewness of numeric features, mean value of correlation (statistical); class entropy, mutual information for symbolic features (information-theoretic). We refer the reader to [39][26] for more examples. Model-based. Here, metafeatures are extracted based on properties of the induced model. Example: number of leaf nodes in a decision tree [55] or mean of the off-diagonal values of a kernel matrix in a Support Vector Machines model [71]. 20

24 Landmarkers. This type of metafeatures are quick estimates of an algorithm s performance. They can be obtained in three different ways: through the run of simplified versions of an algorithm [3][58] (i.e., a decision stump); quick performance estimates on a sample of the data, also called subsampling landamarkers [25]; and finally, through an ordered sequence of subsampling landmarkers for a single algorithm, which allows to form the so called learning curve of an algorithm. In this case, not only the estimates can be used as metafeatures but also the shape of the curve [43]. Datasets Simple, statistical and informationtheoretic Datasets Learning Model Model-based Datasets Learning Model Landmarkers Performance Estimates Figure 4: Metafeatures. Source: adapted from [5]. 3.2 Metalearning for Data Mining Applications Rendel et al. [62][61] published the earlier works in which the expression meta-learning is used in Machine Learning. In the first paper [62], they proposed the Variable Bias Management System (VBMS). Here, the problem of algorithm recommendation is studied for the first time and the need for methods that develop models with different biases is identified. However, the experiment is rather preliminary: only the execution time of the algorithms is considered, the type of metafeatures is very simple (i.e., number of examples) and the evaluation carried is clearly insufficient. In the second paper [61], the data characterization was more detailed and set roots for the MtL research in following years, boosted by the already mentioned European projects, StatLog and METAL. 21