Metalearning for Dynamic Integration in Ensemble Methods
|
|
|
- Amanda Hoover
- 10 years ago
- Views:
Transcription
1 Metalearning for Dynamic Integration in Ensemble Methods Fábio Pinto 12 July 2013 Faculdade de Engenharia da Universidade do Porto Ph.D. in Informatics Engineering Supervisor: Doutor Carlos Soares Co-supervisor: Doutor João Mendes Moreira
2 Outline Abstract ii 1 Introduction Problem Statement Research Approach and Expected Contributions Proposal Organization Ensemble Learning Error, Accuracy and Diversity Bagging, Boosting and other Popular Ensemble Algorithms Ensemble Generation Ensemble Pruning Ensemble Integration and the Dynamic Approach Metalearning Metadata Metalearning for Data Mining Applications Metalearning for Ensemble Methods Research Plan Task 1. Foundations of Ensemble Learning and Metalearning Task 2. Metafeatures for homogeneous ensembles Task 3. Metalearning for dynamic selection of homogeneous ensembles Task 4. Metalearning for dynamic integration of homogeneous ensembles Task 5. Metalearning for dynamic selection and integration of heterogeneous ensembles Task 6. Dissertation Final Remarks 32 i
3 Abstract Ensemble methods have been receiving an increasing amount of attention, especially because of their successful application to high visibility problems (e.g., the NetFlix prize). An important challenge in ensemble learning (EL) is the management of the set of models to ensure a high level of accuracy, particularly with large number of models and in highly dynamic environments [49]. One approach to deal with these problems in the context of EL is the dynamic approach, which consists in the selection and combination of the best subset of model(s) for each test instance. An alternative approach to find models that are most suitable for a given set of data is metalearning (MtL). MtL uses data from past experiments to build models that relate the characteristics of learning problems with the behaviour of algorithms [5]. Thus, the general goal of this project is to investigate the use MtL for dynamic integration approaches to EL. ii
4 1 Introduction The world is deluged by data. The dissemination of the Internet around the globe together with the development of ubiquitous information-sensing mobile devices, wireless sensor networks and information store capacity, has enhanced the need to understand and make value of the data that is being generated. Data Science, a new coined term that brings together Statistics, Machine Learning (more particularly, Data Mining) and Computer Science, emerges as the field that can assist humans in this task [53]. Data Mining is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive information repositories, or data streams [30]. A Data Mining project usually includes one or more of the following tasks [21]: Regression Classification Anomaly detection Association rule learning Clustering Summarization In this project, we will focus on the first two, namely, regression and classification. In a typical regression problem 1 we have a dataset that consists of a set of n instances: {(x 1, f(x 1 )),..., (x n, f(x n ))}. The objective is to induce a function ˆf from the data, where ˆf : X R, where, ˆf(x) = f(x), x X, (1) f representes the unknown true function. The algorithm used to obtain the ˆf function is called induction algorithm or learner. The ˆf function is called model or predictor. The usual goal for regression is to minimize a squared error loss function, namely the mean squared error (MSE), MSE = 1 n n i ( ) 2 ˆf(xi ) f(x i ) (2) 1 We follow [49] very closely. 1
5 For classification, the concept is very similar. The goal is also to induce a function ˆf from a set of training examples. However, in classification the output of ˆf(x) is a categorical variable instead of a numeric one. This has several implications that differentiate classification from regression. One of them being, naturally, the error loss function to minimize. While in regression the majority of the evaluation measures minimize a squared error loss function, in classification the evaluation measures forcibly need to be different: accuracy, precision, recall, F-score, AUC, to name a few. See [30] for further details. Ensemble Learning is a process that uses a set of models (regression or classification), each of them obtained by applying a learning process to a given problem. This set of models is integrated in some way to obtain the final prediction [49]. EL has become increasingly popular both for regression and classification tasks. Besides the extensive research that reports great results with ensemble algorithms in a wide variety of problems [85], data mining competitions with great media coverage (i.e., Netflix prize, Heritage Health prize, etc) proved that ensembles constitute the best technique for predictive modeling when the main goal lays in accuracy. 1.1 Problem Statement The great disadvantage in applying ensemble methods is their black box nature [82]. When combining several models for a prediction task, is very difficult to understand how the ensemble works and extract knowledge from the system. For instance, a decision tree is an algorithm that besides accuracy also provides inner knowledge by inspecting the structure of the tree. This knowledge can be very useful to understand the domain of the prediction task and even to improve the final model. Along with the comprehensibility issue, using an ensemble to obtain predictions is a very blind process. If we have an ensemble that our evaluation methodology says to be accurate, we are going to apply that ensemble to any instance that we want to predict, regardless of its characteristics. We believe that a more dynamic approach can enhance predictive accuracy and, at the same, time provide interesting and useful knowledge. The dynamic approach can be divided into two different steps: the selection of a subset of models within an ensemble and their integration (in other words, how to combine their predictions) to make a final prediction [48]. Both processes are carried at the prediction time and accordingly to the characteristics of the instance that one is trying to predict. This problem has been adressed in literature by a few researchers [64][76][50][48]. However, their approaches are very sparse and the challenge remains on how to dynamically apply an ensemble 2
6 approach for predictive tasks that obtains competitive accuracies and provides insight concerning their behavior. 1.2 Research Approach and Expected Contributions We propose to combine Metalearning techniques with Ensemble Learning in order to test the hypothesis that a better dynamic selection and integration of ensembles can be achieved. Metalearning is the study of principled methods that exploit metaknowledge to obtain efficient models and solutions by adapting machine learning and data mining processes [5]. Our approach can also give important and useful insights about the relation between data and the performance of the ensembles for a better understanding of their behavior. Further details on the research plan are given in Section Expected Contributions Metafeatures: algorithm-dependent (for homogeneous ensembles) and algorithm-independent metafeatures (for heterogeneous ensembles); domain-independent and domain-dependent aggregated metafeatures; combination of aggregated metafeatures with metafeatures describing single instances. Application of Metalearning to Ensemble Methods: use a metamodel that relates data characteristics, ensemble characteristics, individual models characteristics and ensemble performance in order to improve prediction accuracy and gain domain insights. Dynamic Selection of Ensemble Models: use Metalearning to dynamically select the best subset of models within an ensemble for a given instance. Dynamic Integration of Ensemble Models: use Metalearning to dynamically combine the best subset of models within an ensemble for a given a instance. 1.3 Proposal Organization This thesis proposal is organized as follows. Section 2 presents the state of the art for Ensemble Learning, giving particular attention to contributions related to the dynamic selection and integration of ensemble models. A selected overview of Metalearning is provided in Section 3. The research plan is specified in Section 4. Finally, the proposal ends with some final remarks in Section 5. 3
7 2 Ensemble Learning Ensemble Learning (EL) is a process that uses a set of models, each of them obtained by applying a learning process to a given problem. This set of models is integrated in some way to obtain the final prediction [49]. The two pioneering works that laid the roots for EL research presented different perspectives on the same topic: Hansen and Salamon [31] published an empirical work in which was found that predictions made by an ensemble of classifiers (neural networks) are often more accurate than the best single classifier. On the other hand, Schapire [68] showed theoretically that weak learners can be combined to form a strong learner. EL is divided into three sub-processes: ensemble generation, ensemble pruning and ensemble integration [49]. In this state-of-the-art, we will focus particularly on the ensemble integration, given that our research plan will hopefully provide contributions for that sub-process. However, an important overview is also presented for ensemble generation, and the related topics of error decomposition and diversity. Finally, a very brief description of ensemble prunning state-of-the-art is also provided. 2.1 Error, Accuracy and Diversity Good ensembles must present specific characteristics: accurate predictors and errors in different parts of the input space. The generalization error decomposition for regression ensembles was a very important step in understanding the behavior of such systems, and its contributions helped to guide the research in ensemble generation. For classification, there is no such unifying theory. However, there is work in progress in the field [16]. It is well accepted in the Machine Learning community that generating diverse individual classifiers is good practice to achieve accurate ensembles. Again, diversity measures for regression are forcibly different from the classication ones. Although here are proven connections between diversity measures and accuracy, there is also evidence that raises doubts about the usefullness of such metrics in building ensembles [41]. This a research field that still lacks a complete grounded framework Regression The literature regarding the error decomposition of regression ensembles has two main schemes: the Error-Ambiguity and the Bias-Variance-Covariance decomposition. 4
8 The Error-Ambiguity decomposition was proposed by Krogh and Vedelsby [40] for an ensemble of k neural networks. Assuming that ˆf f (x) = k i=1 [α i ˆf i (x)] where k i=1 (α i) = 1 and α i 0, i = 1,..., k, they show that the error for one example is ( ˆf f f) 2 = k [α i ( ˆf k i (x) f) 2 ] [α i ( ˆf i (x) ˆf i ) 2 ] (3) i=1 i=1 The first term of the equation is the bias component (the error of the individual learners given their generalization ability) and the second one is the ambiguity component (measures the variability among the predictors of individual learners, depending on ensemble diversity). The expression shows clearly that the ensemble generalization error is less than or equal to the generalization error of a randomly selected single predictor. This is true because the ambiguity component is always non negative. This decomposition also shows that it is possible to reduce the ensemble generalization error by increasing the ambiguity without increasing the bias. Ueda and Nakano [79] proposed the Bias-Variance-Covariance decomposition. Here, it is assumed that ˆf f (x) = 1 k Σk i=1 [ ˆf i (x)], then E[( ˆf f f) 2 ] = bias k var + (1 1 ) covar (4) k The expression shows that the error of the ensemble depends a lot on the covariance component, which relates the correlation between the individual learners. If the learners make similar errors, this component will be large. Therefore, this expression shows that diversity is very important for accurate ensembles. Later, a study on the relation between the Error-Ambiguity and Bias-Variance-Covariance decomposition showed that is not possible to maximize the ambiguity component and minimize the bias component simultaneously [17]. Thus, generating diverse learners is a complex challenge. The Bias-Variance-Covariance decomposition provides a powerfull measure of regression ensemble diversity: the covariance term [17]. This component was already integrated in one successful ensemble generation algorithm (Negative Correlation Learning [45]). By adding a penalty term associated with the covariance component to the mean squared error function of the ensemble, the algorithm, together with an evolutionary framework, automatically searches for learners that are not correlated with those already present in the ensemble. Moreover, one should acknowledge that the presented error decomposition schemes assume that the integration function of the learners is averaging. In case of non-constant weighting functions 5
9 presented in Section 2.5, these theories do not hold. Another important topic for ensemble regression is the multicollinearity problem. This statistical phenomenon, in the context of EL, refers to the situation in which the predictions of two or more individuals learners of an ensemble are highly correlated. Given the exposition provided previously, is linear to conclude that this can be problematic. However, if the already mentioned principles of diversity are guaranteed, then it is possible, if not to avoid completely, at least smooth the problem in the ensemble generation or ensemble pruning phase [49] Classification Error decomposition in regression ensembles is a very well solved problem. In classification, more research is needed. One can find work in progress trying to adapt the concepts present in regression for classification problems by choosing to approximate the class posterior probabilities [78][24]. However, for some learning algorithms, like decision trees, is not possible to extract those probabilities: the outputs have no intrinsic ordinality. The work in this topic is divided into two directions: ordinal outputs and non-ordinal outputs (in which the outputs of the classifiers are taken as probabilities, as mentioned before). We follow Brown very closely [16]. For ordinal outputs, the theoretical framework for analysing a classifier error when its predictions are posterior probabilities was proposed by Tumer and Gosh [78]. Figure 1 shows their framework. For a one dimensional feature vector x, the solid curves show the true posterior probabilities of classes a and b, P(a) and P(b), respectively. The dotted curves show estimates of the posterior probabilities, from one of the predictors, ˆP (a) and ˆP (b). The solid vertical line at x indicates the optimal decision boundary and the dark shaded is named the Bayes error. This error can not be reduced. The dotted vertical line at ˆx indicates the boundary placed by our predictor. The light shaded area indicates the added error that our predictor makes in addition to the Bayes error. Tumer and Gosh show that the expected added error, if the decision boundary is instead placed by an ensemble, is ( ) 1 + δ(m 1) Eadd ens = E add M (5) where M is the number of classifiers. E add is the expected added error of the individual classifiers: they are assumed to have the same error. The δ is a correlation coefficient measuring the correlation between errors in approximating the posterior probabilities and, thus, a diversity measure. However, 6
10 Figure 1: Tumer and Ghoshs framework [78][16] for analysing classifier error. to achieve this expression, the authors take some critical assumptions. For example, they assume that the errors of the different classifiers have the same variance. Later, this work was extended by Roli and Fumera [24] where some of the assumptions were discarded, one of them being the uniformly weighted combination of the posterior probabilities. For non-ordinal outputs, the state of the art still not provides a unifying satisfactory theory. Ideally, we should have an expression that, similarly to the error decomposition in regression, decomposes the classification error rate into the error rates of the individual learners and a term that quantifies their diversity. The lack of a error decomposition for classification in a context of non-ordinal outputs has led to several diversity measures being proposed in the literature: Disagreement, Q-statistic, Kappastatistic, Kohavi-Wolpert variance, to name a few. 2 However, their uselfullness has been highly questioned. Kuncheva and Whitaker [41] showed through a broad range of experiments that the existing diversity measures do not provide a clear relation between those diversity measurements and the ensemble accuracy. Tang et al. [73] gave evidence that, compared to algorithms that seek diversity implicitly, exploiting diversity measures explicitly is ineffective while constructing strong ensembles. They also showed that diversity measures do not provide reliable information if the ensembles achieve good generalization performance but, at the same time, are highly correlated to average individual accuracies, which is not desirable. More recently, two new research directions emerged for understanding ensemble diversity in a classification context: Brown and Kuncheva s [15] good and bad diversity and information theoretic diversity. Brown and Kuncheva adopt the perspective that a diversity measure should be naturally defined 2 We refer the reader to [85] to further details. 7
11 as a consequence of two decisions in the design of the ensemble learning problem: the choice of error function and the choice of integration function. More particularly, with a zero-one loss function and majority voting integration scheme. The authors derive a decomposition of the majority vote error into three terms: average individual accuracy, good diversity and bad diversity. The good diversity measures the disagreement on datapoints where the ensemble is correct. The bad diversity measures the disagreement on datapoints where the ensemble is incorrect. Based on interaction information (a multivariate generalization of mutual information - see [85] for further details), Brown [14] presented a decomposition of the conditional interaction information between a set of predictors and a target variable. His mathematical formulation proposes to decompose classifier ensembles diversity into three components: relevancy (the sum of mutual information between each classifier and the target), redundancy (measures the dependency, independent to the target variable, among all possible subsets of classifiers) and conditional redundancy (measures the dependency among the classifiers given the class label). The main problem of this decomposition is that there is no effective process for estimating the diversity terms. Zhou and Li [86] provided a mathematical simplification of Brown s contribution and a complex estimation method for the diversity terms with promising results. 2.2 Bagging, Boosting and other Popular Ensemble Algorithms Research in EL has generated some algorithms that due to their simplicity and effectiveness have been widely adopted by the Machine Learning community and even in the industry. This section gives a brief overview of the most popular ones. Bagging stands for bootstrap aggregating and is due to Breiman [8]. This technique plays a central role in Random Forests [12], one of the most popular ensemble learning algorithms. Algorithm 1 shows the pseudocode for the bagging algorithm. Generically, given a data set containing n number of training instances, a sample D bs with replacement of n training instances will be generated. The process is repeated T times and T samples of n training instances are obtained. Then, from each sample, a model ˆf is generated by applying a base learning algorithm A. In terms of aggregating the outputs of the base learners and building the ensemble E, bagging adopts two of the most common ones: voting for classification (the most voted label is the final prediction) and averaging (the predictions of all the base learners are averaged to form the ensemble prediction) for regression [85]. The precision of the base learners can be estimated using the out-of-bag examples (the ones 8
12 input : Data set D = (x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) Base learning algorithm A Number of base learners T for t 1 to T do ˆf t = A(D, D bs ) %Train learner end output: E( ˆf 1,..., ˆf t ) Algorithm 1: Bagging pseudocode. Source: [85]. that were not selected for training) in each iteration, allowing to compute the error of the bagged ensemble. Schapire [68] publisehd a seminal paper in which he theoretically proved that any weak learner is potentially able to be boosted to a strong learner. This thesis originated the family of boosting algorithms. Synthetically, a boosting algorithm works by sequentially train learners and combine their outputs for a final prediction. However, each learner is forced to focus more on instances poorly predicted by the previous generated learners (if any) by assigning a weight to each instance based in e r. This weight influences then the instances selected for D r+1. Algorithm 2 presents the pseudocode for a boosting algorithm. input : Sample distribution d Base learning algorithm A Number of learning rounds R for r 1 to R do ˆf r = A(D r ) %Train weak learner e r = Evaluate Error( ˆf r ) D r+1 = Adjust Distribution (D r, e r ) end output: E = Combine Outputs( ˆf 1,..., ˆf t ) Algorithm 2: Boosting pseudocode. Source: [85]. By analysing the pseudocode presented in Algorithm 2, one can see that there are two main subjective processes: adjust distribution and combine outputs. There are several boosting algorithms in the literature with different procedures for these tasks [85], being AdaBoost [23] the most influential one. Bagging and boosting exploit variation in data in order to achieve greater diversity (and accuracy) in the final predictions. However, other ensemble methods exploit difference among learners. The concept of stacked generalization is due to Wolpert [82]. Stacking initialises by generating a set of models ( ˆf 1,..., ˆf t ) from a set of learning algorithms (A 1,..., A t ) and a dataset D. Then, a meta- 9
13 dataset is generated by replacing each base-level instance by the predictions of the models. 3 This new dataset is then presented to a learning algorithm that relates the predictions of the base-level models with the target output. A prediction from a stacking model is extracted by making the baselevel models predict an output, build a meta-instance and feed it to the meta-learner that provides the final prediction. The stacking framework was improved later with important contributions at the level of meta-features extraction and the selection of the meta-level algorithm [20]. Cascade generalization originated from the work of Gama and Brazdil [27]. Here, the models are used in sequence rather than in parallel as in stacking: the output of the first generated model feeds the second model; the outputs of the first and second model feed the third model, and so on. Meta-decision Trees is due to Todorovski and Dzeroski [74]. In this method, a decision tree is generated were each node corresponds to a model. This metamodel is induced using as attributes the class distribution properties extracted from the base-level examples, allowing to have one metaexample per base-level example. One of the advantages of Meta-decision Trees is that they provide some insight about base-level learning and the area of expertise of each model. This work is further detailed in Section 3.3. Other ensemble methods present in the literature, although with less impact in the research community, are cascading [2], delegating [22] and arbitrating [54]. 2.3 Ensemble Generation The first phase of developing an ensemble is model generation. If the models are generated using the same induction algorithm, the ensemble is called homogeneous, otherwise, in case the models are generated using different induction algorithms, the ensemble is heterogeneous. Higher diversity is expected when developing heterogeneous ensembles, thus, assuring a more accurate ensemble [28]. However, obtaining that diversity with different induction algorithms can be more difficult than with just one. Diversity is achieved either by manipulating data or by the model generation process Data manipulation Data manipulation for ensemble generation can be divided in three different sub-groups: subsampling from the training set, input features manipulation and output targets manipulation. The first consists in using different subsamples from the training set to generate different models. This 3 This can lead to overfitting. To avoid this problem, is often recommended to exclude the base-level instances from the meta-dataset and train the stack model in new data [85]. 10
14 method takes advantage of the unstability of some learning algorithms [9]. Given some randomness of the inductive process of a learning algorithm and its sensivity to changes in the training set, one can manipulate the generation of models to obtain a diverse ensemble. Two well known ensemble learning techniques that use this method are boosting [68] and bagging [8]. Several methods were developed for input features manipulation, being the most simple one the random feature selection. More complex techniques are noise injection [47], that consists in adding Gaussian noise to the inputs; iterative search methods for feature selection [81] and rotation forests [63], a method that combines selection and transformation of features using Principal Component Analysis (PCA). Output targets manipulation is a research field with very few publications. Leo Breiman signs the most important contributions: output noise injection [11], that essentially consists in adding Gaussian noise to the output variable of the training set; and iterated bagging [13]. The latter technique consists of initially generate a model and compute its residuals; a second model is generated with the output target being the residuals of the first model; this iterative process is repeated several times to develop the ensemble Model generation Achieving diversity by model generation manipulation can be done through three techniques: different parameter sets; induction algorithm manipulation or final model manipulation. The vast majority of learning algorithms is sensitive to parameter changes. The number of parameters is highly dependent to the selected algorithm. In order to achieve a diverse set of models, one must focus on the most sensitive parameters of the algorithm. Works on neural networks [59] and k-nearest neighbors [84] ensemble generation show the effectiveness of this technique. Approaches for ensemble generation by manipulation of the induction algorithm have two main categories: sequential and parallel. In sequential approaches [67][33], the generated models are only influenced by previous ones. The main feature of these techniques is the use of a decorrelation penalty term in the error function of the ensemble to increase diversity. Making use of the decomposition of the generalization error of an ensemble, the training of each network tries to minimize a function that has a covariance component, thus decreasing the generalization error. In parallel approaches, the generation of the models includes an exchange of information and usually is guided by an evolutionary framework [45]. Two distinct parallel techniques are the infinite ensemble of Support Vector Machines models [44] (the main concept is to create a kernel that gathers all the possible models in the hypothesis space) and Random Forests, that combines the bagging method with 11
15 random feature selection on the generated trees. Model manipulation is a less studied topic. This group of techniques focus on modify a model in some way so that its performance is boosted (i.e., given a set of rules, produced by one single learning process, one can repeatedly sample the set of rules and build n models [34]). 2.4 Ensemble Pruning Ensemble pruning consists of eliminating models from the ensemble, with the aim of improving its predictive ability or reducing computational costs. Research on ensemble pruning is divided into five categories: exponential search, randomized search, sequential search, ranking pruning and clustering pruning. Here, we follow [49] very closely. Exponential search pruning refers to the group of algorithms that tries to find the optimal set of k models from a pool of K models to integrate an ensemble. The searching space of this problem is very large and is a NP-complete problem. For small values of k, this can be a good approach. However, in most of the cases, this approach gives poor results in comparison with other pruning algorithms and has a very high computational cost. Randomized search pruning algorithms integrate an evolutionary framework in their process to search for a solution that is better than a random one. The GASEN (Genetic Algorithm based Selective ENsemble) [87] algorithm presented very promising results in a classification context. The search algorithm of a sequential pruning process can be named forward (if the search begins with an empty ensemble and adds models to the ensemble in each iteration), backward (f the search begins with all the models in the ensemble and eliminates models from the ensemble in each iteration) or forward-backward (if the selection can have both forward and backward steps.). Comparative studies show that CwE (Constructive with Exploration) [19] presents very robust results. In this algorithm, each time a new candidate model is added to the ensemble, all candidates are tested and it is selected the one that leads to the maximal improvement of the ensemble performance. When no model in the pool improves the ensemble performance, the selection stops. Ranking pruning algorithms sort the models according to a certain criterion and generate an ensemble containing the top k models in the ranking. Most of the algorithms of this category are rather simple and they do not seem to be competitive with state of the art pruning techniques. Clustering algorithms for ensemble pruning resort on grouping the models in several clusters and choose representative models (one or more) from each cluster. A good example of this type o algorithms is ARIA (Adaptive Radius Immune Algorithm) [19]. Here, just the most accurate model 12
16 from each cluster is selected. 2.5 Ensemble Integration and the Dynamic Approach Ensemble integration focus on how to combine the output of models previously generated for an ensemble in order to obtain one final prediction. Techniques for ensemble integration are divided into two main categories: constant weighting functions and non-constant weighting functions [49]. In the former, the weights assigned to each model in the ensemble are a constant value; in the later, the weights vary according to the instance to be predicted. We will pay particular attention to the non-constant weighting functions and more particularly, the dynamic approach Constant weighting functions Naturally, techniques for ensemble integration differ substantially in case of regression or classification. In the case of classication, the most frequent techniques are majority voting ( for binary problems, the final prediction is the label that received more than half of the votes; otherwise, the output will be the rejection option, usually the most frequent class), plurality voting (the final prediction is the label with largest number of votes), weighted voting (a weight is assigned to each learner according to its past performance) and soft voting (here, the output of the classifiers is a probability instead of a label; the techniques used in regression can therefore be applied). The most frequent techniques for regression are averaging (given a set of base learners, the final prediction is the average of the predictions made by the learners) and weighted averaging (given a set of base learners, the final prediction is obtained by averaging the outputs of different learners with different weights implying different importance). Usually the weights are estimated given the past performance of the base learners in some validation data. The great drawback of the most simplistic techniques for ensemble integration is the multicollinearity problem [56]. However, several techniques have been proposed that circumvent this problem. Caruana et al. [18] combined the ensemble integration phase with the ensemble generation one by implicitly calculating the weights as the number of times that each model is selected over the total number of models in the ensemble. Breiman [10] presented a regression version of the original stacking framework. To avoid the multicollinearity problem he used ridge regression as the stack model under the constraint that the coefficients of the regression (in other words, the weights for each model in the ensemble) need to 13
17 be non-negative. Although the results were not great, an important contribution of Breiman is the empirical observation that most of the weights are equal to zero, which reinforces the need for ensemble prunning. Merz and Pazzani [52] presented a technique that used principal component analysis for ensemble integration, named PCR*. After the principals components (PC) being obtained, the method orders the PC as function of the variation they can explain, making the selection of the PC much easier. In a important study with several techniques for constant weighting functions, PCR* showed very consistent results [51] Non-constant weighting functions This category of weighting functions can be divided into static (defined at learning time) and dynamic (defined at prediction time). The most important contribution for static non-constant weighting functions was previously mentioned in Section 2.2 (and further detailed in Section 3.3): Metadecision Trees. Although, one must acknowledge that this work was developed for classification. A regression version would require a different choice of meta attributes. The dynamic approach for non-constant weighting functions has been receiving an increasing amount of attention in the research community [49]. The motivation for this technique is that different models in the ensemble may have different performances on different regions of the input space. One must distinguish the concepts of dynamic selection (DS) and dynamic weighting (DW): while the former considers the selection of the models in an ensemble that are going to make a prediction, the later focus on how to combine the predictions of the models. Figure 2 shows a scheme for dynamic selection of models. The technique suggests that given an input X, similar data is selected from a validation set. This process is usually guided by some distance metric, like the Euclidean distance with the k-nearest neighbors algorithm. Then, one or more models are selected from the ensemble given their past performance on the similar data. After model selection, the predictions can be combined in some way to make the final prediction. The first paper concerning DS of classifiers is due to Ho et al. [32]. In this work, the authors proposed a selection based on a partition of training examples. The individual classifiers are evaluated on each partition to find the best one for each. Then, the test instance to be predicted is categorized into a partition and classified by the corresponding best classifier. A full dynamic approach was introduced by Merz [50] in a paper in which DS of classifiers was combined with the DW of the predictions. Results showed that a simple majority combination was superior to their dynamic approach. Woods [83] used a very similar approach but the results (with 14
18 Figure 2: Dynamic selection. Source: [48]. 4 different datasets) were better. Tsymbal [75][77] combined dynamic integration with classifier ensembles using bagging and boosting algorithms. Results suggest that dynamic integration improves significantly the performance of the ensembles instead of the more typical majority voting integration. Tsymbal also presented experiments in which a dynamic integration approach instead of the simple majority combination in Random Forests was better on some datasets [76]. Rooney et al. [65] extended the dynamic integration for regression problems. They claim that dynamic integration techniques are as effective for regression as stacked regression when the base models are simple. In another paper of the same authors [66], they combined the random subspace method (training data is transformed to contain different random subsets of the variables) with stacked regression and dynamic integration. Again, for simple models like linear regression and k- nearest neighbors, these techniques are more effective than bagging and boosting. Later, Rooney and Paterson [64] proposed a combination of stacking and dynamic integration for regression problems named wmetacomb. Premilinary results were promising. Ko et al. [38] and Moreira et al. [48] presented studies in which several variants of dynamic selection and integration are experimented. The former, showed comparisons of dynamic classifier selection and dynamic ensemble selection; results (no statistical verification was carried) suggested that using weak classifiers, the dynamic ensemble selection can marginally improve the accuracy, but not always performs better than dynamic classifier selection. The later, in a regression task, also found evidence that selecting dynamically several models for the prediction task increases prediction accuracy comparing to the selection of just one model. They also claim that using similarity measures according to the target values improves results. 15
19 Liyanage et al. [46] proposed a dynamically weighted ensemble classification (DWEC) framework whereby an ensemble of multiple classifiers are trained on clustered features. The decisions from these multiple classifiers are dynamically combined based on the distances of the cluster centres to each test data sample being classified. Results showed that their method is significantly better than a Suppor Vector Machine baseline classifier. 16
20 3 Metalearning Metalearning (MtL) is the study of principled methods that exploit metaknowledge to obtain efficient models and solutions by adapting machine learning and data mining processes [5]. Rendell and Cho [61] published the first experiments at the meta-level for classification. They characterized a classification problem and studied its impact on algorithm behavior, extracting metafeatures related to the size and concentration of the labels. In following years, research in MtL was boosted by two European projects: StatLog and METAL. The former provided an assessment of the strengths and weaknesses of several classification techniques, while the later focused on the development of a MtL assistant for providing user support in machine learning and data mining, both for classification and regression problems [70]. The key issue in MtL is metaknowledge: the experience or knowledge gained from one or more data mining tasks. Typically, this knowledge is not generally available to improve the task or to assist in the following data mining tasks. Therefore, MtL concentrates on the effective application of knowledge about learning systems to understand and improve their performance. Figure 3 shows a typical MtL process of knowledge acquisition. Initially (from A to B), the process starts by the user having a group of datasets and the extraction of data characteristics or metafeatures (this topic will be detailed in 3.1). Then, follows the steps of a normal data mining project: preprocessing and experiments (C), choose a learning strategy (D) and finally the evaluation phase (E). Both the stages D and E also provide metaknowledge to be stored in a meta-dataset. A Dataset Dataset B Metafeatures F Metadata Learning Techniques Choose Learning Strategy Performance Evaluation C D E Figure 3: Metalearning: knowledge acquisition. Adapted from: [80]. Some authors consider that concepts like boosting, bagging or stacking (in other words, model combination methods) are a form of MtL (or meta-learning). However, in this work, we refer to 17
21 MtL as the process of using metadata for improving and better understanding data mining processes. Within this notion of MtL, one of the key concepts is learning bias: it refers to any preference for choosing one hypothesis explaining the data over other (equally acceptable) hypotheses, where such preference is based on extra-evidential information independent of the data. MtL studies how to choose the most adequate bias dynamically. Bias can be divided in two schemes: declarative and procedural. The former refers to the representation of the space of hypotheses and affects the size of the search space (e.g., specify the use of linear functions). The later imposes constraints on the ordering of the inductive hypotheses (e.g., prefer smaller hypotheses) [5]. One of the key points in MtL is metadata, namely, the metatarget and the extracted metafeatures (or data characteristics). The metatarget needs to be a variable that has information about the performance of a given algorithm in a given dataset. On the other hand, metafeatures need to properly characterize the datasets in order to achieve an accurate and reliable metalevel model. These can be divided into three types: simple, statistical and information-theoretic measures; modelbased measures; and landmarkers. Further details on this subject are exposed in 3.1. MtL applications can refer to several distinctive tasks, namely algorithm recommendation, development of systems to support the KDD process, combination of base-learners, bias management in data streams, transfer of knowledge and development of complex systems with domain-specific metaknowldge [5]. In this work, we will focus on the topic of recommendations for data mining, given that is more related with our project goals. For a full exposure of MtL applications, we refer the reader to [5]. Last but not least, there is some work in the state of the art that relates ensemble methods with MtL. This is, however, a very restrict group. We will provide some insights of those works in 3.3 and expand their contributions to our future work. 3.1 Metadata Generating the metadata is the most important step in a MtL process. Besides choosing the appropriate metatarget for the task, it is crucial to select meaningful metafeatures that contain information for successfully achieve the main goal. For that, it is very important to take in consideration both task-dependent and algorithm-specific metafeatures [5]. For instance, if the base-level task is classification, the choice of metafeatures should be different than for a regression problem. Moreover, one should also consider the set of algorithms that the MtL system needs to relate. For example, the proportion of symbolic features should be meaningful to differentiate between a neural network 18
22 and a naïve Bayes. It is acknowledge that neural networks usually present a good performance when the dataset contains several numeric variables. On the other hand, naïve Bayes are more proper to symbolic attributes. Each example in a meta-dataset represents a learning problem. As in other learning task, MtL needs a satisfactory number of examples in order to induce a reliable model. The number of metaexamples is often seen as a problem for MtL [5] Metatarget Concerning the development of a MtL system, the first decision that must be made is about the type of metatarget, in others words, the dependent variable of the meta-level learning process. This variable can take several forms depending on the main goal of the MtL system and the nature of the base-level task (i.e., classification, regression, etc). The most simple form of metatarget is a classification scheme (binary or multi-class, depending on the number of algorithms) in which for a given dataset, the metamodel predicts the class that represents the algorithm with better performance from a set. The great disadvantage of this type of metatarget is if the metamodel fails its prediction, the costs can be very high. Another type of metatarget is instead of a single recommendation, the metamodel suggests a subset of algorithms. Given the algorithm with expected best performance, a heuristic measure can be defined to indicate the algorithms that also perform well in comparison with best algorithm. Typically, these metamodels are induced with rules or Inductive Logic Programming [37]. The previous type of metatarget provides several recommendations for the user. However, they are not ordered. This can negatively influence the data mining process. Therefore, algorithm recommendation in form of rankings seems a good alternative. A MtL method that provides recommendations in the form of rankings is proposed in [7]. The system includes an adaptation of the k-nearest neighbors algorithm that identifies algorithms which are expected to tie, providing a reduced ranking by only including one of them in the recommendation. Here, the metamodel also takes the form of a classifier. Finally, if one is interested in concrete value regarding the peformance of a algorithm in a dataset and not the actual relative performance of a set of algorithms, the metatarget can be defined as estimates of performance. In this case, the MtL problem takes the form of a regression, one for each base-algorithm. Besides that this type of metafeature provides more detailed information to the user, it also allows to transform the output of the metamodels in one of the previous recommendations forms mentioned above. 19
23 There is somewhat a lack of studies comparing the forms of metatarget. In fact, the work done on this subject is contradictory. Köpf et al. [39] found evidence that the transformation of the estimates of performance provided by regression metamodels are not a good option. On the other hand, Bensusan and Kalousis [4] provide evidence that better rankings can be obtained with the transformatiom of estimates of performance than by using a ranking algorithm. Further research is needed for a full comparison of methods Metafeatures Defining the metafeatures is probably the most important task in a MtL problem. If the data characterization does not provide useful information, the probability of success of the MtL system is highly reduced. Brazdil et al. [5] defined three fundamental issues that every metafeature should accomplish: Discriminative power. The metafeatures need to contain information that distinguishes between the base-algorithms in terms of their performance. Computational complexity. The computation of the metafeatures should not be too demanding. If not, it may not compensate generating a MtL system if one can save resources by exploring all the hypotheses for a given learning problem. Pfahringer et al. [58] suggested that the computational complexity of extracting metafeatures should not be at most O(n log n). Dimensionality. Given that the number of meta-examples of a MtL problem is usually small, the number of metafeatures should not be too large or overfitting may occur. Kalousis and Hilario [36] found evidence that feature selection can improve a MtL process which supports this claim. As mentioned before, metafeatures can be divided into three types (Figure 4): Simple, statistical and information-theoretic. These are the most common type of metafeatures extracted using descriptive statistics and information-theoretic measures. Some examples: number of features/examples, number of instances with missing values (simple); mean skewness of numeric features, mean value of correlation (statistical); class entropy, mutual information for symbolic features (information-theoretic). We refer the reader to [39][26] for more examples. Model-based. Here, metafeatures are extracted based on properties of the induced model. Example: number of leaf nodes in a decision tree [55] or mean of the off-diagonal values of a kernel matrix in a Support Vector Machines model [71]. 20
24 Landmarkers. This type of metafeatures are quick estimates of an algorithm s performance. They can be obtained in three different ways: through the run of simplified versions of an algorithm [3][58] (i.e., a decision stump); quick performance estimates on a sample of the data, also called subsampling landamarkers [25]; and finally, through an ordered sequence of subsampling landmarkers for a single algorithm, which allows to form the so called learning curve of an algorithm. In this case, not only the estimates can be used as metafeatures but also the shape of the curve [43]. Datasets Simple, statistical and informationtheoretic Datasets Learning Model Model-based Datasets Learning Model Landmarkers Performance Estimates Figure 4: Metafeatures. Source: adapted from [5]. 3.2 Metalearning for Data Mining Applications Rendel et al. [62][61] published the earlier works in which the expression meta-learning is used in Machine Learning. In the first paper [62], they proposed the Variable Bias Management System (VBMS). Here, the problem of algorithm recommendation is studied for the first time and the need for methods that develop models with different biases is identified. However, the experiment is rather preliminary: only the execution time of the algorithms is considered, the type of metafeatures is very simple (i.e., number of examples) and the evaluation carried is clearly insufficient. In the second paper [61], the data characterization was more detailed and set roots for the MtL research in following years, boosted by the already mentioned European projects, StatLog and METAL. 21
25 In the StatLog project, besides great developments in the scope of data characterization [26], MtL was used to to predict the applicability of learning algorithms to a given data set [6]. By applicability understand the notion of assessing if the performance of one learning algorithm is significantly different from the best algorithm on the corresponding dataset. The METAL project allowed to develop the research on MtL focusing more on the problem of ranking recommendations of algorithms [7]. Smith et al. [69] used a self-organising map to cluster 57 classification datasets based only on metafeatures. Each cluster was then inspected to identify common metafeatures and to evaluate the performance of different algorithms within each cluster. Rules were extracted from a statistical analysis of the clusters. Kalousis et al. [35] published a very interesting work in which the authors looked for similarities between algorithms by means of error correlation, and similarities between datasets based on patterns of error correlation and relative performance of algorithms. Their main goal was not predictive performance at the meta-level, but gain understandable insights. They found that the most discriminatory variables were: data availability, curse of dimensionality : number of examples, ratio of number of examples to number of classes class distribution : class entropy and normalized class entropy information content : uncertainty coefficient of attributes and class However, the authors highlight that these findings must be seen carefully. They only studied the problem from a classification perspective with 80 datasets, mainly from UCI. Also in the scope of classification, Ali and Smith [1] used 112 datasets from UCI to infer a decision tree C4.5, producing rules for each algorithm with average accuracy of 10-fold cross-validation testing exceeding 80% in predicting the best algorithm. Majorly, the metafeatures of the study were simple, statistical and information-theoretic. Recently, the problem of meta-examples generation has been adressed by Prudêncio and Ludermir [60] as an Active Learning task, more particularly, Active Metalearning. They used this method to reduce the set of meta-examples by selecting only the most relevant problems for meta-example generation. The combination of different Uncertainty Sampling methods to select the most informative meta-examples, together with a previous application of an Outlier Detection method to remove outliers, presented gains in the MtL performance. 22
26 For the problem of algorithm parameter recommendation, Soares et al. [72] proposed a MtL method to recommend values to set the width of the Gaussian kernel in a SVM. Later [71], they extended their work by showing that significant improvements could be achieved in the same problem after integrating metafeatures based on the kernel matrix. More recently, this problem of parameter recommendation for SVM has been extended by combining a MtL method with metaheuristics [29]. 3.3 Metalearning for Ensemble Methods There are very few publications regarding the application of MtL methods to EL. However, we can identify some papers that provide contributions that can be useful for further research in that particular topic. The most widely known application of MtL to EL is due to Todorovski and Džeroski [74], with the already mentioned Meta-Decision Trees (MDT). They presented an algorithm for learning a decision tree based on C4.5 that instead of making a prediction, the leaves of the tree specify which classifier should be used to obtain a prediction. Their study comprised 21 classification datasets and 5 base-level classifiers, namely, two algorithms for learning decision trees, a rule learning algorithm, a nearest neighbor algorithm and a naive Bayes algorithm. They test MDTs using two type of attributes: ordinary base-level attributes and class distribution properties (CDP). The later reflect the certainty and confidence of the predictions and can be considered metafeatures. The simplicity of the approach allowed MDTs that are easy to interpret and useful metaknowledge can be extracted from the trees inspection. The CDP used were maxprob(x, C): the highest class probability (i.e. the probability of the predicted class) predicted by the base-level classifier C for example x entropy(x, C): the entropy of the class probability distribution predicted by the classifier C for example x weight(x, C): the fraction of the training examples used by the classifier C to estimate the class distribution for example x Experimental results show that MDTs induced from CDP perform much better and are much more concise than MDTs induced from ordinary base-level attributes. In comparison with other EL methods, MDTs also perform better than the SCANN method for combining classifiers and the method of selecting the best single classifier. Finally, MDTs induced from CDPs perform better than boosting and bagging of decision trees and are thus competitive with state of the art methods 23
27 for learning ensembles. However, the metafeatures used can further improve MDTs if other types of features are considered, namely landmarking or simple measures. Furthermore, an adaptation for regression tasks could be an interesting line of research. Peterson and Martinez [57] propose a distance metric for finding similarity between hypotheses and learning algorithms, named Classifier Output Difference (COD). The authors define that the COD distance between two hypotheses is the frequency (a real value between 0 and 1) with which they disagree on the classification of patterns. This distance between two hypotheses over a particular data set can be estimated by observing the frequency that the hypotheses disagree with each other on the classification of the patterns from the given set, therefore COD ˆ T ( ˆf 1, ˆf x T s 2 ) = ˆf 1 (x) ˆf 2 (x) T s (6) where Ts is the test set. They claim that this measure can be used to predict the potential for combining hypotheses in an ensemble to improve accuracy. Later, Lee and Giraud-Carrier [42] published a paper on unsupervised MtL in which they study the application of several diversity measures for ensemble learning as a distance function for clustering learning algorithms. In their experiments, only one measure, COD, presents results that indicate that it can be a good measure for this kind of task. The analysis of their results, after clustering 21 learning algorithms in 129 classification datasets, show that this clustering differs from a clustering based on accuracy and they provide interesting similarities among learning algorithms. This can be a good line of research for seeking interpretability of ensembles performance. 24
28 4 Research Plan The general goal of this project is to investigate the use of MtL for dynamic integration approaches to EL containing a very large number of models. The goals we plan to achieve are: use MtL as a dynamic selection method, i.e., to select a subset of a large set of models which is expected to make the most accurate prediction for a given instance. use MtL as a dynamic integration method, i.e., not only to select but also to combine the models (e.g., by giving different weights to different models). Concerning the algorithms to generate the models, two different scenarios will be investigated: homogeneous ensembles, i.e., models generated by varying the parameters of a single algorithm. heterogeneous ensembles, i.e., models generated with different algorithms. The approaches developed will be applied to: benchmark problems typically used in machine learning. real-world applications which we are working on, including sensors in industrial equipment and trip time prediction in public transportation. One of the main challenges of MtL is the design of suitable data characteristics (i.e., metafeatures). Useful metafeatures contain information about the data that affects the behaviour of the learning algorithm(s). The goals we plan to achieve are to extend existing work on: algorithm-dependent (for homogeneous ensembles) and algorithm-independent metafeatures (for heterogeneous ensembles). domain-independent and domain-dependent aggregated metafeatures (i.e., describing groups of instances). combination of aggregated metafeatures with metafeatures describing single instances. Figure 5 shows a scheme that summarizes our approach. Our starting point will be generating ensembles. This ensemble generation phase shall be assisted by literature presented in Section Then, the extraction of metadata follows: simple, statistical and information-theoretic measures from the base-level datasets; model-based (or ensemble-based) metafeatures; landmarkers (through 4 Hopefully, our contributions shall improve some of the state of the art ensemble algorithms. There are some indicators of that possibility [76]. 25
29 Data Meta features Learning Metadata Metalearning Meta features ŷ1, ŷ2,, ŷn Ensemble Validation Data Metamodel New Data Dynamic Selection and Integration Final Prediction Figure 5: Metalearning for dynamic integration scheme. quick estimates on a validation set) and a metatarget (here, there are several options to research as discussed in Section 3.1). The generated metadata shall allow to generate one or several metamodels (in this case, each metamodel may have different tasks or purposes). Finally, the metamodel(s) and ensemble(s) shall be combined in order to achieve the desired dynamic selection and integration, allowing, hopefully, to achieve a system with improved accuracy and comprehensibility of ensemble performance. The following subsections discriminate our research plan by decomposing it into 6 tasks. 4.1 Task 1. Foundations of Ensemble Learning and Metalearning The application of MtL techniques and concepts can help to unveil the reasons behind the success (or failure) of multiple combinations of models and integration methods. As mentioned in Section 2, the construction of an ensemble method involves three different steps: ensemble generation, ensemble 26
30 pruning and ensemble integration. MtL can be particularly helpful in two phases (generation and integration) and indirectly simplify the pruning process (given a certain direction in the generation phase, if accurate, the number of redundant models should be smaller). We will study the state-of-the-art in each sub-field further. We will pay particular attention to the analysis of metafeatures proposed in the literature. This will be the basis for the study in task 2. Implementations of selected approaches will be made. This will enable us to gain a better understanding of those approaches and their behavior. It will also be the basis for the rest of the project. Expected time for this task: 6 months Ensemble Learning in this sub-task we will study the state-of-the-art of ensemble learning with particular concern with the metalearning applications for EL. Expected duration: 3 months Metalearning this sub-task concerns the study of the state-of-the-art on MtL. Expected duration: 2 months Report concerning the development of task 1. Expected duration: 1 month. 4.2 Task 2. Metafeatures for homogeneous ensembles Based on the analysis carried out in task 1, we will develop metafeatures containing useful information about the data that describes the behavior of the learning algorithm(s). We will address the problem of designing new metafeatures for homogeneous ensembles first, due to the different nature of the ensembles. In this case, metafeatures must discriminate between different parametrizations of a single algorithm and, thus, are very specific. We will implement them and then empirically validate them for MtL on applications we are working on, including prediction of trip time duration of buses and soft sensors. We will also test the approach on general benchmarks (e.g., UCI data). We will write one conference paper about this work. Expected time for this task: 6 months Development of the metafeatures for homogeneous ensembles. Expected duration: 3 months. 27
31 2.2. Empirical validation of the developed metafeatures. Expected duration: 2 months Writing of the conference paper concerning task 2. Expected duration: 1 month. For this particular task, we antecipate some lines of research with great potential. More particularly, the combination of different type of metafeatures still needs further research. Most papers only deal with one or two types of metafeatures [7][26][3], we plan to merge several approaches and study its impact on the MtL process. Furthermore, we want to open new research directions in the topic of metafeature extraction by proposing the first set of specific metafeatures to characterize ensembles. For homogeneous ensembles, a starting point could be averaging the values of model-based characteristics [55], e.g., average number of nodes of the trees in the ensemble. Another important topic to explore is the extraction of metafeatures based on the error decomposition theory for regression tasks and error/diversity measures for classification tasks as exposed in Section Task 3. Metalearning for dynamic selection of homogeneous ensembles We will develop a MtL approach to select, from a very large set of models, which ones to combine. Combination will be done with methods typically used in EL. The implementation will be based on the implementations made in task 1 and the metafeatures will be selected from the ones developed in task 2. Empirical validation will be done on the same problems used in task 2. A conference paper will be written. Expected time for this task: 5 months Development of the MtL approach for dynamic selection of homogeneous ensembles. Expected duration: 2 months Empirical validation of the MtL approach developed in this task. Expected duration: 1 month Writing of the conference paper concerning task 3. Expected duration: 1 month. 28
32 In this task, several issues concerning the EL literature must be acknowledge, namely the multicollinearity problem (for regression problems), error decomposition and ensembles diversity. The selection of the models must be taken into account these fundamental principles. Several papers in the state of the art can give important initial insights [38][48] and experiments with the COD distance metric [57] can be an interesting line of research. While developing this task we should also acknowledge the importance of choosing a learning algorithm that provides comprehensibility. We consider that providing metaknowledge through the application of MtL to ensemble methods is as important as developing systems that present a superior performance in terms of predictive capacity. Therefore, and given results already present in the literature [74], decision trees seem a good option. We would like also to explore some of the techniques mentioned in Section 2.3 in order to boost the predictive capacity of the metamodel(s). 4.4 Task 4. Metalearning for dynamic integration of homogeneous ensembles In this task, we will go one step further, based on the results of task 3. We will use MtL not only to select the base models but also to decide how to combine them (e.g., which weight to assign to each model). The approach will be developed based on the work done in tasks 1 and 3. Empirical validation will be similar to the one carried out in task 3. A journal paper and a conference paper will be written. Expected time for this task: 5 months Development of the MtL approach for dynamic integration of homogeneous ensembles. Expected duration: 2 months Empirical validation of the developed MtL approach for task 4. Expected duration: 1 month Writing of the conference and journal paper concerning task 4. Expected duration: 1 month. Literature suggests that dynamic integration of ensembles has better results for classification tasks than for regression tasks [65][66][75][77][76]. Therefore, it is important to consider this issue when planning the application of MtL methods for dynamic integration of classifiers, and while 29
33 generating the metafeatures for regression problems. Here, the work of Rooney and Paterson [64] can give an important initial insight. 4.5 Task 5. Metalearning for dynamic selection and integration of heterogeneous ensembles We will adapt the work carried out in the previous tasks to heterogeneous ensembles. The biggest challenge is to design new metafeatures for heterogeneous ensembles which are necessarily different from the ones designed for homogeneous ensembles (task 2), due to the different nature of the two approaches. For heterogeneous ensembles, metafeatures must discriminate between different characteristics of possibly very diverse algorithms. Empirical validation will be similar to the one carried out in the previous tasks but, due to the use of diverse algorithms, more effort will be required. A journal paper and a conference paper will be written. Expected time for this task: 8 months Metafeatures for heterogeneous ensembles development of metafeatures for heterogeneous ensembles. Expected duration: 2 months Development of the metalearning approach for dynamic selection and integration of heterogeneous ensembles. Expected duration: 2 months Empirical validation of the developed MtL approach for task 5. Expected duration: 2 months Writing papers of the conference and journal paper concerning task 5. Expected duration: 1 month. If we achive success with previous tasks, we should have important insights that can put us in the right direction. A simple but effective approach present the literature is Todorovski and Džeroski s MDTs [74]. Their work together with the adaptation of our previous contributions may set the roots for the development of the methods for this task. 30
34 4.6 Task 6. Dissertation The thesis will be based on the papers written in the other tasks. Expected duration: 9 months. Figure 6: Schedule of the research plan. 31
35 5 Final Remarks This thesis proposal presents a research plan for the application of Metalearning techniques for the dynamic selection and integration of ensemble models. We include a state of the art for the fields of Ensemble Learning and Metalearning, highligthing the contributions that can be relevant for our future work. The carried literature review allowed to identify some lines of research that we plan to proceed. The application of Metalearning techniques to Ensemble Methods is per se a topic with few publications. We believe that our research can contribute to the combination of these two fields. More particularly, we look forward to investigating causal relations that we may find between ensembles characteristics, data characteristics and ensemble performance. These can be important steps in a better understanding of the inner mechanisms of ensemble methods. Concerning our future work on a dynamic approach to ensemble methods, the literature still does not provide a satisfactory framework for this topic. The application of Metalearning techniques has the potential for interesting contributions. To the best of our knowledge, this is the first attempt at this type of approach. Moreover, the application of Metalearning techniques to Ensemble Learning is a very narrow field. We must pay particular attention to the very few publications already present in the literature and consider our options carefully. In conclusion, we believe that this thesis proposal has set the roots for our research plan and allowed to open highly potential lines of work. 32
36 Bibliography [1] Shawkat Ali and Kate A Smith. On learning algorithm selection for classification. Applied Soft Computing, 6(2): , [2] Ethem Alpaydin and Cenk Kaynak. Cascading classifiers. In Kybernetika, [3] Hilan Bensusan and Christophe Giraud-Carrier. Discovering task neighbourhoods through landmark learning performances. In Principles of Data Mining and Knowledge Discovery, pages Springer, [4] Hilan Bensusan and Alexandros Kalousis. Estimating the predictive accuracy of a classifier. In Machine Learning: ECML 2001, pages Springer, [5] Pavel Brazdil, Christophe Giraud Carrier, Carlos Soares, and Ricardo Vilalta. Metalearning: Applications to data mining. Springer, [6] Pavel Brazdil, Joāo Gama, and Bob Henery. Characterizing the applicability of classification algorithms using meta-level learning. In Machine Learning: ECML-94, pages Springer, [7] Pavel Brazdil, Carlos Soares, and Joaquim Pinto Da Costa. Ranking learning algorithms: Using ibl and meta-learning on accuracy and time results. Machine Learning, 50(3): , [8] Leo Breiman. Bagging predictors. Machine learning, 24(2): , [9] Leo Breiman. Heuristics of instability and stabilization in model selection. The annals of statistics, 24(6): , [10] Leo Breiman. Stacked regressions. Machine learning, 24(1):49 64, [11] Leo Breiman. Randomizing outputs to increase prediction accuracy. Machine Learning, 40(3): , [12] Leo Breiman. Random forests. Machine learning, 45(1):5 32, [13] Leo Breiman. Using iterated bagging to debias regressions. Machine Learning, 45(3): , [14] Gavin Brown. An information theoretic perspective on multiple classifier systems. In Multiple Classifier Systems, pages Springer,
37 [15] Gavin Brown and Ludmila I Kuncheva. good and bad diversity in majority vote ensembles. In Multiple Classifier Systems, pages Springer, [16] Gavin Brown, Jeremy Wyatt, Rachel Harris, and Xin Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6(1):5 20, [17] Gavin Brown, Jeremy L Wyatt, and Peter Tiňo. Managing diversity in regression ensembles. The Journal of Machine Learning Research, 6: , [18] Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. In Proceedings of the twenty-first international conference on Machine learning, page 18. ACM, [19] Guilherme P Coelho and FJ Von Zuben. The influence of the pool of candidates on the performance of selection and combination techniques in ensembles. In Neural Networks, IJCNN 06. International Joint Conference on, pages IEEE, [20] Saso Džeroski and Bernard Ženko. Is combining classifiers with stacking better than selecting the best one? Machine learning, 54(3): , [21] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledge discovery in databases. AI magazine, 17(3):37, [22] César Ferri, Peter Flach, and José Hernández-Orallo. Delegating classifiers. In Proceedings of the twenty-first international conference on Machine learning, page 37. ACM, [23] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1): , [24] Giorgio Fumera and Fabio Roli. A theoretical and experimental analysis of linear combiners for multiple classifier systems. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(6): , [25] Johannes Fürnkranz and Johann Petrak. An evaluation of landmarking variants. In Working Notes of the ECML/PKDD 2000 Workshop on Integrating Aspects of Data Mining, Decision Support and Meta-Learning, pages 57 68, [26] Joao Gama and Pavel Brazdil. Characterization of classification algorithms. In Progress in Artificial Intelligence, pages Springer,
38 [27] João Gama and Pavel Brazdil. Cascade generalization. Machine Learning, 41(3): , [28] Mike Gashler, Christophe Giraud-Carrier, and Tony Martinez. Decision tree ensemble: Small heterogeneous is better than large homogeneous. In Machine Learning and Applications, ICMLA 08. Seventh International Conference on, pages IEEE, [29] Taciana AF Gomes, Ricardo BC Prudêncio, Carlos Soares, André LD Rossi, and André Carvalho. Combining meta-learning and search techniques to select parameters for support vector machines. Neurocomputing, 75(1):3 13, [30] Jiawei Han, Micheline Kamber, and Jian Pei. Data mining: concepts and techniques. Morgan Kaufmann, [31] Lars Kai Hansen and Peter Salamon. Neural network ensembles. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 12(10): , [32] Tin Kam Ho, Jonathan J. Hull, and Sargur N. Srihari. Decision combination in multiple classifier systems. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 16(1):66 75, [33] Md M Islam, Xin Yao, and Kazuyuki Murase. A constructive algorithm for training cooperative neural network ensembles. Neural Networks, IEEE Transactions on, 14(4): , [34] Alipio M Jorge and Paulo J Azevedo. An experiment with association rules and classification: Post-bagging and conviction. In Discovery science, pages Springer, [35] Alexandros Kalousis, João Gama, and Melanie Hilario. On data and algorithms: Understanding inductive performance. Machine Learning, 54(3): , [36] Alexandros Kalousis and Melanie Hilario. Feature selection for meta-learning. In Advances in Knowledge Discovery and Data Mining, pages Springer, [37] Alexandros Kalousis and Theoharis Theoharis. Noemon: Design, implementation and performance results of an intelligent assistant for classi er selection. Intelligent Data Analysis, 3(5): , [38] Albert HR Ko, Robert Sabourin, and Alceu Souza Britto Jr. From dynamic classifier selection to dynamic ensemble selection. Pattern Recognition, 41(5): , [39] Christian Köpf, Charles Taylor, and Jörg Keller. Meta-analysis: From data characterisation for meta-learning to meta-regression. In Proceedings of the PKDD-00 Workshop on Data Mining, Decision Support,Meta-Learning and ILP,
39 [40] Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. Advances in neural information processing systems, pages , [41] Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 51(2): , [42] Jun Won Lee and Christophe Giraud-Carrier. A metric for unsupervised metalearning. Intelligent Data Analysis, 15(6): , [43] Rui Leite and Pavel Brazdil. Predicting relative performance of classifiers from samples. In Proceedings of the 22nd international conference on Machine learning, pages ACM, [44] Hsuan-Tien Lin and Ling Li. Infinite ensemble learning with support vector machines. Springer, [45] Yong Liu, Xin Yao, and Tetsuya Higuchi. Evolutionary ensembles with negative correlation learning. Evolutionary Computation, IEEE Transactions on, 4(4): , [46] Sidath Ravindra Liyanage, Cuntai Guan, Haihong Zhang, Kai Keng Ang, JianXin Xu, and Tong Heng Lee. Dynamically weighted ensemble classification for non-stationary eeg processing. Journal of neural engineering, 10(3):036007, [47] Kiyotoshi Matsuoka. Noise injection into inputs in back-propagation learning. Systems, Man and Cybernetics, IEEE Transactions on, 22(3): , [48] João Mendes-Moreira, Alipio Mario Jorge, Carlos Soares, and Jorge Freire de Sousa. Ensemble learning: A study on different variants of the dynamic selection approach. In Machine Learning and Data Mining in Pattern Recognition, pages Springer, [49] João Mendes-Moreira, Carlos Soares, Alípio Mário Jorge, and Jorge Freire De Sousa. Ensemble approaches for regression: A survey. ACM Computing Surveys (CSUR), 45(1):10, [50] Christopher J Merz. Dynamical selection of learning algorithms. In Learning from Data, pages Springer, [51] Christopher J Merz. Classification and regression by combining models. PhD thesis, University of California, [52] Christopher J Merz and Michael J Pazzani. A principal components approach to combining regression estimates. Machine Learning, 36(1-2):9 32,
40 [53] Claire Clain Miller. Data science: The numbers of our lives. New York Times, April [54] Julio Ortega, Moshe Koppel, and Shlomo Argamon. Arbitrating among competing classifiers using learned referees. Knowledge and Information Systems, 3(4): , [55] Yonghong Peng, Peter A Flach, Pavel Brazdil, and Carlos Soares. Decision tree-based data characterization for meta-learning. In Proceedings of the ECML/PKDD 02 Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning, pages , [56] Michael P Perrone and Leon N Cooper. When networks disagree: Ensemble methods for hybrid neural networks. Technical report, DTIC Document, [57] Adam H Peterson and TR Martinez. Estimating the potential for combining learning models. In Proceedings of the ICML Workshop on Meta-Learning, pages 68 75, [58] Bernhard Pfahringer, Hilan Bensusan, and Christophe Giraud-Carrier. Meta-learning by landmarking various learning algorithms. In In Proceedings of the Seventeenth International Conference on Machine Learning, pages Morgan Kaufmann, [59] Jordan B Pollack. Backpropagation is sensitive to initial conditions. Complex Systems, 4: , [60] Ricardo BC Prudêncio and Teresa B Ludermir. Combining uncertainty sampling methods for supporting the generation of meta-examples. Information Sciences, 196:1 14, [61] Larry Rendell and Howard Cho. Empirical learning as a function of concept character. Machine Learning, 5(3): , [62] Larry Rendell, Raj Seshu, and David Tcheng. More robust concept learning using dynamicallyvariable bias. In Proceedings of the Fourth International Workshop on Machine Learning, pages 66 78, [63] Juan J Rodriguez, Ludmila I Kuncheva, and Carlos J Alonso. Rotation forest: A new classifier ensemble method. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(10): , [64] Niall Rooney and David Patterson. A weighted combination of stacking and dynamic integration. Pattern recognition, 40(4): ,
41 [65] Niall Rooney, David Patterson, Sarab Anand, and Alexey Tsymbal. Dynamic integration of regression models. In Multiple Classifier Systems, pages Springer, [66] Niall Rooney, David Patterson, Alexey Tsymbal, and Sarab Anand. Random subspacing for regression ensembles. In Proceedings of the 17th International Florida Artificial Intelligence Research Society Conference (FLAIRS 2004), Miami Beach, Florida, [67] Bruce E Rosen. Ensemble learning using decorrelated neural networks. Connection Science, 8(3-4): , [68] Robert E Schapire. The strength of weak learnability. Machine learning, 5(2): , [69] Kate A. Smith, Frederick Woo, Victor Ciesielski, and Remzi Ibrahim. Matching data mining algorithm suitability to data characteristics using a self-organising map. Hybrid Information Systems, Physica-Verlag, Heidelberg, pages , [70] Kate A Smith-Miles. Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Computing Surveys (CSUR), 41(1):6, [71] Carlos Soares and Pavel B Brazdil. Selecting parameters of svm using meta-learning and kernel matrix-based meta-features. In Proceedings of the 2006 ACM symposium on Applied computing, pages ACM, [72] Carlos Soares, Pavel B Brazdil, and Petr Kuba. A meta-learning method to select the kernel width in support vector regression. Machine learning, 54(3): , [73] EK Tang, PN Suganthan, and X Yao. An analysis of diversity measures. Machine Learning, 65(1): , [74] Ljupčo Todorovski and Sašo Džeroski. Combining classifiers with meta decision trees. Machine learning, 50(3): , [75] Alexey Tsymbal. Decision committee learning with dynamic integration of classifiers. In Current Issues in Databases and Information Systems, pages Springer, [76] Alexey Tsymbal, Mykola Pechenizkiy, and Pádraig Cunningham. Dynamic integration with random forests. In Machine Learning: ECML 2006, pages Springer, [77] Alexey Tsymbal and Seppo Puuronen. Bagging and boosting with dynamic integration of classifiers. In Principles of Data Mining and Knowledge Discovery, pages Springer,
42 [78] Kagan Tumer and Joydeep Ghosh. Error correlation and error reduction in ensemble classifiers. Connection science, 8(3-4): , [79] Naonori Ueda and Ryohei Nakano. Generalization error of ensemble estimators. In Neural Networks, 1996., IEEE International Conference on, volume 1, pages IEEE, [80] Ricardo Vilalta, Christophe Giraud-Carrier, Pavel Brazdil, and Carlos Soares. Using metalearning to support data mining. International Journal of Computer Science and Applications, 1(1):31 45, [81] Xiangyang Wang, Jie Yang, Xiaolong Teng, Weijun Xia, and Richard Jensen. Feature selection based on rough sets and particle swarm optimization. Pattern Recognition Letters, 28(4): , [82] David H Wolpert. Stacked generalization. Neural networks, 5(2): , [83] Kevin Woods, W Philip Kegelmeyer Jr, and Kevin Bowyer. Combination of multiple classifiers using local accuracy estimates. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19(4): , [84] Dragomir Yankov, Dennis DeCoste, and Eamonn Keogh. Ensembles of nearest neighbor forecasts. In Machine Learning: ECML 2006, pages Springer, [85] Zhi-Hua Zhou. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series. Taylor & Francis, [86] Zhi-Hua Zhou and Nan Li. Multi-information ensemble diversity. In Multiple Classifier Systems, pages Springer, [87] Zhi-Hua Zhou, Jianxin Wu, and Wei Tang. Ensembling neural networks: many could be better than all. Artificial intelligence, 137(1): ,
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
Ensemble Approaches for Regression: A Survey
Ensemble Approaches for Regression: A Survey JOÃO MENDES-MOREIRA, LIAAD-INESC TEC, FEUP, Universidade do Porto CARLOS SOARES, INESC TEC, FEP, Universidade do Porto ALíPIO MÁRIO JORGE, LIAAD-INESC TEC,
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Introduction To Ensemble Learning
Educational Series Introduction To Ensemble Learning Dr. Oliver Steinki, CFA, FRM Ziad Mohammad July 2015 What Is Ensemble Learning? In broad terms, ensemble learning is a procedure where multiple learner
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Meta-learning. Synonyms. Definition. Characteristics
Meta-learning Włodzisław Duch, Department of Informatics, Nicolaus Copernicus University, Poland, School of Computer Engineering, Nanyang Technological University, Singapore [email protected] (or search
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
How To Perform An Ensemble Analysis
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
New Ensemble Combination Scheme
New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,
A Learning Algorithm For Neural Network Ensembles
A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
CS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing
CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
The primary goal of this thesis was to understand how the spatial dependence of
5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)
Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Penalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza
Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
Decompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
MHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Why Ensembles Win Data Mining Competitions
Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
Risk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup
Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor
On the application of multi-class classification in physical therapy recommendation
RESEARCH Open Access On the application of multi-class classification in physical therapy recommendation Jing Zhang 1,PengCao 1,DouglasPGross 2 and Osmar R Zaiane 1* Abstract Recommending optimal rehabilitation
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement
Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement Toshio Sugihara Abstract In this study, an adaptive
