Metalearning for Dynamic Integration in Ensemble Methods

Size: px
Start display at page:

Download "Metalearning for Dynamic Integration in Ensemble Methods"

Transcription

1 Metalearning for Dynamic Integration in Ensemble Methods Fábio Pinto 12 July 2013 Faculdade de Engenharia da Universidade do Porto Ph.D. in Informatics Engineering Supervisor: Doutor Carlos Soares Co-supervisor: Doutor João Mendes Moreira

2 Outline Abstract ii 1 Introduction Problem Statement Research Approach and Expected Contributions Proposal Organization Ensemble Learning Error, Accuracy and Diversity Bagging, Boosting and other Popular Ensemble Algorithms Ensemble Generation Ensemble Pruning Ensemble Integration and the Dynamic Approach Metalearning Metadata Metalearning for Data Mining Applications Metalearning for Ensemble Methods Research Plan Task 1. Foundations of Ensemble Learning and Metalearning Task 2. Metafeatures for homogeneous ensembles Task 3. Metalearning for dynamic selection of homogeneous ensembles Task 4. Metalearning for dynamic integration of homogeneous ensembles Task 5. Metalearning for dynamic selection and integration of heterogeneous ensembles Task 6. Dissertation Final Remarks 32 i

3 Abstract Ensemble methods have been receiving an increasing amount of attention, especially because of their successful application to high visibility problems (e.g., the NetFlix prize). An important challenge in ensemble learning (EL) is the management of the set of models to ensure a high level of accuracy, particularly with large number of models and in highly dynamic environments [49]. One approach to deal with these problems in the context of EL is the dynamic approach, which consists in the selection and combination of the best subset of model(s) for each test instance. An alternative approach to find models that are most suitable for a given set of data is metalearning (MtL). MtL uses data from past experiments to build models that relate the characteristics of learning problems with the behaviour of algorithms [5]. Thus, the general goal of this project is to investigate the use MtL for dynamic integration approaches to EL. ii

4 1 Introduction The world is deluged by data. The dissemination of the Internet around the globe together with the development of ubiquitous information-sensing mobile devices, wireless sensor networks and information store capacity, has enhanced the need to understand and make value of the data that is being generated. Data Science, a new coined term that brings together Statistics, Machine Learning (more particularly, Data Mining) and Computer Science, emerges as the field that can assist humans in this task [53]. Data Mining is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive information repositories, or data streams [30]. A Data Mining project usually includes one or more of the following tasks [21]: Regression Classification Anomaly detection Association rule learning Clustering Summarization In this project, we will focus on the first two, namely, regression and classification. In a typical regression problem 1 we have a dataset that consists of a set of n instances: {(x 1, f(x 1 )),..., (x n, f(x n ))}. The objective is to induce a function ˆf from the data, where ˆf : X R, where, ˆf(x) = f(x), x X, (1) f representes the unknown true function. The algorithm used to obtain the ˆf function is called induction algorithm or learner. The ˆf function is called model or predictor. The usual goal for regression is to minimize a squared error loss function, namely the mean squared error (MSE), MSE = 1 n n i ( ) 2 ˆf(xi ) f(x i ) (2) 1 We follow [49] very closely. 1

5 For classification, the concept is very similar. The goal is also to induce a function ˆf from a set of training examples. However, in classification the output of ˆf(x) is a categorical variable instead of a numeric one. This has several implications that differentiate classification from regression. One of them being, naturally, the error loss function to minimize. While in regression the majority of the evaluation measures minimize a squared error loss function, in classification the evaluation measures forcibly need to be different: accuracy, precision, recall, F-score, AUC, to name a few. See [30] for further details. Ensemble Learning is a process that uses a set of models (regression or classification), each of them obtained by applying a learning process to a given problem. This set of models is integrated in some way to obtain the final prediction [49]. EL has become increasingly popular both for regression and classification tasks. Besides the extensive research that reports great results with ensemble algorithms in a wide variety of problems [85], data mining competitions with great media coverage (i.e., Netflix prize, Heritage Health prize, etc) proved that ensembles constitute the best technique for predictive modeling when the main goal lays in accuracy. 1.1 Problem Statement The great disadvantage in applying ensemble methods is their black box nature [82]. When combining several models for a prediction task, is very difficult to understand how the ensemble works and extract knowledge from the system. For instance, a decision tree is an algorithm that besides accuracy also provides inner knowledge by inspecting the structure of the tree. This knowledge can be very useful to understand the domain of the prediction task and even to improve the final model. Along with the comprehensibility issue, using an ensemble to obtain predictions is a very blind process. If we have an ensemble that our evaluation methodology says to be accurate, we are going to apply that ensemble to any instance that we want to predict, regardless of its characteristics. We believe that a more dynamic approach can enhance predictive accuracy and, at the same, time provide interesting and useful knowledge. The dynamic approach can be divided into two different steps: the selection of a subset of models within an ensemble and their integration (in other words, how to combine their predictions) to make a final prediction [48]. Both processes are carried at the prediction time and accordingly to the characteristics of the instance that one is trying to predict. This problem has been adressed in literature by a few researchers [64][76][50][48]. However, their approaches are very sparse and the challenge remains on how to dynamically apply an ensemble 2

6 approach for predictive tasks that obtains competitive accuracies and provides insight concerning their behavior. 1.2 Research Approach and Expected Contributions We propose to combine Metalearning techniques with Ensemble Learning in order to test the hypothesis that a better dynamic selection and integration of ensembles can be achieved. Metalearning is the study of principled methods that exploit metaknowledge to obtain efficient models and solutions by adapting machine learning and data mining processes [5]. Our approach can also give important and useful insights about the relation between data and the performance of the ensembles for a better understanding of their behavior. Further details on the research plan are given in Section Expected Contributions Metafeatures: algorithm-dependent (for homogeneous ensembles) and algorithm-independent metafeatures (for heterogeneous ensembles); domain-independent and domain-dependent aggregated metafeatures; combination of aggregated metafeatures with metafeatures describing single instances. Application of Metalearning to Ensemble Methods: use a metamodel that relates data characteristics, ensemble characteristics, individual models characteristics and ensemble performance in order to improve prediction accuracy and gain domain insights. Dynamic Selection of Ensemble Models: use Metalearning to dynamically select the best subset of models within an ensemble for a given instance. Dynamic Integration of Ensemble Models: use Metalearning to dynamically combine the best subset of models within an ensemble for a given a instance. 1.3 Proposal Organization This thesis proposal is organized as follows. Section 2 presents the state of the art for Ensemble Learning, giving particular attention to contributions related to the dynamic selection and integration of ensemble models. A selected overview of Metalearning is provided in Section 3. The research plan is specified in Section 4. Finally, the proposal ends with some final remarks in Section 5. 3

7 2 Ensemble Learning Ensemble Learning (EL) is a process that uses a set of models, each of them obtained by applying a learning process to a given problem. This set of models is integrated in some way to obtain the final prediction [49]. The two pioneering works that laid the roots for EL research presented different perspectives on the same topic: Hansen and Salamon [31] published an empirical work in which was found that predictions made by an ensemble of classifiers (neural networks) are often more accurate than the best single classifier. On the other hand, Schapire [68] showed theoretically that weak learners can be combined to form a strong learner. EL is divided into three sub-processes: ensemble generation, ensemble pruning and ensemble integration [49]. In this state-of-the-art, we will focus particularly on the ensemble integration, given that our research plan will hopefully provide contributions for that sub-process. However, an important overview is also presented for ensemble generation, and the related topics of error decomposition and diversity. Finally, a very brief description of ensemble prunning state-of-the-art is also provided. 2.1 Error, Accuracy and Diversity Good ensembles must present specific characteristics: accurate predictors and errors in different parts of the input space. The generalization error decomposition for regression ensembles was a very important step in understanding the behavior of such systems, and its contributions helped to guide the research in ensemble generation. For classification, there is no such unifying theory. However, there is work in progress in the field [16]. It is well accepted in the Machine Learning community that generating diverse individual classifiers is good practice to achieve accurate ensembles. Again, diversity measures for regression are forcibly different from the classication ones. Although here are proven connections between diversity measures and accuracy, there is also evidence that raises doubts about the usefullness of such metrics in building ensembles [41]. This a research field that still lacks a complete grounded framework Regression The literature regarding the error decomposition of regression ensembles has two main schemes: the Error-Ambiguity and the Bias-Variance-Covariance decomposition. 4

8 The Error-Ambiguity decomposition was proposed by Krogh and Vedelsby [40] for an ensemble of k neural networks. Assuming that ˆf f (x) = k i=1 [α i ˆf i (x)] where k i=1 (α i) = 1 and α i 0, i = 1,..., k, they show that the error for one example is ( ˆf f f) 2 = k [α i ( ˆf k i (x) f) 2 ] [α i ( ˆf i (x) ˆf i ) 2 ] (3) i=1 i=1 The first term of the equation is the bias component (the error of the individual learners given their generalization ability) and the second one is the ambiguity component (measures the variability among the predictors of individual learners, depending on ensemble diversity). The expression shows clearly that the ensemble generalization error is less than or equal to the generalization error of a randomly selected single predictor. This is true because the ambiguity component is always non negative. This decomposition also shows that it is possible to reduce the ensemble generalization error by increasing the ambiguity without increasing the bias. Ueda and Nakano [79] proposed the Bias-Variance-Covariance decomposition. Here, it is assumed that ˆf f (x) = 1 k Σk i=1 [ ˆf i (x)], then E[( ˆf f f) 2 ] = bias k var + (1 1 ) covar (4) k The expression shows that the error of the ensemble depends a lot on the covariance component, which relates the correlation between the individual learners. If the learners make similar errors, this component will be large. Therefore, this expression shows that diversity is very important for accurate ensembles. Later, a study on the relation between the Error-Ambiguity and Bias-Variance-Covariance decomposition showed that is not possible to maximize the ambiguity component and minimize the bias component simultaneously [17]. Thus, generating diverse learners is a complex challenge. The Bias-Variance-Covariance decomposition provides a powerfull measure of regression ensemble diversity: the covariance term [17]. This component was already integrated in one successful ensemble generation algorithm (Negative Correlation Learning [45]). By adding a penalty term associated with the covariance component to the mean squared error function of the ensemble, the algorithm, together with an evolutionary framework, automatically searches for learners that are not correlated with those already present in the ensemble. Moreover, one should acknowledge that the presented error decomposition schemes assume that the integration function of the learners is averaging. In case of non-constant weighting functions 5

9 presented in Section 2.5, these theories do not hold. Another important topic for ensemble regression is the multicollinearity problem. This statistical phenomenon, in the context of EL, refers to the situation in which the predictions of two or more individuals learners of an ensemble are highly correlated. Given the exposition provided previously, is linear to conclude that this can be problematic. However, if the already mentioned principles of diversity are guaranteed, then it is possible, if not to avoid completely, at least smooth the problem in the ensemble generation or ensemble pruning phase [49] Classification Error decomposition in regression ensembles is a very well solved problem. In classification, more research is needed. One can find work in progress trying to adapt the concepts present in regression for classification problems by choosing to approximate the class posterior probabilities [78][24]. However, for some learning algorithms, like decision trees, is not possible to extract those probabilities: the outputs have no intrinsic ordinality. The work in this topic is divided into two directions: ordinal outputs and non-ordinal outputs (in which the outputs of the classifiers are taken as probabilities, as mentioned before). We follow Brown very closely [16]. For ordinal outputs, the theoretical framework for analysing a classifier error when its predictions are posterior probabilities was proposed by Tumer and Gosh [78]. Figure 1 shows their framework. For a one dimensional feature vector x, the solid curves show the true posterior probabilities of classes a and b, P(a) and P(b), respectively. The dotted curves show estimates of the posterior probabilities, from one of the predictors, ˆP (a) and ˆP (b). The solid vertical line at x indicates the optimal decision boundary and the dark shaded is named the Bayes error. This error can not be reduced. The dotted vertical line at ˆx indicates the boundary placed by our predictor. The light shaded area indicates the added error that our predictor makes in addition to the Bayes error. Tumer and Gosh show that the expected added error, if the decision boundary is instead placed by an ensemble, is ( ) 1 + δ(m 1) Eadd ens = E add M (5) where M is the number of classifiers. E add is the expected added error of the individual classifiers: they are assumed to have the same error. The δ is a correlation coefficient measuring the correlation between errors in approximating the posterior probabilities and, thus, a diversity measure. However, 6

10 Figure 1: Tumer and Ghoshs framework [78][16] for analysing classifier error. to achieve this expression, the authors take some critical assumptions. For example, they assume that the errors of the different classifiers have the same variance. Later, this work was extended by Roli and Fumera [24] where some of the assumptions were discarded, one of them being the uniformly weighted combination of the posterior probabilities. For non-ordinal outputs, the state of the art still not provides a unifying satisfactory theory. Ideally, we should have an expression that, similarly to the error decomposition in regression, decomposes the classification error rate into the error rates of the individual learners and a term that quantifies their diversity. The lack of a error decomposition for classification in a context of non-ordinal outputs has led to several diversity measures being proposed in the literature: Disagreement, Q-statistic, Kappastatistic, Kohavi-Wolpert variance, to name a few. 2 However, their uselfullness has been highly questioned. Kuncheva and Whitaker [41] showed through a broad range of experiments that the existing diversity measures do not provide a clear relation between those diversity measurements and the ensemble accuracy. Tang et al. [73] gave evidence that, compared to algorithms that seek diversity implicitly, exploiting diversity measures explicitly is ineffective while constructing strong ensembles. They also showed that diversity measures do not provide reliable information if the ensembles achieve good generalization performance but, at the same time, are highly correlated to average individual accuracies, which is not desirable. More recently, two new research directions emerged for understanding ensemble diversity in a classification context: Brown and Kuncheva s [15] good and bad diversity and information theoretic diversity. Brown and Kuncheva adopt the perspective that a diversity measure should be naturally defined 2 We refer the reader to [85] to further details. 7

11 as a consequence of two decisions in the design of the ensemble learning problem: the choice of error function and the choice of integration function. More particularly, with a zero-one loss function and majority voting integration scheme. The authors derive a decomposition of the majority vote error into three terms: average individual accuracy, good diversity and bad diversity. The good diversity measures the disagreement on datapoints where the ensemble is correct. The bad diversity measures the disagreement on datapoints where the ensemble is incorrect. Based on interaction information (a multivariate generalization of mutual information - see [85] for further details), Brown [14] presented a decomposition of the conditional interaction information between a set of predictors and a target variable. His mathematical formulation proposes to decompose classifier ensembles diversity into three components: relevancy (the sum of mutual information between each classifier and the target), redundancy (measures the dependency, independent to the target variable, among all possible subsets of classifiers) and conditional redundancy (measures the dependency among the classifiers given the class label). The main problem of this decomposition is that there is no effective process for estimating the diversity terms. Zhou and Li [86] provided a mathematical simplification of Brown s contribution and a complex estimation method for the diversity terms with promising results. 2.2 Bagging, Boosting and other Popular Ensemble Algorithms Research in EL has generated some algorithms that due to their simplicity and effectiveness have been widely adopted by the Machine Learning community and even in the industry. This section gives a brief overview of the most popular ones. Bagging stands for bootstrap aggregating and is due to Breiman [8]. This technique plays a central role in Random Forests [12], one of the most popular ensemble learning algorithms. Algorithm 1 shows the pseudocode for the bagging algorithm. Generically, given a data set containing n number of training instances, a sample D bs with replacement of n training instances will be generated. The process is repeated T times and T samples of n training instances are obtained. Then, from each sample, a model ˆf is generated by applying a base learning algorithm A. In terms of aggregating the outputs of the base learners and building the ensemble E, bagging adopts two of the most common ones: voting for classification (the most voted label is the final prediction) and averaging (the predictions of all the base learners are averaged to form the ensemble prediction) for regression [85]. The precision of the base learners can be estimated using the out-of-bag examples (the ones 8

12 input : Data set D = (x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) Base learning algorithm A Number of base learners T for t 1 to T do ˆf t = A(D, D bs ) %Train learner end output: E( ˆf 1,..., ˆf t ) Algorithm 1: Bagging pseudocode. Source: [85]. that were not selected for training) in each iteration, allowing to compute the error of the bagged ensemble. Schapire [68] publisehd a seminal paper in which he theoretically proved that any weak learner is potentially able to be boosted to a strong learner. This thesis originated the family of boosting algorithms. Synthetically, a boosting algorithm works by sequentially train learners and combine their outputs for a final prediction. However, each learner is forced to focus more on instances poorly predicted by the previous generated learners (if any) by assigning a weight to each instance based in e r. This weight influences then the instances selected for D r+1. Algorithm 2 presents the pseudocode for a boosting algorithm. input : Sample distribution d Base learning algorithm A Number of learning rounds R for r 1 to R do ˆf r = A(D r ) %Train weak learner e r = Evaluate Error( ˆf r ) D r+1 = Adjust Distribution (D r, e r ) end output: E = Combine Outputs( ˆf 1,..., ˆf t ) Algorithm 2: Boosting pseudocode. Source: [85]. By analysing the pseudocode presented in Algorithm 2, one can see that there are two main subjective processes: adjust distribution and combine outputs. There are several boosting algorithms in the literature with different procedures for these tasks [85], being AdaBoost [23] the most influential one. Bagging and boosting exploit variation in data in order to achieve greater diversity (and accuracy) in the final predictions. However, other ensemble methods exploit difference among learners. The concept of stacked generalization is due to Wolpert [82]. Stacking initialises by generating a set of models ( ˆf 1,..., ˆf t ) from a set of learning algorithms (A 1,..., A t ) and a dataset D. Then, a meta- 9

13 dataset is generated by replacing each base-level instance by the predictions of the models. 3 This new dataset is then presented to a learning algorithm that relates the predictions of the base-level models with the target output. A prediction from a stacking model is extracted by making the baselevel models predict an output, build a meta-instance and feed it to the meta-learner that provides the final prediction. The stacking framework was improved later with important contributions at the level of meta-features extraction and the selection of the meta-level algorithm [20]. Cascade generalization originated from the work of Gama and Brazdil [27]. Here, the models are used in sequence rather than in parallel as in stacking: the output of the first generated model feeds the second model; the outputs of the first and second model feed the third model, and so on. Meta-decision Trees is due to Todorovski and Dzeroski [74]. In this method, a decision tree is generated were each node corresponds to a model. This metamodel is induced using as attributes the class distribution properties extracted from the base-level examples, allowing to have one metaexample per base-level example. One of the advantages of Meta-decision Trees is that they provide some insight about base-level learning and the area of expertise of each model. This work is further detailed in Section 3.3. Other ensemble methods present in the literature, although with less impact in the research community, are cascading [2], delegating [22] and arbitrating [54]. 2.3 Ensemble Generation The first phase of developing an ensemble is model generation. If the models are generated using the same induction algorithm, the ensemble is called homogeneous, otherwise, in case the models are generated using different induction algorithms, the ensemble is heterogeneous. Higher diversity is expected when developing heterogeneous ensembles, thus, assuring a more accurate ensemble [28]. However, obtaining that diversity with different induction algorithms can be more difficult than with just one. Diversity is achieved either by manipulating data or by the model generation process Data manipulation Data manipulation for ensemble generation can be divided in three different sub-groups: subsampling from the training set, input features manipulation and output targets manipulation. The first consists in using different subsamples from the training set to generate different models. This 3 This can lead to overfitting. To avoid this problem, is often recommended to exclude the base-level instances from the meta-dataset and train the stack model in new data [85]. 10

14 method takes advantage of the unstability of some learning algorithms [9]. Given some randomness of the inductive process of a learning algorithm and its sensivity to changes in the training set, one can manipulate the generation of models to obtain a diverse ensemble. Two well known ensemble learning techniques that use this method are boosting [68] and bagging [8]. Several methods were developed for input features manipulation, being the most simple one the random feature selection. More complex techniques are noise injection [47], that consists in adding Gaussian noise to the inputs; iterative search methods for feature selection [81] and rotation forests [63], a method that combines selection and transformation of features using Principal Component Analysis (PCA). Output targets manipulation is a research field with very few publications. Leo Breiman signs the most important contributions: output noise injection [11], that essentially consists in adding Gaussian noise to the output variable of the training set; and iterated bagging [13]. The latter technique consists of initially generate a model and compute its residuals; a second model is generated with the output target being the residuals of the first model; this iterative process is repeated several times to develop the ensemble Model generation Achieving diversity by model generation manipulation can be done through three techniques: different parameter sets; induction algorithm manipulation or final model manipulation. The vast majority of learning algorithms is sensitive to parameter changes. The number of parameters is highly dependent to the selected algorithm. In order to achieve a diverse set of models, one must focus on the most sensitive parameters of the algorithm. Works on neural networks [59] and k-nearest neighbors [84] ensemble generation show the effectiveness of this technique. Approaches for ensemble generation by manipulation of the induction algorithm have two main categories: sequential and parallel. In sequential approaches [67][33], the generated models are only influenced by previous ones. The main feature of these techniques is the use of a decorrelation penalty term in the error function of the ensemble to increase diversity. Making use of the decomposition of the generalization error of an ensemble, the training of each network tries to minimize a function that has a covariance component, thus decreasing the generalization error. In parallel approaches, the generation of the models includes an exchange of information and usually is guided by an evolutionary framework [45]. Two distinct parallel techniques are the infinite ensemble of Support Vector Machines models [44] (the main concept is to create a kernel that gathers all the possible models in the hypothesis space) and Random Forests, that combines the bagging method with 11

15 random feature selection on the generated trees. Model manipulation is a less studied topic. This group of techniques focus on modify a model in some way so that its performance is boosted (i.e., given a set of rules, produced by one single learning process, one can repeatedly sample the set of rules and build n models [34]). 2.4 Ensemble Pruning Ensemble pruning consists of eliminating models from the ensemble, with the aim of improving its predictive ability or reducing computational costs. Research on ensemble pruning is divided into five categories: exponential search, randomized search, sequential search, ranking pruning and clustering pruning. Here, we follow [49] very closely. Exponential search pruning refers to the group of algorithms that tries to find the optimal set of k models from a pool of K models to integrate an ensemble. The searching space of this problem is very large and is a NP-complete problem. For small values of k, this can be a good approach. However, in most of the cases, this approach gives poor results in comparison with other pruning algorithms and has a very high computational cost. Randomized search pruning algorithms integrate an evolutionary framework in their process to search for a solution that is better than a random one. The GASEN (Genetic Algorithm based Selective ENsemble) [87] algorithm presented very promising results in a classification context. The search algorithm of a sequential pruning process can be named forward (if the search begins with an empty ensemble and adds models to the ensemble in each iteration), backward (f the search begins with all the models in the ensemble and eliminates models from the ensemble in each iteration) or forward-backward (if the selection can have both forward and backward steps.). Comparative studies show that CwE (Constructive with Exploration) [19] presents very robust results. In this algorithm, each time a new candidate model is added to the ensemble, all candidates are tested and it is selected the one that leads to the maximal improvement of the ensemble performance. When no model in the pool improves the ensemble performance, the selection stops. Ranking pruning algorithms sort the models according to a certain criterion and generate an ensemble containing the top k models in the ranking. Most of the algorithms of this category are rather simple and they do not seem to be competitive with state of the art pruning techniques. Clustering algorithms for ensemble pruning resort on grouping the models in several clusters and choose representative models (one or more) from each cluster. A good example of this type o algorithms is ARIA (Adaptive Radius Immune Algorithm) [19]. Here, just the most accurate model 12

16 from each cluster is selected. 2.5 Ensemble Integration and the Dynamic Approach Ensemble integration focus on how to combine the output of models previously generated for an ensemble in order to obtain one final prediction. Techniques for ensemble integration are divided into two main categories: constant weighting functions and non-constant weighting functions [49]. In the former, the weights assigned to each model in the ensemble are a constant value; in the later, the weights vary according to the instance to be predicted. We will pay particular attention to the non-constant weighting functions and more particularly, the dynamic approach Constant weighting functions Naturally, techniques for ensemble integration differ substantially in case of regression or classification. In the case of classication, the most frequent techniques are majority voting ( for binary problems, the final prediction is the label that received more than half of the votes; otherwise, the output will be the rejection option, usually the most frequent class), plurality voting (the final prediction is the label with largest number of votes), weighted voting (a weight is assigned to each learner according to its past performance) and soft voting (here, the output of the classifiers is a probability instead of a label; the techniques used in regression can therefore be applied). The most frequent techniques for regression are averaging (given a set of base learners, the final prediction is the average of the predictions made by the learners) and weighted averaging (given a set of base learners, the final prediction is obtained by averaging the outputs of different learners with different weights implying different importance). Usually the weights are estimated given the past performance of the base learners in some validation data. The great drawback of the most simplistic techniques for ensemble integration is the multicollinearity problem [56]. However, several techniques have been proposed that circumvent this problem. Caruana et al. [18] combined the ensemble integration phase with the ensemble generation one by implicitly calculating the weights as the number of times that each model is selected over the total number of models in the ensemble. Breiman [10] presented a regression version of the original stacking framework. To avoid the multicollinearity problem he used ridge regression as the stack model under the constraint that the coefficients of the regression (in other words, the weights for each model in the ensemble) need to 13

17 be non-negative. Although the results were not great, an important contribution of Breiman is the empirical observation that most of the weights are equal to zero, which reinforces the need for ensemble prunning. Merz and Pazzani [52] presented a technique that used principal component analysis for ensemble integration, named PCR*. After the principals components (PC) being obtained, the method orders the PC as function of the variation they can explain, making the selection of the PC much easier. In a important study with several techniques for constant weighting functions, PCR* showed very consistent results [51] Non-constant weighting functions This category of weighting functions can be divided into static (defined at learning time) and dynamic (defined at prediction time). The most important contribution for static non-constant weighting functions was previously mentioned in Section 2.2 (and further detailed in Section 3.3): Metadecision Trees. Although, one must acknowledge that this work was developed for classification. A regression version would require a different choice of meta attributes. The dynamic approach for non-constant weighting functions has been receiving an increasing amount of attention in the research community [49]. The motivation for this technique is that different models in the ensemble may have different performances on different regions of the input space. One must distinguish the concepts of dynamic selection (DS) and dynamic weighting (DW): while the former considers the selection of the models in an ensemble that are going to make a prediction, the later focus on how to combine the predictions of the models. Figure 2 shows a scheme for dynamic selection of models. The technique suggests that given an input X, similar data is selected from a validation set. This process is usually guided by some distance metric, like the Euclidean distance with the k-nearest neighbors algorithm. Then, one or more models are selected from the ensemble given their past performance on the similar data. After model selection, the predictions can be combined in some way to make the final prediction. The first paper concerning DS of classifiers is due to Ho et al. [32]. In this work, the authors proposed a selection based on a partition of training examples. The individual classifiers are evaluated on each partition to find the best one for each. Then, the test instance to be predicted is categorized into a partition and classified by the corresponding best classifier. A full dynamic approach was introduced by Merz [50] in a paper in which DS of classifiers was combined with the DW of the predictions. Results showed that a simple majority combination was superior to their dynamic approach. Woods [83] used a very similar approach but the results (with 14

18 Figure 2: Dynamic selection. Source: [48]. 4 different datasets) were better. Tsymbal [75][77] combined dynamic integration with classifier ensembles using bagging and boosting algorithms. Results suggest that dynamic integration improves significantly the performance of the ensembles instead of the more typical majority voting integration. Tsymbal also presented experiments in which a dynamic integration approach instead of the simple majority combination in Random Forests was better on some datasets [76]. Rooney et al. [65] extended the dynamic integration for regression problems. They claim that dynamic integration techniques are as effective for regression as stacked regression when the base models are simple. In another paper of the same authors [66], they combined the random subspace method (training data is transformed to contain different random subsets of the variables) with stacked regression and dynamic integration. Again, for simple models like linear regression and k- nearest neighbors, these techniques are more effective than bagging and boosting. Later, Rooney and Paterson [64] proposed a combination of stacking and dynamic integration for regression problems named wmetacomb. Premilinary results were promising. Ko et al. [38] and Moreira et al. [48] presented studies in which several variants of dynamic selection and integration are experimented. The former, showed comparisons of dynamic classifier selection and dynamic ensemble selection; results (no statistical verification was carried) suggested that using weak classifiers, the dynamic ensemble selection can marginally improve the accuracy, but not always performs better than dynamic classifier selection. The later, in a regression task, also found evidence that selecting dynamically several models for the prediction task increases prediction accuracy comparing to the selection of just one model. They also claim that using similarity measures according to the target values improves results. 15

19 Liyanage et al. [46] proposed a dynamically weighted ensemble classification (DWEC) framework whereby an ensemble of multiple classifiers are trained on clustered features. The decisions from these multiple classifiers are dynamically combined based on the distances of the cluster centres to each test data sample being classified. Results showed that their method is significantly better than a Suppor Vector Machine baseline classifier. 16

20 3 Metalearning Metalearning (MtL) is the study of principled methods that exploit metaknowledge to obtain efficient models and solutions by adapting machine learning and data mining processes [5]. Rendell and Cho [61] published the first experiments at the meta-level for classification. They characterized a classification problem and studied its impact on algorithm behavior, extracting metafeatures related to the size and concentration of the labels. In following years, research in MtL was boosted by two European projects: StatLog and METAL. The former provided an assessment of the strengths and weaknesses of several classification techniques, while the later focused on the development of a MtL assistant for providing user support in machine learning and data mining, both for classification and regression problems [70]. The key issue in MtL is metaknowledge: the experience or knowledge gained from one or more data mining tasks. Typically, this knowledge is not generally available to improve the task or to assist in the following data mining tasks. Therefore, MtL concentrates on the effective application of knowledge about learning systems to understand and improve their performance. Figure 3 shows a typical MtL process of knowledge acquisition. Initially (from A to B), the process starts by the user having a group of datasets and the extraction of data characteristics or metafeatures (this topic will be detailed in 3.1). Then, follows the steps of a normal data mining project: preprocessing and experiments (C), choose a learning strategy (D) and finally the evaluation phase (E). Both the stages D and E also provide metaknowledge to be stored in a meta-dataset. A Dataset Dataset B Metafeatures F Metadata Learning Techniques Choose Learning Strategy Performance Evaluation C D E Figure 3: Metalearning: knowledge acquisition. Adapted from: [80]. Some authors consider that concepts like boosting, bagging or stacking (in other words, model combination methods) are a form of MtL (or meta-learning). However, in this work, we refer to 17

21 MtL as the process of using metadata for improving and better understanding data mining processes. Within this notion of MtL, one of the key concepts is learning bias: it refers to any preference for choosing one hypothesis explaining the data over other (equally acceptable) hypotheses, where such preference is based on extra-evidential information independent of the data. MtL studies how to choose the most adequate bias dynamically. Bias can be divided in two schemes: declarative and procedural. The former refers to the representation of the space of hypotheses and affects the size of the search space (e.g., specify the use of linear functions). The later imposes constraints on the ordering of the inductive hypotheses (e.g., prefer smaller hypotheses) [5]. One of the key points in MtL is metadata, namely, the metatarget and the extracted metafeatures (or data characteristics). The metatarget needs to be a variable that has information about the performance of a given algorithm in a given dataset. On the other hand, metafeatures need to properly characterize the datasets in order to achieve an accurate and reliable metalevel model. These can be divided into three types: simple, statistical and information-theoretic measures; modelbased measures; and landmarkers. Further details on this subject are exposed in 3.1. MtL applications can refer to several distinctive tasks, namely algorithm recommendation, development of systems to support the KDD process, combination of base-learners, bias management in data streams, transfer of knowledge and development of complex systems with domain-specific metaknowldge [5]. In this work, we will focus on the topic of recommendations for data mining, given that is more related with our project goals. For a full exposure of MtL applications, we refer the reader to [5]. Last but not least, there is some work in the state of the art that relates ensemble methods with MtL. This is, however, a very restrict group. We will provide some insights of those works in 3.3 and expand their contributions to our future work. 3.1 Metadata Generating the metadata is the most important step in a MtL process. Besides choosing the appropriate metatarget for the task, it is crucial to select meaningful metafeatures that contain information for successfully achieve the main goal. For that, it is very important to take in consideration both task-dependent and algorithm-specific metafeatures [5]. For instance, if the base-level task is classification, the choice of metafeatures should be different than for a regression problem. Moreover, one should also consider the set of algorithms that the MtL system needs to relate. For example, the proportion of symbolic features should be meaningful to differentiate between a neural network 18

22 and a naïve Bayes. It is acknowledge that neural networks usually present a good performance when the dataset contains several numeric variables. On the other hand, naïve Bayes are more proper to symbolic attributes. Each example in a meta-dataset represents a learning problem. As in other learning task, MtL needs a satisfactory number of examples in order to induce a reliable model. The number of metaexamples is often seen as a problem for MtL [5] Metatarget Concerning the development of a MtL system, the first decision that must be made is about the type of metatarget, in others words, the dependent variable of the meta-level learning process. This variable can take several forms depending on the main goal of the MtL system and the nature of the base-level task (i.e., classification, regression, etc). The most simple form of metatarget is a classification scheme (binary or multi-class, depending on the number of algorithms) in which for a given dataset, the metamodel predicts the class that represents the algorithm with better performance from a set. The great disadvantage of this type of metatarget is if the metamodel fails its prediction, the costs can be very high. Another type of metatarget is instead of a single recommendation, the metamodel suggests a subset of algorithms. Given the algorithm with expected best performance, a heuristic measure can be defined to indicate the algorithms that also perform well in comparison with best algorithm. Typically, these metamodels are induced with rules or Inductive Logic Programming [37]. The previous type of metatarget provides several recommendations for the user. However, they are not ordered. This can negatively influence the data mining process. Therefore, algorithm recommendation in form of rankings seems a good alternative. A MtL method that provides recommendations in the form of rankings is proposed in [7]. The system includes an adaptation of the k-nearest neighbors algorithm that identifies algorithms which are expected to tie, providing a reduced ranking by only including one of them in the recommendation. Here, the metamodel also takes the form of a classifier. Finally, if one is interested in concrete value regarding the peformance of a algorithm in a dataset and not the actual relative performance of a set of algorithms, the metatarget can be defined as estimates of performance. In this case, the MtL problem takes the form of a regression, one for each base-algorithm. Besides that this type of metafeature provides more detailed information to the user, it also allows to transform the output of the metamodels in one of the previous recommendations forms mentioned above. 19

23 There is somewhat a lack of studies comparing the forms of metatarget. In fact, the work done on this subject is contradictory. Köpf et al. [39] found evidence that the transformation of the estimates of performance provided by regression metamodels are not a good option. On the other hand, Bensusan and Kalousis [4] provide evidence that better rankings can be obtained with the transformatiom of estimates of performance than by using a ranking algorithm. Further research is needed for a full comparison of methods Metafeatures Defining the metafeatures is probably the most important task in a MtL problem. If the data characterization does not provide useful information, the probability of success of the MtL system is highly reduced. Brazdil et al. [5] defined three fundamental issues that every metafeature should accomplish: Discriminative power. The metafeatures need to contain information that distinguishes between the base-algorithms in terms of their performance. Computational complexity. The computation of the metafeatures should not be too demanding. If not, it may not compensate generating a MtL system if one can save resources by exploring all the hypotheses for a given learning problem. Pfahringer et al. [58] suggested that the computational complexity of extracting metafeatures should not be at most O(n log n). Dimensionality. Given that the number of meta-examples of a MtL problem is usually small, the number of metafeatures should not be too large or overfitting may occur. Kalousis and Hilario [36] found evidence that feature selection can improve a MtL process which supports this claim. As mentioned before, metafeatures can be divided into three types (Figure 4): Simple, statistical and information-theoretic. These are the most common type of metafeatures extracted using descriptive statistics and information-theoretic measures. Some examples: number of features/examples, number of instances with missing values (simple); mean skewness of numeric features, mean value of correlation (statistical); class entropy, mutual information for symbolic features (information-theoretic). We refer the reader to [39][26] for more examples. Model-based. Here, metafeatures are extracted based on properties of the induced model. Example: number of leaf nodes in a decision tree [55] or mean of the off-diagonal values of a kernel matrix in a Support Vector Machines model [71]. 20

24 Landmarkers. This type of metafeatures are quick estimates of an algorithm s performance. They can be obtained in three different ways: through the run of simplified versions of an algorithm [3][58] (i.e., a decision stump); quick performance estimates on a sample of the data, also called subsampling landamarkers [25]; and finally, through an ordered sequence of subsampling landmarkers for a single algorithm, which allows to form the so called learning curve of an algorithm. In this case, not only the estimates can be used as metafeatures but also the shape of the curve [43]. Datasets Simple, statistical and informationtheoretic Datasets Learning Model Model-based Datasets Learning Model Landmarkers Performance Estimates Figure 4: Metafeatures. Source: adapted from [5]. 3.2 Metalearning for Data Mining Applications Rendel et al. [62][61] published the earlier works in which the expression meta-learning is used in Machine Learning. In the first paper [62], they proposed the Variable Bias Management System (VBMS). Here, the problem of algorithm recommendation is studied for the first time and the need for methods that develop models with different biases is identified. However, the experiment is rather preliminary: only the execution time of the algorithms is considered, the type of metafeatures is very simple (i.e., number of examples) and the evaluation carried is clearly insufficient. In the second paper [61], the data characterization was more detailed and set roots for the MtL research in following years, boosted by the already mentioned European projects, StatLog and METAL. 21

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Ensemble Approaches for Regression: A Survey

Ensemble Approaches for Regression: A Survey Ensemble Approaches for Regression: A Survey JOÃO MENDES-MOREIRA, LIAAD-INESC TEC, FEUP, Universidade do Porto CARLOS SOARES, INESC TEC, FEP, Universidade do Porto ALíPIO MÁRIO JORGE, LIAAD-INESC TEC,

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Introduction To Ensemble Learning

Introduction To Ensemble Learning Educational Series Introduction To Ensemble Learning Dr. Oliver Steinki, CFA, FRM Ziad Mohammad July 2015 What Is Ensemble Learning? In broad terms, ensemble learning is a procedure where multiple learner

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Ensemble Data Mining Methods

Ensemble Data Mining Methods Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

Meta-learning. Synonyms. Definition. Characteristics

Meta-learning. Synonyms. Definition. Characteristics Meta-learning Włodzisław Duch, Department of Informatics, Nicolaus Copernicus University, Poland, School of Computer Engineering, Nanyang Technological University, Singapore wduch@is.umk.pl (or search

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Increasing the Accuracy of Predictive Algorithms: A Review of Ensembles of Classifiers

Increasing the Accuracy of Predictive Algorithms: A Review of Ensembles of Classifiers 1906 Category: Software & Systems Design Increasing the Accuracy of Predictive Algorithms: A Review of Ensembles of Classifiers Sotiris Kotsiantis University of Patras, Greece & University of Peloponnese,

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

How To Perform An Ensemble Analysis

How To Perform An Ensemble Analysis Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

CS570 Data Mining Classification: Ensemble Methods

CS570 Data Mining Classification: Ensemble Methods CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:

More information

Using Meta-Learning to Support Data Mining

Using Meta-Learning to Support Data Mining International Journal of Computer Science & Applications Vol. I, No. 1, pp. 31-45 2004 Technomathematics Research Foundation Using Meta-Learning to Support Data Mining Ricardo Vilalta 1, Christophe Giraud-Carrier

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate

More information

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 - Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

II. RELATED WORK. Sentiment Mining

II. RELATED WORK. Sentiment Mining Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

More information

Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

More information

Learning bagged models of dynamic systems. 1 Introduction

Learning bagged models of dynamic systems. 1 Introduction Learning bagged models of dynamic systems Nikola Simidjievski 1,2, Ljupco Todorovski 3, Sašo Džeroski 1,2 1 Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

The primary goal of this thesis was to understand how the spatial dependence of

The primary goal of this thesis was to understand how the spatial dependence of 5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Decompose Error Rate into components, some of which can be measured on unlabeled data

Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance

More information

MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 sanjeevk@iasri.res.in 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Data Analytics and Business Intelligence (8696/8697)

Data Analytics and Business Intelligence (8696/8697) http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/36 Data Analytics and Business Intelligence (8696/8697) Ensemble Decision Trees Graham.Williams@togaware.com Data Scientist Australian

More information

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Why Ensembles Win Data Mining Competitions

Why Ensembles Win Data Mining Competitions Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:

More information

Latest Results on outlier ensembles available at http://www.charuaggarwal.net/theory.pdf (Clickable Link) Outlier Ensembles.

Latest Results on outlier ensembles available at http://www.charuaggarwal.net/theory.pdf (Clickable Link) Outlier Ensembles. Outlier Ensembles [Position Paper] Charu C. Aggarwal IBM T. J. Watson Research Center Yorktown Heights, NY charu@us.ibm.com ABSTRACT Ensemble analysis is a widely used meta-algorithm for many data mining

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

CHAPTER VII CONCLUSIONS

CHAPTER VII CONCLUSIONS CHAPTER VII CONCLUSIONS To do successful research, you don t need to know everything, you just need to know of one thing that isn t known. -Arthur Schawlow In this chapter, we provide the summery of the

More information

Risk pricing for Australian Motor Insurance

Risk pricing for Australian Motor Insurance Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model

More information

Multiple Classifiers -Integration and Selection

Multiple Classifiers -Integration and Selection 1 A Dynamic Integration Algorithm with Ensemble of Classifiers Seppo Puuronen 1, Vagan Terziyan 2, Alexey Tsymbal 2 1 University of Jyvaskyla, P.O.Box 35, FIN-40351 Jyvaskyla, Finland sepi@jytko.jyu.fi

More information

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor

More information

On the application of multi-class classification in physical therapy recommendation

On the application of multi-class classification in physical therapy recommendation RESEARCH Open Access On the application of multi-class classification in physical therapy recommendation Jing Zhang 1,PengCao 1,DouglasPGross 2 and Osmar R Zaiane 1* Abstract Recommending optimal rehabilitation

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement

Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement Toshio Sugihara Abstract In this study, an adaptive

More information