Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
|
|
|
- Jean Johnson
- 10 years ago
- Views:
Transcription
1 Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE Skövde Sweden [email protected] Abstract - Two strategies for fusing information from multiple sources when generating predictive models in the domain of pesticide classification are investigated: i) fusing different sets of features (molecular descriptors) before building a model and ii) fusing the classifiers built from the individual descriptor sets. An empirical investigation demonstrates that the choice of strategy can have a significant impact on the predictive performance. Furthermore, the experiment shows that the best strategy is dependent on the type of predictive model considered. When generating a decision tree for pesticide classification, a statistically significant difference in accuracy is observed in favor of combining predictions from the individual models compared to generating a single model from the fused set of molecular descriptors. On the other hand, when the model consists of an ensemble of decision trees, a statistically significant difference in accuracy is observed in favor of building the model from the fused set of descriptors compared to fusing ensemble models built from the individual sources. Keywords: feature fusion, classifier fusion, decision fusion, chemoinformatics 1 Introduction Data mining techniques have become standard tools to develop predictive and descriptive models in situations where one wants to exploit data collected from earlier observations in order optimize future decision making [1]. In the case of predictive modeling, one typically tries to estimate the expected value of a particular variable (called the dependent variable), given the values of a set of other (independent) variables. In the case of a nominal dependent variable (i.e., the possible values are not given any particular order), the prediction task is usually referred to as classification, while the corresponding task when having a numerical dependent variable is referred to as regression. One usually wants the model to be as correct as possible when evaluated on independent test data, and several suggestions for how to measure this have been proposed. For classification, such measures include accuracy, i.e., the percentage of correctly classified test examples, and the area under the ROC curve (AUC), i.e., the probability that a test example belonging to a class is ranked as being more likely belonging to the class than a test example not belonging to the class [2]. Besides the ability to make correct predictions, one is also often interested in obtaining a comprehensible model, so that the reasons behind a particular classification can be understood, and also that one may gain insights into what factors are important for the classification in general. Examples of such comprehensible models are decision trees and rules, e.g. [3], while examples of models not belonging to this group, often called black-box, or opaque, models, include artificial neural networks and support vector machines (see e.g. [4]). A central issue when developing predictive models from multiple sources is how to best integrate or fuse the information from these sources. Should one fuse all available data and then generate a predictive model, or should one generate models from the different sources and then fuse the models? In this work we address this general question by studying a particular problem within the domain of chemoinformatics pesticide classification. This problem concerns classifying molecules into being a pesticide or not, based on different molecular descriptors. What is interesting from an information fusion perspective with this and many similar problems in chemoinformatics is that there is no (known) single source of molecular descriptors that always is optimal, but there is a large number of different sources developed for different purposes. Hence, the question for a particular application in this area is how to best combine the information from these sources when generating a predictive model. In this study, we consider two basic strategies for fusing this information: i) fusing the sources before generating the model, and ii) generating models from each source and then fusing the models into a global predictive model. The research question of this study is whether or not the choice of strategy has any impact on the resulting model w.r.t. predictive performance in the domain of pesticide classification. In the next section, we describe the problem of pesticide classification in a little more detail, and point out the sources of information used in the study. In section three,
2 we briefly describe the employed predictive data mining techniques as well as the methods for fusing features and classifiers. In section four, we present the experimental setup and the results from the experiment. Finally, in section five we give concluding remarks and point out directions for future work. 2 Data Sources The particular class of molecules that we consider in this study are so-called pesticides, i.e., molecules that may be used for preventing or destroying pests, such as microbes, insects and other organisms that are considered to be harmful. The set of pesticides used in this study was selected from the e-pesticide Manual [5]. The structures were converted into SMILES strings [6] and standardized in order to allow further processing such as removing duplicates and descriptor calculations. After removing duplicate structures, we retained 1613 unique compounds with molecular weight (MW) between Dalton. A similar sized counter set (i.e., molecules that have not been classified as pesticides) was selected from a set of compounds obtained from external vendors and compiled at AstraZeneca. The selection was done by matching the MW distribution between the two sets using a genetic algorithm, see Figure 1. The algorithm minimizes the F- test statistics between two datasets in order to obtain a non-significant difference between the normal distributions of a desired property. This was done in order to avoid trivial classifications based on molecular size. % compounds MW distribution pesticides CSET-GA CSET-rand Figure 1. Molecular weight distributions for the pesticides and counter sets. CSET-rand is randomly selected while CSET-GA is selected using a genetic algorithm. The following 2D molecular descriptors have been calculated for both sets with DRAGON [7]: B01: 48 constitutional descriptors (the so-called 0D descriptors independent of molecular connectivity such as MW, number of atoms, sum of atomic properties, etc.) B17: 152 chemical functional group counts B18: 120 atom centered fragments. Although this set was initially proposed for predicting molecular hydrophobicity (lipophilicity) [8], it has proved to be very useful for several other binary classification tasks [9,10] B20: 28 molecular properties calculated from models and empirical descriptors. For a detailed description of the selected molecular descriptors, see [7] and the references therein. 3 Methods 3.1 Decision trees and ensembles Techniques for generating decision trees are perhaps among the most well-known methods for predictive data mining. Early systems for generating decision trees include CART [11] and ID3 [12], the latter being followed by the later versions C4.5 [3] and C5.0 [13]. The basic strategy that is employed when generating decision trees is called recursive partitioning, or divide-and-conquer. It works by partitioning the examples by choosing a set of conditions on an independent variable (e.g., the variable has a value less than a particular threshold, or a value greater or equal to this threshold), and the choice is usually made such that the error on the dependent variable is minimized within each group. The process continues recursively with each subgroup until certain conditions are met, such as that the error cannot be further reduced (e.g., all examples in a group belong to the same class). The resulting decision tree is a graph that contains one node for each subgroup considered, where the node corresponding to the initial set of examples is called the root, and for all nodes there is an edge to each subgroup generated from it, labeled with the chosen condition for that subgroup. A decision tree is normally depicted with the root at the top having the ancestor nodes below it see Fig. 2 for a decision tree generated within the domain of pesticide classification. An example is classified by the tree by following a path from the root to a leaf node, such that all conditions along the path are fulfilled by the example, where the conditions are formed from the variable names directly below each node and from the edge labels (e.g., the condition B01-nCIC 1.5 follows from the root and the leftmost edge from the root). The estimated class probabilities at the reached leaf node are used to assign the most probable class to the example. The relative sizes of the sectors of each pie chart in Fig. 2 correspond to estimated class probabilities, where a dark sector corresponds to the probability of belonging to the class of pesticides, and a light sector corresponds to the
3 Figure 2. Example of a decision tree for classifying molecules as pesticides (dark color) or non-pesticides (light color) based on chemical descriptors. probability of belonging to the class of non-pesticides 1. Decision trees have many attractive features, such as allowing for human interpretation and hence making it possible for a decision maker to gain insights into what factors are important for particular classifications. However, recent research has shown that significant improvements in predictive performance can be achieved by generating large sets of models, or ensembles, which are used to form a collective vote on the value for the dependent variable [14]. It can be shown that as long as each single model performs better than random, and the models make independent errors, the resulting error can in theory be made arbitrarily small by increasing the size of the ensemble. However, in practice it is not possible to completely fulfill these conditions, but several methods have been proposed that try to approximate independence, and still maintain sufficient accuracy of each model, by introducing randomness in the process of selecting examples and conditions when building each individual model. One popular method of introducing randomness in the selection of training examples is bootstrap aggregating, 1 The plus and minus signs below each node indicate whether (+) or not (-) the tree could be further expanded in the particular tool that has been used to create the tree [19]. or bagging, as introduced by Breiman [15]. It works by randomly selecting n examples with replacement from the initial set of n examples, leading to that some examples are duplicated while others are excluded. Typically a large number (at least 25-50) of such sets are sampled from which each individual model is generated. Yet another popular method of introducing randomness when generating decision trees is to consider only a small subset of all available independent variables at each node when forming the tree. When combined with bagging, the resulting models are referred to as random forests [16], and these are widely considered to be among the most competitive and robust of current methods for predictive data mining. The drawback of ensemble models are however that they can no longer be easily interpreted and hence provide less guidance into how classifications are made. 3.2 Feature and classifier fusion Feature fusion concerns how to generate and select a single set of features for a set of objects to which several sets of features are associated. The purpose of feature fusion is typically to obtain a representation that allows for more effective analysis (cf. fusing pixel information into segments to improve image classification).
4 Normally the fused vector results in loss of information. In case all objects have the same sets of features associated to them, one could however perform feature fusion with no loss of information by simply concatenating the feature vectors. Although such a concatenation may clearly not be the most effective method for high-dimensional feature vectors, it may in fact be the most effective alternative for lowdimensional data. Since we in this study only are concerned with moderately-sized feature vectors 2, this method was chosen as a base-line for feature fusion. The combination of multiple classifiers generated from different sources or in different ways into a global model is often referred to as classifier fusion [17]. The purpose of classifier fusion is to improve predictive performance, typically by combining the output of each of the fused classifiers, where the output either is a single class label or a probability distribution over all class labels 3. One may consider classifier fusion to be a special case of decision fusion [18], since the latter normally is not restricted to combining multiple decisions only, but also measures of confidence, probabilities etc. A large number of approaches have been proposed for decision fusion in general, and for classifier fusion in particular. The latter include both ways of combining the output from multiple classifiers and ways of choosing classifiers to include in the combination, see [17] for an extensive characterization of different techniques for classifier fusion. In this work, we have chosen to use Bayes average as a base-line method for obtaining a class probability distribution from the fused classifiers: K k= 1 P ( x C x) = A P ( x C x) where x is an example to be classified, C is one of the classes it may belong to, and P k, k=1,, K, are the probability distributions for the K classifiers. k K 4 Empirical Evaluation We first state the experimental hypothesis, then describe the experimental setting, and finally present the results together with conclusions. 4.1 Experimental hypothesis The null hypothesis of this experiment is that the choice of fusion strategy (fusing descriptors before building a model vs. fusing models built from each set of descriptors) has no impact on the predictive performance. 4.2 Experimental setting We took the entire data set of compounds, of which 50% were classified as pesticides and 50% as non-pesticides, and performed a stratified split into two sets one consisting of 70% (2255 compounds) to be used for training (model construction) and the other consisting of 30% (966 compounds) for testing. The size of the test set was considered to be sufficiently large by this division to allow detection of any significant differences between the strategies if the original set of compounds would have been significantly smaller, cross-validation could have been considered as an alternative. Two techniques for generating predictive models were considered decision trees and ensembles (random forests) as implemented in the Rule Discovery System, v [19]. The former is an example of a technique resulting in a comprehensible model, while the latter results in what can be considered to be a noncomprehensible model (each ensemble in the experiment consists of 50 non-pruned trees). We considered generating models from each of the four sets of descriptors (B01, B17, B18 and B20) as well as generating a fused model by averaging the class probabilities from each individual model. The resulting models were compared to the model obtained by first fusing all descriptors into a global set, and then generating a predictive model. 2 The feature vector obtained by concatenating all four molecular descriptor sets contains 348 features. 3 The methods for generating ensemble models described in the previous section are hence examples of methods for classifier fusion. 4 Of the original set of 3226 compounds, 5 were removed due to failure to calculate molecular descriptors.
5 Tree model B01 B17 B18 B20 Fused Fused Model Descriptors No. errors Accuracy AUC No. rules Table 1. Results for the decision tree method. Ensemble B01 B17 B18 B20 Fused Fused model Model Descriptors No. errors Accuracy AUC No. rules Table 2. Results for the ensemble method. 4.3 Experimental results The predictive performance on the 966 test examples for the decision tree method is presented in Table 1, where the number of errors, accuracy, AUC and number of rules (leaf nodes) is presented for each set of descriptors (B01, B17, B18 and B20) as well as for the fused model, and the model generated from the fused descriptor set. The fused model clearly outperforms each individual model generated from a single set of descriptors when comparing the predictive performance (when measured both as accuracy and AUC). The fused model furthermore significantly outperforms the model generated from the fused descriptor set hence demonstrating that the choice of strategy indeed may have an impact on the resulting predictive performance. According to McNemar s test, the double-sided tail probability for the observed difference in accuracy is 0.006, hence allowing the null hypothesis to be safely rejected. The improved performance is however associated with a cost regarding the comprehensibility, since the size of the fused model is equal to the sum of the sizes of the included models. Interestingly, the tree generated from the fused set of descriptors is the smallest (and still slightly more accurate than the models built from separate descriptor sets). This might be due to that more compact models can be found when combinations of descriptors from different sources are allowed. However, another possible explanation is that the increased dimensionality may lead to more extensive pruning, since the risk of being mislead by spurious correlations when growing the tree increases with the number of dimensions. The predictive performance on the 966 test examples for the ensemble method is presented in Table 2, where again the number of errors, accuracy, AUC and number of rules (leaf nodes) is presented for each set of descriptors (B01, B17, B18 and B20) as well as for the fused model, and the model generated from the fused descriptor set. In contrast to the results for the decision tree method, building an ensemble model from a fused set of descriptors is more effective than fusing ensemble models built from the individual descriptor sets. According to McNemar s test, the double-sided tail probability for the observed difference in accuracy is This means that the null hypothesis can again be safely rejected the choice of strategy is indeed important but this time the best strategy is to fuse the descriptor sets. It should also be noted that in contrast to the experiment with decision trees, the model built from one of the individual descriptor sets (B18) actually outperforms one of the fusion strategies.
6 5 Concluding remarks We have investigated two strategies for fusing information from multiple sources when generating predictive models in the domain of pesticide classification by fusing features from all sources before building a model and by fusing models built from the individual sources. The experiment demonstrated that the choice of strategy does indeed have an impact on the predictive performance. In case the models consist of decision trees, it is clearly more beneficial to combine the predictions from the individual models instead of generating a single model from the fused set of descriptors. However, when the models consist of ensembles of decision trees, it is significantly more effective to build the model from a fused set of descriptors than fusing ensemble models built from the individual sources. Whether this finding holds also for other domains remains to be investigated, but this study has nevertheless demonstrated that significant gains in predictive performance can be obtained by considering different fusion strategies for different types of model. This finding also calls for investigating more elaborate fusion methods than the two basic strategies considered in this study, both regarding fusion of features and fusion of classifiers (cf. [17]). Another line of future research concerns investigating what effect the choice of fusion strategy has on comprehensibility (not only counting the number of rules) and how this affects the possibilities of gaining new insights in the domain of application. Acknowledgements Many thanks to Dr. Sorel Muresan, AstraZeneca R&D, Mölndal, Sweden, for providing the data sets, together with descriptions of these, as well as for valuable discussions. This work was supported by the Information Fusion Research Program ( at the University of Skövde, Sweden, in partnership with the Swedish Knowledge Foundation under grant 2003/0104. References [1] Witten I. and Frank E., Data Mining: Practical Machine Learning Tools and Techniques (Second Edition), Morgan Kaufmann (2005) [2] Provost F., Fawcett T. and Kohavi R., The case against accuracy estimation for comparing induction algorithms, Proc. Fifteenth Intl. Conf. Machine Learning, (1998) [3] Quinlan J.R., C4.5: Programs for Machine Learning, Morgan Kauffman, 1993 [4] Hastie T., Tibshirani R. and Friedman J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag (2001) [5] Tomlin C.D.S. (Ed.), The e-pesticide Manual, 13th edition, BCPC Publications (2003) [6] James C. A., Weininger D. and Delaney J., Daylight Theory Manual, Daylight Chemical Information Systems, (accessed Feb. 21, 2007) [7] DRAGON for Windows (Software for Molecular Descriptor Calculations) version , Talete SRL, (accessed Feb. 21, 2007) [8] Ghose A. K., Viswanadhan, V. N. and Wendoloski J. J., Prediction of Hydrophobic (Lipophilic) Properties of Small Organic Molecules Using Fragmental Methods: An Analysis of ALOGP and CLOGP Methods, Journal of Physical Chemistry A, vol. 102 (1998) [9] Sadowski J. and Kubinyi H., A scoring scheme for discriminating between drugs and nondrugs., J. Med. Chem., vol. 41 (1998) [10] Muresan S. and Sadowski J., In-House likeness : Comparison of large compound collections using artificial neural networks, J. Chem. Inf. Model., vol. 45 (2005) [11] Breiman L., Friedman J.H., Olshen R.A. and Stone C.J., Classification and Regression Trees, Wadsworth (1984) [12] Quinlan J.R., Induction of decision trees, Machine Learning, vol. 1 (1986) [13] Quinlan J.R., Data Mining Tools See5 and C5.0, (accessed Feb. 21, 2007) [14] Bauer E. and Kohavi R., An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants, Machine Learning, vol. 36 (1999) [15] Breiman L., Bagging Predictors, Machine Learning, vol. 24 (1996) [16] Breiman L., Random Forests, Machine Learning, vol. 45 (2001) 5-32 [17] Ruta D. and Gabrys B., An Overview of Classifier Fusion Methods, Computing and Information Systems, vol. 7 (2000) 1-10 [18] Dasarathy B.V., Decision Fusion, IEEE Computer Society Press (1994)
7 [19] Rule Discovery System, v , Compumine AB, (accessed Feb. 21, 2007)
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
The INFUSIS Project Data and Text Mining for In Silico Modeling
The INFUSIS Project Data and Text Mining for In Silico Modeling Henrik Boström 1,2, Ulf Norinder 3, Ulf Johansson 4, Cecilia Sönströd 4, Tuve Löfström 4, Elzbieta Dura 5, Ola Engkvist 6, Sorel Muresan
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Decision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Weather forecast prediction: a Data Mining application
Weather forecast prediction: a Data Mining application Ms. Ashwini Mandale, Mrs. Jadhawar B.A. Assistant professor, Dr.Daulatrao Aher College of engg,karad,[email protected],8407974457 Abstract
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India [email protected]
Studying Auto Insurance Data
Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Benchmarking Open-Source Tree Learners in R/RWeka
Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan
The Predictive Data Mining Revolution in Scorecards:
January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
Roulette Sampling for Cost-Sensitive Learning
Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca
Model Trees for Classification of Hybrid Data Types
Model Trees for Classification of Hybrid Data Types Hsing-Kuo Pao, Shou-Chih Chang, and Yuh-Jye Lee Dept. of Computer Science & Information Engineering, National Taiwan University of Science & Technology,
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
On Cross-Validation and Stacking: Building seemingly predictive models on random data
On Cross-Validation and Stacking: Building seemingly predictive models on random data ABSTRACT Claudia Perlich Media6 New York, NY 10012 [email protected] A number of times when using cross-validation
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
Selecting Data Mining Model for Web Advertising in Virtual Communities
Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: [email protected] Mariusz Łapczyński
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Visualizing class probability estimators
Visualizing class probability estimators Eibe Frank and Mark Hall Department of Computer Science University of Waikato Hamilton, New Zealand {eibe, mhall}@cs.waikato.ac.nz Abstract. Inducing classifiers
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH
EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH SANGITA GUPTA 1, SUMA. V. 2 1 Jain University, Bangalore 2 Dayanada Sagar Institute, Bangalore, India Abstract- One
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
MHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
REPORT DOCUMENTATION PAGE
REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 Public Reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,
Predictive Modeling for Collections of Accounts Receivable Sai Zeng IBM T.J. Watson Research Center Hawthorne, NY, 10523. Abstract
Paper Submission for ACM SIGKDD Workshop on Domain Driven Data Mining (DDDM2007) Predictive Modeling for Collections of Accounts Receivable Sai Zeng [email protected] Prem Melville Yorktown Heights, NY,
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH
CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH Except where reference is made to the work of others, the work described in this thesis is my own or was done in collaboration with my advisory
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients
An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients Celia C. Bojarczuk 1, Heitor S. Lopes 2 and Alex A. Freitas 3 1 Departamento
EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE
EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE S. Anupama Kumar 1 and Dr. Vijayalakshmi M.N 2 1 Research Scholar, PRIST University, 1 Assistant Professor, Dept of M.C.A. 2 Associate
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer
Impact of Boolean factorization as preprocessing methods for classification of Boolean data
Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,
Benchmarking of different classes of models used for credit scoring
Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu
Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)
260 IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011 Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
Using Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, [email protected] Department of Statistics,UC Berkeley Andy Liaw, andy [email protected] Biometrics Research,Merck Research Labs Leo Breiman,
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Logistic Model Trees
Logistic Model Trees Niels Landwehr 1,2, Mark Hall 2, and Eibe Frank 2 1 Department of Computer Science University of Freiburg Freiburg, Germany [email protected] 2 Department of Computer
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three
Predicting Student Performance by Using Data Mining Methods for Classification
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Mining Wiki Usage Data for Predicting Final Grades of Students
Mining Wiki Usage Data for Predicting Final Grades of Students Gökhan Akçapınar, Erdal Coşgun, Arif Altun Hacettepe University [email protected], [email protected], [email protected]
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
Graph Mining and Social Network Analysis
Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
