Classification and Regression Tree Analysis
|
|
|
- Archibald Park
- 10 years ago
- Views:
Transcription
1 Technical Report No. 1 May 8, 2014 Classification and Regression Tree Analysis Jake Morgan [email protected] This paper was published in fulfillment of the requirements for PM931 Directed Study in Health Policy and Management under Professor Cindy Christiansen s ([email protected]) direction. Michal Horný, Marina Soley Bori, Meng-Yun Lin, and Kyung Min Lee provided helpful reviews and comments.
2 Contents Executive summary 2 1 Background and Application CART in Public Health Theory and Implementation Implementation in R An Applied Example Stata and SAS in Brief Extensions Bagging and Boosting 10 4 Required Reading Introduction to CART Advanced Readings Implementation References References 13 Glossary 15
3 Executive summary Classification and Regression Tree Analysis, CART, is a simple yet powerful analytic tool that helps determine the most important (based on explanatory power) variables in a particular dataset, and can help researchers craft a potent explanatory model. This technical report will outline a brief background of CART followed by practical applications and suggested implementation strategies in R. A short example will demonstrate the potential usefulness of CART in a research setting. Keywords: CART; public health; regression tree; classification
4 1 Background and Application As computing power and statistical insight has grown, increasingly complex and detailed regression techniques have emerged to analyze data. While this expanding set of techniques has proved beneficial in properly modeling certain data, it has also increased the burden on statistical practitioners in choosing appropriate techniques. Arguably an even heavier burden has been placed on non-statistician health practitioners in university, government, and private sectors where statistical software allows for immediate implementation of complex regression techniques without interpretation or guidance. In response to this growing complexity, a simple tree system, Classification and Regression Tree (CART) analysis, has become increasingly popular, and is particularly valuable in multidisciplinary fields. 1.1 CART in Public Health CART can statistically demonstrate which factors are particularly important in a model or relationship in terms of explanatory power and variance. This process is mathematically identical to certain familiar regression techniques, but presents the data in a way that is easily interpreted by those not well versed in statistical analysis. In this way, CART presents a sophisticated snapshot of the relationship of variables in the data and can be used as a first step in constructing an informative model or a final visualization of important associations. In a large public health project, statisticians can use CART to present preliminary data to clinicians or other project stakeholders who can comment on the statistical results with practice knowledge and intuition. This process of reconciling the clinical and statistical relevance of variables in the data ultimately yields a more well informed and statistically informative model than either a singularly clinical or statistical approach. For example, complex regression models are routinely presented in economics literature with little introduction or explanation as the audience is familiar with these techniques, and are more interested in the specific application of a particular technique or unexpected result. In public health, however, this method of presentation is not motivating for practitioners without statistical expertise and who need to know the mechanism of the health effect to determine clinical relevance or craft an effective intervention. On the other hand, if the data were explained purely by narrative description and anecdote or excluding variables with statistically significant explanatory power without reason it would be interpreted as lacking scientific rigor. The benefit of CART is to visually bridge interpretation and statistical rigor and facilitate relevant and valid model design. While CART may be unfamiliar in some areas of public health, the concept is firmly rooted in health practice, particularly in epidemiological and clinical settings. The seminal work on classification and regression trees was published in book form by Breiman, Friedman, Olshen, and Stone in 1984 under the informative title Classification and Regression Trees, and they open the text with an example from clinical practice seeking to identify high risk patients within 24 hours of hospital admission for a myocardial infarction [10]. This example proved to be particularly relevant hundreds of studies have since emerged using CART analysis in clinical settings investigating myocardial infarctions. 1 Often, these studies have many 1 Over 300 results in PubMed, over 700 in Google Scholar. 3
5 confounding variables which independently may not strongly predict a given outcome, such as heart attack, but together are important. CART analysis can guide medical researchers to isolate which of these variables is most important as a potential site of intervention. Public health research, especially concerning behavioral factors, often have some intuition regarding the most important predictors, which may explain why the method has been often absent from public health literature. One review of CART in public health has a much more pessimistic rational for the lack of use: a general lack of awareness of the utility of CART procedures and, among those who are aware of these procedures, an uncertainty concerning their statistical properties. [6] This report will highlight some of the possible CART procedures and present their statistical properties in a way which is helpful to public health researchers. 2 Theory and Implementation The statistical processes behind classification and regression in tree analysis is very similar, as we will see, but it is important to clearly distinguish the two. For a response variable which has classes, often 0 1 binary, 2 we want to organize the dataset into groups by the response variable classification. When our response variable is instead numeric or continuous we wish to use the data to predict the outcome, and will use regression trees in this situation. A richer mathematical explanation follows, but, essentially, a classification tree splits the data based on homogeneity, where categorizing based on similar data, filtering out the noise makes it more pure hence the concept of a purity criterion. In the case where the response variable does not have classes, a regression model is fit to each of the independent variables, isolating these variables as nodes where their inclusion decreases error. A useful visualization of this process, created by Majid (2013), is Figure 1. Figure 1: The Structure of CART The premise of our investigation is fairly simple given factors x 1, x 2, x 3,..., x n in the domain X we want to predict the outcome of interest, Y. In Figure 1, the graphic is the 2 Technically, classification under CART has a binary response variable and a variation on the algorithm called C4.5 is used for multi-category variables. This distinction is not of supreme importance in this discussion, however. 4
6 domain of all factors associated with our Y in descending order of importance. In traditional regression models, linear or polynomial, we develop a single equation (or model) to represent the entire data set. CART is an alternative approach to this, where the data space is partitioned into smaller sections where variable interactions are more clear. CART analysis uses this recursive partitioning to create a tree where each node T in Figure 1 represents a cell of the partition. Each cell has attached to it a simplified model which applied to that cell only, and it is useful to draw an analogy here to conditional modeling, where as we move down the nodes, or leaves, of the tree we are conditioning on a certain variable perhaps patient age or the presence of a particular comorbidity. The final split or node is sometimes referred to as a leaf. In Figure 1, A, B, and C are each terminal nodes (leaves), implying that after this split, further splitting of the data does not explain enough of the variance to be relevant in describing Y. Using the notation from Encyclopedia of Statistics in Quality and Reliability[7], we mathematically represent this process: Recalling we want to find a function d(x) to map our domain X to our response variable Y we need to assume the existence of a sample of n observations L = {(x 1, y 1 ),..., (x n, y n )}. As in standard in regression equations, our criterion for choosing d(x) will be the mean squared prediction error E{d(x) E(y x)} 2, or expected misclassification cost in the case of the classification tree. For each leaf-node l and c training samples in the regression tree, then, our model is just ŷ = 1 cc=1 y c 1, the sample mean of the response variable in that cell[12] which creates a piecewise constant model. In the case of the classification tree with leaf node l, training sample c and p(c l), the probability that an observation l belongs to class c, the Gini index node impurity criterion 3 (1 C c=1 p 2 (c l)) defines the node splits, where each split maximizes the decrease in impurity. Whether using classification or regression, reducing error either in classification or prediction is the principal driving statistical mantra behind CART. 2.1 Implementation in R Implementation of CART is primarily carried out through R statistical software and is fairly easy to execute. The two families of commands used for CART include tree and rpart. tree is the simpler of the commands, and will be the focus of the applied examples in this report due to ease of use. The benefit to rpart is that it is user-extensible 4 (a desired commodity in the R community) and can be modified to incorporate surrogate variables more effectively. Additionally, tree has a useful option to more easily compare deviance to GLM or GAM models, which is more difficult in the rpart formulation. The tree command is summarized as: 5 tree(formula, data, weights, subset, na.action = na.pass, control = tree.control(nobs,...), method = "recursive.partition", split = c("deviance", "gini"), 3 There are entropy measures that are functionally equivalent. 4 That is, R is designed so that any user can expand or add to its capabilities with the proper programming knowledge. 5 Adapted from the R help files. 5
7 model = FALSE, x = FALSE, y = TRUE, wts = TRUE,...) where formula defines where the classification or regression model should be entered in the form Y x x n and data refers to the dataset to use. This level of specification is adequate for many simple tree analyses, but it is clear that the other options allow tree to be a powerful tool. Weights allow a user to assign a vector of non-negative observational weights, subset allows a subset of data to be used, na.action is the missing data handler (the default is to skip missing values), control allows users to manually determine certain aspects of fit, method allows the users to assign other CART methods where the default is the previously discussed recursive.partition, split allows users to define alternative node splitting criterion, and model can be used to define numerical nodule endpoint, for example labeling a binary response of y as True. Rpart is similar, with the general command structure: rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE, x = FALSE, y = TRUE, parms, control, cost,...) where the definitions are the same as above. It should be obvious that the implementation is almost identical, except for subtle shifts in coding. For example, in tree one specifies classification analysis in the formula line by specifying the dependent variable as a factor, i.e.: tree(factor(y ) (X), data=mydata ) while in rpart classification is specified in the method option: rpart((y ) (X), method= class, data=mydata ) Similarly, the only difference between the regression implementation in tree: and rpart: tree((y ) (X), data=mydata ) rpart((y ) (X), method= anova, data=mydata ) is removing the factor specification from tree and changing the model statement in rpart to anova. These methods are easily applied to a sample data set. 2.2 An Applied Example Using a dataset of patient visits for upper respiratory infection at Boston Medical Center, we can demonstrate the potential benefits of using CART analysis. Consider the issue of overprescribing of antibiotics for viral respiratory infections. A potential question might be the factors associated with a physician s rate of prescribing if we were to construct an intervention against high prescribing, where might be a good location? To answer this question, the dataset provides each physician s rate prescribing a continuous variable so we will implement the tree function in R: 6
8 Regression Tree: mytree =tree(prescribing rate +physician specialty+ patient ethnicity + patient insurance comorbidites, data = mydata ) The command summary(mytree ) gives us the relevant statistical results: mytree =tree(prescribing rate patient gender+physician specialty+ patient ethnicity + patient insurance comorbidites, data = mydata ) Variables actually used in tree construction: [1] "private insurance" "black race" Number of terminal nodes: 3 Residual mean deviance: = / 6864 Distribution of residuals: Min. 1 Q Median Mean 3 Q Max And the associated visualization in Figure 2: Figure 2: Regression Tree Analysis Here, CART reveals that the most important factors associated with higher prescribing rates is the percent of privately insured patients a physician has, the percent of black patients a physician has, and whether or not the physician specializes in family medicine. For each node, the right branch of the node is conditional on the node being true, and the left is 7
9 conditional on the node being false. For example, the first node is the proportion of the physician s patients which are privately insured, so the right branch is only those data with privately insured patients, and the left includes patients without private insurance. Conditional on private insurance proportion, physician specialty in family medicine or the race mix of patients becomes important. The numbers at the bottom of the terminal branches indicate the mean of the prescribing rate in each data subset. For example, privately insured and black patient populations had a 48% average physician rate of prescribing, while visits without private insurance and with a family medicine doctor were associated with a 54% average physician prescribing rate. Both terminal branches conditional on the presence of private insurance are higher than those associated with other types of (often public) insurance, which could indicate prescriptions are easier to cover under a private plan, or might indicate different rates of demand among those who hold private insurance. Among those without private insurance, the presence of a family medicine doctor raises the average provider prescribing rate by approximately 10%, indicating that family medicine doctors systematically prescribe most antibiotics than non-family medicine doctors. The summary of the model shows a deviance of , or a root mean square of 0.16 (16%), which indicates that our model is actually pretty good given its simplicity, suggesting that the nodes identified explain a good deal of the variation in prescribing rates. This also helps to explain while our tree only has two levels: most of the variation has been described by this relationship. Our regression tree analysis suggested that family medicine specialists may prescribe systematically differently than other specialties. We may want to know what type of patients choose to see a family medicine doctor to see to what degree patient characteristics may be driving the difference among family medicine doctors. For this question, we can use a classification tree based on the binary response of whether or not a physician is a family medicine practitioner. Figure 3 depicts this relationship. myclass =tree(family med patient gender + patient age + patient ethnicity + patient insurance comorbidites, data = mydata ) The command summary(myclass ) gives us the relevant statistical results: Classification Tree: myclass =tree(family med patient gender + patient age + patient ethnicity + patient insurance comorbidites, data = mydata ) Variables actually used in tree construction: [1] "black" "private insurance" "patient over 45" Number of terminal nodes: 4 Residual mean deviance: = 7234 / 6863 Misclassification error rate: = 1650 / 6867 Figure 3 indicates that the most important predictors of choosing a family medicine doctor are patient s race, age, and insurance status. The largest patient share of family medicine doctors is young, black patients, among whom 42% have a family medicine physician. Among black patients over 45 that number is almost half, 24%. Among non-black patients, those without private health insurance are more likely to choose a family medicine practitioner than those with private insurance. These results are somewhat more difficult to interpret because of the nature of the family medicine doctor. For example, we might expect more 8
10 Figure 3: Classification Tree Analysis young people to see family medicine doctors family medicine treats children if they have not been motivated to seek a different PCP. High prescribing among family medicine doctors is consistent with the public literature, which suggests that the broader training associated with family medicine (compared to a specialized internal medicine practitioners) may lead to more liberal prescribing habits. The interventions suggested in our classification diagram may include more rigorous training programs during family medicine residency, or some sore of continuing education program. It might also be important to make sure patients are connected with an appropriate PCP and develop closer connections to care. While a classification tree analysis alone may not be convincing enough to implement certain interventions, this exercise is hypotheses generating, suggesting further areas of empirical research. The misclassification error rate here is 24%, which while slightly high is not uncommon in published studies using classification trees. 2.3 Stata and SAS in Brief Review of the literature reveals that R is by far the most widely used statistical implementation of CART. The reasons for this are varied: R is freely available and user-modifiable, and for this reason the constantly changing landscape of CART is quickly rolled out as a 9
11 program extension; R was the first widely available program to handle CART, so has the most robust collection of trouble-shooting guides and technical support 6 ; The implementation is simple three lines of code will return a robust CART analysis. Increasingly, however, implementations for Stata and SAS are emerging, although these programs are often long and cumbersome. For SAS, implementation must occur in a separate module SAS Enterprise Miner (EM)[17]. EM is driven by a click-and-point interface and a digital workspace where users drag nodes from a toolbar to specify slitting criteria, partion rules, and other program rules. The program then implements CART to statistically grow a tree. For some, the simple interface and colorful flowchart of decision making rules may make the programming of CART more intuitive and clear. EM, however, is not included with the base version of SAS the desktop module costs around $20,000 for a single seat license[18]. Stata has the capability to implement a form of CART for failure data [19], but does not have the ability to run CART on cross sectional data as is widely popular. Nick Cox, lecturer at Durham University and one of the most prolific and well known Stata programmers noted in 2010 that in general, the scope for this kind [CART] of work in Stata falls much below that in other software such as R [20] and little has changed since then. The reference section provide the relevant literature for interested readers. 3 Extensions Bagging and Boosting There are many slight variations in CART analysis based on almost any data structure permutation we could imagine. If you decide to apply CART in your own work, Google and R web forums will prove to be very helpful in determining which slight adjustments need to be made now that you are familiar with the basic structure and terminology associated with the procedure. It is appropriate to discuss two commonly used variations: bagging and boosting. Bagging was developed by Breiman and appeared shortly after his seminal work defining the field, and has gained traction following the increased interest in bootstrapping and similar procedures in statistical analysis, and it is useful to think of it as bootstrapping for tree analysis. 7 The name derives from bootstrap aggregating and involves creating multiple similar datasets, re-running the tree analysis, and then collecting the aggregate of the results and re-calculating the tree and associated statistics based on this aggregate. This technique is often used as cross-validation for larger trees a user wishes to prune and where different versions of the same tree have vastly different rates of misclassification. In general, the procedure will improve the results of a highly unstable tree but may decrease the performance of a stable tree. Using the package ipred the procedure is easy to implement, and we will briefly present it here using the data from our applied example: mybag =bagging(family med patient gender+patient age+ patient ethnicity + patient insurance comorbidites, data = mydata,nbagg=30,coob=t) 6 These files can be widely found with simple Google searches. The rpart help file is available at < 7 The other common bootstrapping procedure for trees is so-called Random Forest which is similar and will not be covered in depth here 10
12 where nbagg specifies that the procedure will create 30 full datasets to aggregate and coob specifies the aggregation selection technique, here the averaged model ( out-of-the-bag ). Calling on the command mybag$err will return the new misclassification error, in our case 23.9% an unimpressive 0.01 decrease in misclassification. Boosting is a technique that seeks to reduce misclassification by a recursive tree model. Here, classifiers are iteratively created by weighted versions of the sample where the weights are adjusted at each iteration based on which cases were misclassified in the previous step many mini-trees which exhibit continuously decreasing misclassification. This technique is often applied to data which has high misclassification because it is largely uninformative, or a weak learner. In these data sets classification is only slightly better than a random guess (think misclassification only slightly less than 50% 8 ), since the data are so loosely (perhaps because of confounders) related [15]. Implementing boosting is either done through a complex series of packages in R or some third-party software specializing in decision tree analysis. Our results indicate that our data are not weak learners, so we will not implement boosting here; in our case, bagging is much more appropriate. Generally, the data structure will indicate whether boosting or bagging is more appropriate. For more information on boosting in R, consult the adabag function. 8 Since a random guess will be correct 50% of the time. 11
13 4 Required Reading 4.1 Introduction to CART Simple Overview of CART with Step-by-Step Intuition: Loh, WeiYin. Classification and regression tree methods. Encyclopedia of statistics in quality and reliability (2008). R Implementation with Additional Tools for Robustness Analysis: Shalizi, Cosma Classification and Regression Trees unpublished report on data mining procedures available at cshalizi/350/lectures/22/lecture-22.pdf A Beginner s Introduction to Bagging and Boosting: Sutton, Clifton D. Classification and regression trees, bagging, and boosting. Handbook of Statistics 24 (2005). 4.2 Advanced Readings Application to Pharmaceuticals: Deconinck, E., et al. Classification of drugs in absorption classes using the classification and regression trees (CART) methodology. Journal of Pharmaceutical and Biomedical Analysis 39.1 (2005): A Review of CART in Public Health: Lemon, Stephenie C., Peter Friedmann, and William Radowski Classification and regression tree analysis in public health: methodological review and comparison with logistic regression. Annals of Behavioral Medicine 26.3 (2003): Applications in Epidemiology: Marshall, Roger J. The use of classification and regression trees in clinical epidemiology. Journal of Clinical Epidemiology 54.6 (2001): Applied Public Health example Smoking: Nishida, Nobuko, et al. Determination of smoking and obesity as periodontitis risks using the classification and regression tree method. Journal of Periodontology 76.6 (2005): The Classic: Breiman JH, L. Olshen, RA Friedman, and Charles J. Stone. Classification and Regression Trees. Wadsworth International Group (1984). 4.3 Implementation References SAS: Gordon, Leonard Using Classification and Regression Trees (CART) in SAS Enterprise Miner TM For Applications in Public Health. SAS Global Forum 2013 Stata (failure only, includes code): v an Putten, Wim. CART: Stata module to perform Classification and Regression Tree analysis. Statistical Software Components (2006). Code and associated materials can be found at: and R: Official help guide for rpart available at 12
14 References 1. Deconinck, E., et al. Classification of drugs in absorption classes using the classification and regression trees (CART) methodology. Journal of Pharmaceutical and Biomedical Analysis 39.1 (2005): Friedman, Jerome H., and Jacqueline J. Meulman. Multiple additive regression trees with application in epidemiology. Statistics in Medicine 22.9 (2003): Hess, Kenneth R., et al. Classification and regression tree analysis of 1000 consecutive patients with unknown primary carcinoma. Clinical Cancer Research 5.11 (1999): Kramer, Stefan, and Gerhard Widmer. Inducing classification and regression trees in first order logic. Relational Data Mining. Springer-Verlag New York, Inc., Lawrence, Rick L., and Andrea Wright. Rule-based classification systems using classification and regression tree (CART) analysis. Photogrammetric Engineering & Remote Sensing (2001): Lemon, Stephenie C., Peter Friedmann, and William Radowski Classification and regression tree analysis in public health: methodological review and comparison with logistic regression. Annals of Behavioral Medicine 26.3 (2003): Loh, WeiYin. Classification and regression tree methods. Encyclopedia of statistics in quality and reliability (2008). 8. Marshall, Roger J. The use of classification and regression trees in clinical epidemiology. Journal of Clinical Epidemiology 54.6 (2001): Nishida, Nobuko, et al. Determination of smoking and obesity as periodontitis risks using the classification and regression tree method. Journal of Periodontology 76.6 (2005): Breiman JH, L. Olshen, RA Friedman, and Charles J. Stone. Classification and Regression Trees. Wadsworth International Group (1984). 11. Pang, Herbert, et al. Pathway analysis using random forests classification and regression. Bioinformatics (2006): Shalizi, Cosma Classification and Regression Trees unpublished report on data mining procedures available at cshalizi/350/lectures/22/lecture- 22.pdf 13. Stark, Katharina DC, and Dirk U. Pfeiffer. The application of non-parametric techniques to solve classification problems in complex data sets in veterinary epidemiologyan example. Intelligent Data Analysis 3.1 (1999):
15 14. Steinberg, Dan, and Phillip Colla. CART: classification and regression trees. The Top Ten Algorithms in Data Mining, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series (1997): Sutton, Clifton D. Classification and regression trees, bagging, and boosting. Handbook of Statistics 24 (2005). 16. Temkin, Nancy R., et al. Classification and regression trees (CART) for prediction of function at 1 year following head trauma. Journal of neurosurgery 82.5 (1995): Gordon, Leonard Using Classification and Regression Trees (CART) in SAS Enterprise Miner TM For Applications in Public Health. SAS Global Forum 2013 (2013) 18. Thompson, Wayne on a LinkedIn SAS forum. Accessed 2013, availibe at is-cost-sas-enterprise s van Putten, Wim CART: Stata module to perform Classification And Regression Tree analysis Statistical Software Components (2006) 20. Cox, Nicholas J. in response to a Stata listserv question available at 02/msg00986.html 14
16 Glossary Gini Impurity Criterion: The Gini in this context is a measure of misclassification, calculating how often some random data element from the data set would be incorrectly specified if it were randomly specified so how much better is the classification from random assignment. Leaf: Lower most node on the tree, also called a terminal node. Node: A variable that significantly contributes to model explanatory power (based on the splitting criterion). Recursive Partitioning: The method by which CART operates subsequently dividing the data into smaller portions to isolate most important variables. Root: Top most node on the tree the most important explanatory variable. Splitting (Purity) Criterion: In CART some criteria needs to be used to determine the best primary way to split the data. While there are different criterion available, the Gini (im)purity criterion is most widely used. In CART, this process of splitting makes makes the tree more pure. 15
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
An Overview and Evaluation of Decision Tree Methodology
An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX [email protected] Carole Jesse Cargill, Inc. Wayzata, MN [email protected]
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
6 Classification and Regression Trees, 7 Bagging, and Boosting
hs24 v.2004/01/03 Prn:23/02/2005; 14:41 F:hs24011.tex; VTEX/ES p. 1 1 Handbook of Statistics, Vol. 24 ISSN: 0169-7161 2005 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(04)24011-1 1 6 Classification
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Data Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND
Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Didacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
Benchmarking Open-Source Tree Learners in R/RWeka
Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)
THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS by XIAO CHENG (Under the Direction of Jeongyoun Ahn) ABSTRACT Big Data has been the new trend in businesses.
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Decision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
REPORT DOCUMENTATION PAGE
REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 Public Reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
Classification and Regression Trees (CART) Theory and Applications
Classification and Regression Trees (CART) Theory and Applications A Master Thesis Presented by Roman Timofeev (188778) to Prof. Dr. Wolfgang Härdle CASE - Center of Applied Statistics and Economics Humboldt
Classification/Decision Trees (II)
Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: [email protected] Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).
Data mining techniques: decision trees
Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups
Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and
THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
How can you unlock the value in real-world data? A novel approach to predictive analytics could make the difference.
How can you unlock the value in real-world data? A novel approach to predictive analytics could make the difference. What if you could diagnose patients sooner, start treatment earlier, and prevent symptoms
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
Homework Assignment 7
Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)
Business Intelligence Professor Chen NAME: Due Date: Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*) Tutorial Summary Objective: Richard would
PharmaSUG2011 Paper HS03
PharmaSUG2011 Paper HS03 Using SAS Predictive Modeling to Investigate the Asthma s Patient Future Hospitalization Risk Yehia H. Khalil, University of Louisville, Louisville, KY, US ABSTRACT The focus of
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
Why do statisticians "hate" us?
Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING
DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING ABSTRACT The objective was to predict whether an offender would commit a traffic offence involving death, using decision tree analysis. Four
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Paper AA-08-2015. Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM
Paper AA-08-2015 Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM Delali Agbenyegah, Alliance Data Systems, Columbus, Ohio 0.0 ABSTRACT Traditional
ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node
Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous
The Predictive Data Mining Revolution in Scorecards:
January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms
Using Excel for Data Manipulation and Statistical Analysis: How-to s and Cautions
2010 Using Excel for Data Manipulation and Statistical Analysis: How-to s and Cautions This document describes how to perform some basic statistical procedures in Microsoft Excel. Microsoft Excel is spreadsheet
A Comparison of Variable Selection Techniques for Credit Scoring
1 A Comparison of Variable Selection Techniques for Credit Scoring K. Leung and F. Cheong and C. Cheong School of Business Information Technology, RMIT University, Melbourne, Victoria, Australia E-mail:
CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
Data Analysis of Trends in iphone 5 Sales on ebay
Data Analysis of Trends in iphone 5 Sales on ebay By Wenyu Zhang Mentor: Professor David Aldous Contents Pg 1. Introduction 3 2. Data and Analysis 4 2.1 Description of Data 4 2.2 Retrieval of Data 5 2.3
Efficiency in Software Development Projects
Efficiency in Software Development Projects Aneesh Chinubhai Dharmsinh Desai University [email protected] Abstract A number of different factors are thought to influence the efficiency of the software
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
Using Excel for Statistical Analysis
2010 Using Excel for Statistical Analysis Microsoft Excel is spreadsheet software that is used to store information in columns and rows, which can then be organized and/or processed. Excel is a powerful
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Data Mining Approaches to Modeling Insurance Risk. Dan Steinberg, Mikhail Golovnya, Scott Cardell. Salford Systems 2009
Data Mining Approaches to Modeling Insurance Risk Dan Steinberg, Mikhail Golovnya, Scott Cardell Salford Systems 2009 Overview of Topics Covered Examples in the Insurance Industry Predicting at the outset
Studying Auto Insurance Data
Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.
S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY
S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT Predictive modeling includes regression, both logistic and linear,
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati
2 Decision tree + Cross-validation with R (package rpart)
1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing
Decision Trees What Are They?
Decision Trees What Are They? Introduction...1 Using Decision Trees with Other Modeling Approaches...5 Why Are Decision Trees So Useful?...8 Level of Measurement... 11 Introduction Decision trees are a
