Machine Learning Algorithms for Predicting Severe Crises of Sickle Cell Disease
|
|
|
- Albert Bishop
- 10 years ago
- Views:
Transcription
1 Machine Learning Algorithms for Predicting Severe Crises of Sickle Cell Disease Clara Allayous Département de Biologie, Université des Antilles et de la guyane Stéphan Clémençon MODALX - Univesité Paris X & Unité Méta@risk-INRA& LPMA-UMR CNRS 7599 Bassirou Diagne MAPMO-UMR CNRS Université d'orléans Richard Emilion MAPMO-UMR CNRS Université d'orléans Thérèse Marianne Département de Biologie, Université des Antilles et de la guyane July 25, 2008 Abstract It is the main purpose of this paper to demonstrate how recent advances in predictive machine learning van be used for quantifying the risk of an acute splenic sequestration crisis (ASSC), a serious symptom of sickle cell disease (SCD). Precisely, based on a training data set, the goal is here to learn how to predict (estimate) the level of severity of the crisis, represented as a binary random variable Y ("mild" vs "severe"), from collected medical measurement X = (X 1,..., X D ). in practice, the prediction rule is obtained by thresholding a certain diagnostic test statistic s(x) at a certain cuto point t: patients with a test score above the selected cuto point are predicted as "prone to severe crisis", otherwise as "prone to a mild crisis". Recently, new machine learning techniques have emerged for constructing a "good predictor", when the ability to distinguish between the two populations is measured either by the standard prediction error or else by the AUC criterion, a measure of quality extensively used for evaluating diagnostic tests in a wide variety of applications. In this article we show how machine learning methods such as boosting or ranking can be used for accurately evaluating the risk of a severe ASSC, while avoiding the question of modeling the underlying distribution of the response Y conditioned upon the explanatory observation X, in contrast to the logistic regression approach. Keywords and phrases: sickle cell disease, ranking, ROC curve, AUC criterion, boosting, decision trees, bootstrap. AMS 2000 Mathematics Subject Classication: 60G70, 60J10, 60K20. 1
2 1 Introduction Sickle cell disease is an inherited disease cells, characterized by pain episodes, anemia (shortage of red blood cells), serious infections and damages to vital organs. The symptoms of sickle cell disease are caused by abnormal hemoglobin, the main protein inside red blood cells that carries oxygen from the lungs and takes it to every part of the body. Normally, red blood cells are round and exible and ow easily through blood vessels. But when struck down by sickle cell disease, the abnormal hemoglobin causes red blood cells to become sti and, under the microscope, may look like a C-shaped farm tool called a "sickle". These stier red blood cells can get stick in tiny blood vessels, cutting o the blood supply to nearby tissues. This is what causes pain (usually termed a "sickle cell pain episode" or "crisis" for short) and sometimes organ damage. Sickle-shaped red blood cells also die and break down more quickly than normal red blood cells, resulting in anemia. There ar several common forms of sickle cell disease, called SS (individual inherit one sickle cell gene from each parent), SC (the child inherits one sickle cell gene and one gene for another abnormal type of hemoglobin called "C") and S-beta thalassemia (the child inherits one sickle cell gene and one gene for beta thalassemia, another inherited anemia). The eects of sickle cell disease vary greatly from one person to another. Some aected children/adults are usually healthy, while others are frequently hospitalized. In order to describe the important clinical variability among such a crisis, medical experts distinguish between two symptom classes: mild and severe. In this paper, we tackle the problem of explaining membership to a certain symptom class based on several medical measurements from a predictive perspective. There area wide variety of ways one can go trying to determine a good prediction. One may seek the opinions of medical experts for instance or else use a training data set, consisting of previously observed cases for which both binary response and predictor variables have been collected at the Caribbean Center for Sickle Cell Disease over the last ten years (refer to for further details) and we investigate the performance of recent machine learning procedures such as boosting and bagging (on decisions trees on interpretability grounds) for building predictive models. As a rst go, the quality of obtained prediction functions is measured by the (expected) prediction error in a standard fashion. However, most commonly used learning procedures output prediction rules based on comparing a certain test statistic to an adequate cuto value and a standard approach for statistical evaluation of such "diagnostic tests" involves computing summary measures based on the so-termed Receiver Operator Characteristic curve (ROC curve, in abbreviated form), such as the AUC criterion (AUC standing for "area under the ROC curve"). One may refer to [13, 16, 15, 11] for instance, see also [14] for a recent account oriented to medical applications. Hence, the predictive power of diagnostic tests output by th machine learning algorithms considered in this article is also evaluated in terms of ROC curve. In this respect, the approach developed in [8] based on pairwise classication (see also [6,7]) for rening standard classication methods, is shown to yield signicant improvements. The paper is organized as follows. A precise formulation of the medical diagnostic prob- 2
3 lem considered in this paper is given in section 2 from the predictive machine learning angle, together with a brief description of the algorithms and data used in our experiments. And in section 3 the results of the various learning procedures we considered are compared in respect of their accuracy and interpretability. 2 Background-Data & Methods In this section, we rst set out the notation and recall certain key notions arising from prediction (machine) learning. The learning algorithms candidates considered in this article for building a good diagnostic test function, competitive in predicting sever acute splenic sequestration crises, are next briey reviewed. Then, the data used for learning how to predict the level of severity of the next ASSC are datailed. 2.1 Formulation of the prediction problem - the bipartite setup In the bipartite setup, the goal is to predict a binary output random variable Y, taking its values in { 1, +1} say, from the observation of a set of explanatory variables X = (X 1,..., X d ). in the present application, Y represents the level of severity of the ASSC, mild and severe being respectively coded as 1 and +1, and X the collection of medical measurements described in Table?? for predicting Y. Using a training sample {(x i, y i )} 1 i n of n cases for which the values of both the explanatory and the output variables have been jointly observed, the matter is to build a prediction rule C that maps all points x = (x 1,..., x d ) in the space X of all values of the input random vector to a point C(x) in the space of response values, namely { 1, +1} with a predictive risk R(C) = P(Y C(X))) as low as possible (the probability in (1) being taken over the joint distribution of (X, Y )). 2.2 The database We have performed the above algorithms on a data set of 42 children described by 15 characteristics. These data were collected during 10 years at the "Centre Caribéen de la drépanocytose (Pointe-à-Titre, Guadeloupe, French West indies)", the structure of this data table is displayed below in table 1. 3
4 Variables Description GR number of basic red globules HT basic ht LEU leucocyte number NEU neutrophyle number PLT number of plates of blood HYGRO hygrometry HB rate of basic hemoglobin AGE age of children LIVER liver SPLEEN spleen DELTALEU variation of leucocytes DELTANEU variation of neutrophiles DELTAHB variation of the hemoglobin lvel DELTAPLT variation of plates of blood PERIODYEAR period of the crisis in the year CLASSE CLASSE equal 1 when HG greater than 4 and -1 otherwise Table 1: Description the input variables. As it was observed that any ASSC is mainly characterized by a decrease of hemoglobin level, we have decide that crisis with hemoglobin level less 4g/dl are severe and the others are mild. We didn't include HB as predictor in analyzes because our denition of ASSC was based on HG level. Also GR, DELTAHB were correlated with HT (respectively t = 12.36, p < 2.2e 16; t = 9.45, p = 2.2e 16), and DELTALEU, DELTANEU were strongly positively correlated with NEU (respectively t = 38.8, p = 2.2e 16; t = 7.38, p < 2.2e 16), in the total sample of 132 subjects. Therefore we eliminated: HB, GR, DELTAHB, DELTALEU, DELTANEU, DELTASPLEEN. We have conducted the above learning algorithms and compared their performance in order to choose the variables that indicated a good prediction of the ASSC. We have used the R software where all these methods are available in some statistical packages, but we have implemented the ranktree, ranking boosting and ranking bagging. The methods used here belong to the family of supervised learning ones, which are suitable for classication, ranking and predictions tasks. Supervised learning methods require a data set S of the form S = {(x i, y i )} where the x i are the medical measurements observed on the patients and also the age variable while y i is the class label corresponding to the mild or severity symptom. This set of classied instances is used to learn the model and is therefore called the training set. A similar data set, called the test set, is used for evaluating the performance of the methods. The classication model assigns each feature vector x i of the test set to a class c i, which is then compared to the true class y i of the instance. If c i = y i, the instance is considered correctly classied, otherwise it counts as a misclassication. But in the case of ranking, the goal is to rank the positive instances higher than negative instance instead of simply 4
5 classifying them. 2.3 Methods ADABOOST Adaboost, a general discriminative learning algorithm invented by Freund and Schapire (1997) can be described as follows: F 0 = 0 for t = 1,..., T w t i = wt i exp( y if t 1 (x i )) Get h t from weak learner α t = 1 2 ( i:h t (x i )=1,y i =1 wt i i:h t (x i )=1,y i = 1 wt i F t+1 = F t + α t h t Figure 1. The Adaboost algorithm. The basic idea of Adaboost is to repeatedly apply a simple learning algorithm called the weak or the base learner, to dierent weighting of the same training est. In its simplest form, Adaboost is intended for binary prediction problems where the training set consists of pairs (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ), x i corresponds to the feature of an example, and y i { 1, 1} is the binary label to be predicted. A weighting of the training examples is an assignment of a non-negative real value w i to each example (x i, y i ). On iteration t of the boosting process, the weak learner is applied to the training set with a set of weights w t 1,..., wt n and produces a prediction rule h t that maps x to {0, 1}. The requirement on the weak learner is for h t (x) to have a small but signicant correlation with the example labels y when measured using the current weighting of the examples. After the rule h t is generated, the examples weights are changed so that the weak predictions h t (x) and the label y are uncorrelated. The weak learner is then called with the new weights over the training examples, and the process repeats. Finally, all of the weak prediction rules are combined into a single strong rule using a weighted majority vote. One Can prove that if the rules generated in the iterations are all slightly correlated with the label, then the strong rule will have a very high correlation with the label, in order words, it will predict the label very accurately. The whole process can be seen as a variational method in which an approximation F (x) is repeatedly changed by adding to it small corrections given by the weak prediction functions. The strong prediction rule learned by Adaboost is denoted by sign(f (x)) Bagging Let's consider a learning sample of S consist of data {(x i, y i ), i = 1,..., N}, where y {1,..., J} denotes the class variable. Bootstrap aggregating or bagging (Breiman 1996) works by learning multiple classiers on bootstrap samples. For each trial t = 1,..., T, a 5
6 sample of size N is selected randomly with replacement from the original learning set S. Classiers φ t (x, S (B) t ) based on the bootstrap training set {S (B) t } is formed. The bagged classier φ B is then constructed by combining φ t with simple voting. Now we species the bagging algorithm as follows 1. A sequence of bootstrap samples {S (B) t } is drawn from the learning sample S. Classier φ t = φ generated by using, t T, for a xed number T. 2. Let N j be the number of times that an instance x is classied as class j through. That is φ t = φ(x, S (B) t N j = T t=1 I[φ(x, S (B) t ) = j] for j {1,..., J}, where I stands for the indicator function. 3. Bagged predictor is created by simple voting. That is φ B (x) = arg max N j. j= Boosting Tree Ranking The existing methods for growing decision trees typically use splitting criteria based on error/accuracy or discrimination. In this method we use AUC splitting criterion Logistic Regression Denote the gold standards by a random variable Y and the other medical feature by X 1, X 2,..., X n. Let Y = 1 when a patient has a very severe crisis and Y = 0 when he has a mild one. The logistic model is of the form log P r(y = 1/X) 1 P r(y = 1/X) = α + βx, where the random vector X consists of X 1, X 2,..., X n an their interaction terms. The stepwise selection procedure starts from a null model. At each step, it adds a variable with the most signicant score statistics among those not in the model, ten sequentially removes the variable with the least score statistic among those in the model whose score statistics are not signicant. The process terminates if no further variable can be added to the model or if the variable just entered into the model is the only variable removed in the subsequent elimination. Here, the score statistic measures the signicance of the eect of a variable Performance measurement: ROC curve Receiving Operator Characteristic (ROC) curve (Zhou et al.) is a graphical representation used to assess the discriminatory ability of a dichotomous classier by showing the tradeos between sensitivity and specicity for every possible cut o. Sensitivity is calculated by dividing the number of true positives (TP) by the number of all positives, 6 J
7 which is equal to the sum of the true positives and the false negative (FN). Specicity is calculated by dividing the number of the negative (TN) by the number of all negatives which equals the sum of the true negatives and the false positive (FP). Sensitivity = Specif icity = T P T P + F N T N T N + F P The ROC curve plot 1 specificity on the X-axis and sensitivity on the Y-axis. A good classier has its ROC curve climbing rapidly towards upper left hand corner of the graph. This can also be quantied by measuring the area under the curve. The closer the area is to 1, the better the classier is while the closer the area is to 0.5, the worse the classier is. We also use the ROC curve to compare the performance of these learning algorithms. 2.4 Results Adaboost With adaboost, the bootstrap error rate in the model rises to 9%. This was evaluated by doing 100 runs, each base learner receives dierent training set of n instances. With this sampling called bootstrap replication (Efron and Tibsirani (1993)), it has observed that, on average 36.8% of training instances are not used for building each tree but as test set, and then averaging over the test set errors. We choose the four most important variables which are HT, NEU, DELTAPLT, HYGRO by ranking the variables according to their frequency weighted by the position of the node. Bagging The results mostly agree with the bagging: Two similarity rst features with adaboost are found to be very important. The bootstrap error rate in the model rises to 6%. This was evaluated by doing 100 runs. Bagging gives HT, HYGRO, PLT, and NEU as being the fours must important variables (see graph 1). Boosting Tree Ranking ranking bagging The existing methods for growing decision trees typically use splitting criteria based on error/accuracy or discrimination. In this method we use an AUC splitting criterion. This algorithm build a binary tree-structured scoring function. This method mimics the idea recursive approximation procedure of the optimal ROC curve. Boosting ranking and ranking bagging use this tree as base learner and the same construction with adaboost and bagging respectively to obtain the nals methods. 7
8 Logistic Regression We use the likelihood ratio to test the signicance of predictor variables included in the model. The likelihood ratio is given in the following equation: D = 2 ln[l(b)]. This statistic, D, is known as the deviance. First, we want to determine the deviance of the model without any of the predictor variables (i.e with the intercept only), and compare this value with that of the model consisting of dierence combinations of variables. As we add variables, we can evaluate the p-value of the deviance, which tests for the signicance of that particular combination of predictor variables. A low p-value justies the rejection of the null hypothesis, which is that all of he beta coecients are equal to zero (i.e the all of the predictor variables are independent of the response variable). The rejection of the null hypothesis means that the variables included in the model are signicant. Backwards regression is the method we will use to nd the overall best model. That is, we will run a model in R software containing all of the predictor variables and then analyze the p-value of the predictor variables to determine which of them should be removed. We will also note the deviance of the model, along with its associated p-value. To rene the model, we remove several of the factors exhibiting the highest p-value the reevaluate the model. We repeat the removal process until we get to the point where the remaining predictor variables are all signicant at the level α = 0.05 level HT, PLT are the remaining risk variables. Using the this model, we calculated the predictive error rate of logistic regression on the data set and obtained 11.36%. This was evaluated by doing 100 bootstraps and then averaging over the test set errors. The results presented show that all these methods Figure 1: Variables Importance 8
9 discriminate very well the severe and mild crisis. This is an indiction that there are clear distinction between the two class in terms of the 9 attributes that were used. The results of adaboost and bagging coincide in terms of what the most important variables are. The rst four variables according to adaboost and bagging are HT, NEU, DELTAPLT, HYGRO. Also the the three rst variables chosen by bagging ranking are ranked between the four most important variables according to boosting tree ranking. 2.5 ROC curves The ROC curves constructed when we use these methods show that these methods improve on the default hypothesis (which would correspond to the diagonal line T P = F P). Moreover these curves conrm that Adaboost and RankTree give the best prediction. Methods AUC Adaboost 0.92% RankTree 0.90 % RankBagg 0.88 % Bagging 0.87 % Logistic regression 0.80% Table 2: Area under a ROC curve (AUC). True positive rate RankTree RankBagg Adaboost Bagging Logistic Regression False positive rate Conclusion In this paper, we have explored Adaboost, Ranktree, Rankbagg, Bagging and logistic regression methods for diagnostic of severity of ASSC, a serious symptom in cycle cell 9
10 disease. The ve methods were evaluated on diagnostic problem at a xed level of haemoglobin to classify patient between mild and severe symptoms. The rst interesting observation is the high accuracy of all methods. In particular the Adaboost and Ranktree achieved respectively 92% and 90% AUC, thus providing a very good model of the diagnostic process. This indicated that we can choose HT, NEU, DELT AP LT and HY GRO medical diagnostic to predict severity of ASSC. References [1] A Liaw and M. Wiener, Classication and regression by randomforest. Rnews, 2/3:1 8-22, Decembre 2002 [2] Breiman, L.,Friedman, J., Olshen, C. (1984). Classication and Regression Trees. Wadsworth, Belmont, California. [3] Breiman,L. (1996) Bagging predictors. Machine learning, 24, [4] Breiman,L. (2001) RandomForests. Machine learning, 45, 5-32 [5] Breiman, L. (2003). How to Use Survival Forests. Technical Report, Department of Statistics, University of California, Berkeley, URL: [6] Clémençon S., Lugosi G. and Vayatis N. (2005). Ranking and scoring using empirical risk minimization. In Proc. of COLT Bertino, Italy, June 27-30, Lecture Notes in computer Science 3559 Springer,1-15. [7] Clémençon S., Lugosi G. and Vayatis N. (2006). From Classication to Ranking: a Statistical View. In Proc. of the 29th Annual Conference of the German Classication Society, GfKl 2005, 'Studies in Classication, Data Analysis and Knowledge Organization' series, Vol. 30. Springer-Verlag. [8] Clémençon S., Lugosi G. and Vayatis N. (2007). U-processes and nonparametric scoring. To appear in Annals of Statistics, available at [9] Efron, B. and Tibishirani, R.J An introduction to the bootstrap. London: Chapman & Hall. [10] Freund, Y. an Schapire, R. E. (1997). A decision-theoritic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55 (1) [11] Goddard, M.J. and Hinberg I. (1990). Receiver Operator Charcteristic curves and non-normal data: an empirical study. Statistics in Medicine, 9,
11 [12] Hastie, T., R. Tibishirani, and J. Friedman (2003). The Elements of Statistical Learning. New York: Springer. [13] Hsieh, F.and Turnbull B.W. (1996). Nonparametric Methods for Evaluating Diagnostic Tests. Statistica Sinica, 6, [14] Pepe, M.S. (2003). The Statistical Evaluation of Medical Tests for Classication and Prediction. Oxford: Oxford University Press. [15] Swets, J.A. (1988). Measuring the accuracy of diagnostic systems. Science, 240, [16] Swets, J.A. and Pickett R.M.(1982). Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. Academic Press, New York. [17] Zhou X.H., Obuchowski N. and Obuchowski D. (2002). Statistical Methods in Diagnostic Medicine. New York: Wiley & Sons:
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Introduction to Logistic Regression
OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
1 Example of Time Series Analysis by SSA 1
1 Example of Time Series Analysis by SSA 1 Let us illustrate the 'Caterpillar'-SSA technique [1] by the example of time series analysis. Consider the time series FORT (monthly volumes of fortied wine sales
Modelling and added value
Modelling and added value Course: Statistical Evaluation of Diagnostic and Predictive Models Thomas Alexander Gerds (University of Copenhagen) Summer School, Barcelona, June 30, 2015 1 / 53 Multiple regression
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-17-AUC
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
A Learning Algorithm For Neural Network Ensembles
A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
Hydroxyurea Treatment for Sickle Cell Disease
Hydroxyurea Treatment for Sickle Cell Disease Before hydroxyurea After hydroxyurea Hydroxyurea Treatment for Sickle Cell Disease 1 This document is not intended to take the place of the care and attention
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Tests for Two Survival Curves Using Cox s Proportional Hazards Model
Chapter 730 Tests for Two Survival Curves Using Cox s Proportional Hazards Model Introduction A clinical trial is often employed to test the equality of survival distributions of two treatment groups.
REPORT DOCUMENTATION PAGE
REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 Public Reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma
Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma Shinto Eguchi a, and John Copas b a Institute of Statistical Mathematics and Graduate University of Advanced Studies, Minami-azabu
X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)
CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
Identifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
Boosting. [email protected]
. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg [email protected]
Health Care and Life Sciences
Sensitivity, Specificity, Accuracy, Associated Confidence Interval and ROC Analysis with Practical SAS Implementations Wen Zhu 1, Nancy Zeng 2, Ning Wang 2 1 K&L consulting services, Inc, Fort Washington,
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Performance Measures in Data Mining
Performance Measures in Data Mining Common Performance Measures used in Data Mining and Machine Learning Approaches L. Richter J.M. Cejuela Department of Computer Science Technische Universität München
Copyright 2006, SAS Institute Inc. All rights reserved. Predictive Modeling using SAS
Predictive Modeling using SAS Purpose of Predictive Modeling To Predict the Future x To identify statistically significant attributes or risk factors x To publish findings in Science, Nature, or the New
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
On Adaboost and Optimal Betting Strategies
On Adaboost and Optimal Betting Strategies Pasquale Malacaria School of Electronic Engineering and Computer Science Queen Mary, University of London Email: [email protected] Fabrizio Smeraldi School of
Using Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, [email protected] Department of Statistics,UC Berkeley Andy Liaw, andy [email protected] Biometrics Research,Merck Research Labs Leo Breiman,
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
Hyperspectral images retrieval with Support Vector Machines (SVM)
Hyperspectral images retrieval with Support Vector Machines (SVM) Miguel A. Veganzones Grupo Inteligencia Computacional Universidad del País Vasco (Grupo Inteligencia SVM-retrieval Computacional Universidad
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Selecting Data Mining Model for Web Advertising in Virtual Communities
Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: [email protected] Mariusz Łapczyński
Risk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
Personalized Predictive Medicine and Genomic Clinical Trials
Personalized Predictive Medicine and Genomic Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov brb.nci.nih.gov Powerpoint presentations
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Virtual Site Event. Predictive Analytics: What Managers Need to Know. Presented by: Paul Arnest, MS, MBA, PMP February 11, 2015
Virtual Site Event Predictive Analytics: What Managers Need to Know Presented by: Paul Arnest, MS, MBA, PMP February 11, 2015 1 Ground Rules Virtual Site Ground Rules PMI Code of Conduct applies for this
Package acrm. R topics documented: February 19, 2015
Package acrm February 19, 2015 Type Package Title Convenience functions for analytical Customer Relationship Management Version 0.1.1 Date 2014-03-28 Imports dummies, randomforest, kernelfactory, ada Author
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Didacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
Educator s Guide to Sickle Cell Disease
Educator s Guide to Sickle Cell Disease Educator s Guide to Sickle Cell Disease Sickle cell disease is an inherited blood disorder affecting about one out of every 350 African Americans. Most children
Nominal and ordinal logistic regression
Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
Generalized Linear Models
Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the
THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
Gamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
Performance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
CS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:
Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller
Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive
PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS
PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS Arturo Fornés [email protected], José A. Conejero [email protected] 1, Antonio Molina [email protected], Antonio Pérez [email protected],
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct
A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND
Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression
Introduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL
Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations
