Random Forests for Regression and Classification. Adele Cutler Utah State University
|
|
|
- Corey Wheeler
- 9 years ago
- Views:
Transcription
1 Random Forests for Regression and Classification Adele Cutler Utah State University September 5-7, 2 Ovronnaz, Switzerland
2 Leo Breiman, : PhD Berkeley (mathematics) : UCLA (mathematics) : Consultant Berkeley (statistics) 984 Classification & Regression Trees (with Friedman, Olshen, Stone) 996 Bagging 2 Random Forests September 5-7, 2 Ovronnaz, Switzerland 2
3 Random Forests for Regression and Classification September 5-7, 2 Ovronnaz, Switzerland 3
4 Outline Background. Trees. Bagging predictors. Random Forests algorithm. Variable importance. Proximity measures. Visualization. Partial plots and interpretation of effects. September 5-7, 2 Ovronnaz, Switzerland 4
5 What is Regression? Given data on predictor variables (inputs, X) and a continuous response variable (output, Y) build a model for: Predicting the value of the response from the predictors. Understanding the relationship between the predictors and the response. e.g. predict a person s systolic blood pressure based on their age, height, weight, etc. September 5-7, 2 Ovronnaz, Switzerland 5
6 Regression Examples Y: income X: age, education, sex, occupation, Y: crop yield X: rainfall, temperature, humidity, Y: test scores X: teaching method, age, sex, ability, Y: selling price of homes X: size, age, location, quality, September 5-7, 2 Ovronnaz, Switzerland 6
7 Regression Background Linear regression Multiple linear regression Nonlinear regression (parametric) Nonparametric regression (smoothing) Kernel smoothing B-splines Smoothing splines Wavelets September 5-7, 2 Ovronnaz, Switzerland 7
8 Regression Picture September 5-7, 2 Ovronnaz, Switzerland 8
9 Regression Picture September 5-7, 2 Ovronnaz, Switzerland 9
10 Regression Picture September 5-7, 2 Ovronnaz, Switzerland
11 Regression Picture September 5-7, 2 Ovronnaz, Switzerland
12 What is Classification? Given data on predictor variables (inputs, X) and a categorical response variable (output, Y) build a model for: Predicting the value of the response from the predictors. Understanding the relationship between the predictors and the response. e.g. predict a person s 5-year-survival (yes/no) based on their age, height, weight, etc. September 5-7, 2 Ovronnaz, Switzerland 2
13 Classification Examples Y: presence/absence of disease X: diagnostic measurements Y: land cover (grass, trees, water, roads ) X: satellite image data (frequency bands) Y: loan defaults (yes/no) X: credit score, own or rent, age, marital status, Y: dementia status X: scores on a battery of psychological tests September 5-7, 2 Ovronnaz, Switzerland 3
14 Classification Background Linear discriminant analysis (93 s) Logistic regression (944) Nearest neighbors classifiers (95) September 5-7, 2 Ovronnaz, Switzerland 4
15 Classification Picture September 5-7, 2 Ovronnaz, Switzerland 5
16 Classification Picture September 5-7, 2 Ovronnaz, Switzerland 6
17 Classification Picture September 5-7, 2 Ovronnaz, Switzerland 7
18 Classification Picture September 5-7, 2 Ovronnaz, Switzerland 8
19 Classification Picture September 5-7, 2 Ovronnaz, Switzerland 9
20 Classification Picture September 5-7, 2 Ovronnaz, Switzerland 2
21 Classification Picture September 5-7, 2 Ovronnaz, Switzerland 2
22 Classification Picture September 5-7, 2 Ovronnaz, Switzerland 22
23 Classification Picture September 5-7, 2 Ovronnaz, Switzerland 23
24 Regression and Classification Given data D = { (x i,y i ), i=,,n} where x i =(x i,,x ip ), build a model f-hat so that Y-hat = f-hat (X) for random variables X = (X,,X p ) and Y. Then f-hat will be used for: Predicting the value of the response from the predictors: y -hat = f-hat(x ) where x = (x o,,x op ). Understanding the relationship between the predictors and the response. September 5-7, 2 Ovronnaz, Switzerland 24
25 Assumptions Independent observations Not autocorrelated over time or space Not usually from a designed experiment Not matched case-control Goal is prediction and (sometimes) understanding Which predictors are useful? How? Where? Is there interesting structure? September 5-7, 2 Ovronnaz, Switzerland 25
26 Predictive Accuracy Regression Expected mean squared error Classification Expected (classwise) error rate September 5-7, 2 Ovronnaz, Switzerland 26
27 Estimates of Predictive Accuracy Resubstitution Use the accuracy on the training set as an estimate of generalization error. AIC etc Use assumptions about model. Crossvalidation Randomly select a training set, use the rest as the test set. -fold crossvalidation. September 5-7, 2 Ovronnaz, Switzerland 27
28 -Fold Crossvalidation Divide the data at random into pieces, D,,D. Fit the predictor to D 2,,D ; predict D. Fit the predictor to D,D 3,,D ; predict D 2. Fit the predictor to D,D 2,D 4,,D ; predict D 3. Fit the predictor to D,D 2,,D 9 ; predict D. Compute the estimate using the assembled predictions and their observed values. September 5-7, 2 Ovronnaz, Switzerland 28
29 Estimates of Predictive Accuracy Typically, resubstitution estimates are optimistic compared to crossvalidation estimates. Crossvalidation estimates tend to be pessimistic because they are based on smaller samples. Random Forests has its own way of estimating predictive accuracy ( out-of-bag estimates). September 5-7, 2 Ovronnaz, Switzerland 29
30 Case Study: Cavity Nesting birds in the Uintah Mountains, Utah Red-naped sapsucker (Sphyrapicus nuchalis) (n = 42 nest sites) Mountain chickadee (Parus gambeli) (n = 42 nest sites) Northern flicker (Colaptes auratus) (n = 23 nest sites) n = 6 non-nest sites
31 Case Study: Cavity Nesting birds in the Uintah Mountains, Utah Response variable is the presence (coded ) or absence (coded ) of a nest. Predictor variables (measured on.4 ha plots around the sites) are: Numbers of trees in various size classes from less than inch in diameter at breast height to greater than 5 inches in diameter. Number of snags and number of downed snags. Percent shrub cover. Number of conifers. Stand Type, coded as for pure aspen and for mixed aspen and conifer.
32 Assessing Accuracy in Classification Predicted Class Actual Absence Presence Class Total Absence, a b a+b Presence, c d c+d Total a+c b+d n
33 Assessing Accuracy in Classification Predicted Class Actual Absence Presence Class Total Absence, a b a+b Presence, c d c+d Total a+c b+d n Error rate = ( c + b ) / n
34 Resubstitution Accuracy (fully grown tree) Predicted Class Actual Absence Presence Class Total Absence, 5 6 Presence, 7 7 Total Error rate = ( + )/23 = (approx).5 or.5%
35 Crossvalidation Accuracy (fully grown tree) Predicted Class Actual Absence Presence Class Total Absence, Presence, Total Error rate = ( )/23 = (approx).2 or 2%
36 Outline Background. Trees. Bagging predictors. Random Forests algorithm. Variable importance. Proximity measures. Visualization. Partial plots and interpretation of effects. September 5-7, 2 Ovronnaz, Switzerland 36
37 Classification and Regression Trees Pioneers: Morgan and Sonquist (963). Breiman, Friedman, Olshen, Stone (984). CART Quinlan (993). C4.5 September 5-7, 2 Ovronnaz, Switzerland 37
38 Classification and Regression Trees Grow a binary tree. At each node, split the data into two daughter nodes. Splits are chosen using a splitting criterion. Bottom nodes are terminal nodes. For regression the predicted value at a node is the average response variable for all observations in the node. For classification the predicted class is the most common class in the node (majority vote). For classification trees, can also get estimated probability of membership in each of the classes September 5-7, 2 Ovronnaz, Switzerland 38
39 A Classification Tree Predict hepatitis (=absent, =present) using protein and alkaline phosphate. protein< Yes goes left. alkphos< 7 protein< alkphos< 29.4 /4 9/ 4/ /2 protein>=26 /3 7/4 September 5-7, 2 Ovronnaz, Switzerland 39
40 Splitting criteria Regression: residual sum of squares RSS = left (y i y L *) 2 + right (y i y R *) 2 where y L * = mean y-value for left node y R * = mean y-value for right node Classification: Gini criterion Gini = N L k=,,k p kl (- p kl ) + N R k=,,k p kr (- p kr ) where p kl = proportion of class k in left node p kr = proportion of class k in right node September 5-7, 2 Ovronnaz, Switzerland 4
41 Choosing the best horizontal split Best horizontal split is at 3.67 with RSS = September 5-7, 2 Ovronnaz, Switzerland 4
42 Choosing the best vertical split Best vertical split is at.5 with RSS = September 5-7, 2 Ovronnaz, Switzerland 42
43 Regression tree (prostate cancer) September 5-7, 2 Ovronnaz, Switzerland 43
44 Choosing the best split in the left node Best horizontal split is at 3.66 with RSS = 6.. September 5-7, 2 Ovronnaz, Switzerland 44
45 Choosing the best split in the left node Best vertical split is at -.48 with RSS = 3.6. September 5-7, 2 Ovronnaz, Switzerland 45
46 Regression tree (prostate cancer) September 5-7, 2 Ovronnaz, Switzerland 46
47 Choosing the best split in the right node Best horizontal split is at 3.7 with RSS = September 5-7, 2 Ovronnaz, Switzerland 47
48 Choosing the best split in the right node Best vertical split is at 2.79 with RSS = 25.. September 5-7, 2 Ovronnaz, Switzerland 48
49 Regression tree (prostate cancer) September 5-7, 2 Ovronnaz, Switzerland 49
50 Choosing the best split in the third node Best horizontal split is at 3.7 with RSS = 4.42, but this is too close to the edge. Use 3.46 with RSS = 6.4. September 5-7, 2 Ovronnaz, Switzerland 5
51 Choosing the best split in the third node Best vertical split is at 2.46 with RSS = September 5-7, 2 Ovronnaz, Switzerland 5
52 Regression tree (prostate cancer) September 5-7, 2 Ovronnaz, Switzerland 52
53 Regression tree (prostate cancer) September 5-7, 2 Ovronnaz, Switzerland 53
54 Regression tree (prostate cancer) lpsa lweight lcavol September 5-7, 2 Ovronnaz, Switzerland 54
55 Classification tree (hepatitis) protein< alkphos< 7 protein< alkphos< 29.4 /4 9/ 4/ /2 protein>=26 /3 7/ September 5-7, 2 Ovronnaz, Switzerland 55
56 Classification tree (hepatitis) protein< alkphos< 7 protein< alkphos< 29.4 /4 9/ 4/ /2 protein>=26 /3 7/ September 5-7, 2 Ovronnaz, Switzerland 56
57 Classification tree (hepatitis) protein< alkphos< 7 protein< alkphos< 29.4 /4 9/ 4/ /2 protein>=26 /3 7/ September 5-7, 2 Ovronnaz, Switzerland 57
58 Classification tree (hepatitis) protein< alkphos< 7 protein< alkphos< 29.4 /4 9/ 4/ /2 protein>=26 /3 7/ September 5-7, 2 Ovronnaz, Switzerland 58
59 Classification tree (hepatitis) protein< alkphos< 7 protein< alkphos< 29.4 /4 9/ 4/ /2 protein>=26 /3 7/ September 5-7, 2 Ovronnaz, Switzerland 59
60 Pruning If the tree is too big, the lower branches are modeling noise in the data ( overfitting ). The usual paradigm is to grow the trees large and prune back unnecessary splits. Methods for pruning trees have been developed. Most use some form of crossvalidation. Tuning may be necessary. September 5-7, 2 Ovronnaz, Switzerland 6
61 Case Study: Cavity Nesting birds in the Uintah Mountains, Utah Choose cp =.35
62 Crossvalidation Accuracy (cp =.35) Predicted Class Actual Absence Presence Class Total Absence, Presence, Total Error rate = ( )/23 = (approx).9 or 9%
63 Classification and Regression Trees Advantages Applicable to both regression and classification problems. Handle categorical predictors naturally. Computationally simple and quick to fit, even for large problems. No formal distributional assumptions (non-parametric). Can handle highly non-linear interactions and classification boundaries. Automatic variable selection. Handle missing values through surrogate variables. Very easy to interpret if the tree is small. September 5-7, 2 Ovronnaz, Switzerland 63
64 Classification and Regression Trees Advantages (ctnd) The picture of the tree can give valuable insights into which variables are important and where. protein< The terminal nodes suggest a natural clustering of data into homogeneous groups. 24/ /2 fatigue<.5 alkphos< 7 /3 age>=28.5 /4 2/ albumin< / varices<. firm>=.5 3/9 /4 September 5-7, 2 Ovronnaz, Switzerland 64
65 Classification and Regression Trees Disadvantages Accuracy - current methods, such as support vector machines and ensemble classifiers often have 3% lower error rates than CART. Instability if we change the data a little, the tree picture can change a lot. So the interpretation is not as straightforward as it appears. Today, we can do better! Random Forests September 5-7, 2 Ovronnaz, Switzerland 65
66 Outline Background. Trees. Bagging predictors. Random Forests algorithm. Variable importance. Proximity measures. Visualization. Partial plots and interpretation of effects. September 5-7, 2 Ovronnaz, Switzerland 66
67 Data and Underlying Function September 5-7, 2 Ovronnaz, Switzerland 67
68 Single Regression Tree September 5-7, 2 Ovronnaz, Switzerland 68
69 Regression Trees September 5-7, 2 Ovronnaz, Switzerland 69
70 Average of Regression Trees September 5-7, 2 Ovronnaz, Switzerland 7
71 Hard problem for a single tree: Ovronnaz, Switzerland September 5-7, 2 7
72 Single tree: September 5-7, 2 Ovronnaz, Switzerland 72
73 25 Averaged Trees: September 5-7, 2 Ovronnaz, Switzerland 73
74 25 Voted Trees: September 5-7, 2 Ovronnaz, Switzerland 74
75 Bagging (Bootstrap Aggregating) Breiman, Bagging Predictors, Machine Learning, 996. Fit classification or regression models to bootstrap samples from the data and combine by voting (classification) or averaging (regression). Bootstrap sample f (x) Bootstrap sample f 2 (x) Bootstrap sample f 3 (x) Bootstrap sample f M (x) MODEL AVERAGING Combine f (x),, f M (x) f(x) f i (x) s are base learners September 5-7, 2 Ovronnaz, Switzerland 75
76 Bagging (Bootstrap Aggregating) A bootstrap sample is chosen at random with replacement from the data. Some observations end up in the bootstrap sample more than once, while others are not included ( out of bag ). Bagging reduces the variance of the base learner but has limited effect on the bias. It s most effective if we use strong base learners that have very little bias but high variance (unstable). E.g. trees. Both bagging and boosting are examples of ensemble learners that were popular in machine learning in the 9s. September 5-7, 2 Ovronnaz, Switzerland 76
77 Bagging CART Dataset # cases # vars # classes CART Bagged CART Decrease % Waveform Heart Breast Cancer Ionosphere Diabetes Glass Soybean Leo Breiman (996) Bagging Predictors, Machine Learning, 24, September 5-7, 2 Ovronnaz, Switzerland 77
78 Outline Background. Trees. Bagging predictors. Random Forests algorithm. Variable importance. Proximity measures. Visualization. Partial plots and interpretation of effects. September 5-7, 2 Ovronnaz, Switzerland 78
79 Random Forests Dataset # cases # vars # classes CART Bagged CART Random Forests Waveform Breast Cancer Ionosphere Diabetes Glass Leo Breiman (2) Random Forests, Machine Learning, 45, September 5-7, 2 Ovronnaz, Switzerland 79
80 Random Forests Grow a forest of many trees. (R default is 5) Grow each tree on an independent bootstrap sample* from the training data. At each node:. Select m variables at random out of all M possible variables (independently for each node). 2. Find the best split on the selected m variables. Grow the trees to maximum depth (classification). Vote/average the trees to get predictions for new data. *Sample N cases at random with replacement. September 5-7, 2 Ovronnaz, Switzerland 8
81 Random Forests Inherit many of the advantages of CART: Applicable to both regression and classification problems. Yes. Handle categorical predictors naturally. Yes. Computationally simple and quick to fit, even for large problems. Yes. No formal distributional assumptions (non-parametric). Yes. Can handle highly non-linear interactions and classification boundaries. Yes. Automatic variable selection. Yes. But need variable importance too. Handles missing values through surrogate variables. Using proximities. Very easy to interpret if the tree is small. NO! September 5-7, 2 Ovronnaz, Switzerland 8
82 Random Forests But do not inherit: The picture of the tree can give valuable insights into which NO! variables are important and where. The terminal nodes suggest a natural NO! clustering of data into homogeneous groups. 25/ albumin< 3.8 protein< protein< 45 protein< bilirubin<.65 alkphos< 7 bilirubin>=3.65 alkphos< 7.5 prog>=.5 albumin< 3.9 bilirubin<.65 /2 bilirubin<.5 alkphos< 3/ 7.5 /9 /4 varices<.5 sgot>=62 bilirubin>=.8 3/ albumin< bilirubin>= alkphos< 7 firm>=.5 sgot< 29 alkphos< 49 protein< / varices< /2.5 /8 2/ / 2/ /89 prog>=.5 bilirubin>=.8 3/ fatigue< alkphos<.5 9 3/ age< 2/ 5 /6 4/6 fatigue< fatigue<.5.5 albumin< sgot>=23.8 9/ 3.9 2/98 /8/7 /2 /5 /7 25/ /2 2/ /2 4/ /2 2/ /7 / 5/4/ /2 /7 9/ /4 September 5-7, 2 Ovronnaz, Switzerland 82
83 Random Forests Improve on CART with respect to: Accuracy Random Forests is competitive with the best known machine learning methods (but note the no free lunch theorem). Instability if we change the data a little, the individual trees may change but the forest is relatively stable because it is a combination of many trees. September 5-7, 2 Ovronnaz, Switzerland 83
84 Two Natural Questions. Why bootstrap? (Why subsample?) Bootstrapping out-of-bag data Estimated error rate and confusion matrix Variable importance 2. Why trees? Trees proximities Missing value fill-in Outlier detection Illuminating pictures of the data (clusters, structure, outliers) September 5-7, 2 Ovronnaz, Switzerland 84
85 The RF Predictor A case in the training data is not in the bootstrap sample for about one third of the trees (we say the case is out of bag or oob ). Vote (or average) the predictions of these trees to give the RF predictor. The oob error rate is the error rate of the RF predictor. The oob confusion matrix is obtained from the RF predictor. For new cases, vote (or average) all the trees to get the RF predictor. September 5-7, 2 Ovronnaz, Switzerland 85
86 The RF Predictor For example, suppose we fit trees, and a case is out-of-bag in 339 of them, of which: 283 say class 56 say class 2 The RF predictor for this case is class. The oob error gives an estimate of test set error (generalization error) as trees are added to the ensemble. September 5-7, 2 Ovronnaz, Switzerland 86
87 RFs do not overfit as we fit more trees Oob test September 5-7, 2 Ovronnaz, Switzerland 87
88 RF handles thousands of predictors Ramón Díaz-Uriarte, Sara Alvarez de Andrés Bioinformatics Unit, Spanish National Cancer Center March, 25 Compared SVM, linear kernel KNN/crossvalidation (Dudoit et al. JASA 22) DLDA Shrunken Centroids (Tibshirani et al. PNAS 22) Random forests Given its performance, random forest and variable selection using random forest should probably become part of the standard tool-box of methods for the analysis of microarray data. September 5-7, 2 Ovronnaz, Switzerland 88
89 Microarray Datasets Data P N # Classes Leukemia Breast Breast NCI Adenocar Brain Colon Lymphoma Prostate Srbct September 5-7, 2 Ovronnaz, Switzerland 89
90 Microarray Error Rates Data SVM KNN DLDA SC RF rank Leukemia Breast Breast NCI Adenocar Brain Colon Lymphoma Prostate Srbct Mean September 5-7, 2 Ovronnaz, Switzerland 9
91 RF handles thousands of predictors Add noise to some standard datasets and see how well Random Forests: predicts detects the important variables September 5-7, 2 Ovronnaz, Switzerland 9
92 RF error rates (%) No noise added noise variables noise variables Dataset Error rate Error rate Ratio Error rate Ratio breast diabetes ecoli german glass image iono liver sonar soy vehicle votes vowel September 5-7, 2 Ovronnaz, Switzerland 92
93 RF error rates Error rates (%) Number of noise variables Dataset No noise added,, breast glass votes September 5-7, 2 Ovronnaz, Switzerland 93
94 Outline Background. Trees. Bagging predictors. Random Forests algorithm. Variable importance. Proximity measures. Visualization. Partial plots and interpretation of effects. September 5-7, 2 Ovronnaz, Switzerland 94
95 Variable Importance RF computes two measures of variable importance, one based on a rough-and-ready measure (Gini for classification) and the other based on permutations. To understand how permutation importance is computed, need to understand local variable importance. But first September 5-7, 2 Ovronnaz, Switzerland 95
96 RF variable importance Dataset m noise variables noise variables Number in top m Percent Number in top m Percent breast diabetes ecoli german glass image ionosphere liver sonar soy vehicle votes vowel.... September 5-7, 2 Ovronnaz, Switzerland 96
97 RF error rates Number in top m Number of noise variables Dataset m,, breast glass votes September 5-7, 2 Ovronnaz, Switzerland 97
98 Local Variable Importance We usually think about variable importance as an overall measure. In part, this is probably because we fit models with global structure (linear regression, logistic regression). In CART, variable importance is local. September 5-7, 2 Ovronnaz, Switzerland 98
99 Local Variable Importance Different variables are important in different regions of the data. protein< If protein is high, we don t care about alkaline phosphate. Similarly if protein is low. But for intermediate values of protein, alkaline phosphate is important. alkphos< 7 protein< alkphos< 29.4 /4 9/ 4/ /2 protein>=26 /3 7/ September 5-7, 2 Ovronnaz, Switzerland 99
100 Local Variable Importance For each tree, look at the out-of-bag data: randomly permute the values of variable j pass these perturbed data down the tree, save the classes. For case i and variable j find error rate with variable j permuted error rate with no permutation where the error rates are taken over all trees for which case i is out-of-bag. September 5-7, 2 Ovronnaz, Switzerland
101 Local importance for a class 2 case TREE No permutation Permute variable Permute variable m % Error % % 35% September 5-7, 2 Ovronnaz, Switzerland
102 Outline Background. Trees. Bagging predictors. Random Forests algorithm. Variable importance. Proximity measures. Visualization. Partial plots and interpretation of effects. September 5-7, 2 Ovronnaz, Switzerland 2
103 Proximities Proximity of two cases is the proportion of the time that they end up in the same node. The proximities don t just measure similarity of the variables - they also take into account the importance of the variables. Two cases that have quite different predictor variables might have large proximity if they differ only on variables that are not important. Two cases that have quite similar values of the predictor variables might have small proximity if they differ on inputs that are important. September 5-7, 2 Ovronnaz, Switzerland 3
104 Visualizing using Proximities To look at the data we use classical multidimensional scaling (MDS) to get a picture in 2-D or 3-D: Proximities MDS Scaling Variables Might see clusters, outliers, unusual structure. Can also use nonmetric MDS. September 5-7, 2 Ovronnaz, Switzerland 4
105 Visualizing using Proximities at-a-glance information about which classes are overlapping, which classes differ find clusters within classes find easy/hard/unusual cases With a good tool we can also identify characteristics of unusual points see which variables are locally important see how clusters or unusual points differ September 5-7, 2 Ovronnaz, Switzerland 5
106 Visualizing using Proximities Synthetic data, 6 cases 2 meaningful variables 48 noise variables 3 classes September 5-7, 2 Ovronnaz, Switzerland 6
107 The Problem with Proximities Proximities based on all the data overfit! e.g. two cases from different classes must have proximity zero if trees are grown deep. Data MDS X dim X dim September 5-7, 2 Ovronnaz, Switzerland 7
108 Proximity-weighted Nearest Neighbors RF is like a nearest-neighbor classifier: Use the proximities as weights for nearest-neighbors. Classify the training data. Compute the error rate. Want the error rate to be close to the RF oob error rate. BAD NEWS! If we compute proximities from trees in which both cases are OOB, we don t get good accuracy when we use the proximities for prediction! September 5-7, 2 Ovronnaz, Switzerland 8
109 Proximity-weighted Nearest Neighbors Dataset RF OOB breast diabetes ecoli german glass image.9 2. iono liver sonar soy vehicle votes vowel September 5-7, 2 Ovronnaz, Switzerland 9
110 Proximity-weighted Nearest Neighbors Dataset RF OOB Waveform Twonorm Threenorm Ringnorm September 5-7, 2 Ovronnaz, Switzerland
111 New Proximity Method Start with P = I, the identity matrix. For each observation i: For each tree in which case i is oob: Pass case i down the tree and note which terminal node it falls into. Increase the proximity between observation i and the k in-bag cases that are in the same terminal node, by the amount /k. Can show that except for ties, this gives the same error rate as RF, when used as a proximity-weighted nn classifier. September 5-7, 2 Ovronnaz, Switzerland
112 New Method Dataset RF OOB New breast diabetes ecoli german glass image iono liver sonar soy vehicle votes vowel September 5-7, 2 Ovronnaz, Switzerland 2
113 New Method Dataset RF OOB New Waveform Twonorm Threenorm Ringnorm September 5-7, 2 Ovronnaz, Switzerland 3
114 But The new proximity matrix is not symmetric! Methods for doing multidimensional scaling on asymmetric matrices. September 5-7, 2 Ovronnaz, Switzerland 4
115 Other Uses for Random Forests Missing data imputation. Feature selection (before using a method that cannot handle high dimensionality). Unsupervised learning (cluster analysis). Survival analysis without making the proportional hazards assumption. September 5-7, 2 Ovronnaz, Switzerland 5
116 Missing Data Imputation Fast way: replace missing values for a given variable using the median of the non-missing values (or the most frequent, if categorical) Better way (using proximities):. Start with the fast way. 2. Get proximities. 3. Replace missing values in case i by a weighted average of non-missing values, with weights proportional to the proximity between case i and the cases with the non-missing values. Repeat steps 2 and 3 a few times (5 or 6). September 5-7, 2 Ovronnaz, Switzerland 6
117 Feature Selection Ramón Díaz-Uriarte: varselrf R package. In the NIPS competition 23, several of the top entries used RF for feature selection. September 5-7, 2 Ovronnaz, Switzerland 7
118 Unsupervised Learning Global histone modification patterns predict risk of prostate cancer recurrence David B. Seligson, Steve Horvath, Tao Shi, Hong Yu, Sheila Tze, Michael Grunstein and Siavash K. Kurdistan (all at UCLA). Used RF clustering of 83 tissue microarrays to find two disease subgroups with distinct risks of tumor recurrence. re3672.html September 5-7, 2 Ovronnaz, Switzerland 8
119 Survival Analysis Hemant Ishwaran and Udaya B. Kogalur: randomsurvivalforest R package. September 5-7, 2 Ovronnaz, Switzerland 9
120 Outline Background. Trees. Bagging predictors. Random Forests algorithm. Variable importance. Proximity measures. Visualization. September 5-7, 2 Ovronnaz, Switzerland 2
121 Case Study: Cavity Nesting birds in the Uintah Mountains, Utah Red-naped sapsucker (Sphyrapicus nuchalis) (n = 42 nest sites) Mountain chickadee (Parus gambeli) (n = 42 nest sites) Northern flicker (Colaptes auratus) (n = 23 nest sites) n = 6 non-nest sites
122 Case Study: Cavity Nesting birds in the Uintah Mountains, Utah Response variable is the presence (coded ) or absence (coded ) of a nest. Predictor variables (measured on.4 ha plots around the sites) are: Numbers of trees in various size classes from less than inch in diameter at breast height to greater than 5 inches in diameter. Number of snags and number of downed snags. Percent shrub cover. Number of conifers. Stand Type, coded as for pure aspen and for mixed aspen and conifer.
123 Autism Data courtesy of J.D.Odell and R. Torres, USU 54 subjects (38 chromosomes) 7 variables, all categorical (up to 3 categories) 2 classes: Normal, blue (69 subjects) Autistic, red (85 subjects) September 5-7, 2 Ovronnaz, Switzerland 23
124 Brain Cancer Microarrays Pomeroy et al. Nature, 22. Dettling and Bühlmann, Genome Biology, cases, 5,597 genes, 5 tumor types: medulloblastomas BLUE malignant gliomas PALE BLUE atypical teratoid/rhabdoid tumors (AT/RTs) GREEN 4 human cerebella ORANGE 8 PNETs RED September 5-7, 2 Ovronnaz, Switzerland 24
125 Dementia Data courtesy of J.T. Tschanz, USU 56 subjects 28 variables 2 classes: no cognitive impairment, blue (372 people) Alzheimer s, red (44 people) September 5-7, 2 Ovronnaz, Switzerland 25
126 Metabolomics (Lou Gehrig s disease) data courtesy of Metabolon (Chris Beecham) 63 subjects 37 variables 3 classes: blue (22 subjects) ALS (no meds) green (9 subjects) ALS (on meds) red (32 subjects) healthy September 5-7, 2 Ovronnaz, Switzerland 26
127 Random Forests Software Free, open-source code (FORTRAN, java) Commercial version (academic discounts) R interface, independent development (Andy Liaw and Matthew Wiener) September 5-7, 2 Ovronnaz, Switzerland 27
128 Java Software Raft uses VisAD and ImageJ These are both open-source projects with great mailing lists and helpful developers. September 5-7, 2 Ovronnaz, Switzerland 28
129 References Leo Breiman, Jerome Friedman, Richard Olshen, Charles Stone (984) Classification and Regression Trees (Wadsworth). Leo Breiman (996) Bagging Predictors Machine Learning, 24, Leo Breiman (2) Random Forests Machine Learning, 45, Trevor Hastie, Rob Tibshirani, Jerome Friedman (29) Statistical Learning (Springer). September 5-7, 2 Ovronnaz, Switzerland 29
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
Random Forests for Scientific Discovery. Leo Breiman, UC Berkeley Adele Cutler, Utah State University
Random Forests for Scientific Discovery Leo Breiman, UC Berkeley Adele Cutler, Utah State University Contact Information [email protected] Outline 8:3 - :5 Part :5 - :3 Refreshment Break :3-2: Part 2
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Benchmarking Open-Source Tree Learners in R/RWeka
Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
Distances, Clustering, and Classification. Heatmaps
Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Risk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Package trimtrees. February 20, 2015
Package trimtrees February 20, 2015 Type Package Title Trimmed opinion pools of trees in a random forest Version 1.2 Date 2014-08-1 Depends R (>= 2.5.0),stats,randomForest,mlbench Author Yael Grushka-Cockayne,
Package bigrf. February 19, 2015
Version 0.1-11 Date 2014-05-16 Package bigrf February 19, 2015 Title Big Random Forests: Classification and Regression Forests for Large Data Sets Maintainer Aloysius Lim OS_type
Applied Multivariate Analysis - Big data analytics
Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix [email protected] http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
6 Classification and Regression Trees, 7 Bagging, and Boosting
hs24 v.2004/01/03 Prn:23/02/2005; 14:41 F:hs24011.tex; VTEX/ES p. 1 1 Handbook of Statistics, Vol. 24 ISSN: 0169-7161 2005 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(04)24011-1 1 6 Classification
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
The treatment of missing values and its effect in the classifier accuracy
The treatment of missing values and its effect in the classifier accuracy Edgar Acuña 1 and Caroline Rodriguez 2 1 Department of Mathematics, University of Puerto Rico at Mayaguez, Mayaguez, PR 00680 [email protected]
Big Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
Unsupervised and supervised dimension reduction: Algorithms and connections
Unsupervised and supervised dimension reduction: Algorithms and connections Jieping Ye Department of Computer Science and Engineering Evolutionary Functional Genomics Center The Biodesign Institute Arizona
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
MS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
Marrying Prediction and Segmentation to Identify Sales Leads
Marrying Prediction and Segmentation to Identify Sales Leads Jim Porzak, The Generations Network Alex Kriney, Sun Microsystems February 19, 2009 1 Business Challenge In 2005 Sun Microsystems began the
M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest
Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Fraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
Implementation of Breiman s Random Forest Machine Learning Algorithm
Implementation of Breiman s Random Forest Machine Learning Algorithm Frederick Livingston Abstract This research provides tools for exploring Breiman s Random Forest algorithm. This paper will focus on
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
A Review of Missing Data Treatment Methods
A Review of Missing Data Treatment Methods Liu Peng, Lei Lei Department of Information Systems, Shanghai University of Finance and Economics, Shanghai, 200433, P.R. China ABSTRACT Missing data is a common
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
Decompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
The Operational Value of Social Media Information. Social Media and Customer Interaction
The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND
Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Supervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
Local classification and local likelihoods
Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor
Identifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
DATA MINING SPECIES DISTRIBUTION AND LANDCOVER. Dawn Magness Kenai National Wildife Refuge
DATA MINING SPECIES DISTRIBUTION AND LANDCOVER Dawn Magness Kenai National Wildife Refuge Why Data Mining Random Forest Algorithm Examples from the Kenai Species Distribution Model Pattern Landcover Model
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort [email protected] Motivation Location matters! Observed value at one location is
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
Beating the MLB Moneyline
Beating the MLB Moneyline Leland Chen [email protected] Andrew He [email protected] 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
Mathematical Models of Supervised Learning and their Application to Medical Diagnosis
Genomic, Proteomic and Transcriptomic Lab High Performance Computing and Networking Institute National Research Council, Italy Mathematical Models of Supervised Learning and their Application to Medical
Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016
Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency
Data Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
Beating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort [email protected] Session Number: TBR14 Insurance has always been a data business The industry has successfully
Package acrm. R topics documented: February 19, 2015
Package acrm February 19, 2015 Type Package Title Convenience functions for analytical Customer Relationship Management Version 0.1.1 Date 2014-03-28 Imports dummies, randomforest, kernelfactory, ada Author
CART 6.0 Feature Matrix
CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window
Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan
Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
Classification and Regression Trees (CART) Theory and Applications
Classification and Regression Trees (CART) Theory and Applications A Master Thesis Presented by Roman Timofeev (188778) to Prof. Dr. Wolfgang Härdle CASE - Center of Applied Statistics and Economics Humboldt
An Overview and Evaluation of Decision Tree Methodology
An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX [email protected] Carole Jesse Cargill, Inc. Wayzata, MN [email protected]
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
