Multinomial Logistic Regression Ensembles
|
|
|
- Hubert Thomas
- 10 years ago
- Views:
Transcription
1 Multinomial Logistic Regression Ensembles Abstract This article proposes a method for multiclass classification problems using ensembles of multinomial logistic regression models. A multinomial logit model is used as a base classifier in ensembles from random partitions of predictors. The multinomial logit model can be applied to each mutually exclusive subset of the feature space without variable selection. By combining multiple models the proposed method can handle a huge database without a constraint needed for analyzing high-dimensional data, and the random partition can improve the prediction accuracy by reducing the correlation among base classifiers. The proposed method is implemented using R and the performance including overall prediction accuracy, sensitivity, and specificity for each category is evaluated on two real data sets and simulation data sets. To investigate the quality of prediction in terms of sensitivity and specificity, area under the ROC curve (AUC) is also examined. The performance of the proposed model is compared to a single multinomial logit model and it shows a substantial improvement in overall prediction accuracy. The proposed method is also compared with other classification methods such as Random Forest, Support Vector Machines, and Random Multinomial Logit Model. Keywords: Class prediction; Ensemble; Logistic regression; Majority voting; Multinomial logit; Random partition. 1
2 1 Introduction Acute gastrointestinal bleeding (GIB) is an increasing healthcare problem due to rising NSAID (non-steroidal anti-inflammatory drugs) use in an aging population (Rockall et al., 1995). In the emergency room (ER), the ER physician can misdiagnose a GIB patient at least 50% of the time (Kollef et al., 1997). While it is best for a gastroenterologist to diagnose GIB patients, it is not feasible due to time and constraints. Classification models can be used to assist the ER physician to diagnose GIB patients more efficiently and effectively, providing scarce healthcare resources to those who need it the most. Chu et al. (2008) evaluated eight different classification models on a 121 patient GIB database. Using clinical and laboratory information available within a few hours of patient presentation, the models can be used to predict the source of bleeding, need for intervention, and disposition in patients with acute upper, mid, and lower GIB. Another application we consider in this paper is determination of stromal signatures in breast carcinoma. West et al. (2005) examined two types of tumors with fibroblastic features, solitary fibrous tumor (SFT) and desmoid-type fibromatosis (DTF), by DNA microarray analysis and found that the two tumor types differ in their patterns of expression in various functional categories of genes. Their findings suggest that gene expression patterns of soft tissue tumors can be used to discover new markers for normal connective tissue cells. Compared to the GIB data the breast cancer data (West et al., 2005) are high-dimensional. The data set contains 4148 variables on 57 tumor patients. A classification method can be applied to classify the data into DTF, SFT, and other types of tumors. Logistic regression is a model that fits the log odds of the response to a linear combination of the explanatory variables. It is used mainly for binary responses, although there are extensions for multi-way responses as well. An advantage for using logistic regression is that a model can be clearly and succinctly represented. Logistic regression is widely used in areas such as medical and social sciences. 2
3 LORENS (Logistic Regression Ensembles: Lim et al., 2010) uses the CERP (Classification by Ensembles from Random Partitions: Ahn et al., 2007) algorithm to classify binary responses based on high-dimensional data using the logistic regression model as a base classifier. CERP is similar to random subspace (Ho, 1998), but the difference is that base classifiers in an ensemble are obtained from mutually exclusive sets of predictors in CERP to increase diversity, while they are obtained by a random selection with overlap in random subspace. Although logistic regression is known to be robust as a classification method and is widely used, it requires that there be more observations than predictors. Thus, in general, to use logistic regression for high-dimensional predictor spaces, variable selection is unavoidable. In LORENS, however, each base classifier is constructed from a different set of predictors determined by a random partition of the entire set of predictors, so that there are always more observations than predictors. Hence the logistic regression model can be used without variable selection. LORENS generates multiple ensembles with different random partitions, and conducts a majority voting of individual ensembles to further improve gain in overall accuracy. LORENS is a useful classification method for high-dimensional data, which is designed for binary responses. Multiclass problems are common; thus we developed a method comparable to LORENS for a multi-way classification. We expanded LORENS to multiclass problems (mlorens) and compare the performance with that of a single multiple logistic regression model in this study. The multinomial logistic regression (MLR) model can be easily implemented and it is less computer intensive than the tree-based CERP. Lim et al. (2010) showed that the prediction accuracy of LORENS is as good as that of RF (Random Forest: Breiman, 2001) or SVM (Support Vector Machines: Vapnik, 1999) using some real data sets and a simulation study. Prinzie and den Poel (2008) proposed Random Multinomial Logit (RMNL) model which fits multinomial logit models in different bootstrap samples. They borrowed the structure 3
4 of RF, and used the idea of bagging. RF generates diversity by randomly selecting variables in each node of a base tree. Thus various variables are involved in each tree classifier. In RMNL, however, one base classifier is built by only one random selection of variables. In this paper we investigate the improvement of mlorens over MLR which has never been studied before. The GIB data, breast carcinoma data, and simulated data are used. We focus on the improvement of mlorens over MLR, but to further evaluate mlorens, it is also compared to RF, SVM and RMNL. RMNL is included in the comparison because it is also an ensemble classification method using the logistic regression model as a base classifier. We implemented the mlorens algorithm in R. The software will be available upon request. 2 Methods 2.1 LORENS (Logistic regression ensembles) Based on the CERP algorithm, Lim et al. (2010) developed LORENS by using logistic regression models as base classifiers. To minimize the correlation among the ensemble of classifiers, the feature space is randomly partitioned into K subspaces with roughly equal sizes. Since the subspaces are randomly chosen from the same distribution, we assume that there is no bias in selection of the predictor variables in each subspace. In each of these subspaces, we fit a full logistic regression model without a variable selection. Due to the randomness, we expect nearly equal probability of the classification error among the K classifiers. LORENS combines the results of these multiple logistic regression models to achieve an improved accuracy of classifying a patient s outcome by taking the average of the predicted values within an ensemble. The predicted values from all the base classifiers (logistic regression models) in an ensemble are averaged and the sample is classified as either 0 or 1 using a threshold on this average. Details of the method can be found in Lim et al. 4
5 (2010). 2.2 mlorens for multinomial logistic regression model Suppose Y is a response variable with J > 2 categories. Let {π 1,, π J } be the response probabilities satisfying J j=1 π j = 1. When one takes n independent observations based on these probabilities, the probability distribution for the number of outcomes that occur as each of the J types is multinomial. If a category is fixed as a baseline category, we have J 1 log odds paired with the baseline category. When the last category is the baseline, the baseline-category logits are log(π j /π J ), j = 1,, J 1. The logit model using baselinecategory logits with a predictor x has the form log(π j /π J ) = α j + β j x, j = 1,, J 1. The model consists of J 1 logit equations, with separate parameters. By fitting these J 1 logit equations simultaneously, estimates of the model parameters can be obtained, and the same parameter estimates occur for a pair of categories regardless of the baseline category (Agresti, 1996). The estimates of the response probabilities can be expressed as π j = exp(α j + β j x)/ h exp(α h + β h x), j = 1,, J 1. Some alternative approaches can be considered instead of the above multinomial logistic regression (MLR) model. One of the approaches is nested binary models (Jobson, 1992). For each of the classes, the logistic regression model is fit for the class versus the remaining classes combined in this model. The class with the highest estimated response is chosen. We compared the MLR model with this alternative model in Section 3, but we do not show it in this paper because the results were not significantly different. We developed mlorens which uses MLR as a base classifier. We anticipate that the proposed method will possess the nice properties of MLR and simultaneously inherit the advantages of CERP handling high-dimensional data sets. 5
6 3 Applications The improvement of mlorens over a single MLR model is investigated and its performance is compared to RF, SMV and RMNL. The methods are applied to real data sets and to simulation data. Twenty repetitions of 10-fold cross validation (CV) were conducted for each model. For mlorens, the optimal partition sizes were determined in the training phase. For a single MLR model, variable selection is necessary when the sample size is smaller than the number of predictors. For MLR, a fixed number of predictors was preassigned and the variables were selected from the training data. Variable selection was not performed for mlorens as it is not necessary due to the random partition. ACC (overall accuracy), SENS (sensitivity), SPEC (specificity), PPV (positive predictive value: rate of true positives among positive predictions) and NPV (negative predictive value: rate of true negatives among negative predictions) were compared. The receiver operating characteristic (ROC) curves were created and areas under the curves (AUC) were compared to assess the performance of the models in terms of sensitivity and specificity. AUC is a scalar measure gauging the performance of the ROC curve. An AUC of 0.5 represents a random prediction, and an AUC of 1 represents a perfect prediction. The Mann-Whitney statistic was calculated, which is equivalent to the AUC (DeLong et al., 1988). Prediction accuracy is calculated by the total number of correct predictions divided by total number of predictions, and sensitivity and specificity are calculated, respectively, for each category as follows. If the number of classes is three, for example, the following table shows the counts of predicted classification and true classification. The sensitivity for class 1 is calculated by a/(a + b + c), for class 2 is e/(d + e + f), and for class 3 is i/(g + h + i). The specificity for class 1 is (e + f + h + i)/(d + e + f + g + h + i), for class 2 is (a + c + g + i)/(a + b + c + g + h + i), and for class 3 is (a + b + d + e)/(a + b + c + d + e + f). 6
7 True class a d g Predicted class 2 b e h 3 c f i The MLR model was implemented into mlorens in R using the multinom function in the nnet package. The multinom function fits the multinomial logit model via neural networks. Thus the multinom function is computer-intensive. For RF, the RandomForest package in R is used. The default of five hundred trees are generated in the forest, and m 1/2 variables are randomly selected at each node of the base tree. The e1071 package in R is used for SVM. The RMNL program is implemented in R by the first author. Average values obtained from 20 repetitions of 10-fold CV were used for all the comparisons for each model. 3.1 GIB data Patients with acute GIB were identified from the hospital medical records database of 121 patients. Chu et al. (2008) classified the sample into different bleeding sources, need for urgent blood resuscitation, need for urgent endoscopy, and disposition. In this paper, we focus on the source of bleeding. The bleeding source is classified into three classes: Upper, Mid, and Lower intestine. The definitive source of bleeding was the irrefutable identification of a bleeding source at upper endoscopy, colonoscopy, small bowel enteroscopy, or capsule endoscopy. Twenty input variables utilized to predict the source of GIB included prior history of GIB, hematochezia, hematemesis, melena, syncope/presyncope, risk for stress ulceration, cirrhosis, ASA/NSAID use, systolic and diastolic blood pressures, heart rate, orthostasis, NG (Naso Gastric) lavage, rectal exam, platelet count, creatinine, BUN (Blood Urea Nitrogen), and INR (International Normalized Ratio). For mlorens, random partition into 2 subsets of predictors was used without a search for an optimal partition size. Since the number of predictor variables is only 20, the fixed partition size was used for this example. Among 121 subjects, 81 fell on upper, 29 on lower, 7
8 and the remaining 11 fell on mid intestine. The class with the highest probability estimate is chosen to be the right class without considering a threshold for a decision. The prediction accuracy along with sensitivity and specificity was provided in Table 1. Although both MLR and mlorens worked well on GIB data, mlorens was better in prediction accuracy. The prediction accuracy for mlorens was 93% compared to 89% for MLR. According to McNemar s test, the accuracy of mlorens was significantly higher than that of MLR (p-value <.0001), SVM (linear kernel: p-value <.0001, radial kernel: p-value.025), and RMNL (p-value.015). The accuracies of mlorens and RF were not significantly different (p-value.115). Throughout the paper, the p-values are for two-sided tests. For each of the three categories, AUC was obtained by grouping the data to the given category and the rest. The mean AUC was obtained by taking average of these three AUC s (Hand and Till, 2001). Sensitivity and specificity were not significantly different between MLR and mlorens, but mlorens showed higher sensitivities for large classes (upper and lower), and for small class (mid) MLR showed high sensitivity. Specificity for the larger class was higher in MLR than in mlorens. The mean AUC was high in both mlorens and MLR. RF and RMNL s performances were similar to that of mlorens, while SVM performed worse than mlorens. 3.2 Breast cancer data West et al. (2005) studied determination of stromal signatures in breast carcinoma using DNA microarray analysis to examine two fibroblastic tumors: solitary fibrous tumor (SFT) and desmoid-type fibromatosis (DTF). The data contain 57 subjects. Ten cases of DTF and 13 cases of benign SFT were compared to 34 other previously examined soft tissue tumors with expression profiling on 42,000 element cdna microarrays, corresponding to approximately 36,000 unique gene sequences. The data were obtained from stanford.edu/cgi-bin/publication/viewpublication.pl?pub no=436. For the data, 4148 pre- 8
9 dictors were used in the analysis. Table 2 shows the comparison of mlorens and other classification methods these data. The partition sizes of mlorens were obtained in the training phase. The average number of mutually exclusive subsets in a partition was 207 resulting in approximately 20 variables in each subset on average in mlorens. Since MLR requres variable selection, 10, 20, or 30 variables were selected in the training phase. The variable selection was performed in the learning set using the BW ratio (ratio of between-group to within-group sums of squares) defined as BW (j) = i i k I(y i = k) (x kj x.j ) 2 k I(y i = k) (x ij x kj ) 2, where x.j = i x ij N ; x kj = I(y i =k) x ij N k, for sample i and gene j. This criterion has been shown to be reliable for variable selection from high-dimensional data (Dudoit et al., 2002). The overall accuracy of mlorens was higher (.92, sd.02) than that of MLR (.69, sd.05 when 30 variables were selected). mlorens also outperformed MLR in mean AUC (.92 versus.78). mlorens was significantly higher in ACC according to McNemar s test (pvalue < ). These results show that without considering variable selection, mlorens performs significantly better than MLR which requires variable selection. mlorens was comparable to RMNL (.93, sd.02), worse than SVM linear kernel (.94, sd.01, p-value.002), but significantly higher (p-value <.0001) than RF (.86, sd.02) and SVM radial kernel (.86, sd.00) in accuracy. We observed the same pattern in AUC. 3.3 Simulation We conducted a simulation study to evaluate mlorens in comparison with MLR using high dimensional simulated data and we compared to RF, SVM and RMNL. We generated two data sets with 120 subjects and 500 independent predictors, one for training and the other for testing. The ratios of the classes were given as 40:40:40. One hundred pairs of these 9
10 training and test data sets were generated. Fifty of these predictor variables were generated from three different normal distributions, and the remaining 450 predictor variables were generated from one normal distribution. The latter served as noise. The predictors were generated from normal distributions with standard deviation (σ) of 1, 2 or 3. Figure 1 shows the simulation design. Among the 50 significant variables, the first 10 variables were generated from N(0, σ 2 ) for class 1, N(1, σ 2 ) for class 2, and N(2, σ 2 ) for class 3. Next 10 variables were generated from N(0, σ 2 ) for class 1 and N(1, σ 2 ) for class 2 and 3. For the other significant variables see Figure 1. The 450 noise variables were independently generated from N(0, σ 2 ). For each simulation data, the model fit in the training set was applied to the test set for evaluating the performance of each classification method. The average of the accuracies from the 100 pairs of simulation data sets from each classification method is provided in Tables 3 through 5, for σ = 1, σ = 2 and σ = 3, respectively. LORENS included all 500 variables without variable selection. An ideal model should be the MLR model consisting of the 50 significant predictors assuming that we can accurately identify these variables. Since the significant variables are unknown in practice, we compared mlorens and the ideal MLR model with MLR models with variable selection. The variable selection was performed in the learning set using the BW ratio. Since the number of significant variables is unknown in practice, the numbers of selected variables were preselected as 10, 30, 50 and 70 for MLR. The given numbers of variables were selected in the training phase. For mlorens, we tried both fixed numbers of partitions and an optimal number of partitions searched in the training phase. The result was good with the optimal number of partitions from the search method. mlorens showed substantially higher accuracy compared to the MLR models. As expected, the data with smaller standard deviation gave much higher accuracy than the data with higher standard deviation. For data with standard deviation 1, the model with 10 variables gave a better result than the ideal MLR model consisting of all significant variables (Table 3). It might be because the 50 10
11 significant variables have repetition. In fact 5 to 10 variables were generated from each of the 7 distinct significant features. Thus MLR with more variables or the correct model may be redundant. According to the result, mlorens appears to handle the redundancy effectively by random partition. Since the feature space is randomly partitioned to 30 subspaces on average, the MLR model on each subset would contain nearly two significant variables on average. This would minimize the chance of redundancy and the ensemble of these models takes advantage over the single MLR model. When the standard deviation was 3, the MLR with the known 50 significant variables performed significantly better than MLR with variable selection due to the ambiguity of the data (Table 5). The accuracies of mlorens and the ideal MLR model with all the significant variables were not significantly different when the standard deviation was 3. Otherwise, the accuracy of mlorens was significantly higher than that of the best MLR model and RMNL with the p-value less than for all values of the standard deviation using McNemar s test. The mean AUC was significantly higher in mlorens than any single MLR model. When we compared to RF and SVM, mlorens was significantly worse when the standard deviation was 1 (p-value <.0001), comparable when the standard deviation was 2, and significantly better when the standard deviation was 3 (p-value <.0001). The variable selection in MLR was highly accurate for the data with standard deviation 1, while it was not as accurate for the data with standard deviation 3. Most of the significant variables were selected for the data with standard deviation 1, but the average number of significant variables selected were 7.4, 13.6, 27.4 and 28.4, when the model contained 10, 30, 50 and 70 variables, respectively, for the data with standard deviation 3. We also conducted a simulation with the same design, but with correlated variables. The results were similar except for lower overall accuracy in general. Thus we do not report the result from this simulation. The run time of mlorens was approximately 2 hours to finish 100 repetitions of simulating the high-dimensional data in Section 3.3 on a Windows 3.0 GHz machine. 11
12 4 Conclusion and Discussion A major advantage of LORENS over CERP or other aggregation methods is that the base classifier, logistic regression, is widely used and well understood by a broad base of users in science and medicine. Moreover, LORENS is more computationally efficient than CERP which is tree-based. Due to the random partition, LORENS is free of the constraint on the dimension of the data. In addition to binary classification problems, multi-class classification problems are not rare in biomedical applications. In this paper, we expanded LORENS to multi-class classification problems. Our results show that the performance of mlorens is superior to that of MLR in terms of achieving significantly higher prediction accuracy, although mlorens appears to favor the majority class when class sizes are very unbalanced. mlorens shows higher AUC than a single MLR model for high dimensional data in this paper. Since randomly selected mutually exclusive subsets of predictors are assigned to each of the random partitions, redundancy of the data is reduced as we discussed in Section 3.3. Furthermore, variable selection is not necessary in mlorens. The correlation between the classifiers is lowered by random partitioning. By integrating these advantages to the ensemble majority voting, the accuracy of the method is greatly improved. Although our focus in this paper was on the improvement of mlorens over MLR, we also compared it with other popular existing classification methods. The performance of mlorens was significantly better than that of RF and SVM for hard-to-classify data with larger standard deviation according to our simulation study. Furthermore, mlorens is less computer intensive than these methods. In this paper, the performance of mlorens was evaluated and compared with a single MLR model. The class with the highest estimate was chosen as the right class without considering a threshold for a decision. Lim et al. (2010) developed a threshold search for two-way classification for LORENS. For a multi-way classification, the search for optimal thresholds is multi-dimensional which is complicated and computer-intensive. In a future 12
13 study, we plan to develop a method to determine the optimal thresholds in the learning phase, which is expected to improve the balance between sensitivity and specificity when class sizes are very unbalanced. Logistic regression is a standard and commonly used model for a binary classification. MLR is an extension of the logistic regression model to a multi-way classification. We used the multimum function in R for MLR. This function uses the artificial neural network approach, and thus this is computer-intensive. In a future study we plan to develop a more efficient and less computer-intensive algorithm for MLR by applying a standard procedure. References 1. Agresti, A. (1996). Multicategory logit models, in: An Introduction to Categorical Data Analysis, New York, Chichester, Brisbane, Toronto, Singapore: Wiley-Interscience. 2. Ahn, H., Moon, H., Fazzari, M. J., Lim, N., Chen, J. J., Kodell, R. L. (2007). Classification by ensembles from random partitions of high-dimensional data. Computational Statistics and Data Analysis 51: Breiman, L. (2001). Random Forest, Machine Learning 45: Chu, A., Ahn, H., Halwan, B., Kalmin, B., Artifon, E., Barkun, A., Lagoudakis, M., Kumar, A. (2008). A decision support system to facilitate management of patients with acute gastrointestinal bleeding. Artificial Intelligence in Medicine 42: DeLong, E. R., DeLong, D. M., Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 44: Dudoit, S., Fridlyand, J., Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American 13
14 Statistical Association 97: Hand, D. J., Till, R. J. (2001). A simple generalisation of the area under the curve for multiple class classification problems. Machine Learning 45: Ho, T. K. (1998). The random subspace method for constructing decision forests. IIII Transactions on Pattern Analysis and Machine Intelligence 20(8): Jobson, J. D. (1992). Applied multivariate data analysis. Volume 1: Regression and experimental design, Chapter New York, Springer. 10. Kollef, M. H., O Brien, J.D., Zuckerman, G. R., Shannon, W. (1997). A classification tool to predict outcomes in patients with acute upper and lower gastrointestinal hemorrhage. Critical Care Medicine 25: Lim, N., Ahn, H., Moon, H., Chen, J. J. (2010). Classification of high-dimensional data with ensemble of logistic regression models. Journal of Biopharmaceutical Statistics 20: Prinzie, A., den Poel, D. V. (2008). Random forests for multiclass classification: Random MultiNomial Logit. Expert Systems with Applications 34: Rockall, T. A., Logan, R. F. A., Devlin, H. B., Northfield, T. C. (1995). Incidence of and mortality from acute upper gastrointestinal haemorrhage in the United Kingdom. British Medical Journal 311: Vapnik, V. (1999). The Nature of Statistical Learning Theory. New York:Springer Verlag. 15. West, R. B., Nuyten, D. S., Subramanian, S., Nielsen, T. O., Corless, C. L., Rubin, B. P., Montgomery, K., Zhu, S., Patel, R., Hernandez-Boussard, T., Goldblum, J. R., Brown, P. O., van de Vijver, M., van de Rijin, M. (2005). Determination of stromal signatures in breast carcinoma. PLoS Biology 3(6):e
15 Table 1: Comparison of mlorens and MLR for the GIB data (sd in parentheses). Fixed partition sizes are used in mlorens. Mean Method #part. a ACC AUC b Upper Lower Mid mlorens 2.93 (.01).90 (.02) SENS:.98 (.01).89 (.03).68 (.09) SPEC:.93 (.02).97 (.01).98 (.01) AUC:.96 (.01).93 (.02).83 (.05) MLR.89 (.02).90 (.02) SENS:.92 (.02).86 (.04).75 (.09) SPEC:.97 (.03).95 (.01).94 (.01) AUC:.95 (.01).90 (.02).85 (.05) RMNL.92 (.02).90 (.03) SENS:.96 (.01).91 (.02).66 (.13) SPEC.96 (.03).95 (.02).97 (.01) AUC.96 (.01).93 (.02).81 (.07) RF.94 (.01).90 (.02) SENS:.98 (.01).94 (.02).61 (.08) SPEC:.93 (.01).97 (.01).99 (.00) AUC:.96 (.01).95 (.01).80 (.04) SVM.90 (.02).88 (.03) SENS:.95 (.01).86 (.04).64 (.13) Linear SPEC:.91 (.04).96 (.02).96 (.01) Kernel AUC:.93 (.02).91 (.02).80 (.06) SVM.92 (.01).88 (.01) SENS:.99 (.00).90 (.00).54 (.07) Radial SPEC:.88 (.01).95 (.01) 1.00 (.00) Kernel AUC:.93 (.00).93 (.00).77 (.04) a pre-determined number of mutually exclusive subsets in the partition b mean of the AUC s from the three classes 15
16 Table 2: Comparison of mlorens and MLR for the breast cancer data (sd in parentheses). The partition sizes for mlorens are determined in the learning phase. Mean Method #var. a #part. b ACC AUC c DTF SFT Other mlorens all SENS: 1.00 (.00).68 (.08).99 (.02) (79) (.02) (.07) SPEC:.99 (.01) 1.00 (.00).82 (.04) AUC: 1.00 (.01).84 (.04).91 (.03) MLR SENS:.87 (.15).43 (.14).69 (.08) (.07) (.14) SPEC:.93 (.04).83 (.05).64 (.12) AUC:.90 (.07).63 (.07).67 (.08) SENS:.84 (.13).58 (.14).66 (.09) (.07) (.11) SPEC:.93 (.03).79 (.06).73 (.10) AUC:.89 (.06).69 (.07).70 (.07) SENS:.88 (.09).66 (.12).65 (.08) (.05) (.10) SPEC:.92 (.04).79 (.06).79 (.09) AUC:.90 (.04).73 (.05).72 (.05) RMNL SENS: 1.00 (.02).75 (.07).98 (.02) (2.9) (.02) (.02) SPEC: 1.00 (.01).99 (.01).86 (.04) AUC: 1.00 (.01).87 (.03).92 (.02) RF SENS: 1.00 (.00).47 (.07).97 (.02) (.02) (.02) SPEC:.99 (.01).98 (.01).70 (.04) AUC: 1.00 (.01).72 (.03).83 (.02) SVM SENS: 1.00 (.00).83 (.04).97 (.02) Linear (.01) (.01) SPEC:.98 (.01) 1.00 (.00).90 (.02) Kernel AUC:.99 (.01).92 (.02).94 (.01) SVM SENS:.90 (.00).46 (.00) 1.00 (.01) Radial (.00) (.00) SPEC: 1.00 (.00) 1.00 (.00).65 (.00) Kernel AUC:.95 (.00).73 (.00).83 (.00) a number of selected variables chosen in the training phase b mean (sd) number of mutually exclusive subsets of predictors in a partition, chosen in the training phase c mean of the AUC s from the three classes 16
17 Table 3: Performance of mlorens, multivariate logit models with variable selection, and multivariate logit model with all significant variables for the simulated data with 500 predictors. Predictors were generated from normal distribution with standard deviation 1. #sig. Mean Method #var. a var. b #part. c ACC AUC d Class 1 Class 2 Class 3 mlorens All SENS:.99 (.01).76 (.07).99 (.01) (7.6) (.02) (.05) SPEC:.94 (.03).99 (.01).94 (.03) AUC:.97 (.01).88 (.03).97 (.01) MLR SENS:.89 (.07).75 (.08).86 (.08) with (0) (.04) (.06) SPEC:.94 (.03).88 (.05).93 (.03) variable AUC:.92 (.04).82 (.04).89 (.04) selection SENS:.88 (.07).72 (.10).86 (.08) (.8) (.05) (.07) SPEC:.90 (.04).88 (.05).95 (.03) AUC:.89 (.04).80 (.06).90 (.04) SENS:.80 (.08).69 (.08).74 (.09) (.8) (.04) (.06) SPEC:.88 (.04).81 (.05).91 (.04) AUC:.84 (.04).75 (.04).82 (.05) SENS:.72 (.09).63 (.09).66 (.10) (.7) (.06) (.07) SPEC:.87 (.05).76 (.06).87 (.04) AUC:.80 (.05).70 (.05).77 (.06) MLR SENS:.80 (.09).68 (.10).73 (.10) w/signif. (.05) (.06) SPEC:.88 (.05).82 (.06).91 (.04) variables AUC:.84 (.05).75 (.05).82 (.06) RMNL SENS:.96 (.05).66 (.10).95 (.05) (.04) (.03) SPEC:.91 (.04).96 (.04).92 (.04) AUC:.94 (.03).81 (.05).94 (.03) RF SENS:.99 (.01).87 (.06).99 (.02) (.02) (.02) SPEC:.97 (.02).99 (.01).96 (.02) AUC:.98 (.01).93 (.03).98 (.01) SVM SENS:.96 (.03).90 (.05).95 (.03) Linear (.02) (.01).97 (.02).96 (.02).98 (.02) Kernel.97 (.02).93 (.02).96 (.02) SVM SENS:.97 (.03).92 (.05).96 (.03) Radial (.02) (.02) SPEC:.98 (.02).97 (.02).98 (.02) Kernel AUC:.97 (.02).94 (.03).97 (.02) a number of selected variables chosen in the training phase b number of significant variables among the selected variables c mean (sd) number of mutually exclusive subsets of predictors in a partition, chosen in the training phase d mean of the AUC s from the three classes 17
18 Table 4: Performance of mlorens, multivariate logit models with variable selection, and multivariate logit model with all significant variables for the simulated data with 500 predictors. Predictors were generated from normal distribution with standard deviation 2. #sig. Mean Method #var. a var. b #part. c ACC AUC d Class 1 Class 2 Class 3 mlorens All 29.6 e SENS:.86 (.06).40 (.08).86 (.06) (8.8) (.04) (.11) SPEC:.84 (.04).87 (.04).84 (.04) AUC:.85 (.03).64 (.04).85 (.03) MLR SENS:.69 (.08).46 (.09).68 (.09) with (.6) (.05) (.09) SPEC:.83 (.05).76 (.06).83 (.05) variable AUC:.76 (.05).61 (.05).75 (.06) selection SENS:.68 (.09).46 (.11).67 (.09) (1.9) (.06) (.09) SPEC:.83 (.05).74 (.07).84 (.05) AUC:.75 (.05).60 (.06).75 (.05) SENS:.70 (.09).50 (.09).61 (.10) (2.0) (.04) (.08) SPEC:.82 (.05).73 (.06).86 (.05) AUC:.76 (.05).62 (.05).73 (.05) SENS:.63 (.09).45 (.09).56 (.10) (2.0) (.05) (.08) SPEC:.79 (.05).70 (.06).83 (.05) AUC:.71 (.05).58 (.05).69 (.05) MLR SENS:.70 (.11).50 (.10).63 (.10) w/signif. (.06) (.08) SPEC:.82 (.06).74 (.07).86 (.05) variables AUC:.76 (.06).62 (.05).75 (.05) RMNL SENS:.79 (.08).38 (.08).74 (.09) (.04) (.03) SPEC:.81 (.05).80 (.06).84 (.05) AUC:.80 (.04).59 (.04).79 (.04) RF SENS:.83 (.07).45 (.09).83 (.06) (.04) (.03) SPEC:.85 (.04).85 (.05).86 (.04) AUC:.84 (.03).65 (.04).84 (.04) SVM SENS:.75 (.08).55 (.08).73 (.08) Linear (.05) (.03) SPEC:.88 (.04).76 (.05).88 (.04) Kernel AUC:.81 (.04).66 (.05).81 (.04) SVM SENS:.76 (.09).59 (.09).75 (.09) Radial (.04) (.03) SPEC:.89 (.04).77 (.06).89 (.04) Kernel AUC:.83 (.04).68 (.04).82 (.04) a number of selected variables chosen in the training phase b number of significant variables among the selected variables c mean number of mutually exclusive subsets of predictors in a partition (sd), chosen in the training phase d mean of the AUC s from the three classes e average (sd) number of subsets in the partition determined in the learning phase 18
19 Table 5: Performance of mlorens, multivariate logit models with variable selection, and multivariate logit model with all significant variables for the simulated data with 500 predictors. Predictors were generated from normal distribution with standard deviation 3. #sig. Mean Method #var. a var. b #part. c ACC AUC d Class 1 Class 2 Class 3 mlorens All 27.8 e SENS:.68 (.09).36 (.08).69 (.08) (8.5) (.04) (.03) SPEC:.80 (.05).77 (.06).80 (.05) AUC:.74 (.05).56 (.04).74 (.04) MLR SENS:.57 (.10).35 (.09).55 (.10) with (1.5) (.05) (.04) SPEC:.76 (.06).73 (.06).76 (.06) variable AUC:.67 (.05).54 (.05).66 (.06) selection SENS:.55 (.11).37 (.09).54 (.10) (2.3) (.05) (.04) SPEC:.77 (.07).70 (.07).76 (.06) AUC:.66 (.06).54 (.05).65 (.05) SENS:.57 (.10).38 (.09).54 (.11) (3.0) (.05) (.04) SPEC:.79 (.06).69 (.07).76 (.06) AUC:.68 (.06).54 (.05).65 (.06) SENS:.53 (.11).37 (.09).50 (.10) (3.0) (.05) (.04) SPEC:.76 (.06).68 (.07).76 (.06) AUC:.64 (.06).52 (.05).63 (.05) MLR SENS:.67 (.10).43 (.10).60 (.10) w/signif. (.06) (.04) SPEC:.80 (.06).72 (.06).83 (.05) variables AUC:.74 (.05).58 (.06).71 (.06) RMNL SENS:.59 (.12).36 (.10).57 (.09) (.05) (.04) SPEC:.76 (.06).72 (.08).78 (.05) AUC:.68 (.05).54 (.05).68 (.05) RF SENS:.60 (.09).37 (.09).63 (.09) (.05) (.03) SPEC:.79 (.05).72 (.07).79 (.05) AUC:.70 (.05).55 (.05).71 (.05) SVM SENS:.61 (.09).45 (.09).60 (.09) Linear (.05) (.04) SPEC:.82 (.05).68 (.07).83 (.05) Kernel AUC:.71 (.05).57 (.05).72 (.05) SVM SENS:.60 (.11).46 (.11).62 (.10) Radial (.05) (.04) SPEC:.84 (.05).68 (.08).82 (.05) Kernel AUC:.72 (.05).57 (.05).72 (.05) a number of selected variables chosen in the training phase b number of significant variables among the selected variables c mean number of mutually exclusive subsets of predictors in a partition (sd), chosen in the training phase d mean of the AUC s from the three classes e average (sd) number of subsets in the partition determined in the learning phase 19
20 120 samples #significant variables: Noise 450 Class 1 40 N ( 0, σ 2) N ( 0, σ 2) N ( 0, σ 2) 35 N ( 0, σ 2) 50 N ( 0, σ 2) 35 N ( 0, σ 2) 45 N ( 0, σ 2) 80 Class 2 40 Class 3 40 N ( 1, σ 2) 40 N ( 2, σ 2) 40 N ( 1, σ 2) 80 N ( 1, σ 2) 40 N ( 1, σ 2) 35 N ( 1, σ 2) 35 N ( 1, σ 2) N ( 1, σ 2) N ( 0, σ 2) N ( 2, σ 2) 50 N ( 2, σ 2) N ( N ( 2, σ 2) 2, σ 2) Figure 1: Simulation design for a comparison of mlorens and MLR. 20
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION
MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION Matthew A. Lanham & Ralph D. Badinelli Virginia Polytechnic Institute and State University Department of Business
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
Nominal and ordinal logistic regression
Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
DATA MINING-BASED PREDICTIVE MODEL TO DETERMINE PROJECT FINANCIAL SUCCESS USING PROJECT DEFINITION PARAMETERS
DATA MINING-BASED PREDICTIVE MODEL TO DETERMINE PROJECT FINANCIAL SUCCESS USING PROJECT DEFINITION PARAMETERS Seungtaek Lee, Changmin Kim, Yoora Park, Hyojoo Son, and Changwan Kim* Department of Architecture
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Machine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY
S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT Predictive modeling includes regression, both logistic and linear,
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
TOWARD BIG DATA ANALYSIS WORKSHOP
TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Mathematical Models of Supervised Learning and their Application to Medical Diagnosis
Genomic, Proteomic and Transcriptomic Lab High Performance Computing and Networking Institute National Research Council, Italy Mathematical Models of Supervised Learning and their Application to Medical
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Data Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Statistical issues in the analysis of microarray data
Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data
Beating the MLB Moneyline
Beating the MLB Moneyline Leland Chen [email protected] Andrew He [email protected] 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
The Operational Value of Social Media Information. Social Media and Customer Interaction
The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode
A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode Seyed Mojtaba Hosseini Bamakan, Peyman Gholami RESEARCH CENTRE OF FICTITIOUS ECONOMY & DATA SCIENCE UNIVERSITY
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
Employer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
VI. Introduction to Logistic Regression
VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Linear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
Package hybridensemble
Type Package Package hybridensemble May 30, 2015 Title Build, Deploy and Evaluate Hybrid Ensembles Version 1.0.0 Date 2015-05-26 Imports randomforest, kernelfactory, ada, rpart, ROCR, nnet, e1071, NMOF,
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Decision Support Systems
Decision Support Systems 50 (2011) 602 613 Contents lists available at ScienceDirect Decision Support Systems journal homepage: www.elsevier.com/locate/dss Data mining for credit card fraud: A comparative
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/
Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
REPORT DOCUMENTATION PAGE
REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 Public Reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,
Local classification and local likelihoods
Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor
Package acrm. R topics documented: February 19, 2015
Package acrm February 19, 2015 Type Package Title Convenience functions for analytical Customer Relationship Management Version 0.1.1 Date 2014-03-28 Imports dummies, randomforest, kernelfactory, ada Author
LCs for Binary Classification
Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
Decompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
diagnosis through Random
Convegno Calcolo ad Alte Prestazioni "Biocomputing" Bio-molecular diagnosis through Random Subspace Ensembles of Learning Machines Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini DSI Dipartimento
ABSTRACT INTRODUCTION
Paper SP03-2009 Illustrative Logistic Regression Examples using PROC LOGISTIC: New Features in SAS/STAT 9.2 Robert G. Downer, Grand Valley State University, Allendale, MI Patrick J. Richardson, Van Andel
Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller
Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive
Multinomial and Ordinal Logistic Regression
Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,
Bayesian Adaptive Designs for Early-Phase Oncology Trials
The University of Hong Kong 1 Bayesian Adaptive Designs for Early-Phase Oncology Trials Associate Professor Department of Statistics & Actuarial Science The University of Hong Kong The University of Hong
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
A Methodology for Predictive Failure Detection in Semiconductor Fabrication
A Methodology for Predictive Failure Detection in Semiconductor Fabrication Peter Scheibelhofer (TU Graz) Dietmar Gleispach, Günter Hayderer (austriamicrosystems AG) 09-09-2011 Peter Scheibelhofer (TU
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
Multivariate Statistical Inference and Applications
Multivariate Statistical Inference and Applications ALVIN C. RENCHER Department of Statistics Brigham Young University A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
MS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)
Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
