Chapter 6. Machine Learning Classifiers for Protein- Protein Interaction Prediction

Size: px
Start display at page:

Download "Chapter 6. Machine Learning Classifiers for Protein- Protein Interaction Prediction"

Transcription

1 Chapter 6 Machine Learning Classifiers for Protein- Protein Interaction Prediction

2 6.1 Introduction In the previous chapter, I reported that the PPIs predicted by GN, GC, ES, PP and GM methods show marginal overlap and hence they can complement each other. Therefore, integration of these methods is likely to provide better information than the single method alone. The integrative approach requires the tools and methods capable of transforming the heterogeneous scores generated by these prediction methods into meaningful combination. On the basis of rich mathematical/statistical foundation, Machine Learning Classifiers (MLCs) allow us to go beyond a mere linear score cutoff based predictions and provide hidden trends of the data in the form of testable models using gold standard data. Then, one can use these models to predict whether unknown protein pairs belong to positive or negative category. There are a number of studies showing feasibility of PPI prediction task using MLCs [54, 80, 88, 90, 91, 167]. It was suggested that performance of MLCs can be improved by appropriate integration and selection of features rather than by adding as many as features possible [80-82, 143, 168, 169]. However, previous reports lack critical comparative assessment of various MLCs and data dependency. Furthermore, the combination of MLCs for classification problems has been widely used in non-biological sciences but few studies have evaluated multiple classifiers for PPI predictions [86, 88, 91, 170]. A majority of these studies have been performed in Yeast as model organism [ ]. Some of the issues worth to be explored for E. coli encoded proteins is are: 1) which MLC is best suited for PPI predictions in E. coli, 2) the feasibility of multiple MLC based PPI predictions and, 3) choice of a appropriate Gold Standard (GS) dataset for learning. In this chapter, I have investigated aforementioned issues by predicting genome-wide PPI networks using seven different MLCs trained on four GS datasets. The resulting PPI networks provide highly accurate protein-protein functional interaction map generated to date, which can be used for system level analysis of Escherichia coli proteome.

3 6.2 Material and Methods Gold standard datasets All MLCs applied to the PPI prediction task in this work are supervised learning approaches. Therefore, these algorithms require GS dataset for training and testing. The analysis was carried out using positive GS datasets composed of Operon, DIP and Complex datasets. Because, PPI prediction methods were able to strongly discriminate these datasets from the negative examples (Chapter 4, section ). Considering the highest classification accuracy of Operon and Complex datasets, it was decided to combine them together to understand the discriminative power of MLCs on mixture of two different types of PPIs i.e. functional (co-operonic PPIs) and physical (co-complex PPIs). The negative examples were randomly chosen from the high confidence negative dataset generated for testing PPI prediction methods (Chapter 2, section 2.2.2). Each positive dataset was combined with a negative subset of size five times greater than the number of positive examples. This was done to predict physiologically meaningful PPIs whereas performance evaluation of MLCs was carried out using positive and negative dataset of equal size to get fair judgment of prediction accuracy Data feature encoding The interaction scores (i.e. features) for all possible pairs of E. coli proteins (including gold standard protein pairs) were calculated by GN, GC, ES, PP and GM methods. There are two possible ways for encoding these scores for machine learning. First, the scores for a particular protein pair generated by the five PPI prediction methods is represented by a single value called as Summary [88]. The same information source is described by the five values called as Detailed [88]. For the present analysis, Detailed method of feature encoding was used due to less number of available features.

4 6.2.3 Machine learning classifiers A set of seven MLCs was used in this study which includes Support Vector Machine (SVM), Decision tress (DT), Random Forest (RF), Naïve Bayes (NB), Bayesian Network (BN), Neural Network (NN) and Logistic Regression (LR). LibSVM package ( was used for SVM based predictions and WEKA toolbox ( for other machine learning algorithms. The cost and gamma functions of SVM were optimized using grid.py script provided in LibSVM package. J48 variant of decision tree was used because it incorporates numerical attributes, allows post pruning after induction of trees. The RF classifier is based on multiple DTs. A total of 100 DTs were grown simultaneously where each node uses a random subset of the features. In order to classify a new instance from an input vector, each input vector is subjected to analyzed by each of the trees in the forest. The output decision is based on majority vote over all the trees in the forest. For the Naïve Bayes classifier, a kernel estimation parameter was switched on. All other classifiers were applied with default parameters in the WEKA Cross-validation and performance assessment of MLCs The positive and negative GS datasets were randomly divided into five equal sized datasets. The protein pairs present in positive and negative dataset are referred as true positives (i.e. interacting) and true negatives (i.e. noninteracting). We performed five-fold cross-validation on all the MLCs using the above mentioned gold standard datasets. In each round of cross-validation, we used a random combination of four datasets for training the MLC and the remaining dataset was used for testing the model. This process was repeated for five times by using different combinations of training and testing datasets. For each round, we calculated the sensitivity (or TPR), specificity, and positive predictive value (or PPV or precision) as a measure of the quality of binary (twoclass) classifications using each MLC. The equations for sensitivity, PPV measures

5 have been described in the chapter 2, section Specificit y TN TNFP For all performance measures, TP is the number of positive dataset protein pairs that are correctly predicted as interacting by given MLC. FN is the number of positive dataset protein pairs that are incorrectly predicted as not interacting. Similarly TN is the number of negative dataset protein pairs that are correctly predicted as not interacting. FP is the number of negative dataset protein pairs that are incorrectly predicted as interacting. The average performance of five rounds of cross-validation was taken as a measure of prediction performance. The TPR and FPR were calculated using the decision values for protein pairs generated by SVM. For other MLCs, TPR and FPR values generated by WEKA were obtained. The TPR and FPR values were used to plot ROC curves Genome-wide PPI prediction The models generated by each of the seven MLCs using aforementioned four gold standard datasets were used for genome-wide PPI predictions in E. coli. The resulting 28 PPI networks were used to further analysis. The topological properties of the networks were calculated as described in the chapter 5, section Comparison of predicted PPI networks with experimental and functional PPI datasets PPINet GS ( PPINet, ) 100* PPINet GS JC GS

6 . The gold standard representing experimentally characterized and functional PPIs are same as described in Table Results and Discussion The 28 whole genome PPI maps of E. coli were reconstructed using models generated by seven MLCs on five data features that are interaction scores generated by GN, GC, ES, PP and GM. In the chapter 4, it was found that these prediction methods were able discriminate Operon, Complex and DIP PPIs from negatives with very high accuracy (Figure 4.1 & 4.2). Hence, these three datasets were used as gold standard (GS) positive datasets to generate MLC models along with a fourth GS called as Complex_Operon, which was a union of Operon and Complex PPIs. The performance of Operon and Complex dataset was observed relatively better than that of other datasets. It was assumed that their combination would provide enough training examples of functional (Operon) and physical (Complex) PPIs to generate better MLC model for predictions MLCs have predicted PPI networks with very high accuracy A combination of five data features was used to train and test seven MLCs using four GS datasets. Performance measures were averaged over five-fold crossvalidation for each MLC. As shown in the Figure 6.1, ROC curves for MLCs trained on four GS have reported extremely good sensitivities/tprs at the cost of less than 0.05 FPR (i.e. 5%). It suggests that only 5% predictions are likely to be false predictions. At the cost of 5% FPR, MLCs were able to discriminate 65, 80, 95 and 85 percent of PPIs (True positives) of DIP, Complex, Operon and Complex_Operon from negative training examples (Figure 6.1). One of the previous studies on PPI predictions in Yeast have reported that RF consistently ranked as one of the top two classifiers [88]. Figure 6.1 depicts that the performance of RF indeed better than other six classifiers using three out of four

7 GSs evaluated in this study. NB is the best performer when DIP GS was used for training MLCs. Figure 6.1 Performance accuracy measured as ROC curves for seven machine learning classifiers trained on four gold standard datasets. Each solid line on plots depicts the machine learning classifier (MLC). The colors of the lines correspond to the Bayesian Network (BN), Decision tress (DT), Logistic Regression (LR), Naïve Bayes (NB), Neural Network (NN), Random Forest (RF), and Support Vector Machine (SVM) classifier. MLCs were trained on gold standard dataset from A) DIP database B) cocomplex PPIs C) Co-operonic PPIs D) union of (B) and (C) PPIs. Performance of all MLCs using four gold standard dataset is elegant. RF classifier outperformed on (B), (C) and (D) gold standard datasets whereas NB on DIP. Note that y-axis starts at 0.4 TPR while x-axis ends at 0.05 FPR. generated by ES and GM in addition to above mentioned three features. However, Another independent study had been performed on SVM based PPI predictions in E. coli by Yellabiona and coworkers [54]. Authors have reported average fivefold cross-validation accuracy of SVM trained on GN, GC and PP features using Complex dataset of 0.89 (i.e. average of sensitivity and specificity). The present study have performed on five data features which include scores

8 the accuracy, which is obtained in the present study, is very similar to the reported by these authors. It was expected that the performance accuracy of classifiers would excel by two features additional features as compared to the three used by them but it didn t. One possible explanation for the similar results could be the composition of negatives and positives in their GS dataset. They have used 13 times higher numbers of negative examples as compared to the positives for predictions, which raised specificity of their cross-validation analysis to one as compared to the 0.97 achieved in this study. The sensitivity of 0.79 was achieved by them was preceded by 0.81 achieved in this study [54]. To a large extent ROCs are insensitive to absolute number of false positives since the calculations are solely based on ranks of positives relative to those of negatives [167]. So the ROCs alone can be misleading if absolute numbers of false positives become relevant. The possible number of all protein pairs among the proteins of any organism is too high as compared to the estimated PPIs. Out of 8 million possible pairs of E. coli proteins, the estimated number would be close to 40,000 probable interacting candidates only, thereby the absolute number of false positives equally matter. In order to ensure the quality of the final PPI predictions, PPV or precision and specificity of fivefold cross-validation was calculated and has shown in Figure 6.2. The analysis of Figure 6.2 A, reflects the outperformance of NB on all GS datasets and only two percent of negative examples were predicted as potential interacting by NB. The second best performance is observed for BN followed by SVM classifier. Specificities of NB trained on four GS lead to observations similar to PPV plot (Figure 6.2 B). Although, the performance of RF classifier measured as ROC is better than other classifiers, ability to detect negatives is substantially higher for the NB. Hence, NB predictions are biologically more significant than RF by considering the total space of possible protein pairs among proteins of an organism as compared to the estimated numbers of actual PPI.

9 Figure 6.2 Performance accuracy measured as specificity and positive predictive value for seven machine learning classifiers trained on four gold standard datasets. Each bar on plots depicts the average of five-fold crossvalidation accuracy of machine learning classifier trained on corresponding gold standard dataset. The machine learning classifiers are Bayesian Network (BN), Decision tress (DT), Logistic Regression (LR), Naïve Bayes (NB), Neural Network (NN), Random Forest (RF), and Support Vector Machine (SVM) classifier. A) Accuracy measured as the specificity B) Accuracy measured as the positive predictive value. Performance of NB classifier trained on four gold standard datasets is highest as compared to the other classifiers. Note that y-axis starts at Numbers of PPIs in the networks predicted by various MLCs varied greatly A total of 28 PPI networks were predicted using model generated by seven MLCs trained on four datasets consists of. The number of PPIs predicted by each classifier and overlap has shown in the Table 6.1. The numbers of PPIs predicted have shown substantial differences. A total number of interactions among the proteins present in an organism are difficult to determine by experimental method if not impossible. The range of 16,000 to 37,000 PPIs for Yeast proteome of ~6300 proteins was estimated by two independent computational analyses. To best of our knowledge such analysis has not been carried out in E. coli. Unlike Yeast, E. coli lacks sub-cellular compartments and the number of proteins is 4132 suggests that the number of interacting proteins may be in the same range. However, as shown in Table 6.1, 15 out of 28 predicted PPI networks by seven MLCs using four gold standard datasets contains number of interactions more than 100 thousands whereas numbers of interactions in remaining networks range from ~44,004 to 87,277.

10 Table 6.1 A statistics on the numbers of protein-protein interactions predicted by seven Machine Learning Classifiers (MLCs) trained on four different gold standard datasets DIP BN DT LR NB NN SVM RF BN DT LR NB NN SVM RF Complex BN DT LR NB NN SVM RF BN DT LR NB NN SVM RF Operon BN DT LR NB NN SVM RF BN DT LR NB NN SVM RF Complex_Operon BN DT LR NB NN SVM RF BN DT LR NB NN SVM RF Notes: The MLCs are Bayesian Network (BN), Decision tress (DT), Logistic Regression (LR), Naïve Bayes (NB). DIP, Complex, Operon and Complex_Operon are gold standard proteinprotein interaction datasets. The numbers with bold face represent a total number of PPI

11 predicted by corresponding classifier. NB predicts least numbers of PPIs whereas RF predicts higher numbers. various types of PPIs is also not reasonable since the coverage of known Majority of MLCs predicted more than 100 thousand interactions using Complex and DIP gold standard datasets. Though RF classifier has outperformed using three GS datasets but the predicted number of interactions using each model was more than 200 thousand. These numbers are far higher than the estimated numbers and hence might be consist of many false predictions. On the other hand, NB has predicted 78956, 57806, and PPIs using DIP, Complex, Operon and Complex_Operon GS datasets respectively. These numbers are quite close to the estimated PPIs in an organism and considering their small numbers likely to have many potential interactions Functional coverage of the predicted networks is very less The 28 PPI networks were reconstructed using classifiers that have given fivefold cross-validation PPV of more than 85%. It means that the 85% of the predicted PPIs using these models/classifiers are expected to be true interactions. The coverage of various known PPIs in these networks is one of the options to validate training accuracies of the MLCs. Plots in the Figure 6.3 show the coverage of 12 datasets representing known experimental (six sets) and functional (six sets) PPIs in the form of boxplots/wisker s plots. Boxplots represent differences between groups without making any assumptions of the underlying statistical distribution and hence they are non-parametric. The spacing between the different parts of the box indicates the degree of dispersion and skewness in the data. Here groups are various MLCs and each box on the plot represents distribution of Jaccard coefficients (JC) between the predicted PPIs by corresponding MLC and every known PPI datasets. JC represents the coefficient (scaled in between 0-100) of similarity between two datasets taking into account the size difference. Even though MLCs have shown very high sensitivity values, coverage of known PPIs in the predicted networks is very less (Figure 6.3). The training on

12 experimental (Figure 6.3 A) and functional (Figure 6.3 B) PPIs in the networks predicted using DIP, Complex and Operon GS is almost similar. Every known PPI dataset is underrepresented in the predicted networks, irrespective of the training gold standard datasets and MLC used for prediction. JC have not risen above 5 for a single classifier which was scaled in between zero and 100 (Figure 6.3). JC score of zero here reflects no overlap between two PPI networks while Figure 6.3 Coverage of 12 known experimental and functional proteinprotein interaction datasets in the networks, predicted by seven machine learning classifiers trained on four different gold standard datasets. Each boxplot on plots depicts the distribution of Jaccard similarity Coefficients calculated between network predicted by corresponding machine learning classifier and 6 datasets of known physical and functional protein-protein interactions (PPIs). The machine learning classifiers are Bayesian Network (BN), Decision tress (DT), Logistic Regression (LR), Naïve Bayes (NB), Neural Network (NN), Random Forest (RF), and Support Vector Machine (SVM). A) Coverage of experimentally characterized PPI datasets which include His-tagged bait, TAP-tagged baits (2 sets), co-presence of proteins in the same protein complex, DIP database and synthetic lethality analysis. B) Coverage of functional PPI datasets which include protein pairs that belong to the same KEGG pathway, GO term, COG functional category, Operon, EcoCyc functional category and transcriptional regulatory associations. Coverage of experimental as well as functional datasets is higher in the networks predicted by NB classifier using all gold standard training datasets. Every known PPI dataset is underrepresented in the network predicted by different MLCs.

13 score 100 means complete overlap. It is consistent with the previous observations that a computational PPI prediction complements the existing knowledge of PPIs [54, 56, 86, 173, 174]. It was also observed that coverage of known indirect interactions is better than the direct PPIs in the predicted networks [54]. Here, indirect interaction is defined as link between two proteins via its adjacent neighbors in the network rather than their direct interaction. Therefore, the marginal coverage of known PPIs in the networks predicted in present study is not an artifact but the limited available knowledge of biological systems. Unexpectedly, the coverage of experimental PPI datasets is relatively higher in the networks predicted by NB classifier (Figure 3A). These experimental PPI datasets include PPIs derived from His-tagged bait, TAP-tagged baits (2 sets), co-presence of proteins in the same protein complex, DIP database and synthetic lethality analysis. The coverage of these datasets in the NB networks has observed substantially higher than that of RF classifier. RF was the best performer in terms of ROC performance measure using three GS. It reflects the shortcoming of routinely used ROC performance measure for PPI predictions. As suggested by Park, Y, the ROC measure should be complimented with other performance measure to ensure the percentage of positives among the predicted PPIs [167]. The present study suggests the PPV and specificity plots can also be useful for such cross validation. NB has outperformed when performance was measured as PPV and specificity alone. The coverage of available functional PPI datasets also led to the similar observations (Figure 6.3 B). The coverage of functional PPI datasets is slightly higher than the experimentally derived ones but still all JC scores are below 5 (Figure 6.3 B). Furthermore, the distributions of JCs for various MLCs have differed comparably. Again, NB predicted PPI networks have higher coverage of functionally linked proteins as compared to the classifiers except BN (Figure 6.3 B). BN also have relatively better coverage of the functional PPI datasets under consideration. BN represents the probabilistic relationships between data

14 features/variables for the corresponding labels/evidence. Given data features, the Bayesian network can be used to compute the probabilities of the label associated with the data feature. NB is a special case of BN with strong independence (naïve) assumptions. Hence these results suggest that the classifiers based on the probabilistic models are better predictor of PPIs than the other classifiers used in this study such as RF, SVM etc. In the literature, such approaches have been indeed favored as compared to the other classifiers for PPI predictions [86, ] Networks predicted by NB classifier are better than the previous reports With the very limited knowledge of the total PPIs existing in the E. coli, it is challenging to compare these predictions with previously reported computational predictions. Furthermore, not only methodology differs with each report but also the GS used for evaluation. Hence, the direct comparison of various approaches that are used so far is often difficult [87]. In addition, performance of prediction often biased towards the GS, which is used for validation of predictions. Considering the combination of five PPI predictions (i.e. data features), multiple MLC models using four GS datasets, it was expected that the predicted PPIs would be of high quality with respect to false predictions and the coverage of proteins. In the light of afore mentioned concerns, it was decided to use three datasets that are not related to MLC training in the present study. At the same time, these datasets are likely to be free from experimental noise. The first two datasets consists of PPIs that are detected by the systematic large-scale tandemaffinity purifications to isolate multiprotein complexes by two independent studies [86, 109, 110]. The experimental set up, which was used by authors, minimizes the spurious non-specific protein associations and enables recovery of native protein complexes at near-endogenous levels. Third dataset was created using pairs of proteins that have been associated with each other in at least two

15 experimental analysis. Then, percentage of these datasets was calculated in the networks predicted in the present study and also for the four computational genome-scale PPI networks reported in the literature [54, 81, 82, 86]. Most of the networks predicted by various MLCs had limited coverage of these known datasets as compared to the networks reported in previous studies. Some of these networks had higher coverage, which may be likely due to the larger size of the predicted networks. Some of the networks such as, NB and BN predicted networks had much better coverage of these experimentally derived PPIs as compared to the existing networks. The comparisons of these networks have shown in the Figure 6.4. The size of these networks is comparatively less than that of existing genome-scale PPI datasets but the coverage of experimentally derived PPIs is substantially better than them (Figure 6.4). Figure 6.4 Comparison of the predicted networks with previously published PPI networks. X-axis represents computationally predicted genome-scale protein-protein interaction (PPI) networks. Networks with asterisk marks are predicted in the present study. Each bar represents percentage of a corresponding experimentally characterized PPIs in the predicted networks. Exp2 dataset represents PPIs that have been characterized by at least two different experimental analyses. The networks predicted in the present study have substantially high coverage of three PPI datasets used evaluation. Only network predicted by Yellabiona and coworkers is close to the predicted networks in this study. cnb, onb and conb stands for networks predicted by models generated by Naïve Bayes classifier using Complex, Operon and a union of complex & operon as a positive gold

16 standard dataset. Similaraly, cobn network is predicted by Bayesian Network using a union of complex & operon as a positive gold. 6.4 Summary The present study describes the prediction of genome-wide PPI networks using seven MLCs trained on four different GS datasets. It was observed that the combination of the confidence scores generated by five PPI prediction methods effectively increased classifiers accuracy. Majority of classifiers statistically show higher training accuracies but the predicted networks lack coverage of known PPIs. A probabilistic model based NB classifier, which assumes independence of data features recovers substantially higher numbers of known PPIs. These results led me the conclusion that probability based classifiers, which assume feature independence, are best suited for PPI predictions in E. coli. The reason behind their better coverage of known PPIs could be independent assumptions of the five methods for PPI predictions. Thereby, each feature have complemented others and enabled recovery of biologically meaningful interactions, when NB classifier was used. A combination of five PPI prediction methods, seven MLCs, and four GS datasets, led to the prediction of 28 networks with size no more than 3,20,000 PPIs. Majority of the networks consist of PPIs in the range between 1,00,000-2,00,000. These comparisons have encouraged me to speculate that a total number of PPIs in bacterial cells could be around 1,50,000 which is much higher than the previous estimates.

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

Car Insurance. Havránek, Pokorný, Tomášek

Car Insurance. Havránek, Pokorný, Tomášek Car Insurance Havránek, Pokorný, Tomášek Outline Data overview Horizontal approach + Decision tree/forests Vertical (column) approach + Neural networks SVM Data overview Customers Viewed policies Bought

More information

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Learning from Diversity

Learning from Diversity Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology

More information

Employer Health Insurance Premium Prediction Elliott Lui

Employer Health Insurance Premium Prediction Elliott Lui Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

More information

Beating the MLB Moneyline

Beating the MLB Moneyline Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376 Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

More information

Data Mining in Weka Bringing It All together

Data Mining in Weka Bringing It All together Data Mining in Weka Bringing It All together Predictive Analytics Center of Excellence (PACE) San Diego Super Computer Center, UCSD Data Mining Boot Camp 1 Introduction The project assignment demonstrates

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

More information

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev 86 ITHEA APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING Anatoli Nachev Abstract: This paper presents a case study of data mining modeling techniques for direct marketing. It focuses to three

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers

Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers Manjit Kaur Department of Computer Science Punjabi University Patiala, India manjit8718@gmail.com Dr. Kawaljeet Singh University

More information

Hong Kong Stock Index Forecasting

Hong Kong Stock Index Forecasting Hong Kong Stock Index Forecasting Tong Fu Shuo Chen Chuanqi Wei tfu1@stanford.edu cslcb@stanford.edu chuanqi@stanford.edu Abstract Prediction of the movement of stock market is a long-time attractive topic

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Machine learning for algo trading

Machine learning for algo trading Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with

More information

Cross-Validation. Synonyms Rotation estimation

Cross-Validation. Synonyms Rotation estimation Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Predicting Flight Delays

Predicting Flight Delays Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

More information

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

Equity forecast: Predicting long term stock price movement using machine learning

Equity forecast: Predicting long term stock price movement using machine learning Equity forecast: Predicting long term stock price movement using machine learning Nikola Milosevic School of Computer Science, University of Manchester, UK Nikola.milosevic@manchester.ac.uk Abstract Long

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Classification and Regression by randomforest

Classification and Regression by randomforest Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

More information

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

ROC Curve, Lift Chart and Calibration Plot

ROC Curve, Lift Chart and Calibration Plot Metodološki zvezki, Vol. 3, No. 1, 26, 89-18 ROC Curve, Lift Chart and Calibration Plot Miha Vuk 1, Tomaž Curk 2 Abstract This paper presents ROC curve, lift chart and calibration plot, three well known

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information

A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh

More information

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three

More information

Predicting Good Probabilities With Supervised Learning

Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Department Of Computer Science, Cornell University, Ithaca NY 4853 Abstract We examine the relationship between the predictions made by different learning algorithms

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Spam Detection System Combining Cellular Automata and Naive Bayes Classifier

Spam Detection System Combining Cellular Automata and Naive Bayes Classifier Spam Detection System Combining Cellular Automata and Naive Bayes Classifier F. Barigou*, N. Barigou**, B. Atmani*** Computer Science Department, Faculty of Sciences, University of Oran BP 1524, El M Naouer,

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

More information

Understanding the dynamics and function of cellular networks

Understanding the dynamics and function of cellular networks Understanding the dynamics and function of cellular networks Cells are complex systems functionally diverse elements diverse interactions that form networks signal transduction-, gene regulatory-, metabolic-

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Measures of diagnostic accuracy: basic definitions

Measures of diagnostic accuracy: basic definitions Measures of diagnostic accuracy: basic definitions Ana-Maria Šimundić Department of Molecular Diagnostics University Department of Chemistry, Sestre milosrdnice University Hospital, Zagreb, Croatia E-mail

More information

Discovering process models from empirical data

Discovering process models from empirical data Discovering process models from empirical data Laura Măruşter (l.maruster@tm.tue.nl), Ton Weijters (a.j.m.m.weijters@tm.tue.nl) and Wil van der Aalst (w.m.p.aalst@tm.tue.nl) Eindhoven University of Technology,

More information

Comparison of machine learning methods for intelligent tutoring systems

Comparison of machine learning methods for intelligent tutoring systems Comparison of machine learning methods for intelligent tutoring systems Wilhelmiina Hämäläinen 1 and Mikko Vinni 1 Department of Computer Science, University of Joensuu, P.O. Box 111, FI-80101 Joensuu

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Data Mining Analysis (breast-cancer data)

Data Mining Analysis (breast-cancer data) Data Mining Analysis (breast-cancer data) Jung-Ying Wang Register number: D9115007, May, 2003 Abstract In this AI term project, we compare some world renowned machine learning tools. Including WEKA data

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS

COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS AMJAD HARB and RASHID JAYOUSI Faculty of Computer Science, Al-Quds University, Jerusalem, Palestine Abstract This study exploits

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Multiple Kernel Learning on the Limit Order Book

Multiple Kernel Learning on the Limit Order Book JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Maximizing Precision of Hit Predictions in Baseball

Maximizing Precision of Hit Predictions in Baseball Maximizing Precision of Hit Predictions in Baseball Jason Clavelli clavelli@stanford.edu Joel Gottsegen joeligy@stanford.edu December 13, 2013 Introduction In recent years, there has been increasing interest

More information

Journal of Engineering Science and Technology Review 7 (4) (2014) 89-96

Journal of Engineering Science and Technology Review 7 (4) (2014) 89-96 Jestr Journal of Engineering Science and Technology Review 7 (4) (2014) 89-96 JOURNAL OF Engineering Science and Technology Review www.jestr.org Applying Fuzzy Logic and Data Mining Techniques in Wireless

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar

More information

Creditworthiness Analysis in E-Financing Businesses - A Cross-Business Approach

Creditworthiness Analysis in E-Financing Businesses - A Cross-Business Approach Creditworthiness Analysis in E-Financing Businesses - A Cross-Business Approach Kun Liang 1,2, Zhangxi Lin 2, Zelin Jia 2, Cuiqing Jiang 1,Jiangtao Qiu 2,3 1 Shcool of Management, Hefei University of Technology,

More information

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

More information

Binary Logistic Regression

Binary Logistic Regression Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Visualizing class probability estimators

Visualizing class probability estimators Visualizing class probability estimators Eibe Frank and Mark Hall Department of Computer Science University of Waikato Hamilton, New Zealand {eibe, mhall}@cs.waikato.ac.nz Abstract. Inducing classifiers

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information