How To Identify A Churner

Size: px
Start display at page:

Download "How To Identify A Churner"

Transcription

1 th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management Engineering, POSTECH, Pohang, South Korea {skagud,jaewookl}@postech.ac. kr Kyu-Hwan Jung SK telecom Seoul, South Korea Onlyou7@postech.ac.kr Yong Seog Kim Department of Management Information Systems, Utah State University, Logan, UT 84322, USA yong.kim@usu.edu Abstract This paper explores the possible application of a single SVM classifier and its variants to churner identification problem in mobile telecommunication industry in which the role of customer retention program becomes more important than ever due to its very competitive business environment. In particular, this study introduces a uniformly subsampled ensemble model of SVM classifiers combined with principal component analysis (PCA) not only to reduce the high dimensionality of data but also to boost the reliability and accuracy of calibrated models on data sets with highly skewed class distributions. According to our experiments, the performance of USE SVM model with PCA is superior to all compared models and the number of principal components (PCs) affect the accuracy of ensemble models. 1. Introduction The availability of cheap hard disk spaces and the expansion of data collection technologies empower many business companies to easily monitor and visualize customers daily purchase and usage patterns through online transaction processing (OLTP) databases [5]. Therefore, in these days, most companies have plenty of data. However, data itself is not information, and data must be turned into information so that users can answer their own questions with the right information at the right time and in the right place. In this paper, we consider an imaginary company in mobile telecommunications industry that faces a very steep competition from its competitors and hence is compelled to capture, understand, and harness these customer related data sets to seek for new business opportunities with new customers and retain current customers through improved business operations. Note that many companies in telecommunications industry has been suffering from extremely high churning rates, i.e. between 20% and 40% of customers leave their current service provider for a given year, mainly because their relatively homogeneous technologies and services drive them to compete in terms of lower service charges. In such a case, the role of marketing becomes a key success factor. In particular, it is well known that it is much profitable for a company to retain a current and royal customer than to recruit a new customer considering an increasing marketing costs. In particular, micro or target marketing programs with tailored messages are much more cost effective than mass marketing programs through traditional marketing channels such as TVs and newspapers. Therefore, it is strongly recommended that companies in a very competitive business environment operate their own customer relationship management (CRM) systems equipped with business intelligence and data mining tools to identify a group of customers who are most likely to terminate their relationships with the current service providers. Note that churn identification and prevention is a critical issue because the mobile phone market has already reached a saturation point and each company strives to attract new subscribers while retaining current profitable customers [19]. To support such an effort, in this paper, we like to introduce one of such micro marketing tools suited for churner identification on behalf of companies in telecommunications industry. We first note that churn management should start with an accurate identification of churners possibly coupled with detailed profiling of their demographic information and behavioral and transactional patterns. While /12 $ IEEE DOI /HICSS

2 developing retention strategies and management practices targeted for identified very likely churners may complete a churn management system, we limit our interests on developing a new SVM ensemble model to accurately identify possible churners from their service usage patterns collected over a certain period. values. Second, categorical variables with high missing rate were also eliminated because each categorical variable has very little predictive power in general [17]. The remainder of this paper is organized as follows. Section 2 describes the original data set and preprocessing procedure. Then we introduce the Uniformly Subsampled Ensemble (USE) method in Section 3 and present experimental results in Section 4. Finally, Section 5 concludes this paper and provides suggestions for future research directions. 2. Data Description and Evaluation Metrics 2.1. Telecommunications Market Data The data sets used in this paper are the customer records of a major wireless telecommunications company. They are provided by the Teradata Center for CRM at Duke University [9]. The data collection period is the second half of The active customers who had been with the company for at least 6 months were sampled. The original predictor variables are 171 and the number of samples is 100,000. Predictor data include four types of variables: demographics such as age, location, number and ages of children; financial such as credit score, credit card ownerships; product details such as handset price, handset capabilities; and phone usage information. To predict churn, we have to set the criteria of churn at first. We classified the customers left the company by the end of the following 60 days after sampled as churners. The actual ratio of churners in a given month is approximately 1.8% but churners in the original training data set were oversampled to 50%. In the test data set there were 51,036 observations with 924 churners which represent a real churning rate 1.8% per month rate. Fig. 1 shows a plot of training dataset with two features selected by feature selection. We can see that the churners and non-churners are highly overlapped Data Preprocessing For further analysis, we performed the preprocessing on raw data before applying the proposed method as follows. First, we eliminated continuous variables with more than 20% of missing Figure 1. Plot of training dataset Also if they are encoded into multiple binary variables, dimensions will increase. Thus only 11 categorical variables were included. They are either indicator variables or countable variables. Finally we removed observations with missing values in dataset. After preprocessing steps, we have 123 predictors with 11 categorical variables and 112 continuous variables. The training dataset has 67,181 observations with 32,862 churners of which churn rate is approximately 49%. The test set has 34,986 observations with 619 churners of which churn rate is approximately 1.8% Evaluation We used hit rate as an evaluation metric for our research. The hit rate is a popular measure to evaluate the predictive power of models numerically for the marketing field [18]. Hit rate is calculated as n Hit rate = H i / n (1) i= 1 where Hi is 1 if the prediction is correct and 0 otherwise. n represents the number of samples in the data sets. In other words, the hit rate represents the percentage of correctly predicted churners from the churner candidates. Hit rate is associated with a target point. For example, a hit rate at a target point of x% is a hit rate when only the top x% of customers are considered for evaluation based on their estimated churn probabilities. Therefore, if we assume that we 1024

3 have 10,000 observations, a hit rate at a target point of 10% is the percentage of correctly predicted churner out of 1,000 customers who are most likely to churn. Considering hit rates with target points is important because marketing managers have to focus only on the top percentage of customers due to limited budget and time constraints. Thus out target point is 30% 3. Proposed Ensemble Method In this section, we present the structure of our new ensemble model, the USE, and describe its unique characteristics in terms of sampling and weighting schemes. Figure 2 graphically presents the structure of the USE model. The first step in building the USE is to partition the data set into subsets to train a single corresponding classifier. Once a single classifier is calibrated to produce the estimated score (e.g., probability of churning) for each customer record from each partition, the USE ensemble model aggregates the scores of each classifier and produces the final score of the ensemble model. Figure 2. The structure of the proposed ensemble method 3.1. Weighting methods To generate a collective decision, we consider several ways to aggregate the predictions of trained classification models through various weighting schemes such as uniform weights, weighted by classification performance, or weighted by hit rate. The simplest weighting scheme is the uniform weight method that apply the same weight (=1/M) to the prediction from all the classifiers. The prediction of each individual classifier may be weighted depending on the binary classification performance or the hit rate on sampled validation data from the training data. To apply the weighting scheme based on classification performance, the classification accuracy on the validation data of each classifier is normalized to facilitate summing to 1 and the final prediction on the test data set is weighted according to this normalized weight. In the weighting scheme based on hit rate, the hit rate at 10%, 20%, and 30% are summed to measure the performance. Subsequently, they are normalized prior to summing to 1. The final prediction on the test data set is weighted according to this normalized weight as follows: f ( x) = M ˆ (4) m= 1 w m f m ( x) 3.2. Bagging and Boosting vs. USE To build an accurate ensemble model based on our proposed USE method, we divide an entire training data set into M equally sized nonoverlapping subsamples using a random sampler. Consequently, any single classifier (e.g., an SVM classifier) can be calibrated on each subsampled data set to determine hidden patterns. Finally, the prediction of all classifiers will be aggregated via a weighted summation to construct the final prediction as an ensemble model for each record in a test data. In this sense, the proposed USE method is very similar to two popular ensemble methods, namely, bagging [2] and boosting [6] which have been known to perform better than single classifiers [1], [3]. For example, ensemble models based on bagging train each classifier on a randomly drawn training set that consists of the same number of examples randomly drawn from the original training set, with the probability of drawing any given example being equal. Since samples are drawn with replacement, some examples may be selected multiple times whereas others may not be selected at all. Bagging combines the predictions of multiple classifiers by voting with an equal weight. In short, the major difference between bagging and the proposed USE method is whether or not samples are drawn with replacement and whether or not the size of sampled training set for each single classifier is equal to the size of the original training set. 1025

4 Furthermore, our proposed USE method is different from the boosting [6] method that produces a series of classifiers with each training set based on the performance of the previous classifiers. Through adaptive resampling in boosting, examples that are incorrectly predicted by previous classifiers are sampled more frequently, whereas the uniform subsampling method without replacement is exploited in the USE. Overall, each classifier in the USE model is calibrated on a smaller training set compared with classifiers in bagging and boosting, which requires less CPU power and main memory. The USE model can still reduce the expected prediction error of a single predictor. All three ensemble models bagging, boosting, and USE share a common characteristic: the effectiveness and improved accuracy of the proposed ensemble model come primarily from the diversity caused by resampling training examples. While it is perfectly reasonable to calibrate single classifier on a sampled training set without further preprocessing, we consider a data dimension reduction method such as principal component analysis (PCA). Note that PCA is a mathematical procedure to transform a set of correlated predictors into a set of new uncorrelated variables called principal components (PCs) that capture the maximum amount of variation in the data. Since the number of PCs is less than or equal to the number of original variables, and each PC is uncorrelated with other PCs, PCA method can be particularly useful to reduce high dimensionality of data sets in which many input variables are correlated. Note that dimensionality reduction can be accomplished by selecting by fewer number of PCs than the original input variables, and three methods have been widely used for determining the number of PCs. The first criterion, The Eigenvalue-One Criterion" or the Kaiser-Guttman criterion [8] selects all PCs with an eigenvalue greater than 1. The second approach is based on The Scree Test" [4] and it selects all PCs considering a definitive break between sorted eigenvalues of PCs. The last criterion retains components if they exceeds a specified proportion of variance in the data where the proportion is calculated as follows: Proportion =Eigenvalue for the component of interest/ Total eigenvalues of the correlation matrix compared with other classifiers [10], [11], and authors familiarity. On the other hand, our proposed USE method can be combined with any other classifier. In addition, SVM classifiers often require additional computing power and show poor performance when they are applied to large-scale data [7], [12], [14], [15]. Therefore, the SVM classifier is a perfect candidate to test the effectiveness of the USE method through data subsampling if the aim is to reduce the requirement for high computing power. 4. Experimental results In this section, we present the process and results that the proposed Uniformly Subsampled Ensemble SVM is applied to the telecommunications market data. Fig. 3 shows correlation matrix of variables. We can notice that there are high correlations among features. It supports the need of extracting uncorrelated new features. Figure 3. The correlation matrix with values higher than 0.5 We applied PCA for data dimension reduction. As asserted in section 3, there exist some kinds of methods to select the optimal number of PCs. Among them we considered three approaches which are most commonly used.. In the actual implementation of the USE model in the present paper, we built an ensemble of SVM classifiers. Mainly, the SVM classifiers are used to construct an ensemble model because of their popularity among researchers, superior performance 1026

5 Figure 6. Effect of Number of Classifiers Figure 4. Plot of Eigenvalues Fig. 4 is the plot of eigenvalues. The numbers of PCs from each approach are as follows. The eigenvalue-one criterion [8] : 27 PCs Scree test [4] : 4 PCs Proportion of variance accounted : 36(90%), 48(95%) PCs We applied the proposed method with above numbers of PCs then compared their hit rates. The results for different number of PCs are presented in Fig. 5. As shown in the graph, hit rate at 30\% is highest when 48 PCs are used. It tends to be increased as the number of PCs increases. Thus 48 PCs are selected in this study After choosing the number of PCs, the optimal number of SVMs, M should be considered. We explored the effects of number of classifiers on the predictive accuracy while the number of PCs was fixed as 48. The training dataset is divided into the different M groups by a random sampler. The hit rate at 10% is highest when M is 49, i.e. 49 SVMs but 25 SVMs shows better hit rate at 30%. Thus our final optimal model is an ensemble model of 25 SVMs with 48 PCs. We also analyzed the effect of weighting methods. Fig. 7 represents a graph of cumulative hit rate for different weighting methods. PCA in the graph is a uniform weight method. As shown in the figure, the weighting methods don't greatly affect the performance. However, the uniform weight method is easy to apply and its performance is a little bit better than other methods. Thus we decided to apply the proposed method using the uniform weight method.. Figure 7. Effect of weighting methods Fig. 7 represents a graph of cumulative hit rate for different weighting methods. PCA in the graph is a uniform weight method. As shown in the figure, the weighting methods don't greatly affect the performance. However, the uniform weight method is easy to apply and its performance is a little bit better than other methods. Thus we decided to apply the proposed method using the uniform weight method. Figure 5. Effect of Number of PC 1027

6 SVDD (Support Vector Domain Description) model to our problem. Fig. 9 presents hit rates of different five models, respectively: USE SVM + PCA, Ensemble Multi-SVDD,, Logistic model, and a random model. The proposed USE SVM + PCA outperformed other methods, and it shows a larger performance improvement at low proportion. As mentioned before, hit rate at low proportion is more important measure than its at high proportion. Through this the proposed USE method outperforms the conventional method not only theoretically but also practically. Figure 8. Gain by PCA and Ensemble To explore how much the proposed PCA and ensemble model contribute to the increase of performance compared to a single SVM, we compared the performances of five models: USE SVM + PCA, USE SVM, single SVM + PCA, single SVM, and a random model. In Fig. 8 the hit rates were noticeably improved in both cases of USE SVM and USE SVM + PCA. In the case of using a signle SVM with PCA, the performance is increased slightly compared with the results of the previous two methods. 5. Conclusions In this paper, we proposed Uniformly Subsmapled Ensemble (USE) method for churn management. We show that USE SVM enhances churn prediction performance. New features are extracted using PCA. We also investigated the effect of the number of classifiers and principle components and gave a guideline to select them. Different aggregating methods were also considered but they didn't affect the result that much. The performance of USE SVM proposed in this research is superior to all compared models. For further research, Ensemble of heterogeneous classifiers can be considered. In the proposed methodology, only a single classifier, SVM model, is used for prediction, but other heterogeneous classifiers can be calibrated. Effect of distribution of labels also can be analyzed besides effect of the number classifiers and PCs. 6. References Figure 9. Comparison with other methods The performance of the proposed method was compared with the performance of other classifiers. Given dataset is large scale and highly imbalanced, only 1.8% of observations are non-churners. Thus simple conventional methods will not work properly. In the previous studies, Partial Least Square (PLS) model and logistic model, popular models in marketing area, have been proposed to solve this problem [13]. We also applied ensemble multi- [1] E. Bauer and R. Kohavi, "An empirical comparison of voting classification algorithms: Bagging, Boosting, and variants", Machine Learning, 36(1 2): , [2] L. Breiman. Bagging predictors. Machine Learning, 24(2): , [3] L. Breiman. Stacked regression. Machine Learning, 24(1):49 64, [4] R. B. Cattell. The scree test for the number of factors. Multivariate Behavioral Research, 1: [5] S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. SIGMOD Rec., 26:65 74, March [6] Y. Freund and R. Schapire. Experiments with a new Boosting algorithm. In Proc. of 13th Int l Conf. on Machine Learning, pages , Bari, Italy, [7] K.-H. Jung, D. Lee, and J. Lee. Fast support-based clustering method for large-scale problems. Pattern Recognition, 43: ,

7 [8] H. Kaiser. The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20: [9] Y. Kim. Toward a successful crm: Variable selection, sampling, and ensemble. Decision Support Systems, 41(2): , [10] D. Lee and J. Lee. Domain described support vector classifier for multi-class classification problems. Pattern Recognition, 40:41 51, [11] D. Lee and J. Lee. Equilibrium-based support vector machine for semisupervised classification. IEEE Trans. on Neural Networks, 18(2): , [12] D. Lee and J. Lee. Dynamic dissimilarity measure for supportbased clustering. IEEE Trans. on Knowledge and Data Engineering, 22(6): , [13] H. Lee, Y. Kim, Y. Lee, and H. Cho. Toward optimal churn management: A partial least square (pls) model. In Proc. of 16th AMCIS. Paper 78, pages 1 10, [14] J. Lee and D. Lee. An improved cluster labeling method for support vector clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(3): , march [15] J. Lee and D. Lee. Dynamic characterization of cluster structures for robust and inductive support vector clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(11): , nov [16] S. Rosset, E. Neumann, U. Eick, N. Vatnik, and I. Idan. Evaluation of prediction models for marketing campaigns. In Proc. of 7th Int l Conf. on Knowledge Discovery & Data Mining (KDD-01), pages , [17] P. E. Rossi, R. McCulloch, and G. Allenby. The Value of Household Information in Target Marketing. Marketing Science, 15(3): , [18] P. Vassiliadis, A. Simitsis, and S. Skiadopoulos. Conceptual modeling for etl processes. In Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP, DOLAP 02, pages 14 21, New York, NY, USA, ACM. [19] L. Wright. The crm imperative practice vs theory in the telecommunications industry. The Journal of Database Marketing, 9: (11), 1 July

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Churn Management Optimization with Controllable Marketing Variables and Associated Management Costs

Churn Management Optimization with Controllable Marketing Variables and Associated Management Costs Utah State University DigitalCommons@USU MIS Faculty Publications Management Information Systems 5-2013 Churn Management Optimization with Controllable Marketing Variables and Associated Management Costs

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Churn Prediction. Vladislav Lazarov. Marius Capota. vladislav.lazarov@in.tum.de. mariuscapota@yahoo.com

Churn Prediction. Vladislav Lazarov. Marius Capota. vladislav.lazarov@in.tum.de. mariuscapota@yahoo.com Churn Prediction Vladislav Lazarov Technische Universität München vladislav.lazarov@in.tum.de Marius Capota Technische Universität München mariuscapota@yahoo.com ABSTRACT The rapid growth of the market

More information

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

II. RELATED WORK. Sentiment Mining

II. RELATED WORK. Sentiment Mining Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 - Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

How To Solve The Class Imbalance Problem In Data Mining

How To Solve The Class Imbalance Problem In Data Mining IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

More information

CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition

CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition Neil P. Slagle College of Computing Georgia Institute of Technology Atlanta, GA npslagle@gatech.edu Abstract CNFSAT embodies the P

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Operations Research and Knowledge Modeling in Data Mining

Operations Research and Knowledge Modeling in Data Mining Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 koda@sk.tsukuba.ac.jp

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA ABSTRACT Current trends in data mining allow the business community to take advantage of

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

Benchmarking of different classes of models used for credit scoring

Benchmarking of different classes of models used for credit scoring Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Introduction To Ensemble Learning

Introduction To Ensemble Learning Educational Series Introduction To Ensemble Learning Dr. Oliver Steinki, CFA, FRM Ziad Mohammad July 2015 What Is Ensemble Learning? In broad terms, ensemble learning is a procedure where multiple learner

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo 71251911@mackenzie.br,nizam.omar@mackenzie.br

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded

More information

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES International Journal of Scientific and Research Publications, Volume 4, Issue 4, April 2014 1 CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES DR. M.BALASUBRAMANIAN *, M.SELVARANI

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Ensemble Data Mining Methods

Ensemble Data Mining Methods Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

More information

SVM Ensemble Model for Investment Prediction

SVM Ensemble Model for Investment Prediction 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

More information

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

More information

Data Warehousing and Data Mining in Business Applications

Data Warehousing and Data Mining in Business Applications 133 Data Warehousing and Data Mining in Business Applications Eesha Goel CSE Deptt. GZS-PTU Campus, Bathinda. Abstract Information technology is now required in all aspect of our lives that helps in business

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Data Mining for Fun and Profit

Data Mining for Fun and Profit Data Mining for Fun and Profit Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. - Ian H. Witten, Data Mining: Practical Machine Learning Tools

More information

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

More information

Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Pattern-Aided Regression Modelling and Prediction Model Analysis

Pattern-Aided Regression Modelling and Prediction Model Analysis San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Pattern-Aided Regression Modelling and Prediction Model Analysis Naresh Avva Follow this and

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

Cross-Validation. Synonyms Rotation estimation

Cross-Validation. Synonyms Rotation estimation Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

More information

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad

More information

A Comparison of Variable Selection Techniques for Credit Scoring

A Comparison of Variable Selection Techniques for Credit Scoring 1 A Comparison of Variable Selection Techniques for Credit Scoring K. Leung and F. Cheong and C. Cheong School of Business Information Technology, RMIT University, Melbourne, Victoria, Australia E-mail:

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

More information

5.2 Customers Types for Grocery Shopping Scenario

5.2 Customers Types for Grocery Shopping Scenario ------------------------------------------------------------------------------------------------------- CHAPTER 5: RESULTS AND ANALYSIS -------------------------------------------------------------------------------------------------------

More information

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Enhanced Boosted Trees Technique for Customer Churn Prediction Model IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

DATA MINING METHODS WITH TREES

DATA MINING METHODS WITH TREES DATA MINING METHODS WITH TREES Marta Žambochová 1. Introduction The contemporary world is characterized by the explosion of an enormous volume of data deposited into databases. Sharp competition contributes

More information

MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud

More information

Combining SVM classifiers for email anti-spam filtering

Combining SVM classifiers for email anti-spam filtering Combining SVM classifiers for email anti-spam filtering Ángela Blanco Manuel Martín-Merino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and

More information

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

More information

Landgrabenweg 151, D-53227 Bonn, Germany {dradosav, putten}@liacs.nl kim.larsen@t-mobile.nl

Landgrabenweg 151, D-53227 Bonn, Germany {dradosav, putten}@liacs.nl kim.larsen@t-mobile.nl Transactions on Machine Learning and Data Mining Vol. 3, No. 2 (2010) 80-99 ISSN: 1865-6781 (Journal), ISBN: 978-3-940501-19-6, IBaI Publishing ISSN 1864-9734 The Impact of Experimental Setup in Prepaid

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

MS1b Statistical Data Mining

MS1b Statistical Data Mining MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to

More information

Overview of Factor Analysis

Overview of Factor Analysis Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,

More information

Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar

More information

Data Mining Algorithms and Techniques Research in CRM Systems

Data Mining Algorithms and Techniques Research in CRM Systems Data Mining Algorithms and Techniques Research in CRM Systems ADELA TUDOR, ADELA BARA, IULIANA BOTHA The Bucharest Academy of Economic Studies Bucharest ROMANIA {Adela_Lungu}@yahoo.com {Bara.Adela, Iuliana.Botha}@ie.ase.ro

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Behavioral Entropy of a Cellular Phone User

Behavioral Entropy of a Cellular Phone User Behavioral Entropy of a Cellular Phone User Santi Phithakkitnukoon 1, Husain Husna, and Ram Dantu 3 1 santi@unt.edu, Department of Comp. Sci. & Eng., University of North Texas hjh36@unt.edu, Department

More information

Selecting Data Mining Model for Web Advertising in Virtual Communities

Selecting Data Mining Model for Web Advertising in Virtual Communities Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: jerzy.surma@gmail.com Mariusz Łapczyński

More information

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management Paper Jean-Louis Amat Abstract One of the main issues of operators

More information

Addressing the Class Imbalance Problem in Medical Datasets

Addressing the Class Imbalance Problem in Medical Datasets Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,

More information

A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information