Distributed Regression For Heterogeneous Data Sets 1

Size: px
Start display at page:

Download "Distributed Regression For Heterogeneous Data Sets 1"

Transcription

1 Distributed Regression For Heterogeneous Data Sets 1 Yan Xing, Michael G. Madden, Jim Duggan, Gerard Lyons Department of Information Technology National University of Ireland, Galway Ireland {yan.xing, michael.madden, jim.duggan, Abstract. Existing meta-learning based distributed data mining approaches do not explicitly address context heterogeneity across individual sites. This limitation constrains their applications where distributed data are not identically and independently distributed. Modeling heterogeneously distributed data with hierarchical models, this paper extends the traditional meta-learning techniques so that they can be successfully used in distributed scenarios with context heterogeneity. 1 Introduction Distributed data mining (DDM is an active research sub-area of data mining. It has been successfully applied in the cases where data are inherently distributed among different loosely coupled sites that are connected by a networ [1-3]. Through transmitting high level information, DDM techniques can discover new nowledge from dispersed data. Such high level information not only has reduced storage and bandwidth requirements, but also maintains the privacy of individual records. Most DDM algorithms for regression or classification are within a meta-learning framewor that is based on ensemble learning. They accept the implicit assumption that the probability distributions of dispersed data are homogeneous. It means that the data are just geographically or physically distributed across various sites, and there are no differences among the sites. Although a distinction is made between homogeneous and heterogeneous data, it is limited on database schemata. In reality, heterogeneity across the sites is often the rule rather than the exception [4]. An example is a health virtual organization having several hospitals, which differ in expertise, sills, equipment and treatment. Another example is a loosely coupled commercial networ consisting of several retailers, which adopt different business policies, sell different products and have different price standards. Wirth describes this ind of scenario as distribution is part of the semantics [5], which has been seldom discussed and addressed. 1 The support of the Informatics Research Initiative of Enterprise Ireland is gratefully acnowledged.

2 Model of Distributed Data Across Heterogeneous Sites Distributed data across homogeneous sites may be regarded as random samples from the same underlying population, even though the probability distribution of the population is unnown. The differences among the data sets are just random sampling errors. From a statistical perspective, the distributed data are identically and independently distributed (IID [6, 7]. A data set stored at the th site consists of data {( y i, x i, i = 1,..., N } where the y s are numerical responses for regression problems, N is the sample size of the data, {, = 1,..., K} and K is the total number of individual sites. If all the sites are homogeneous, IID data of normal distribution with mean θ and variance σ can be expressed as: y IID ~ N( θ, σ i If e i is the random sampling error at the th site, and f ( x i is the real global regression model, then equation (1 can be rewritten as: yi = θ + ei = f ( x i + e i, e i IID σ When there is heterogeneity across the different sites, distributed data are not IID. The differences across the various sites are not only the sampling errors, but also some context heterogeneities caused by different features of the sites. In practice, it is difficult to determine the exact sources of the context heterogeneities. Considering the earlier example of a hospital domain, the mixture effects of the differences of hospital expertise, equipment, sills and treatment cause context heterogeneity. It is hard or even impossible for us to measure how much of the heterogeneity is caused by one of the above issues. Furthermore, there may be some differences that exist across the hospitals which are unobservable. So it is with other domains such as the networ of retailers. In this case, we need to model the context heterogeneity in a reasonable way. In statistical meta-analysis, a popular way to model unobservable or immeasurable context heterogeneity is to assume that the heterogeneity across different sites is random. In other words, context heterogeneity derives from essentially random differences among sites whose sources cannot be identified or are unobservable [7]. Distributed data across various sites having randomly distributed context heterogeneity is often regarded as conditional IID [4]. This leads to a two-level hierarchical model which describes context heterogeneity with mixture models and employs latent variables in a hierarchical structure [4, 8, 9]. Assuming that the context of different sites is distributed normally with mean θ and variance τ, data y i at the th site has normal distribution with mean θ and variance σ, then the two-level hierarchical model of distributed data across heterogeneous sites is: (1 (

3 IID Between sites level : θ ~ N( θ, τ IID Within a certain site : ( yi θ ~ N( θ, σ If t is the random sampling error of context across different sites, and the different level residuals t and e i are independent, then equation (3 can be rewritten as: yi = θ + ei + t = f ( x + e + t (4 i i IID IID ei σ, t τ When τ = 0, then t = 0, equation (3 will be the same as equation (1, and equation (4 will be same as equation (. So a distributed scenario with homogeneous sites is only a special case of the generic situation. In theory of hierarchical modeling, intra-class correlation (ICC is a concept to measure how much quantity of the total variance of a model is caused by context heterogeneity. ICC is calculated by: (3 ICC =τ /( τ + σ (5 When the various sites are homogeneous, ICC = 0. The larger value ICC has, the more quantity of model variance is caused by context heterogeneity. 3 Towards Context-based Once distributed data across heterogeneous sites are modeled as equation (4, the main tas of distributed regression is to obtain the global regression model f ( x i. In hierarchical modeling, statistical linear and nonlinear multilevel model fitting is done by iterative maximum lielihood based algorithms [4, 9]. Unfortunately, all of these algorithms are designed for centralized data. Even though we can modify the algorithms to a distributed environment, the iteration feature will cause significant burdens on communications among the various sites [10]. Since most existing DDM approaches are within a meta-learning framewor, it is necessary for us to extend the state of the art techniques so that they can be successfully used in distributed scenarios with heterogeneous sites. 3.1 Traditional Distributed When different sites are homogeneous, equations (1 and ( apply. The implicit assumption for an ensemble of learners to be more accurate than the average performance of its individual members is satisfied [11, 1]. Assuming the base models (or learners generated at different sites are f ( x i, = 1,,..., K, then the final ensemble model (meta-learner is

4 f A( x i = E [ f ( xi ], where E denotes the expectation over, and the subscript A in f A denotes aggregation. Distributed meta-learning taes f A ( x i as the estimate of the real global model f ( x i. based distributed regression follows three main steps. Firstly, generate base regression models at each site using a learning algorithm. Secondly, collect the base models at a central site. Produce meta-level data from a separate validation set and predictions generated by the base classifier on it. Lastly, generate the final regression model from meta-level data via combiner (un-weighted or weighted averaging. 3. Context-based When different sites are heterogeneous, equations (3 and (4 apply. The variance of data within a certain site is σ, but the variance of data from different sites is σ + τ. So distributed data are not IID. The criterion for success of meta-learning is not satisfied. We need an approach to deal with the context variance τ in the metalearning framewor. We call it context-based meta-learning approach Global Model Estimation According to equation (4, given the context site can be expressed as: ( yi θ = θ + e IID ei σ and { θ, 1,,..., K} has the following distribution: = i θ of the th site, data within that θ = θ + t = f ( x + t (7 i IID t τ So base models f ( x i generated at the local sites are the estimates of θ, =1,,..., K. Given θ and x i, suppose?ˆ is the estimate of the real θ, then: f ( x =? ˆ θ = θ + t = f ( x + t (8 Then the final ensemble model f A ( x i is: i i f x = E [ f ( x ] f ( x + E ( t (9 A ( i i i Because E ( = 0, we can use f x to estimate the real global model f x. t A ( i ( i (6

5 3.. Context Residual Estimation Since there is context heterogeneity, when we use the ensemble model for prediction at a certain th site, we need to add the value of context residual at that site, which is t. From equation (8 and (9, we have: t 1 = N N [ f ( x i= 1 f ( x i With equation (10, t will never be exactly equal to zero even though the real context residual is zero. So when we get the value of t by equation (10, we use twotail t-test to chec the null hypothesis H 0 : t = 0, and calculate the context residual of th site by: t 0, H = t, H 0 0 A i ] accepted given α rejected given α Where α is the level of significance. When we get all the values of { t, = 1,,..., K}, we can calculate context level variance τ Algorithm for Context-based Our context-based algorithm for distributed regression follows six steps: 1. At each site, use cross-validation to generate a base regression model.. At each site, collect the base models. Produce meta-level data from the predictions generated by the base classifiers on the local data set. 3. At each site, generate the ensemble model from meta-level data via combiner (unweighted or weighted averaging 4. At each site, calculate its context residual with equation (10 and ( At each site, generate the final regression model of this site by equation (8. 6. Collect all the base models and context residuals at a central site, calculate context level variance. (10 (11 4 Simulation Experiment In practice, different distributed scenarios have different levels of intra-class correlation. It ranges from ICC = 0, which is the homogeneous case, to ICC 1, which means variance is mainly caused by context heterogeneity. In order to evaluate our approach on different values of ICC, we use simulation data sets because it is difficult for us to get real world distributed data sets that can satisfy all our requirements.

6 4.1 Simulation Data Sets The three simulation data sets we used are Friedman s data sets [13]. They are originally generated by Friedman and used by Breiman in his oft-cited paper about bagging [1]. We have made some modifications to the three data sets so that they are compatible with our distributed scenarios. Friedman #1: there are ten independent predictor variables x,..., x 1 10 each of which is uniformly distributed over [ 0,1]. We set the total number of sites is K = 10. At th site, the sample size is 00 for training set and 1000 for testing set. The response is given by: #1: yi = 10 sin( π x1x + 0( x x4 + 5x5 + ei + t ei 1, t τ We adjust the value of τ so that we can get ICC = 0.0,0.1,...,0. 9 respectively. Friedman #, #3: these two examples are four variable data with 1 / yi = ( x1 + ( x x3 (1/ x x4 + ei + t # : ei σ, t τ yi #3: ei = tan 1 x ( σ 3 x (1/ x, 3 t x 1 τ x4 + e Where x 1, x, x3, x4 are uniformly distributed as 0 x 1 100, 0 ( x / π 80, 0 x 3 1 and 1 x The total site number and sample sizes of training and testing set of each site are the same as #1. The parameters σ,σ are selected to give 3:1 signal/noise ratios, and 3 τ,τ 3 are adjusted for the same purpose as #1. 3 i + t (1 (13 (14 4. Simulation Result To compare the prediction accuracy of traditional meta-learning approach and our context-based meta-learning approach under different values of ICC, we implement both algorithms. The base learner we use is a regression tree algorithm implemented in Wea [14]; the meta-learner we use is un-weighted averaging and the significance level α = We evaluate our approach from two angles: the whole virtual organization and its individual sites The Whole Organization From the point of view of the whole organization, the target of DDM is to discover the global trend. Fig. 1 shows the global prediction accuracy under different ICC values.

7 From Fig.1, we can see that, as ICC increases, the prediction accuracy of the traditional meta-learning approach decreases, while the prediction accuracy of our approach essentially remains the same. When ICC=0, the prediction accuracy of both approaches are almost the same. 4.. Individual Sites From the point of view of each individual site, the goal of DDM is to get more accurate model than the local model (base model only created from its local data. In practice, when ICC>0.5, individual site usually only use local model because the context heterogeneity is too large ICC (Friedman # ICC (Friedman # ICC (Friedman #3 Fig. 1. Global prediction accuracy (average of 0 runs

8 Individual Sites (Friedman #1 Individual Sites (Friedman # Individual Sites (Friedman #3 Local model Fig.. Prediction accuracy of individual sites when ICC=0.3 (average of 0 runs Fig. compares the prediction accuracy of local model, meta-learning model and the model created with our context-based meta-learning approach when ICC =0.3. For those sites (5 th, 8 th, 10 th for #1; 1 st, 4 th, 5 th, 8 th for #; 3 rd, 4 th, 7 th for #3 with relatively larger context residuals, meta-learning behaves worse than our approach, and sometimes even worse than the local models (1 st, 4 th, 5 th, 8 th for #; 4 th, 7 th for #3. For those sites with very small context residuals, the performance of our approach is not worse than meta-learning. So the overall performance of our approach is the best. 5 Discussion of Our Approach There are two important issues relating with our approach: number of sites and sample size at individual sites. 5.1 Number of Sites In practical distributed scenarios, the number of sites is usually much smaller than the number of data at each individual site. The extreme case is that there are only two sites.

9 From perspective of statistics, the larger the number of sites is, the more accurate estimations of the quantity of context heterogeneity we can obtain. When the number of sites is extremely low, we usually underestimate the quantity of context heterogeneity [9]. This ind of underestimation will worsen the advantage of our approach. We tried our simulation experiment when the number of sites is 5 and respectively. The results we got demonstrate this ICC (Sites = ICC (Sites = 5 Fig. 3. Comparison of global prediction accuracy when number of sites is extremely small (Friedman #3, average of 0 runs Fig.3 shows the global prediction accuracy for Friedman #3 data when the number of sites is extremely small. When the number of sites is 5, the advantage of our approach is still obvious. But when the number of sites is, the advantage of our approach is less obvious. 5. Sample Size of Data at Individual Sites At each individual site, as the sample size of training data set increases, the accuracy of using?ˆ to estimate θ in equation (8 increases. So we can obtain more accurate estimation of the context residual of the site if we have more local data. Thus finally we can get higher prediction accuracy.

10 Comparing the two graphs in Fig. 4, it can be seen that the advantage of our approach is greater pronounced when there are more training data at each individual site Local model Individual Sites (train size = 50 Fig. 4. Comparison of prediction accuracy with different training size (Friedman #1, testing size =1000 for each site, ICC=0.3, average of 0 runs Local model Individual Sites (train size = Related Wor is one popular approach of DDM techniques. The most successful meta-learning based DDM application is done by Prodromidis in the domain of credit card fraud detection [3]. However, most existing DDM applications within metalearning framewor do not explicitly address context heterogeneity across individual sites. Wirth defines distributed scenarios with context heterogeneity as distribution is part of the semantics in his wor [5], but does not give an approach to solve it. The only wor we found which explicitly addresses context heterogeneity is done by Páircéir [15], where statistical hierarchical models are used to discover multi-level association rules from dispersed hierarchical data. In our previous wor, we use hierarchical modeling to address context heterogeneity in the domain of virtual organizations [10]. Although we get some encouraging results, we also realize that iteration-based algorithms will cause heavy communication traffic among individual sites. 7 Summary And Future Wor Through the analysis of the limitations of distributed meta-learning approach, we model distributed data across heterogeneously distributed sites with two-level statistical hierarchical models, and extend traditional meta-learning approach to suit non-iid distributed data. We successfully use our context-based meta-learning

11 approach on several simulation data sets for distributed regression. We also discuss the important issues related with our approach. For our future wor, we will use some real world distributed data sets to test our approach. Then we plan to extend our approach to distributed classification problems. References 1. Provost, F.: Distributed Data Mining: Scaling Up and Beyond. In: Kargupta, H. Chan, P.K.(eds.: Advances in Distributed and Parallel Knowledge Discovery. AAAI / MIT Press ( Par, B.-H. Kargupta, H.: Distributed Data Mining: Algorithms, Systems, and Applications. In: Ye, N.(ed.: Data Mining Handboo (00 3. Prodromidis, A.L., Chan, P.K. Stolfo, S.J.: Meta-Learning in Distributed Data Mining Systems: Issues and Approaches. In: Kargupta, H. Chan, P.K.(eds.: Advances in Distributed and Parallel Knowledge Discovery. Chapter 3. AAAI / MIT Press ( Draper, D.: Bayesian Hierarchical Modeling. In Tutorial on ISBA000. Crete, Greece ( Wirth, R., Borth, M. Hipp, J.: When Distribution is Part of the Semantics: A New Problem Class for Distributed Knowledge Discovery. In Worshop on Ubiquitous Data Mining for Mobile and Distributed Environments. 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'01 ( Brandt, S.: Data Analysis: Statistical and Computational Methods for Scientists and Engineers, ed. Edition, T. Springer ( Lipsey, M.W. Wilson, D.B.: Practical Meta-Analysis. SAGE Publications ( Kreft, I. Leeuw, J.D.: Introducing Multilevel Modeling. Sage Publications ( Goldstein, H.: Multilevel Statistical Models, ed. Edition, S. ARNOLD ( Xing, Y., Duggan, J., Madden, M.G. Lyons, G.J.: A Multi-Agent System For Customer Behavior Prediction in Virtual Organization. Technical Report for Enterprise Ireland ( Dietterich, T.G.: Ensemble Methods in Macine Learning. Lecture Notes in Computer Science ( Breiman, L.: Bagging Predictors. Machine Learning. 4 ( Friedman, J.H.: Multivariate Adaptive Regression Splines. Annals of Statistics Witten, I.H. Fran, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Academic Press ( Páircéir, R., McClean, S. Scitney, B.: Discovery of Multi-level Rules and Exceptions from a Distributed Database. In Sixth ACM SIGKDD International Conference on Knowledge Discovery & Data Mining KDD000. Boston, MA, USA (000

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Credit Card Fraud Detection Using Meta-Learning: Issues 1 and Initial Results

Credit Card Fraud Detection Using Meta-Learning: Issues 1 and Initial Results From: AAAI Technical Report WS-97-07. Compilation copyright 1997, AAAI (www.aaai.org). All rights reserved. Credit Card Fraud Detection Using Meta-Learning: Issues 1 and Initial Results Salvatore 2 J.

More information

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1

Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1 Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1 Salvatore J. Stolfo, David W. Fan, Wenke Lee and Andreas L. Prodromidis Department of Computer Science Columbia University

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

More information

Community Mining from Multi-relational Networks

Community Mining from Multi-relational Networks Community Mining from Multi-relational Networks Deng Cai 1, Zheng Shao 1, Xiaofei He 2, Xifeng Yan 1, and Jiawei Han 1 1 Computer Science Department, University of Illinois at Urbana Champaign (dengcai2,

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Solving Regression Problems Using Competitive Ensemble Models

Solving Regression Problems Using Competitive Ensemble Models Solving Regression Problems Using Competitive Ensemble Models Yakov Frayman, Bernard F. Rolfe, and Geoffrey I. Webb School of Information Technology Deakin University Geelong, VIC, Australia {yfraym,brolfe,webb}@deakin.edu.au

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

The Optimality of Naive Bayes

The Optimality of Naive Bayes The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

More information

Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal

Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal Magdalena Graczyk 1, Tadeusz Lasota 2, Bogdan Trawiński 1, Krzysztof Trawiński 3 1 Wrocław University of Technology,

More information

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. by Philip Kostov and Seamus McErlean Working Paper, Agricultural and Food Economics, Queen

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Classification of Learners Using Linear Regression

Classification of Learners Using Linear Regression Proceedings of the Federated Conference on Computer Science and Information Systems pp. 717 721 ISBN 978-83-60810-22-4 Classification of Learners Using Linear Regression Marian Cristian Mihăescu Software

More information

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining

More information

A self-growing Bayesian network classifier for online learning of human motion patterns. Title

A self-growing Bayesian network classifier for online learning of human motion patterns. Title Title A self-growing Bayesian networ classifier for online learning of human motion patterns Author(s) Chen, Z; Yung, NHC Citation The 2010 International Conference of Soft Computing and Pattern Recognition

More information

Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Assessing Data Mining: The State of the Practice

Assessing Data Mining: The State of the Practice Assessing Data Mining: The State of the Practice 2003 Herbert A. Edelstein Two Crows Corporation 10500 Falls Road Potomac, Maryland 20854 www.twocrows.com (301) 983-3555 Objectives Separate myth from reality

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

The Predictive Data Mining Revolution in Scorecards:

The Predictive Data Mining Revolution in Scorecards: January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms

More information

A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

Knowledge Discovery from Data Bases Proposal for a MAP-I UC

Knowledge Discovery from Data Bases Proposal for a MAP-I UC Knowledge Discovery from Data Bases Proposal for a MAP-I UC P. Brazdil 1, João Gama 1, P. Azevedo 2 1 Universidade do Porto; 2 Universidade do Minho; 1 Knowledge Discovery from Data Bases We are deluged

More information

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan

More information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN

More information

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376 Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

More information

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group Introduction to Multilevel Modeling Using HLM 6 By ATS Statistical Consulting Group Multilevel data structure Students nested within schools Children nested within families Respondents nested within interviewers

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Introduction to Data Analysis in Hierarchical Linear Models

Introduction to Data Analysis in Hierarchical Linear Models Introduction to Data Analysis in Hierarchical Linear Models April 20, 2007 Noah Shamosh & Frank Farach Social Sciences StatLab Yale University Scope & Prerequisites Strong applied emphasis Focus on HLM

More information

Data Mining as Exploratory Data Analysis. Zachary Jones

Data Mining as Exploratory Data Analysis. Zachary Jones Data Mining as Exploratory Data Analysis Zachary Jones The Problem(s) presumptions social systems are complex causal identification is difficult/impossible with many data sources theory not generally predictively

More information

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Machine Learning and Data Mining. Fundamentals, robotics, recognition Machine Learning and Data Mining Fundamentals, robotics, recognition Machine Learning, Data Mining, Knowledge Discovery in Data Bases Their mutual relations Data Mining, Knowledge Discovery in Databases,

More information

Studying Auto Insurance Data

Studying Auto Insurance Data Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008 Ensemble Learning Better Predictions Through Diversity Todd Holloway ETech 2008 Outline Building a classifier (a tutorial example) Neighbor method Major ideas and challenges in classification Ensembles

More information

Multiple Classifiers -Integration and Selection

Multiple Classifiers -Integration and Selection 1 A Dynamic Integration Algorithm with Ensemble of Classifiers Seppo Puuronen 1, Vagan Terziyan 2, Alexey Tsymbal 2 1 University of Jyvaskyla, P.O.Box 35, FIN-40351 Jyvaskyla, Finland sepi@jytko.jyu.fi

More information

Revenue Management with Correlated Demand Forecasting

Revenue Management with Correlated Demand Forecasting Revenue Management with Correlated Demand Forecasting Catalina Stefanescu Victor DeMiguel Kristin Fridgeirsdottir Stefanos Zenios 1 Introduction Many airlines are struggling to survive in today's economy.

More information

Learning bagged models of dynamic systems. 1 Introduction

Learning bagged models of dynamic systems. 1 Introduction Learning bagged models of dynamic systems Nikola Simidjievski 1,2, Ljupco Todorovski 3, Sašo Džeroski 1,2 1 Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Utility-Based Fraud Detection

Utility-Based Fraud Detection Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Utility-Based Fraud Detection Luis Torgo and Elsa Lopes Fac. of Sciences / LIAAD-INESC Porto LA University of

More information

Ensemble Data Mining Methods

Ensemble Data Mining Methods Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

More information

ICPSR Summer Program

ICPSR Summer Program ICPSR Summer Program Data Mining Tools for Exploring Big Data Department of Statistics Wharton School, University of Pennsylvania www-stat.wharton.upenn.edu/~stine Modern data mining combines familiar

More information

Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition

Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition Brochure More information from http://www.researchandmarkets.com/reports/2171322/ Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition Description: This book reviews state-of-the-art methodologies

More information

Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection

Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection Philip K. Chan Computer Science Florida Institute of Technolog7 Melbourne, FL 32901 pkc~cs,

More information

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three

More information

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/

More information

Introduction to Longitudinal Data Analysis

Introduction to Longitudinal Data Analysis Introduction to Longitudinal Data Analysis Longitudinal Data Analysis Workshop Section 1 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development Section 1: Introduction

More information

Perspectives on Data Mining

Perspectives on Data Mining Perspectives on Data Mining Niall Adams Department of Mathematics, Imperial College London n.adams@imperial.ac.uk April 2009 Objectives Give an introductory overview of data mining (DM) (or Knowledge Discovery

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

Available online at www.sciencedirect.com Available online at www.sciencedirect.com Available online at www.sciencedirect.com Available online at www.sciencedirect.com Procedia Procedia Engineering Engineering 00 (0 9 (0 000 000 340 344 Procedia Engineering www.elsevier.com/locate/procedia

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Eliminating Class Noise in Large Datasets

Eliminating Class Noise in Large Datasets Eliminating Class Noise in Lar Datasets Xingquan Zhu Xindong Wu Qijun Chen Department of Computer Science, University of Vermont, Burlington, VT 05405, USA XQZHU@CS.UVM.EDU XWU@CS.UVM.EDU QCHEN@CS.UVM.EDU

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

More information

Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

More information

Web Hosting Service Level Agreements

Web Hosting Service Level Agreements Chapter 5 Web Hosting Service Level Agreements Alan King (Mentor) 1, Mehmet Begen, Monica Cojocaru 3, Ellen Fowler, Yashar Ganjali 4, Judy Lai 5, Taejin Lee 6, Carmeliza Navasca 7, Daniel Ryan Report prepared

More information

Finding statistical patterns in Big Data

Finding statistical patterns in Big Data Finding statistical patterns in Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research IAS Research Workshop: Data science for the real world (workshop 1)

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

On Cross-Validation and Stacking: Building seemingly predictive models on random data

On Cross-Validation and Stacking: Building seemingly predictive models on random data On Cross-Validation and Stacking: Building seemingly predictive models on random data ABSTRACT Claudia Perlich Media6 New York, NY 10012 claudia@media6degrees.com A number of times when using cross-validation

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Weather forecast prediction: a Data Mining application

Weather forecast prediction: a Data Mining application Weather forecast prediction: a Data Mining application Ms. Ashwini Mandale, Mrs. Jadhawar B.A. Assistant professor, Dr.Daulatrao Aher College of engg,karad,ashwini.mandale@gmail.com,8407974457 Abstract

More information

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern

More information

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 - Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

More information

Longitudinal Meta-analysis

Longitudinal Meta-analysis Quality & Quantity 38: 381 389, 2004. 2004 Kluwer Academic Publishers. Printed in the Netherlands. 381 Longitudinal Meta-analysis CORA J. M. MAAS, JOOP J. HOX and GERTY J. L. M. LENSVELT-MULDERS Department

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Using Mining@Home for Distributed Ensemble Learning

Using Mining@Home for Distributed Ensemble Learning Using Mining@Home for Distributed Ensemble Learning Eugenio Cesario 1, Carlo Mastroianni 1, and Domenico Talia 1,2 1 ICAR-CNR, Italy {cesario,mastroianni}@icar.cnr.it 2 University of Calabria, Italy talia@deis.unical.it

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Data-driven Multi-touch Attribution Models

Data-driven Multi-touch Attribution Models Data-driven Multi-touch Attribution Models Xuhui Shao Turn, Inc. 835 Main St. Redwood City, CA 94063 xuhui.shao@turn.com Lexin Li Department of Statistics North Carolina State University Raleigh, NC 27695

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

Statistics for BIG data

Statistics for BIG data Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information