Distributed Regression For Heterogeneous Data Sets 1
|
|
- Juliet Daniel
- 8 years ago
- Views:
Transcription
1 Distributed Regression For Heterogeneous Data Sets 1 Yan Xing, Michael G. Madden, Jim Duggan, Gerard Lyons Department of Information Technology National University of Ireland, Galway Ireland {yan.xing, michael.madden, jim.duggan, Abstract. Existing meta-learning based distributed data mining approaches do not explicitly address context heterogeneity across individual sites. This limitation constrains their applications where distributed data are not identically and independently distributed. Modeling heterogeneously distributed data with hierarchical models, this paper extends the traditional meta-learning techniques so that they can be successfully used in distributed scenarios with context heterogeneity. 1 Introduction Distributed data mining (DDM is an active research sub-area of data mining. It has been successfully applied in the cases where data are inherently distributed among different loosely coupled sites that are connected by a networ [1-3]. Through transmitting high level information, DDM techniques can discover new nowledge from dispersed data. Such high level information not only has reduced storage and bandwidth requirements, but also maintains the privacy of individual records. Most DDM algorithms for regression or classification are within a meta-learning framewor that is based on ensemble learning. They accept the implicit assumption that the probability distributions of dispersed data are homogeneous. It means that the data are just geographically or physically distributed across various sites, and there are no differences among the sites. Although a distinction is made between homogeneous and heterogeneous data, it is limited on database schemata. In reality, heterogeneity across the sites is often the rule rather than the exception [4]. An example is a health virtual organization having several hospitals, which differ in expertise, sills, equipment and treatment. Another example is a loosely coupled commercial networ consisting of several retailers, which adopt different business policies, sell different products and have different price standards. Wirth describes this ind of scenario as distribution is part of the semantics [5], which has been seldom discussed and addressed. 1 The support of the Informatics Research Initiative of Enterprise Ireland is gratefully acnowledged.
2 Model of Distributed Data Across Heterogeneous Sites Distributed data across homogeneous sites may be regarded as random samples from the same underlying population, even though the probability distribution of the population is unnown. The differences among the data sets are just random sampling errors. From a statistical perspective, the distributed data are identically and independently distributed (IID [6, 7]. A data set stored at the th site consists of data {( y i, x i, i = 1,..., N } where the y s are numerical responses for regression problems, N is the sample size of the data, {, = 1,..., K} and K is the total number of individual sites. If all the sites are homogeneous, IID data of normal distribution with mean θ and variance σ can be expressed as: y IID ~ N( θ, σ i If e i is the random sampling error at the th site, and f ( x i is the real global regression model, then equation (1 can be rewritten as: yi = θ + ei = f ( x i + e i, e i IID σ When there is heterogeneity across the different sites, distributed data are not IID. The differences across the various sites are not only the sampling errors, but also some context heterogeneities caused by different features of the sites. In practice, it is difficult to determine the exact sources of the context heterogeneities. Considering the earlier example of a hospital domain, the mixture effects of the differences of hospital expertise, equipment, sills and treatment cause context heterogeneity. It is hard or even impossible for us to measure how much of the heterogeneity is caused by one of the above issues. Furthermore, there may be some differences that exist across the hospitals which are unobservable. So it is with other domains such as the networ of retailers. In this case, we need to model the context heterogeneity in a reasonable way. In statistical meta-analysis, a popular way to model unobservable or immeasurable context heterogeneity is to assume that the heterogeneity across different sites is random. In other words, context heterogeneity derives from essentially random differences among sites whose sources cannot be identified or are unobservable [7]. Distributed data across various sites having randomly distributed context heterogeneity is often regarded as conditional IID [4]. This leads to a two-level hierarchical model which describes context heterogeneity with mixture models and employs latent variables in a hierarchical structure [4, 8, 9]. Assuming that the context of different sites is distributed normally with mean θ and variance τ, data y i at the th site has normal distribution with mean θ and variance σ, then the two-level hierarchical model of distributed data across heterogeneous sites is: (1 (
3 IID Between sites level : θ ~ N( θ, τ IID Within a certain site : ( yi θ ~ N( θ, σ If t is the random sampling error of context across different sites, and the different level residuals t and e i are independent, then equation (3 can be rewritten as: yi = θ + ei + t = f ( x + e + t (4 i i IID IID ei σ, t τ When τ = 0, then t = 0, equation (3 will be the same as equation (1, and equation (4 will be same as equation (. So a distributed scenario with homogeneous sites is only a special case of the generic situation. In theory of hierarchical modeling, intra-class correlation (ICC is a concept to measure how much quantity of the total variance of a model is caused by context heterogeneity. ICC is calculated by: (3 ICC =τ /( τ + σ (5 When the various sites are homogeneous, ICC = 0. The larger value ICC has, the more quantity of model variance is caused by context heterogeneity. 3 Towards Context-based Once distributed data across heterogeneous sites are modeled as equation (4, the main tas of distributed regression is to obtain the global regression model f ( x i. In hierarchical modeling, statistical linear and nonlinear multilevel model fitting is done by iterative maximum lielihood based algorithms [4, 9]. Unfortunately, all of these algorithms are designed for centralized data. Even though we can modify the algorithms to a distributed environment, the iteration feature will cause significant burdens on communications among the various sites [10]. Since most existing DDM approaches are within a meta-learning framewor, it is necessary for us to extend the state of the art techniques so that they can be successfully used in distributed scenarios with heterogeneous sites. 3.1 Traditional Distributed When different sites are homogeneous, equations (1 and ( apply. The implicit assumption for an ensemble of learners to be more accurate than the average performance of its individual members is satisfied [11, 1]. Assuming the base models (or learners generated at different sites are f ( x i, = 1,,..., K, then the final ensemble model (meta-learner is
4 f A( x i = E [ f ( xi ], where E denotes the expectation over, and the subscript A in f A denotes aggregation. Distributed meta-learning taes f A ( x i as the estimate of the real global model f ( x i. based distributed regression follows three main steps. Firstly, generate base regression models at each site using a learning algorithm. Secondly, collect the base models at a central site. Produce meta-level data from a separate validation set and predictions generated by the base classifier on it. Lastly, generate the final regression model from meta-level data via combiner (un-weighted or weighted averaging. 3. Context-based When different sites are heterogeneous, equations (3 and (4 apply. The variance of data within a certain site is σ, but the variance of data from different sites is σ + τ. So distributed data are not IID. The criterion for success of meta-learning is not satisfied. We need an approach to deal with the context variance τ in the metalearning framewor. We call it context-based meta-learning approach Global Model Estimation According to equation (4, given the context site can be expressed as: ( yi θ = θ + e IID ei σ and { θ, 1,,..., K} has the following distribution: = i θ of the th site, data within that θ = θ + t = f ( x + t (7 i IID t τ So base models f ( x i generated at the local sites are the estimates of θ, =1,,..., K. Given θ and x i, suppose?ˆ is the estimate of the real θ, then: f ( x =? ˆ θ = θ + t = f ( x + t (8 Then the final ensemble model f A ( x i is: i i f x = E [ f ( x ] f ( x + E ( t (9 A ( i i i Because E ( = 0, we can use f x to estimate the real global model f x. t A ( i ( i (6
5 3.. Context Residual Estimation Since there is context heterogeneity, when we use the ensemble model for prediction at a certain th site, we need to add the value of context residual at that site, which is t. From equation (8 and (9, we have: t 1 = N N [ f ( x i= 1 f ( x i With equation (10, t will never be exactly equal to zero even though the real context residual is zero. So when we get the value of t by equation (10, we use twotail t-test to chec the null hypothesis H 0 : t = 0, and calculate the context residual of th site by: t 0, H = t, H 0 0 A i ] accepted given α rejected given α Where α is the level of significance. When we get all the values of { t, = 1,,..., K}, we can calculate context level variance τ Algorithm for Context-based Our context-based algorithm for distributed regression follows six steps: 1. At each site, use cross-validation to generate a base regression model.. At each site, collect the base models. Produce meta-level data from the predictions generated by the base classifiers on the local data set. 3. At each site, generate the ensemble model from meta-level data via combiner (unweighted or weighted averaging 4. At each site, calculate its context residual with equation (10 and ( At each site, generate the final regression model of this site by equation (8. 6. Collect all the base models and context residuals at a central site, calculate context level variance. (10 (11 4 Simulation Experiment In practice, different distributed scenarios have different levels of intra-class correlation. It ranges from ICC = 0, which is the homogeneous case, to ICC 1, which means variance is mainly caused by context heterogeneity. In order to evaluate our approach on different values of ICC, we use simulation data sets because it is difficult for us to get real world distributed data sets that can satisfy all our requirements.
6 4.1 Simulation Data Sets The three simulation data sets we used are Friedman s data sets [13]. They are originally generated by Friedman and used by Breiman in his oft-cited paper about bagging [1]. We have made some modifications to the three data sets so that they are compatible with our distributed scenarios. Friedman #1: there are ten independent predictor variables x,..., x 1 10 each of which is uniformly distributed over [ 0,1]. We set the total number of sites is K = 10. At th site, the sample size is 00 for training set and 1000 for testing set. The response is given by: #1: yi = 10 sin( π x1x + 0( x x4 + 5x5 + ei + t ei 1, t τ We adjust the value of τ so that we can get ICC = 0.0,0.1,...,0. 9 respectively. Friedman #, #3: these two examples are four variable data with 1 / yi = ( x1 + ( x x3 (1/ x x4 + ei + t # : ei σ, t τ yi #3: ei = tan 1 x ( σ 3 x (1/ x, 3 t x 1 τ x4 + e Where x 1, x, x3, x4 are uniformly distributed as 0 x 1 100, 0 ( x / π 80, 0 x 3 1 and 1 x The total site number and sample sizes of training and testing set of each site are the same as #1. The parameters σ,σ are selected to give 3:1 signal/noise ratios, and 3 τ,τ 3 are adjusted for the same purpose as #1. 3 i + t (1 (13 (14 4. Simulation Result To compare the prediction accuracy of traditional meta-learning approach and our context-based meta-learning approach under different values of ICC, we implement both algorithms. The base learner we use is a regression tree algorithm implemented in Wea [14]; the meta-learner we use is un-weighted averaging and the significance level α = We evaluate our approach from two angles: the whole virtual organization and its individual sites The Whole Organization From the point of view of the whole organization, the target of DDM is to discover the global trend. Fig. 1 shows the global prediction accuracy under different ICC values.
7 From Fig.1, we can see that, as ICC increases, the prediction accuracy of the traditional meta-learning approach decreases, while the prediction accuracy of our approach essentially remains the same. When ICC=0, the prediction accuracy of both approaches are almost the same. 4.. Individual Sites From the point of view of each individual site, the goal of DDM is to get more accurate model than the local model (base model only created from its local data. In practice, when ICC>0.5, individual site usually only use local model because the context heterogeneity is too large ICC (Friedman # ICC (Friedman # ICC (Friedman #3 Fig. 1. Global prediction accuracy (average of 0 runs
8 Individual Sites (Friedman #1 Individual Sites (Friedman # Individual Sites (Friedman #3 Local model Fig.. Prediction accuracy of individual sites when ICC=0.3 (average of 0 runs Fig. compares the prediction accuracy of local model, meta-learning model and the model created with our context-based meta-learning approach when ICC =0.3. For those sites (5 th, 8 th, 10 th for #1; 1 st, 4 th, 5 th, 8 th for #; 3 rd, 4 th, 7 th for #3 with relatively larger context residuals, meta-learning behaves worse than our approach, and sometimes even worse than the local models (1 st, 4 th, 5 th, 8 th for #; 4 th, 7 th for #3. For those sites with very small context residuals, the performance of our approach is not worse than meta-learning. So the overall performance of our approach is the best. 5 Discussion of Our Approach There are two important issues relating with our approach: number of sites and sample size at individual sites. 5.1 Number of Sites In practical distributed scenarios, the number of sites is usually much smaller than the number of data at each individual site. The extreme case is that there are only two sites.
9 From perspective of statistics, the larger the number of sites is, the more accurate estimations of the quantity of context heterogeneity we can obtain. When the number of sites is extremely low, we usually underestimate the quantity of context heterogeneity [9]. This ind of underestimation will worsen the advantage of our approach. We tried our simulation experiment when the number of sites is 5 and respectively. The results we got demonstrate this ICC (Sites = ICC (Sites = 5 Fig. 3. Comparison of global prediction accuracy when number of sites is extremely small (Friedman #3, average of 0 runs Fig.3 shows the global prediction accuracy for Friedman #3 data when the number of sites is extremely small. When the number of sites is 5, the advantage of our approach is still obvious. But when the number of sites is, the advantage of our approach is less obvious. 5. Sample Size of Data at Individual Sites At each individual site, as the sample size of training data set increases, the accuracy of using?ˆ to estimate θ in equation (8 increases. So we can obtain more accurate estimation of the context residual of the site if we have more local data. Thus finally we can get higher prediction accuracy.
10 Comparing the two graphs in Fig. 4, it can be seen that the advantage of our approach is greater pronounced when there are more training data at each individual site Local model Individual Sites (train size = 50 Fig. 4. Comparison of prediction accuracy with different training size (Friedman #1, testing size =1000 for each site, ICC=0.3, average of 0 runs Local model Individual Sites (train size = Related Wor is one popular approach of DDM techniques. The most successful meta-learning based DDM application is done by Prodromidis in the domain of credit card fraud detection [3]. However, most existing DDM applications within metalearning framewor do not explicitly address context heterogeneity across individual sites. Wirth defines distributed scenarios with context heterogeneity as distribution is part of the semantics in his wor [5], but does not give an approach to solve it. The only wor we found which explicitly addresses context heterogeneity is done by Páircéir [15], where statistical hierarchical models are used to discover multi-level association rules from dispersed hierarchical data. In our previous wor, we use hierarchical modeling to address context heterogeneity in the domain of virtual organizations [10]. Although we get some encouraging results, we also realize that iteration-based algorithms will cause heavy communication traffic among individual sites. 7 Summary And Future Wor Through the analysis of the limitations of distributed meta-learning approach, we model distributed data across heterogeneously distributed sites with two-level statistical hierarchical models, and extend traditional meta-learning approach to suit non-iid distributed data. We successfully use our context-based meta-learning
11 approach on several simulation data sets for distributed regression. We also discuss the important issues related with our approach. For our future wor, we will use some real world distributed data sets to test our approach. Then we plan to extend our approach to distributed classification problems. References 1. Provost, F.: Distributed Data Mining: Scaling Up and Beyond. In: Kargupta, H. Chan, P.K.(eds.: Advances in Distributed and Parallel Knowledge Discovery. AAAI / MIT Press ( Par, B.-H. Kargupta, H.: Distributed Data Mining: Algorithms, Systems, and Applications. In: Ye, N.(ed.: Data Mining Handboo (00 3. Prodromidis, A.L., Chan, P.K. Stolfo, S.J.: Meta-Learning in Distributed Data Mining Systems: Issues and Approaches. In: Kargupta, H. Chan, P.K.(eds.: Advances in Distributed and Parallel Knowledge Discovery. Chapter 3. AAAI / MIT Press ( Draper, D.: Bayesian Hierarchical Modeling. In Tutorial on ISBA000. Crete, Greece ( Wirth, R., Borth, M. Hipp, J.: When Distribution is Part of the Semantics: A New Problem Class for Distributed Knowledge Discovery. In Worshop on Ubiquitous Data Mining for Mobile and Distributed Environments. 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'01 ( Brandt, S.: Data Analysis: Statistical and Computational Methods for Scientists and Engineers, ed. Edition, T. Springer ( Lipsey, M.W. Wilson, D.B.: Practical Meta-Analysis. SAGE Publications ( Kreft, I. Leeuw, J.D.: Introducing Multilevel Modeling. Sage Publications ( Goldstein, H.: Multilevel Statistical Models, ed. Edition, S. ARNOLD ( Xing, Y., Duggan, J., Madden, M.G. Lyons, G.J.: A Multi-Agent System For Customer Behavior Prediction in Virtual Organization. Technical Report for Enterprise Ireland ( Dietterich, T.G.: Ensemble Methods in Macine Learning. Lecture Notes in Computer Science ( Breiman, L.: Bagging Predictors. Machine Learning. 4 ( Friedman, J.H.: Multivariate Adaptive Regression Splines. Annals of Statistics Witten, I.H. Fran, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Academic Press ( Páircéir, R., McClean, S. Scitney, B.: Discovery of Multi-level Rules and Exceptions from a Distributed Database. In Sixth ACM SIGKDD International Conference on Knowledge Discovery & Data Mining KDD000. Boston, MA, USA (000
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationOn the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationCredit Card Fraud Detection Using Meta-Learning: Issues 1 and Initial Results
From: AAAI Technical Report WS-97-07. Compilation copyright 1997, AAAI (www.aaai.org). All rights reserved. Credit Card Fraud Detection Using Meta-Learning: Issues 1 and Initial Results Salvatore 2 J.
More informationMining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
More informationModel Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
More informationCredit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1
Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1 Salvatore J. Stolfo, David W. Fan, Wenke Lee and Andreas L. Prodromidis Department of Computer Science Columbia University
More informationIntroduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
More informationCommunity Mining from Multi-relational Networks
Community Mining from Multi-relational Networks Deng Cai 1, Zheng Shao 1, Xiaofei He 2, Xifeng Yan 1, and Jiawei Han 1 1 Computer Science Department, University of Illinois at Urbana Champaign (dengcai2,
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &
More informationMarketing Mix Modelling and Big Data P. M Cain
1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationSolving Regression Problems Using Competitive Ensemble Models
Solving Regression Problems Using Competitive Ensemble Models Yakov Frayman, Bernard F. Rolfe, and Geoffrey I. Webb School of Information Technology Deakin University Geelong, VIC, Australia {yfraym,brolfe,webb}@deakin.edu.au
More informationNew Ensemble Combination Scheme
New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationThe Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
More informationComparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal
Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal Magdalena Graczyk 1, Tadeusz Lasota 2, Bogdan Trawiński 1, Krzysztof Trawiński 3 1 Wrocław University of Technology,
More informationUsing Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean
Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. by Philip Kostov and Seamus McErlean Working Paper, Agricultural and Food Economics, Queen
More informationApplied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is
More informationDATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
More informationClassification of Learners Using Linear Regression
Proceedings of the Federated Conference on Computer Science and Information Systems pp. 717 721 ISBN 978-83-60810-22-4 Classification of Learners Using Linear Regression Marian Cristian Mihăescu Software
More informationExample application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
More informationA self-growing Bayesian network classifier for online learning of human motion patterns. Title
Title A self-growing Bayesian networ classifier for online learning of human motion patterns Author(s) Chen, Z; Yung, NHC Citation The 2010 International Conference of Soft Computing and Pattern Recognition
More informationRoulette Sampling for Cost-Sensitive Learning
Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationA Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
More informationAssessing Data Mining: The State of the Practice
Assessing Data Mining: The State of the Practice 2003 Herbert A. Edelstein Two Crows Corporation 10500 Falls Road Potomac, Maryland 20854 www.twocrows.com (301) 983-3555 Objectives Separate myth from reality
More informationAdvanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
More informationL25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
More informationGetting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise
More informationFeature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
More informationHow To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn
More informationThe Predictive Data Mining Revolution in Scorecards:
January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms
More informationA Learning Algorithm For Neural Network Ensembles
A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República
More informationIntroduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
More informationKnowledge Discovery from Data Bases Proposal for a MAP-I UC
Knowledge Discovery from Data Bases Proposal for a MAP-I UC P. Brazdil 1, João Gama 1, P. Azevedo 2 1 Universidade do Porto; 2 Universidade do Minho; 1 Knowledge Discovery from Data Bases We are deluged
More informationENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan
More informationCredit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information
Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN
More informationBIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
More informationIntroduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group
Introduction to Multilevel Modeling Using HLM 6 By ATS Statistical Consulting Group Multilevel data structure Students nested within schools Children nested within families Respondents nested within interviewers
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationIntroduction to Data Analysis in Hierarchical Linear Models
Introduction to Data Analysis in Hierarchical Linear Models April 20, 2007 Noah Shamosh & Frank Farach Social Sciences StatLab Yale University Scope & Prerequisites Strong applied emphasis Focus on HLM
More informationData Mining as Exploratory Data Analysis. Zachary Jones
Data Mining as Exploratory Data Analysis Zachary Jones The Problem(s) presumptions social systems are complex causal identification is difficult/impossible with many data sources theory not generally predictively
More informationCOPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments
Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationMachine Learning and Data Mining. Fundamentals, robotics, recognition
Machine Learning and Data Mining Fundamentals, robotics, recognition Machine Learning, Data Mining, Knowledge Discovery in Data Bases Their mutual relations Data Mining, Knowledge Discovery in Databases,
More informationStudying Auto Insurance Data
Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationAn Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset
P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang
More informationnot possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
More informationPredict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationEnsemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008
Ensemble Learning Better Predictions Through Diversity Todd Holloway ETech 2008 Outline Building a classifier (a tutorial example) Neighbor method Major ideas and challenges in classification Ensembles
More informationMultiple Classifiers -Integration and Selection
1 A Dynamic Integration Algorithm with Ensemble of Classifiers Seppo Puuronen 1, Vagan Terziyan 2, Alexey Tsymbal 2 1 University of Jyvaskyla, P.O.Box 35, FIN-40351 Jyvaskyla, Finland sepi@jytko.jyu.fi
More informationRevenue Management with Correlated Demand Forecasting
Revenue Management with Correlated Demand Forecasting Catalina Stefanescu Victor DeMiguel Kristin Fridgeirsdottir Stefanos Zenios 1 Introduction Many airlines are struggling to survive in today's economy.
More informationLearning bagged models of dynamic systems. 1 Introduction
Learning bagged models of dynamic systems Nikola Simidjievski 1,2, Ljupco Todorovski 3, Sašo Džeroski 1,2 1 Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationUtility-Based Fraud Detection
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Utility-Based Fraud Detection Luis Torgo and Elsa Lopes Fac. of Sciences / LIAAD-INESC Porto LA University of
More informationEnsemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
More informationICPSR Summer Program
ICPSR Summer Program Data Mining Tools for Exploring Big Data Department of Statistics Wharton School, University of Pennsylvania www-stat.wharton.upenn.edu/~stine Modern data mining combines familiar
More informationData Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition
Brochure More information from http://www.researchandmarkets.com/reports/2171322/ Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition Description: This book reviews state-of-the-art methodologies
More informationToward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection
Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection Philip K. Chan Computer Science Florida Institute of Technolog7 Melbourne, FL 32901 pkc~cs,
More informationIdentifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three
More informationWelcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA
Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/
More informationIntroduction to Longitudinal Data Analysis
Introduction to Longitudinal Data Analysis Longitudinal Data Analysis Workshop Section 1 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development Section 1: Introduction
More informationPerspectives on Data Mining
Perspectives on Data Mining Niall Adams Department of Mathematics, Imperial College London n.adams@imperial.ac.uk April 2009 Objectives Give an introductory overview of data mining (DM) (or Knowledge Discovery
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationAvailable online at www.sciencedirect.com Available online at www.sciencedirect.com
Available online at www.sciencedirect.com Available online at www.sciencedirect.com Procedia Procedia Engineering Engineering 00 (0 9 (0 000 000 340 344 Procedia Engineering www.elsevier.com/locate/procedia
More informationAUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
More informationEliminating Class Noise in Large Datasets
Eliminating Class Noise in Lar Datasets Xingquan Zhu Xindong Wu Qijun Chen Department of Computer Science, University of Vermont, Burlington, VT 05405, USA XQZHU@CS.UVM.EDU XWU@CS.UVM.EDU QCHEN@CS.UVM.EDU
More informationBootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationDecision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
More informationChapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
More informationWeb Hosting Service Level Agreements
Chapter 5 Web Hosting Service Level Agreements Alan King (Mentor) 1, Mehmet Begen, Monica Cojocaru 3, Ellen Fowler, Yashar Ganjali 4, Judy Lai 5, Taejin Lee 6, Carmeliza Navasca 7, Daniel Ryan Report prepared
More informationFinding statistical patterns in Big Data
Finding statistical patterns in Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research IAS Research Workshop: Data science for the real world (workshop 1)
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationOn Cross-Validation and Stacking: Building seemingly predictive models on random data
On Cross-Validation and Stacking: Building seemingly predictive models on random data ABSTRACT Claudia Perlich Media6 New York, NY 10012 claudia@media6degrees.com A number of times when using cross-validation
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationHandling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza
Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationREVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
More informationWeather forecast prediction: a Data Mining application
Weather forecast prediction: a Data Mining application Ms. Ashwini Mandale, Mrs. Jadhawar B.A. Assistant professor, Dr.Daulatrao Aher College of engg,karad,ashwini.mandale@gmail.com,8407974457 Abstract
More informationTree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems
Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern
More informationChapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
More informationLongitudinal Meta-analysis
Quality & Quantity 38: 381 389, 2004. 2004 Kluwer Academic Publishers. Printed in the Netherlands. 381 Longitudinal Meta-analysis CORA J. M. MAAS, JOOP J. HOX and GERTY J. L. M. LENSVELT-MULDERS Department
More informationRegularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
More informationA Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms
Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using
More informationIntroduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
More informationUsing Mining@Home for Distributed Ensemble Learning
Using Mining@Home for Distributed Ensemble Learning Eugenio Cesario 1, Carlo Mastroianni 1, and Domenico Talia 1,2 1 ICAR-CNR, Italy {cesario,mastroianni}@icar.cnr.it 2 University of Calabria, Italy talia@deis.unical.it
More informationSTATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
More informationData-driven Multi-touch Attribution Models
Data-driven Multi-touch Attribution Models Xuhui Shao Turn, Inc. 835 Main St. Redwood City, CA 94063 xuhui.shao@turn.com Lexin Li Department of Statistics North Carolina State University Raleigh, NC 27695
More informationHow To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
More informationStatistics for BIG data
Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before
More informationComparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
More informationChapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
More information