Distributed Regression For Heterogeneous Data Sets 1

Transcription

1 Distributed Regression For Heterogeneous Data Sets 1 Yan Xing, Michael G. Madden, Jim Duggan, Gerard Lyons Department of Information Technology National University of Ireland, Galway Ireland {yan.xing, michael.madden, jim.duggan, Abstract. Existing meta-learning based distributed data mining approaches do not explicitly address context heterogeneity across individual sites. This limitation constrains their applications where distributed data are not identically and independently distributed. Modeling heterogeneously distributed data with hierarchical models, this paper extends the traditional meta-learning techniques so that they can be successfully used in distributed scenarios with context heterogeneity. 1 Introduction Distributed data mining (DDM is an active research sub-area of data mining. It has been successfully applied in the cases where data are inherently distributed among different loosely coupled sites that are connected by a networ [1-3]. Through transmitting high level information, DDM techniques can discover new nowledge from dispersed data. Such high level information not only has reduced storage and bandwidth requirements, but also maintains the privacy of individual records. Most DDM algorithms for regression or classification are within a meta-learning framewor that is based on ensemble learning. They accept the implicit assumption that the probability distributions of dispersed data are homogeneous. It means that the data are just geographically or physically distributed across various sites, and there are no differences among the sites. Although a distinction is made between homogeneous and heterogeneous data, it is limited on database schemata. In reality, heterogeneity across the sites is often the rule rather than the exception [4]. An example is a health virtual organization having several hospitals, which differ in expertise, sills, equipment and treatment. Another example is a loosely coupled commercial networ consisting of several retailers, which adopt different business policies, sell different products and have different price standards. Wirth describes this ind of scenario as distribution is part of the semantics [5], which has been seldom discussed and addressed. 1 The support of the Informatics Research Initiative of Enterprise Ireland is gratefully acnowledged.

2 Model of Distributed Data Across Heterogeneous Sites Distributed data across homogeneous sites may be regarded as random samples from the same underlying population, even though the probability distribution of the population is unnown. The differences among the data sets are just random sampling errors. From a statistical perspective, the distributed data are identically and independently distributed (IID [6, 7]. A data set stored at the th site consists of data {( y i, x i, i = 1,..., N } where the y s are numerical responses for regression problems, N is the sample size of the data, {, = 1,..., K} and K is the total number of individual sites. If all the sites are homogeneous, IID data of normal distribution with mean θ and variance σ can be expressed as: y IID ~ N( θ, σ i If e i is the random sampling error at the th site, and f ( x i is the real global regression model, then equation (1 can be rewritten as: yi = θ + ei = f ( x i + e i, e i IID σ When there is heterogeneity across the different sites, distributed data are not IID. The differences across the various sites are not only the sampling errors, but also some context heterogeneities caused by different features of the sites. In practice, it is difficult to determine the exact sources of the context heterogeneities. Considering the earlier example of a hospital domain, the mixture effects of the differences of hospital expertise, equipment, sills and treatment cause context heterogeneity. It is hard or even impossible for us to measure how much of the heterogeneity is caused by one of the above issues. Furthermore, there may be some differences that exist across the hospitals which are unobservable. So it is with other domains such as the networ of retailers. In this case, we need to model the context heterogeneity in a reasonable way. In statistical meta-analysis, a popular way to model unobservable or immeasurable context heterogeneity is to assume that the heterogeneity across different sites is random. In other words, context heterogeneity derives from essentially random differences among sites whose sources cannot be identified or are unobservable [7]. Distributed data across various sites having randomly distributed context heterogeneity is often regarded as conditional IID [4]. This leads to a two-level hierarchical model which describes context heterogeneity with mixture models and employs latent variables in a hierarchical structure [4, 8, 9]. Assuming that the context of different sites is distributed normally with mean θ and variance τ, data y i at the th site has normal distribution with mean θ and variance σ, then the two-level hierarchical model of distributed data across heterogeneous sites is: (1 (

3 IID Between sites level : θ ~ N( θ, τ IID Within a certain site : ( yi θ ~ N( θ, σ If t is the random sampling error of context across different sites, and the different level residuals t and e i are independent, then equation (3 can be rewritten as: yi = θ + ei + t = f ( x + e + t (4 i i IID IID ei σ, t τ When τ = 0, then t = 0, equation (3 will be the same as equation (1, and equation (4 will be same as equation (. So a distributed scenario with homogeneous sites is only a special case of the generic situation. In theory of hierarchical modeling, intra-class correlation (ICC is a concept to measure how much quantity of the total variance of a model is caused by context heterogeneity. ICC is calculated by: (3 ICC =τ /( τ + σ (5 When the various sites are homogeneous, ICC = 0. The larger value ICC has, the more quantity of model variance is caused by context heterogeneity. 3 Towards Context-based Once distributed data across heterogeneous sites are modeled as equation (4, the main tas of distributed regression is to obtain the global regression model f ( x i. In hierarchical modeling, statistical linear and nonlinear multilevel model fitting is done by iterative maximum lielihood based algorithms [4, 9]. Unfortunately, all of these algorithms are designed for centralized data. Even though we can modify the algorithms to a distributed environment, the iteration feature will cause significant burdens on communications among the various sites [10]. Since most existing DDM approaches are within a meta-learning framewor, it is necessary for us to extend the state of the art techniques so that they can be successfully used in distributed scenarios with heterogeneous sites. 3.1 Traditional Distributed When different sites are homogeneous, equations (1 and ( apply. The implicit assumption for an ensemble of learners to be more accurate than the average performance of its individual members is satisfied [11, 1]. Assuming the base models (or learners generated at different sites are f ( x i, = 1,,..., K, then the final ensemble model (meta-learner is

4 f A( x i = E [ f ( xi ], where E denotes the expectation over, and the subscript A in f A denotes aggregation. Distributed meta-learning taes f A ( x i as the estimate of the real global model f ( x i. based distributed regression follows three main steps. Firstly, generate base regression models at each site using a learning algorithm. Secondly, collect the base models at a central site. Produce meta-level data from a separate validation set and predictions generated by the base classifier on it. Lastly, generate the final regression model from meta-level data via combiner (un-weighted or weighted averaging. 3. Context-based When different sites are heterogeneous, equations (3 and (4 apply. The variance of data within a certain site is σ, but the variance of data from different sites is σ + τ. So distributed data are not IID. The criterion for success of meta-learning is not satisfied. We need an approach to deal with the context variance τ in the metalearning framewor. We call it context-based meta-learning approach Global Model Estimation According to equation (4, given the context site can be expressed as: ( yi θ = θ + e IID ei σ and { θ, 1,,..., K} has the following distribution: = i θ of the th site, data within that θ = θ + t = f ( x + t (7 i IID t τ So base models f ( x i generated at the local sites are the estimates of θ, =1,,..., K. Given θ and x i, suppose?ˆ is the estimate of the real θ, then: f ( x =? ˆ θ = θ + t = f ( x + t (8 Then the final ensemble model f A ( x i is: i i f x = E [ f ( x ] f ( x + E ( t (9 A ( i i i Because E ( = 0, we can use f x to estimate the real global model f x. t A ( i ( i (6

5 3.. Context Residual Estimation Since there is context heterogeneity, when we use the ensemble model for prediction at a certain th site, we need to add the value of context residual at that site, which is t. From equation (8 and (9, we have: t 1 = N N [ f ( x i= 1 f ( x i With equation (10, t will never be exactly equal to zero even though the real context residual is zero. So when we get the value of t by equation (10, we use twotail t-test to chec the null hypothesis H 0 : t = 0, and calculate the context residual of th site by: t 0, H = t, H 0 0 A i ] accepted given α rejected given α Where α is the level of significance. When we get all the values of { t, = 1,,..., K}, we can calculate context level variance τ Algorithm for Context-based Our context-based algorithm for distributed regression follows six steps: 1. At each site, use cross-validation to generate a base regression model.. At each site, collect the base models. Produce meta-level data from the predictions generated by the base classifiers on the local data set. 3. At each site, generate the ensemble model from meta-level data via combiner (unweighted or weighted averaging 4. At each site, calculate its context residual with equation (10 and ( At each site, generate the final regression model of this site by equation (8. 6. Collect all the base models and context residuals at a central site, calculate context level variance. (10 (11 4 Simulation Experiment In practice, different distributed scenarios have different levels of intra-class correlation. It ranges from ICC = 0, which is the homogeneous case, to ICC 1, which means variance is mainly caused by context heterogeneity. In order to evaluate our approach on different values of ICC, we use simulation data sets because it is difficult for us to get real world distributed data sets that can satisfy all our requirements.

6 4.1 Simulation Data Sets The three simulation data sets we used are Friedman s data sets [13]. They are originally generated by Friedman and used by Breiman in his oft-cited paper about bagging [1]. We have made some modifications to the three data sets so that they are compatible with our distributed scenarios. Friedman #1: there are ten independent predictor variables x,..., x 1 10 each of which is uniformly distributed over [ 0,1]. We set the total number of sites is K = 10. At th site, the sample size is 00 for training set and 1000 for testing set. The response is given by: #1: yi = 10 sin( π x1x + 0( x x4 + 5x5 + ei + t ei 1, t τ We adjust the value of τ so that we can get ICC = 0.0,0.1,...,0. 9 respectively. Friedman #, #3: these two examples are four variable data with 1 / yi = ( x1 + ( x x3 (1/ x x4 + ei + t # : ei σ, t τ yi #3: ei = tan 1 x ( σ 3 x (1/ x, 3 t x 1 τ x4 + e Where x 1, x, x3, x4 are uniformly distributed as 0 x 1 100, 0 ( x / π 80, 0 x 3 1 and 1 x The total site number and sample sizes of training and testing set of each site are the same as #1. The parameters σ,σ are selected to give 3:1 signal/noise ratios, and 3 τ,τ 3 are adjusted for the same purpose as #1. 3 i + t (1 (13 (14 4. Simulation Result To compare the prediction accuracy of traditional meta-learning approach and our context-based meta-learning approach under different values of ICC, we implement both algorithms. The base learner we use is a regression tree algorithm implemented in Wea [14]; the meta-learner we use is un-weighted averaging and the significance level α = We evaluate our approach from two angles: the whole virtual organization and its individual sites The Whole Organization From the point of view of the whole organization, the target of DDM is to discover the global trend. Fig. 1 shows the global prediction accuracy under different ICC values.

7 From Fig.1, we can see that, as ICC increases, the prediction accuracy of the traditional meta-learning approach decreases, while the prediction accuracy of our approach essentially remains the same. When ICC=0, the prediction accuracy of both approaches are almost the same. 4.. Individual Sites From the point of view of each individual site, the goal of DDM is to get more accurate model than the local model (base model only created from its local data. In practice, when ICC>0.5, individual site usually only use local model because the context heterogeneity is too large ICC (Friedman # ICC (Friedman # ICC (Friedman #3 Fig. 1. Global prediction accuracy (average of 0 runs

8 Individual Sites (Friedman #1 Individual Sites (Friedman # Individual Sites (Friedman #3 Local model Fig.. Prediction accuracy of individual sites when ICC=0.3 (average of 0 runs Fig. compares the prediction accuracy of local model, meta-learning model and the model created with our context-based meta-learning approach when ICC =0.3. For those sites (5 th, 8 th, 10 th for #1; 1 st, 4 th, 5 th, 8 th for #; 3 rd, 4 th, 7 th for #3 with relatively larger context residuals, meta-learning behaves worse than our approach, and sometimes even worse than the local models (1 st, 4 th, 5 th, 8 th for #; 4 th, 7 th for #3. For those sites with very small context residuals, the performance of our approach is not worse than meta-learning. So the overall performance of our approach is the best. 5 Discussion of Our Approach There are two important issues relating with our approach: number of sites and sample size at individual sites. 5.1 Number of Sites In practical distributed scenarios, the number of sites is usually much smaller than the number of data at each individual site. The extreme case is that there are only two sites.

9 From perspective of statistics, the larger the number of sites is, the more accurate estimations of the quantity of context heterogeneity we can obtain. When the number of sites is extremely low, we usually underestimate the quantity of context heterogeneity [9]. This ind of underestimation will worsen the advantage of our approach. We tried our simulation experiment when the number of sites is 5 and respectively. The results we got demonstrate this ICC (Sites = ICC (Sites = 5 Fig. 3. Comparison of global prediction accuracy when number of sites is extremely small (Friedman #3, average of 0 runs Fig.3 shows the global prediction accuracy for Friedman #3 data when the number of sites is extremely small. When the number of sites is 5, the advantage of our approach is still obvious. But when the number of sites is, the advantage of our approach is less obvious. 5. Sample Size of Data at Individual Sites At each individual site, as the sample size of training data set increases, the accuracy of using?ˆ to estimate θ in equation (8 increases. So we can obtain more accurate estimation of the context residual of the site if we have more local data. Thus finally we can get higher prediction accuracy.

10 Comparing the two graphs in Fig. 4, it can be seen that the advantage of our approach is greater pronounced when there are more training data at each individual site Local model Individual Sites (train size = 50 Fig. 4. Comparison of prediction accuracy with different training size (Friedman #1, testing size =1000 for each site, ICC=0.3, average of 0 runs Local model Individual Sites (train size = Related Wor is one popular approach of DDM techniques. The most successful meta-learning based DDM application is done by Prodromidis in the domain of credit card fraud detection [3]. However, most existing DDM applications within metalearning framewor do not explicitly address context heterogeneity across individual sites. Wirth defines distributed scenarios with context heterogeneity as distribution is part of the semantics in his wor [5], but does not give an approach to solve it. The only wor we found which explicitly addresses context heterogeneity is done by Páircéir [15], where statistical hierarchical models are used to discover multi-level association rules from dispersed hierarchical data. In our previous wor, we use hierarchical modeling to address context heterogeneity in the domain of virtual organizations [10]. Although we get some encouraging results, we also realize that iteration-based algorithms will cause heavy communication traffic among individual sites. 7 Summary And Future Wor Through the analysis of the limitations of distributed meta-learning approach, we model distributed data across heterogeneously distributed sites with two-level statistical hierarchical models, and extend traditional meta-learning approach to suit non-iid distributed data. We successfully use our context-based meta-learning

11 approach on several simulation data sets for distributed regression. We also discuss the important issues related with our approach. For our future wor, we will use some real world distributed data sets to test our approach. Then we plan to extend our approach to distributed classification problems. References 1. Provost, F.: Distributed Data Mining: Scaling Up and Beyond. In: Kargupta, H. Chan, P.K.(eds.: Advances in Distributed and Parallel Knowledge Discovery. AAAI / MIT Press ( Par, B.-H. Kargupta, H.: Distributed Data Mining: Algorithms, Systems, and Applications. In: Ye, N.(ed.: Data Mining Handboo (00 3. Prodromidis, A.L., Chan, P.K. Stolfo, S.J.: Meta-Learning in Distributed Data Mining Systems: Issues and Approaches. In: Kargupta, H. Chan, P.K.(eds.: Advances in Distributed and Parallel Knowledge Discovery. Chapter 3. AAAI / MIT Press ( Draper, D.: Bayesian Hierarchical Modeling. In Tutorial on ISBA000. Crete, Greece ( Wirth, R., Borth, M. Hipp, J.: When Distribution is Part of the Semantics: A New Problem Class for Distributed Knowledge Discovery. In Worshop on Ubiquitous Data Mining for Mobile and Distributed Environments. 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'01 ( Brandt, S.: Data Analysis: Statistical and Computational Methods for Scientists and Engineers, ed. Edition, T. Springer ( Lipsey, M.W. Wilson, D.B.: Practical Meta-Analysis. SAGE Publications ( Kreft, I. Leeuw, J.D.: Introducing Multilevel Modeling. Sage Publications ( Goldstein, H.: Multilevel Statistical Models, ed. Edition, S. ARNOLD ( Xing, Y., Duggan, J., Madden, M.G. Lyons, G.J.: A Multi-Agent System For Customer Behavior Prediction in Virtual Organization. Technical Report for Enterprise Ireland ( Dietterich, T.G.: Ensemble Methods in Macine Learning. Lecture Notes in Computer Science ( Breiman, L.: Bagging Predictors. Machine Learning. 4 ( Friedman, J.H.: Multivariate Adaptive Regression Splines. Annals of Statistics Witten, I.H. Fran, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Academic Press ( Páircéir, R., McClean, S. Scitney, B.: Discovery of Multi-level Rules and Exceptions from a Distributed Database. In Sixth ACM SIGKDD International Conference on Knowledge Discovery & Data Mining KDD000. Boston, MA, USA (000