A 2 d -Tree-Based Blocking Method for Microaggregating Very Large Data Sets

Size: px
Start display at page:

Download "A 2 d -Tree-Based Blocking Method for Microaggregating Very Large Data Sets"

Transcription

1 A 2 d -Tree-Based Blocking Method for Microaggregating Very Large Data Sets Agusti Solanas, Antoni Martínez-Ballesté, Josep Domingo-Ferrer and Josep M. Mateo-Sanz Universitat Rovira i Virgili, Dept. of Computer Engineering and Maths, Av. Països Catalans 26, E Tarragona, Catalonia, Spain {agusti.solanas,antoni.martinez}@urv.net {josep.domingo,josepmaria.mateo}@urv.net Abstract Blocking is a well-known technique used to partition a set of records into several subsets of manageable size. The standard approach to blocking is to split the records according to the values of one or several attributes (called blocking attributes). This paper presents a new blocking method based on 2 d -trees for intelligently partitioning very large data sets for microaggregation. A number of experiments has been carried out in order to compare our method with the most typical univariate one. Keywords: Microaggregation, Blocking, Statistical Disclosure Control, Privacy. 1 Introduction Due to the new information technologies and the massive use of computers for organizing and distributing data, the amount of stored data has had a spectacular increase in a very diverse set of fields, namely finance, health care and engineering. This new impressive amount of data has to be managed in order to perform a wide variety of tasks from knowledge discovery to statistical disclosure control. It is not straightforward to manage this big amount of data and generally the data sets must be divided into smaller ones in order to apply different techniques over each generated data subset. To this end, a well known statistical technique called blocking is used to divide a huge data set into a number of smaller ones. This technique is based on selecting a number of attributes of the data set, called blocking attributes (i.e., some columns or variables), sort them and This work was partially supported by the Government of Catalonia under grant 2005 SGR and by the Spanish Ministry of Science and Education under project SEG C04-1 PROPRIETAS. divide them as many times as needed to obtain manageable subsets. Some drawbacks of blocking using blocking attributes are the following: The choice of blocking attributes may not be obvious; Blocking may fail to adapt to the distribution of data (very heterogeneous blocks); The size of some blocks may be too small or too big for some purposes (e.g., privacy preservation). In this paper, we propose a method based on a 2 d -tree for dividing very large data sets into a number of smaller ones and we concentrate on using this method to adapt the classical microaggregation to manage these huge data sets. The proposed method considers a user-definable number of dimensions d in order to divide the original data set into a number of smaller ones. 1.1 The microaggregation problem Microaggregation is a statistical disclosure control (SDC) technique and its purpose is to cluster a set of elements in groups of at least k elements, being k auserdefinable parameter. As a result, a set of generated groups called k-partition is obtained. Once all the elements in a data set are clustered, a microaggregated data set is built by replacing each original element by the centroid of the group which it belongs to. After this process, the microaggregated data set can be released and the privacy of the individuals which form the original data set can be guaranteed because the aggregated data set will contain at least k identical and indistinguishable elements.

2 1.2 Related work Microaggregation has been used for several years in different countries: it started at Eurostat [3] in the early nineties, and has since then been used in Germany [17] and several other countries [9]. These years of experience and practice devoted to microaggregation have bequeathed us a variety of approaches, which we next briefly summarize. Symbol d D D 1,D 2,...D G E g G k L Meaning Dimensions A data set The Generated subsets Number of elements in group g Number of generated subsets. Min. Elements per group Max. Elements per leaf Optimal methods: Univariate case: In [12] a polynomial algorithm for optimal univariate microaggregation was presented. Multivariate case: Optimal multivariate microaggregation was shown to be NP-hard in [16]. So the only practical multivariate microaggregation methods are heuristic. Heuristic methods: Fixed-size heuristics: The best-known example of this class is Maximum Distance to Average Vector (MDAV) [5, 7, 13]. MDAV produces groups of fixed cardinality k and, when the number of records is not divisible by k, one group with a cardinality between k and 2k 1. MDAV has proven to be the best performer in terms of time and one of the best regarding the homogeneity of the resulting groups. Variable-size heuristics: These yield k-partitions with group sizes varying between k and 2k 1. Such a flexibility can be exploited to achieve higher within-group homogeneity. Variable-size methods include the genetic-inspired approach for small data sets in [18] and also [4, 14, 6]. Typically, the clusters that a microaggregation technique generates are evaluated with a compactness or homogeneity measure. The homogeneity of a group can be measured as the inverse of its sum of square errors (SSE) which is computed as SSE = s n i (x ij x i ) (x ij x i ) (1) i=1 j=1 where s is the number of generated sets, n i is the number of elements in the i-th set, x ij is the j-th element in the i-th set and x i is the centroid of the i-th set. SSE is the most utilized measure in clustering and related areas [8, 10, 11, 15, 19]. The main objective of microaggregation is to reduce the SSE and, thus, maximize the within-group homogeneity. Table 1. Table of main symbols 1.3 Contribution and plan of the paper In this paper a new method for microaggregating very large data sets is presented and compared with the previously utilized univariate blocking technique. The most relevant aspects of our method regarding group size constraints and SSE have been studied in detail. The rest of the paper is organized as follows. In Section 2 our method is detailed. Next, Section 3 shows the experimental results. Finally in Section 4 the article concludes. 2 The Method 2.1 Problem definition We are addressing a problem that consists in dividing a large data set D into a number of smaller ones D 1,D 2,...D G. We claim that in order to maintain the spatial relations that may exist between the elements in the original data set, it is necessary to consider several dimensions for partitioning the data set into smaller ones. Thus, we propose to use a very well-known data structure such as a 2 d -tree for dividing the data taking into account a userdefinable number d of dimensions. We do not only want to divide the original data, but we want to finally obtain a microaggregated set in order to allow the data to be safely released by minimizing the SSE. To that end, the leaves of the generated 2 d -tree must be modified for properly applying a microaggregation technique over each one. Once all leaves are prepared, a microaggregation method is applied. Specifically, the MDAV heuristic [5, 7, 13] has been used to obtain the empirical results presented in Section 3. A complete explanation about how to design and implement a 2 d -tree can be found in [2], thus, we will only elaborate on the most relevant aspects of the 2 d -tree generation, and the other main steps of the method. Table 1 shows the notation used in the paper.

3 2.2 2 d -tree generation The first step of our method consists of building a 2 d - tree. This kind of structure is commonly found in image analysis with d equaling two (i.e., Quadtree) and virtual reality or satellite image analysis with d equaling three (i.e., Octree). The number of dimensions d is a parameter to our method and must be defined by the user. Although any value of d is possible, it is not a good idea to use high values of d because the children of each node in the tree grows exponentially with d (i.e., each node has 2 d children). In this paper we have used data sets with a dimensionality between 2 and 10. Thus, all the dimensions were used to build the 2 d -tree. When the number of dimensions is bigger, the user must specify which dimensions have to be used to generate the 2 d -tree. Figure 1 shows a typical partition obtained by applying a 2 d -tree with d equaling 2. In Table 2 our structure of a node is shown in C style. All the elements in the structure are common to any tree structure in which there must be relations between the father and the children, a node identifier, etc. However, the most relevant element in the node structure of a 2 d -tree is the dimension limits vector. The dimension limits vector contains an upper-bound and a lower-bound for each dimension. Thus, all the records contained in a node must fall inside these limits. Initially, the 2 d -tree has a single leaf and it is divided into 2 d leaves if the number of records inside its bounds is bigger than L. This division of the leaves into 2 d children leaves is recursively applied until all the leaves have a maximum number of records, within its dimension limits, lower than L. The dimension limits vectors of child nodes are computed as follows. The range of each of the d dimensions in the parent node is split into two subranges: one from the parent s lower bound for that dimension to the midpoint of the parent s range and another from the midpoint of the parent s range to the parent s upper bound. This is done for all d dimensions, so that eventually 2 d d-dimensional subranges are obtained, each of which is assigned to a child node. Moreover, there are some computational issues which may be emphasized: Each node in the 2 d -tree must know which records fall inside the space region that it represents. However, each record is not really stored in terms of the value of all its dimensions; instead, a reference to its position (i.e., an ordinal) in the original data set is stored. This results in a better use of the physical memory. During this step no distance computations are needed and all the process can be completed with simple and Table 2. A node structure struct node { int nodeid; //Node identifier int level; //Depth of the node int Nrecords; //Number of records int *records; //Reference to the records int Nchildren; //Number of children node *children; //A reference to the children node *parent; //The parent node float **dimlimits; //The dimensional limits }; Figure 1. Example of a quad-tree partition of a data set of 500 elements with L =50. computationally cheap comparisons. 2.3 Fusing low-cardinality leaves Once the 2 d -tree is generated, we must pay attention to its leaves. By construction, the number of elements in each leaf will not exceed L. However, we must guaranty the number of elements in each leaf to be, at least, k because that is a constraint imposed by microaggregation. To that end, we firstly detect the leaves with low-cardinality (i.e., with less than k elements) as shown in Figure 2. Secondly,the centroid of each non-empty leaf is computed using the next Equation: Eg i=0 C g = xg i, (2) E g where x g i is the i th record of leaf g and E g is the number of records in the leaf. After computing the centroids, the closest centroid to each low-cardinality leaf centroid is found and the corresponding leaves are fused as shown in Figure 3. This pro-

4 Table 3. Number of blocks for a data set with n = records, d =2attributes and minimum block size k =3 L Univ.Block d -tree Figure 2. Example of low-cardinality leaves detection with k =10. Highlighted squares indicate low-cardinality leaves. cess is repeated until there are no low-cardinality leaves. Note that the number of elements contained in a leaf after fusing two of them can be bigger than L. Thus, we can only guarantee the existence of, at least, k elements but we can say nothing about the size of the finally obtained leaves 1. Figure 3. Example of centroid computations and leaves fusion. The lines link the leaves to be fused 2.4 MDAV microaggregation In this final step we are able to apply the MDAV [5, 7, 13] microaggregation algorithm over each generated group of records obtained from each non-empty leaf. 1 If the goal is to guarantee the upper limit L, then the fusing step should be avoided or supervised. The resultant SSE Di values obtained from the application of MDAV over each group D i are accumulated to obtain the final SSE, which is the measure we use to evaluate the result of the microaggregation. 3 Experimental Results The experiments we have carried out consist in microaggregating a large microdata set using: i) univariate blocking, as described in the Introduction; ii) 2 d -tree based blocking. In order to compare our method with the plain univariate blocking approach we have generated large data sets. Attribute values were drawn from the [ 10000, 10000] range by simple random sampling. Given a block size L, plain univariate blocking of a data set with n records divides the range of the blocking attribute into n/l equal intervals; then the records whose blocking attribute falls into the i-th interval are included in the i-th block. The number of records in the simulated data sets range from n = to n = , and the number of dimensions d ranges from 2 to 10. In Section 3.3 we analyze the influence of blocking over the SSE by studying its effects over a small data set that can be microaggregated without blocking. Note that this comparison can only be done by using a small data set. The reason is that MDAV can not deal with very large data sets in a reasonable time without storing a matrix of distances in memory, thus, the size of the data set is limited by the available memory. 3.1 Study of L L is an important value that has to be tuned by the user in order to properly obtain the groups for microaggregating. Table 3 shows that, while plain univariate blocking always yields n/l blocks, the 2 d -tree method described in this paper results in a number of blocks higher than n/l. Certainly, there are empty blocks among those generated by the 2 d -tree methods. However, if we look only at the useful blocks, their number is still higher than n/l, as shown in Figure 4. If the purpose of blocking is to apply microaggregation on the resulting blocks, we must bear in mind that the best

5 Figure 5. SSE evolution for different L values applied to , , and records with 2 dimensions for k =3. The continuous line represents the univariate blocking approach, the dashed line represents the 2 d -tree blocking approach. microaggregation heuristics (e.g., MDAV) have a computational cost O(n 2 ). Thus, sizes of blocks should be as similar as possible, without causing a dramatic increase of the within-group sum of squares SSE as a result of blocking. Our 2 d -tree blocking achieves a trade-off between size balance and SSE after microaggregation. For MDAV microaggregation, Figure 5 shows that the SSE is actually much lower than the one obtained with univariate blocking: the reason is that our method yields blocks which are much more homogeneous than the blocks formed by using one arbitrarily chosen block attribute. Figure 4. Number of useful blocks for a data set with n = records, with d =2attributes and minimum block size k =3. The continuous line represents univariate blocking while the dashed line represents 2 d -tree based blocking In order to analyze the behavior of L we have studied how the obtained SSE evolves depending on the number of records and the blocking technique used. Specifically, Figure 6 shows that quite similar values of SSE are obtained for a given L whatever the number of records is when the utilized blocking method is based on a 2 d -tree. On the other hand, Figure 7 shows that the SSE grows with the number of records when the univariate blocking is used, specially when L is small.

6 Figure 6. SSE evolution for different values of L and different number of records using k =3 and the 2 d -tree approach before microaggregating. Figure 7. SSE evolution for different values of L and different number of records using k =3 and the univariate blocking approach before microaggregating. d d -tree Univ.Block Table 4. SSE of the microaggregated data set for k =3and 2, 3, 4, 5 and 10 dimensions. The data set has 2.5 million records. 3.2 The importance of d The dimensionality d is a key factor in our algorithm because the number of leaves/blocks exponentially increases with d. In fact, as d increases, the advantage of our method over univariate blocking in terms of block homogeneity dramatically increases. Figure 8 and Table 4 show the evolution of SSE for different values of dimensionality of the data set. It can be observed that both blocking techniques result in a growing of the SSE of the microaggregated data set. However, the microaggregation with the 2 d -tree blocking method always outputs a lower SSE. 3.3 The influence of blocking In the previous sections we have analyzed the differences between univariate blocking and our proposed 2 d -tree based blocking over very large data sets. As it has been explained in the Introduction, blocking is necessary when working with large data sets. However, it can unintentionally break some natural clusters during the partition of the space. Thus, the resulting SSE could become worse than the obtained without blocking. In order to study this hypothesis, we use a small data L k SSE univ.block. SSE 2 d tree Table 5. SSE results for EIA data set (4096 records and d =11). set (i.e., EIA data set), that can be microaggregated without blocking, and we compare the SSEs which are obtained with each technique. The EIA data set [1] has 4096 records with 11 attributes and it has become a usual reference data set for testing multivariate microaggregation [5, 6, 14]. When MDAV is applied over the whole data with k =3 an SSE = is obtained. When k =5the resulting SSE is Table 5 shows the SSE results over this data set using the studied blocking techniques for different values of L and k. Comparing the results in Table 5 with the SSE obtained without blocking, we can see that the univariate blocking is clearly disruptive while our 2 d -tree approach makes the SSE to be worse for k = 3 but improves it for k = 5. This behavior of the SSE can be explained because the EIA data set is known to be naturally clustered for a k =5and the partition that the 2 d -tree based blocking obtains helps MDAV to improve the resulting SSE. From these results we can conclude that the univariate blocking technique is more disruptive than the 2 d -tree based one. Moreover, in some cases in which the data set behaves as a clustered data set, the 2 d -tree-based blocking can help MDAV to improve the SSE.

7 Figure 8. SSE evolution for different d values applied to records with dimensions from 2 to 10, for k = 3. The value for L is The continuous line represents the univariate blocking approach, the dashed line represents the 2 d -tree blocking approach. 4 Conclusions We have presented a new blocking method based on 2 d - trees for intelligently dividing a very large data set prior to microaggregation. We have shown that our method outperforms the previous one based on univariate blocking in terms of SSE. Thus, it can be considered an appropriate tool for microaggregating huge data sets. Some research lines for a further work remain open: Expand the method to non-numerical data sets Find the optimal value of L for a given data set Analyze the impact of the selected dimensions d over the obtained blocking. References [1] R. Brand, J. Domingo-Ferrer, and J. M. Mateo-Sanz. Reference data sets to test and compare sdc methods for protection of numerical microdata, European Project IST CASC, [2] T. T. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to algorithms. MIT Press, Cambridge, MA, USA, [3] D. Defays and P. Nanopoulos. Panels of enterprises and confidentiality: the small aggregates method. In Proc. of 92 Symposium on Design and Analysis of Longitudinal Surveys, pages , Ottawa, Statistics Canada. [4] J. Domingo-Ferrer, A. Martínez-Ballesté, and J. M. Mateo- Sanz. Efficient multivariate data-oriented microaggregation. Manuscript, [5] J. Domingo-Ferrer and J. M. Mateo-Sanz. Practical dataoriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering, 14(1): , [6] J. Domingo-Ferrer, F. Sebé, and A. Solanas. A polynomialtime approximation to optimal multivariate microaggregation. Manuscript, [7] J. Domingo-Ferrer and V. Torra. Ordinal, continuous and heterogenerous k-anonymity through microaggregation. Data Mining and Knowledge Discovery, 11(2), [8] A. W. F. Edwards and L. L. Cavalli-Sforza. A method for cluster analysis. Biometrics, 21: , [9] E. C. for Europe. Statistical data confidentiality in the transition countries: 2000/2001 winter survey. In Joint ECE/Eurostat Work Session on Statistical Data Confidentiality, Invited paper n.43. [10] A. D. Gordon and J. T. Henderson. An algorithm for euclidean sum of squares classification. Biometrics, 33: , [11] P. Hansen, B. Jaumard, and N. Mladenovic. Minimum sum of squares clustering in a low dimensional space. Journal of Classification, 15:37 55, [12] S. L. Hansen and S. Mukherjee. A polynomial algorithm for optimal univariate microaggregation. IEEE Transactions on Knowledge and Data Engineering, 15(4): , July- August [13] A. Hundepool, A. V. de Wetering, R. Ramaswamy, L. Franconi, A. Capobianchi, P.-P. DeWolf, J. Domingo-Ferrer, V. Torra, R. Brand, and S. Giessing. μ-argus version 4.0 Software and User s Manual. Statistics Netherlands, Voorburg NL, may [14] M. Laszlo and S. Mukherjee. Minimum spanning tree partitioning algorithm for microaggregation. IEEE Transactions on Knowledge and Data Engineering, 17(7): , [15] J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages , [16] A. Oganian and J. Domingo-Ferrer. On the complexity of optimal microaggregation for statistical disclosure control. Statistical Journal of the United Nations Economic Comission for Europe, 18(4): , [17] M. Rosemann. Erste ergebnisse von vergleichenden untersuchungen mit anonymisierten und nicht anonymisierten einzeldaten amb beispiel der kostenstrukturerhebung und der umsatzsteuerstatistik. In G. Ronning and R. Gnoss (editors) Anonymisierung wirtschaftsstatistischer Einzeldaten, Wiesbaden: Statistisches Bundesamt, pages , [18] A. Solanas, A. Martínez-Ballesté, J. M. Mateo-Sanz, and J. Domingo-Ferrer. Multivariate microaggregation based on a genetic algorithm. Manuscript, [19] J. H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58: , 1963.

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

Constrained Clustering of Territories in the Context of Car Insurance

Constrained Clustering of Territories in the Context of Car Insurance Constrained Clustering of Territories in the Context of Car Insurance Samuel Perreault Jean-Philippe Le Cavalier Laval University July 2014 Perreault & Le Cavalier (ULaval) Constrained Clustering July

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Secure Large-Scale Bingo

Secure Large-Scale Bingo Secure Large-Scale Bingo Antoni Martínez-Ballesté, Francesc Sebé and Josep Domingo-Ferrer Universitat Rovira i Virgili, Dept. of Computer Engineering and Maths, Av. Països Catalans 26, E-43007 Tarragona,

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

A Three-Dimensional Conceptual Framework for Database Privacy

A Three-Dimensional Conceptual Framework for Database Privacy A Three-Dimensional Conceptual Framework for Database Privacy Josep Domingo-Ferrer Rovira i Virgili University UNESCO Chair in Data Privacy Department of Computer Engineering and Mathematics Av. Països

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Junghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea

Junghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea Proceedings of the 211 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. A BINARY PARTITION-BASED MATCHING ALGORITHM FOR DATA DISTRIBUTION MANAGEMENT Junghyun

More information

Privacy Requirements of Search Engine - query Logs

Privacy Requirements of Search Engine - query Logs Statistics & Operations Research Transactions SORT Special issue: Privacy in statistical databases, 2011, 41-58 ISSN: 1696-2281 eissn: 2013-8830 www.idescat.cat/sort/ Statistics & Operations Research c

More information

Performance of KDB-Trees with Query-Based Splitting*

Performance of KDB-Trees with Query-Based Splitting* Performance of KDB-Trees with Query-Based Splitting* Yves Lépouchard Ratko Orlandic John L. Pfaltz Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science University of Virginia Illinois

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Strategic Online Advertising: Modeling Internet User Behavior with

Strategic Online Advertising: Modeling Internet User Behavior with 2 Strategic Online Advertising: Modeling Internet User Behavior with Patrick Johnston, Nicholas Kristoff, Heather McGinness, Phuong Vu, Nathaniel Wong, Jason Wright with William T. Scherer and Matthew

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

How To Identify Noisy Variables In A Cluster

How To Identify Noisy Variables In A Cluster Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science,

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

SIMS 255 Foundations of Software Design. Complexity and NP-completeness

SIMS 255 Foundations of Software Design. Complexity and NP-completeness SIMS 255 Foundations of Software Design Complexity and NP-completeness Matt Welsh November 29, 2001 mdw@cs.berkeley.edu 1 Outline Complexity of algorithms Space and time complexity ``Big O'' notation Complexity

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li. Advised by: Dave Mount. May 22, 2014

Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li. Advised by: Dave Mount. May 22, 2014 Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li Advised by: Dave Mount May 22, 2014 1 INTRODUCTION In this report we consider the implementation of an efficient

More information

Determining optimal window size for texture feature extraction methods

Determining optimal window size for texture feature extraction methods IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Fast Sequential Summation Algorithms Using Augmented Data Structures

Fast Sequential Summation Algorithms Using Augmented Data Structures Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik vadim.stadnik@gmail.com Abstract This paper provides an introduction to the design of augmented data structures that offer

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

The Statistics of Income (SOI) Division of the

The Statistics of Income (SOI) Division of the Brian G. Raub and William W. Chen, Internal Revenue Service The Statistics of Income (SOI) Division of the Internal Revenue Service (IRS) produces data using information reported on tax returns. These

More information

Linear programming approach for online advertising

Linear programming approach for online advertising Linear programming approach for online advertising Igor Trajkovski Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Rugjer Boshkovikj 16, P.O. Box 393, 1000 Skopje,

More information

Cost Models for Vehicle Routing Problems. 8850 Stanford Boulevard, Suite 260 R. H. Smith School of Business

Cost Models for Vehicle Routing Problems. 8850 Stanford Boulevard, Suite 260 R. H. Smith School of Business 0-7695-1435-9/02 $17.00 (c) 2002 IEEE 1 Cost Models for Vehicle Routing Problems John Sniezek Lawerence Bodin RouteSmart Technologies Decision and Information Technologies 8850 Stanford Boulevard, Suite

More information

Application Placement on a Cluster of Servers (extended abstract)

Application Placement on a Cluster of Servers (extended abstract) Application Placement on a Cluster of Servers (extended abstract) Bhuvan Urgaonkar, Arnold Rosenberg and Prashant Shenoy Department of Computer Science, University of Massachusetts, Amherst, MA 01003 {bhuvan,

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

A CP Scheduler for High-Performance Computers

A CP Scheduler for High-Performance Computers A CP Scheduler for High-Performance Computers Thomas Bridi, Michele Lombardi, Andrea Bartolini, Luca Benini, and Michela Milano {thomas.bridi,michele.lombardi2,a.bartolini,luca.benini,michela.milano}@

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

GENERATING SIMULATION INPUT WITH APPROXIMATE COPULAS

GENERATING SIMULATION INPUT WITH APPROXIMATE COPULAS GENERATING SIMULATION INPUT WITH APPROXIMATE COPULAS Feras Nassaj Johann Christoph Strelen Rheinische Friedrich-Wilhelms-Universitaet Bonn Institut fuer Informatik IV Roemerstr. 164, 53117 Bonn, Germany

More information

Self Organizing Maps for Visualization of Categories

Self Organizing Maps for Visualization of Categories Self Organizing Maps for Visualization of Categories Julian Szymański 1 and Włodzisław Duch 2,3 1 Department of Computer Systems Architecture, Gdańsk University of Technology, Poland, julian.szymanski@eti.pg.gda.pl

More information

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES *

CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES * CLUSTERING LARGE DATA SETS WITH MIED NUMERIC AND CATEGORICAL VALUES * ZHEUE HUANG CSIRO Mathematical and Information Sciences GPO Box Canberra ACT, AUSTRALIA huang@cmis.csiro.au Efficient partitioning

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

HYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE

HYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE HYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE Subodha Kumar University of Washington subodha@u.washington.edu Varghese S. Jacob University of Texas at Dallas vjacob@utdallas.edu

More information

2004 Networks UK Publishers. Reprinted with permission.

2004 Networks UK Publishers. Reprinted with permission. Riikka Susitaival and Samuli Aalto. Adaptive load balancing with OSPF. In Proceedings of the Second International Working Conference on Performance Modelling and Evaluation of Heterogeneous Networks (HET

More information

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n Principles of Data Mining Pham Tho Hoan hoanpt@hnue.edu.vn References [1] David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT press, 2002 [2] Jiawei Han and Micheline Kamber,

More information

BIRCH: An Efficient Data Clustering Method For Very Large Databases

BIRCH: An Efficient Data Clustering Method For Very Large Databases BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image.

More information

Tracking Groups of Pedestrians in Video Sequences

Tracking Groups of Pedestrians in Video Sequences Tracking Groups of Pedestrians in Video Sequences Jorge S. Marques Pedro M. Jorge Arnaldo J. Abrantes J. M. Lemos IST / ISR ISEL / IST ISEL INESC-ID / IST Lisbon, Portugal Lisbon, Portugal Lisbon, Portugal

More information

Three Effective Top-Down Clustering Algorithms for Location Database Systems

Three Effective Top-Down Clustering Algorithms for Location Database Systems Three Effective Top-Down Clustering Algorithms for Location Database Systems Kwang-Jo Lee and Sung-Bong Yang Department of Computer Science, Yonsei University, Seoul, Republic of Korea {kjlee5435, yang}@cs.yonsei.ac.kr

More information

Big Data Privacy: Challenges to Privacy Principles and Models

Big Data Privacy: Challenges to Privacy Principles and Models Noname manuscript No. (will be inserted by the editor) Big Data Privacy: Challenges to Privacy Principles and Models Jordi Soria-Comas and Josep Domingo-Ferrer Received: date / Accepted: date Abstract

More information

Less naive Bayes spam detection

Less naive Bayes spam detection Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

More information

Privacy Preserved Association Rule Mining For Attack Detection and Prevention

Privacy Preserved Association Rule Mining For Attack Detection and Prevention Privacy Preserved Association Rule Mining For Attack Detection and Prevention V.Ragunath 1, C.R.Dhivya 2 P.G Scholar, Department of Computer Science and Engineering, Nandha College of Technology, Erode,

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

More information

Time series clustering and the analysis of film style

Time series clustering and the analysis of film style Time series clustering and the analysis of film style Nick Redfern Introduction Time series clustering provides a simple solution to the problem of searching a database containing time series data such

More information

Flexible Neural Trees Ensemble for Stock Index Modeling

Flexible Neural Trees Ensemble for Stock Index Modeling Flexible Neural Trees Ensemble for Stock Index Modeling Yuehui Chen 1, Ju Yang 1, Bo Yang 1 and Ajith Abraham 2 1 School of Information Science and Engineering Jinan University, Jinan 250022, P.R.China

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University Agency Internal User Unmasked Result Subjects

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Analysis of Algorithms I: Optimal Binary Search Trees

Analysis of Algorithms I: Optimal Binary Search Trees Analysis of Algorithms I: Optimal Binary Search Trees Xi Chen Columbia University Given a set of n keys K = {k 1,..., k n } in sorted order: k 1 < k 2 < < k n we wish to build an optimal binary search

More information

Results of the Task Team Privacy

Results of the Task Team Privacy United Nations Economic Commission for Europe Statistical Division Workshop on the Modernisation of Statistical Production and Services November 19-20, 2014 The Role of Big Data in the Modernisation of

More information

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases Survey On: Nearest Neighbour Search With Keywords In Spatial Databases SayaliBorse 1, Prof. P. M. Chawan 2, Prof. VishwanathChikaraddi 3, Prof. Manish Jansari 4 P.G. Student, Dept. of Computer Engineering&

More information

SACOC: A spectral-based ACO clustering algorithm

SACOC: A spectral-based ACO clustering algorithm SACOC: A spectral-based ACO clustering algorithm Héctor D. Menéndez, Fernando E. B. Otero, and David Camacho Abstract The application of ACO-based algorithms in data mining is growing over the last few

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

More information

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES Silvija Vlah Kristina Soric Visnja Vojvodic Rosenzweig Department of Mathematics

More information

Private Record Linkage with Bloom Filters

Private Record Linkage with Bloom Filters To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,

More information

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY QÜESTIIÓ, vol. 25, 3, p. 509-520, 2001 PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY GEORGES HÉBRAIL We present in this paper the main applications of data mining techniques at Electricité de France,

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

ON THE COMPLEXITY OF THE GAME OF SET. {kamalika,pbg,dratajcz,hoeteck}@cs.berkeley.edu

ON THE COMPLEXITY OF THE GAME OF SET. {kamalika,pbg,dratajcz,hoeteck}@cs.berkeley.edu ON THE COMPLEXITY OF THE GAME OF SET KAMALIKA CHAUDHURI, BRIGHTEN GODFREY, DAVID RATAJCZAK, AND HOETECK WEE {kamalika,pbg,dratajcz,hoeteck}@cs.berkeley.edu ABSTRACT. Set R is a card game played with a

More information

Metamodeling by using Multiple Regression Integrated K-Means Clustering Algorithm

Metamodeling by using Multiple Regression Integrated K-Means Clustering Algorithm Metamodeling by using Multiple Regression Integrated K-Means Clustering Algorithm Emre Irfanoglu, Ilker Akgun, Murat M. Gunal Institute of Naval Science and Engineering Turkish Naval Academy Tuzla, Istanbul,

More information

Specific Usage of Visual Data Analysis Techniques

Specific Usage of Visual Data Analysis Techniques Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia

More information

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm Prepared by: Yacine ghanjaoui Supervised by: Dr. Hachim Haddouti March 24, 2003 Abstract The indexing techniques in multidimensional

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Tree based ensemble models regularization by convex optimization

Tree based ensemble models regularization by convex optimization Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000

More information

A ROUGH CLUSTER ANALYSIS OF SHOPPING ORIENTATION DATA. Kevin Voges University of Canterbury. Nigel Pope Griffith University

A ROUGH CLUSTER ANALYSIS OF SHOPPING ORIENTATION DATA. Kevin Voges University of Canterbury. Nigel Pope Griffith University Abstract A ROUGH CLUSTER ANALYSIS OF SHOPPING ORIENTATION DATA Kevin Voges University of Canterbury Nigel Pope Griffith University Mark Brown University of Queensland Track: Marketing Research and Research

More information

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,

More information

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Compact Representations and Approximations for Compuation in Games

Compact Representations and Approximations for Compuation in Games Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions

More information

A Comparative Study of clustering algorithms Using weka tools

A Comparative Study of clustering algorithms Using weka tools A Comparative Study of clustering algorithms Using weka tools Bharat Chaudhari 1, Manan Parikh 2 1,2 MECSE, KITRC KALOL ABSTRACT Data clustering is a process of putting similar data into groups. A clustering

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

K-Means Clustering Tutorial

K-Means Clustering Tutorial K-Means Clustering Tutorial By Kardi Teknomo,PhD Preferable reference for this tutorial is Teknomo, Kardi. K-Means Clustering Tutorials. http:\\people.revoledu.com\kardi\ tutorial\kmean\ Last Update: July

More information

A Perspective on Statistical Tools for Data Mining Applications

A Perspective on Statistical Tools for Data Mining Applications A Perspective on Statistical Tools for Data Mining Applications David M. Rocke Center for Image Processing and Integrated Computing University of California, Davis Statistics and Data Mining Statistics

More information

Cluster Analysis for Optimal Indexing

Cluster Analysis for Optimal Indexing Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference Cluster Analysis for Optimal Indexing Tim Wylie, Michael A. Schuh, John Sheppard, and Rafal A.

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

Research on the Performance Optimization of Hadoop in Big Data Environment

Research on the Performance Optimization of Hadoop in Big Data Environment Vol.8, No.5 (015), pp.93-304 http://dx.doi.org/10.1457/idta.015.8.5.6 Research on the Performance Optimization of Hadoop in Big Data Environment Jia Min-Zheng Department of Information Engineering, Beiing

More information