Performance of KDB-Trees with Query-Based Splitting*
|
|
- Barnaby Skinner
- 8 years ago
- Views:
Transcription
1 Performance of KDB-Trees with Query-Based Splitting* Yves Lépouchard Ratko Orlandic John L. Pfaltz Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science University of Virginia Illinois Institute of Technology University of Virginia Charlottesville, VA 2293Chicago, IL 6616 Charlottesville, VA Abstract While the persistent data of many advanced database applications, such as OLAP and scientific studies, are characterized by very high dimensionality, typical queries posed on these data appeal to a small number of relevant dimensions. Unfortunately, the multidimensional access methods designed for highdimensional data perform rather poorly for these partially specified queries. A potentially very appealing idea, frequently suggested in the literature, is to adopt a node-splitting policy that takes into account the importance of individual dimensions, which could be determined either a priori or through a statistical sampling of actual queries. This paper presents the results of some carefully controlled experiments conducted to observe the effects of query-based splitting on the performance of KDB-trees. The strategy is compared to a splitting policy that selects the split dimensions in a cyclic fashion, which has been shown to be very effective, especially in high-dimensional situations. Based on the results, the query-based splitting does not appear to be a very appealing splitting strategy for KDB-trees. Keywords: information databases, multi-dimensional databases, access methods, data dimensionality. 1. Introduction Typical retrieval mechanisms are based on a subdivision of the search space into finer and finer subspaces organized into a tree structure. Each subspace is represented by an index node (page) of the tree structure. The leaf nodes correspond to the smallest regions in which the desired items are to be found. With large amounts of data, the structure must reside on the secondary storage. While the exact-match search typically follows a single path of the tree, the rangesearch queries usually require access to a possibly large number of nodes. To search a 2-or 3-dimensional space based on region queries, one would divide the space into index regions (typically rectangular ones) through a splitting process that partitions individual dimensions. How this process takes place may have a significant impact on the retrieval performance. In part because of this, we have today a variety of different structures for spatial (2-or 3- dimensional) retrieval [2]. The real problem arises when we consider data in higher-dimensional spaces. These spaces naturally occur when data is regarded with respect to a multi-dimensional parameter space. Examples include information retrieval, data mining, OLAP, multimedia systems and numerous scientific simulations, such as high-energy physics, longterm environmental observations, as well as genome and protein studies. Unfortunately, the traditional multi-dimensional access methods do not scale well to spaces with many dimensions. Their performance rapidly deteriorates as the number of dimensions grows. As a result, they impose a practical limit on the number of dimensions, which is typically quite low. The limitations of contemporary multi-dimensional access methods in spaces with many dimensions have attracted considerable attention lately [1,3,5,9]. Further complicating the matter, the multidimensional queries of many applications tend to specify only a relatively small subset of parameters of interest (dimensions). For example, the study reported in [8] cites large data spaces with more than 2 data dimensions. However, the dimensionality of queries is typically about 2 to 4. While the events of interest for experimental high- Partly funded by the DOE grant no. DE-FG295-ER25254.
2 energy physics may have more than 1 dimensional properties, the number of properties specified by the queries is much smaller, typically about 1 to 8 [4]. Since the traditional multi-dimensional access methods are generally designed assuming fully specified search predicates, they perform poorly for these partial queries. Given this situation, a potentially very appealing approach is to take into account the importance of individual parameters/dimensions in the process of node splitting, as to minimize the probability of overlap between the index regions and the queries and, thereby, reduce the number of page accesses. The importance of the dimensions could be determined either apriorior through a statistical sampling of actual queries. This idea is frequently suggested in the literature as a way of increasing the retrieval performance of multi-dimensional access methods. For example, in [7], Robinson suggested query-based splitting as a possible space-partitioning strategy that could improve the performance of KDBtrees; but the idea was never pursued. The KDB-tree has become a point access method of choice in many applications. While the structure was originally designed for low-dimensional data, in [5], we have shown that it can serve as a basis for an effective retrieval mechanism in high-dimensional spaces as well. The topic of this paper arose from a larger investigation, in which we studied the effects of various splitting policies on the performance of KDB-trees in high-dimensional situations. The goal of this paper is to present the results of some carefully controlled experiments conducted to observe the effects of querybased splitting on the performance of KDB-trees. The strategy is compared to a splitting policy that selects the split dimensions in a cyclic fashion, which has been shown to be very effective, especially in highdimensional situations [5]. Based on the results, we will argue that query-based splitting does not appear to be a very appealing splitting strategy for KDB-trees. The rest of the paper is organized as follows. In Section 2, we review the structure of KDB-trees and its policy of cyclic splitting. Section 3 discusses the idea of query-based splitting, introducing some relevant terminology. Section 4 presents the results of the experimental study conducted to compare the performance of KDB-trees with query-based and cyclic splitting. Section 5 concludes the paper by summarizing the results. 2. Cyclic Splitting of KDB-Trees A KDB-tree is a height-balanced hierarchy of nodes (pages), in which each node represents a portion of space. At every level of the structure, the d-dimensional universe is recursively divided into hyper-rectangles by means of (d-1)-dimensional hyper-planes, each of which is perpendicular to one of the axes. The root node represents the entire universe, which itself is a multidimensional rectangle. Figure 1 illustrates a portion of a KDB-tree and its partition of a 2-dimensional space. Observe that each rectangular subspace has been split first with respect to one, and then another dimension. This is the characteristic of cyclic splitting. Figure 1. A 2-dimensional KDB-tree. The leaf nodes of KDB-trees, also called point pages, contain actual data objects, i.e. points in space. In the conceptual subdivision of the space corresponding to this level of the structure, the directions of the dividing hyper-planes alternate among individual dimensions. Every interior node, called region page, maintains index entries, each representing a child node at the level below. With cyclic splitting of point pages, the splitting dimension of a point page is determined as (splitting dimension of the old page +1)MODd, whered is the dimensionality of the universe. We enhance this policy with a splitting strategy for region pages, called firstdivision splitting [5]. According to this strategy, a region page R is split along the dividing hyper-plane by which the index region of R was split for the first time (firstdivision plane). As a result, this policy follows the partitioning sequence at the leaf level, selecting the splitting dimensions of the region pages in a cyclic fashion. As shown in [5], this strategy significantly
3 improves the performance of KDB-trees in highdimensional spaces. 3. Query-Based Splitting In [7], Robinson suggested a finer splitting policy that can take the advantage of the actual query patterns. In the following, we say that a query is partially specified if it restricts only a subset of dimensions, leaving the rest of the dimensions unspecified. Of particular interest for our analysis will be mono-specified queries, whose result set can be formally defined as R={x S x min x.a i x max }, where S is the given set of points, a i is the coordinate of a point object along the i th dimension, and x min and x max are two scalar values. Figure 2 shows a mono-specified query (gray volume) defined on a 3- dimensional space. Figure 2. A mono-specified query. We say that a query is fully specified when all dimensions are restricted by the query predicate. Reusing the above notation, the result set can be defined as R={x S i [, d), x min.a i x.a i x max.a i }. Note that now x min and x max represent vectors, not scalar values as in the previous formula. Since queries of typical multi-dimensional applications tend to be partially specified, we can take a statistical analysis of the dimensions that are specified most often. If such analysis is possible, one can compute the probability of specifying each dimension and build the tree structure accordingly. Whenever a split of a point page occurs, we pick the splitting dimension in relation to the probability distribution resulting from the anticipated query pattern. This is the underlying idea of query-based (QB-) splitting. Note that this policy applies to point pages. The splitting of region pages follows the first-division splitting strategy described earlier. 4. Experimental Evidence In order to compare cyclic and query-based splitting policies, we implemented two versions of KDB-trees that differ in the way the splitting dimensions are selected. We also constructed three different test cases. The first test case compares the two variants of KDB-trees for queries that are always specified with respect to one particular dimension. In the second test case, the queries are mono-specified but with respect to different dimensions. The third case compares the cyclic splitting with for fully specified queries. For all test cases, the input was the same set of 1, randomly generated points. In each test case, we performed 1, queries and measured the average number of page accesses per query as dimensionality increases from 2 to 16. In the first two test cases, the queries were mono-specified and the probability distribution used to determine the split axes of the KDBtree with was the same as the importance of the dimensions implied by the queries. In the first mono-specified case, the query dimension was constant. In other words, all queries specified only this single dimension. In the second case, one dimension was clearly dominant and it was specified by most queries. But, some queries specified other dimensions as well. The third case was constructed to observe the performance of when the actual queries do not behave as anticipated. In this test case, the queries were fully specified. Obviously, these experiments were constructed for some extreme scenarios. For mono-specified queries, a traditional B-tree index would be far superior to any multi-dimensional structure. However, the test cases were constructed so that they can reveal the promise of querybased splitting. Certainly, if the policy does not perform well with mono-specified queries, for which the distribution of splits can be selected to match perfectly the implied relevance of the dimensions, it is unlikely that it can perform well for partial queries that specify more than one dimension. In the first test case, whose results are shown in Figure 3, the same dimension was specified by all queries. Thus, the probability of specifying predominant dimension was 1., whereas the probability of specifying any other dimension was.. In this case, the querybased splitting guarantees that all splits occur along the
4 predominant dimension. As one can see from Figure 3, for this scenario, is clearly superior to the cyclic splitting, especially in high-dimensional spaces. 5 4 matches the priorities of the dimensions. This is because, whenever a split occurs along a certain dimension, all queries that discriminate along other dimensions are penalized by this choice. Thus, even though the QBsplitting policy pays off in some extreme cases, for more general cases, it does not appear to bring any improvement that could justify the effort of forecasting the actual queries Figure 3. Test Case 1: Mono-specified queries along one exclusively predominant dimension. It is unrealistic to have all queries specify only one dimension. A more realistic scenario might have 6% of all queries specify one dimension, 3% might specify second, with only 1% specifying a third. However, an arbitrary distribution need not scale with data dimensionality. Here, we need a density distribution that remains self-similar as dimensionality grows. Thus, in the second test case, we applied a continuous square-root function to compute the probability distribution for any number of dimensions. The square root function F: x x 1/α,whereα = 2, was used on the real interval [, 1]. For example, in a 5-dimensional space, we calculate the probability p(i) that each dimension i is specified as follows: F().447 p() = F().447; F(1).632 p(1) = F(1) F().185; F(2).775 p(2) = F(2) F(1).143; F(3).894 p(3) = F(3) F(2).119; F(4) 1 p(4) = F(4) F(3).16. Observe that the first dimension is still clearly predominant as it is specified in more than 4% of all queries. The second dimension is specified in nearly 2% of queries, with the remaining dimensions appearing in about 1-15% queries. The strong bias toward the first dimension should still make desirable. Nevertheless, Figure 4 reveals little difference between QB-and cyclic splitting, even though the KDB-tree structure has a distribution of splits that perfectly 2 1 Figure 4. Test Case 2: Skewed importance of the dimensions. The queries used for test case 3, shown in Figure 5, were fully specified. Therefore, all dimensions are important. However, even though other dimensions may appear in the queries, this test case forces splits only along a single dimension. While this test case is somewhat contrived, its purpose is to show what could happen when the actual queries do not behave as expected. Clearly, the policy adapts very poorly to unexpected query patterns. is a clear winner in this situation. In summary, the idea of adapting the splitting policy to suit the collected statistics about queries is intuitively a fine concept. It is backed up by at least one scenario investigated in this paper. However, for more general cases, the query-based splitting does not bring significant improvement that can justify the effort of forecasting the actual queries. As appealing as it may sound, QBsplitting does not appear to be an effective splitting strategy for KDB-trees.
5 Figure 5. Test Case 3: Fully specified queries and only one splitting dimension. 5. Summary and Discussion The problem of accessing data in high-dimensional spaces has attracted considerable attention. The proposed solutions generally assume well-defined queries that restrict the values of all dimensions in the universe. However, in advanced applications such as scientific studies and OLAP, typical queries posed on their highdimensional data restrict a relatively small subset of dimensions, leaving the rest of the dimensions unspecified. The contemporary multi-dimensional access methods tend to perform poorly for these partially specified queries. In this paper, we investigated the idea of adopting a splitting policy that takes into account the priorities of individual dimensions, which we call query-based splitting. We presented the results of an experimental study conducted to observe the effects of query-based splitting on the performance of KDB-trees. The strategy was compared to a splitting policy that selects the split dimensions in a cyclic fashion, which has been shown to be very effective, especially in high-dimensional situations. Based on the results, the query-based splitting does not appear to be a very appealing splitting strategy for KDB-trees. In [6], we have proposed a much more effective retrieval technique for partial queries. The idea is to apply an elaborate storage organization, called the inverted space (IS), which assigns to a high-dimensional universe one data store and a number of multidimensional indexes, each supporting efficient selections on a subset of dimensions. This organization allows the system administrator to control the size of individual indexes and avoid the negative impact of very high data dimensionality on the retrieval performance. To support the IS storage organization, we have also developed a new point access method, called the KDB HD -tree [6]. Together, the two solutions can enable efficient access to persistent data of high dimensionality based on partially specified queries. 6. References [1] S.Berchtold, C.Bohm, and H.P.Kriegel, The Pyramid-Technique: Towards Breaking the Curse of Dimensionality, Proc. ACM SIGMOD Int. Conf. on Management of Data, pp , [2] V.Gaede, and O.Gunther, Multidimensional Access Methods, ACM Computing Surveys 3(2):17-231, [3] K.I.Lin, H.V.Jagadish, and C.Faloutsos, The TV- Tree: An Index Structure for High-Dimensional Data, VLDB Journal 3(4): , [4] R.Orlandic, J.Lukaszuk and C.Swietlik, "The Design of a Retrieval Technique for High-Dimensional Data on Tertiary Storage," SIGMOD Record, 22 (in press). [5] R.Orlandic, and B.Yu, Implementing KDB-trees to Support High-Dimensional Data, Proc. Int. Database Engineering and Applications Symposium IDEAS 1, pp , 21. [6] R.Orlandic and B.Yu, "A Retrieval Technique for High-Dimensional Data and Partially Specified Queries," Data and Knowledge Eng., 22 (in press). [7] J.T.Robinson, The K-D-B Tree: A Search Structure for Large Multidimensional Dynamic Indexes, Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 1-18, [8] K.A.Ross, and K.A.Zaman, Optimizing Selections over Databases, Proc. 12th Int. Conf. on Scientific and Statistical Database Management, pp , 2. [9] R.Weber, H.-J.Schek and S.Blott, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces, Proc. 24th Int. Conf. on VLDB, , 1998.
Data Mining and Database Systems: Where is the Intersection?
Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise
More informationData Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data
More informationSurvey On: Nearest Neighbour Search With Keywords In Spatial Databases
Survey On: Nearest Neighbour Search With Keywords In Spatial Databases SayaliBorse 1, Prof. P. M. Chawan 2, Prof. VishwanathChikaraddi 3, Prof. Manish Jansari 4 P.G. Student, Dept. of Computer Engineering&
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
More informationThe DC-Tree: A Fully Dynamic Index Structure for Data Warehouses
Published in the Proceedings of 16th International Conference on Data Engineering (ICDE 2) The DC-Tree: A Fully Dynamic Index Structure for Data Warehouses Martin Ester, Jörn Kohlhammer, Hans-Peter Kriegel
More informationMIDAS: Multi-Attribute Indexing for Distributed Architecture Systems
MIDAS: Multi-Attribute Indexing for Distributed Architecture Systems George Tsatsanifos (NTUA) Dimitris Sacharidis (R.C. Athena ) Timos Sellis (NTUA, R.C. Athena ) 12 th International Symposium on Spatial
More informationVisual Data Mining with Pixel-oriented Visualization Techniques
Visual Data Mining with Pixel-oriented Visualization Techniques Mihael Ankerst The Boeing Company P.O. Box 3707 MC 7L-70, Seattle, WA 98124 mihael.ankerst@boeing.com Abstract Pixel-oriented visualization
More informationThe DC-tree: A Fully Dynamic Index Structure for Data Warehouses
The DC-tree: A Fully Dynamic Index Structure for Data Warehouses Martin Ester, Jörn Kohlhammer, Hans-Peter Kriegel Institute for Computer Science, University of Munich Oettingenstr. 67, D-80538 Munich,
More informationClustering through Decision Tree Construction in Geology
Nonlinear Analysis: Modelling and Control, 2001, v. 6, No. 2, 29-41 Clustering through Decision Tree Construction in Geology Received: 22.10.2001 Accepted: 31.10.2001 A. Juozapavičius, V. Rapševičius Faculty
More informationR-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants
R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions
More informationCluster Description Formats, Problems and Algorithms
Cluster Description Formats, Problems and Algorithms Byron J. Gao Martin Ester School of Computing Science, Simon Fraser University, Canada, V5A 1S6 bgao@cs.sfu.ca ester@cs.sfu.ca Abstract Clustering is
More informationCUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB
CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB Badal K. Kothari 1, Prof. Ashok R. Patel 2 1 Research Scholar, Mewar University, Chittorgadh, Rajasthan, India 2 Department of Computer
More informationAn Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
More informationBinary Space Partitions
Title: Binary Space Partitions Name: Adrian Dumitrescu 1, Csaba D. Tóth 2,3 Affil./Addr. 1: Computer Science, Univ. of Wisconsin Milwaukee, Milwaukee, WI, USA Affil./Addr. 2: Mathematics, California State
More informationFluency With Information Technology CSE100/IMT100
Fluency With Information Technology CSE100/IMT100 ),7 Larry Snyder & Mel Oyler, Instructors Ariel Kemp, Isaac Kunen, Gerome Miklau & Sean Squires, Teaching Assistants University of Washington, Autumn 1999
More informationCluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
More informationHorizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
More informationA Dynamic Load Balancing Strategy for Parallel Datacube Computation
A Dynamic Load Balancing Strategy for Parallel Datacube Computation Seigo Muto Institute of Industrial Science, University of Tokyo 7-22-1 Roppongi, Minato-ku, Tokyo, 106-8558 Japan +81-3-3402-6231 ext.
More informationIndexing Techniques in Data Warehousing Environment The UB-Tree Algorithm
Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm Prepared by: Yacine ghanjaoui Supervised by: Dr. Hachim Haddouti March 24, 2003 Abstract The indexing techniques in multidimensional
More informationPrinciples of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
More informationDecision Trees What Are They?
Decision Trees What Are They? Introduction...1 Using Decision Trees with Other Modeling Approaches...5 Why Are Decision Trees So Useful?...8 Level of Measurement... 11 Introduction Decision trees are a
More informationChallenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases
Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases Alexander Grebhahn grebhahn@st.ovgu.de Reimar Schröter rschroet@st.ovgu.de David Broneske dbronesk@st.ovgu.de
More informationData Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
More informationLoad Balancing in Structured Peer to Peer Systems
Load Balancing in Structured Peer to Peer Systems DR.K.P.KALIYAMURTHIE 1, D.PARAMESWARI 2 Professor and Head, Dept. of IT, Bharath University, Chennai-600 073 1 Asst. Prof. (SG), Dept. of Computer Applications,
More informationLoad Balancing in Structured Peer to Peer Systems
Load Balancing in Structured Peer to Peer Systems Dr.K.P.Kaliyamurthie 1, D.Parameswari 2 1.Professor and Head, Dept. of IT, Bharath University, Chennai-600 073. 2.Asst. Prof.(SG), Dept. of Computer Applications,
More informationMulti-dimensional index structures Part I: motivation
Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for
More informationDetermining optimal window size for texture feature extraction methods
IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec
More informationFinding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm
R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi,*
More informationInternational journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer.
RESEARCH ARTICLE ISSN: 2321-7758 GLOBAL LOAD DISTRIBUTION USING SKIP GRAPH, BATON AND CHORD J.K.JEEVITHA, B.KARTHIKA* Information Technology,PSNA College of Engineering & Technology, Dindigul, India Article
More informationEchidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of
More informationJim Lambers MAT 169 Fall Semester 2009-10 Lecture 25 Notes
Jim Lambers MAT 169 Fall Semester 009-10 Lecture 5 Notes These notes correspond to Section 10.5 in the text. Equations of Lines A line can be viewed, conceptually, as the set of all points in space that
More informationAg + -tree: an Index Structure for Range-aggregation Queries in Data Warehouse Environments
Ag + -tree: an Index Structure for Range-aggregation Queries in Data Warehouse Environments Yaokai Feng a, Akifumi Makinouchi b a Faculty of Information Science and Electrical Engineering, Kyushu University,
More informationCaching XML Data on Mobile Web Clients
Caching XML Data on Mobile Web Clients Stefan Böttcher, Adelhard Türling University of Paderborn, Faculty 5 (Computer Science, Electrical Engineering & Mathematics) Fürstenallee 11, D-33102 Paderborn,
More informationSection 1.1. Introduction to R n
The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to
More informationHorizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner
24 Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner Rekha S. Nyaykhor M. Tech, Dept. Of CSE, Priyadarshini Bhagwati College of Engineering, Nagpur, India
More informationQuickDB Yet YetAnother Database Management System?
QuickDB Yet YetAnother Database Management System? Radim Bača, Peter Chovanec, Michal Krátký, and Petr Lukáš Radim Bača, Peter Chovanec, Michal Krátký, and Petr Lukáš Department of Computer Science, FEECS,
More informationClassifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
More informationOLAP Theory-English version
OLAP Theory-English version On-Line Analytical processing (Business Intelligence) [Ing.J.Skorkovský,CSc.] Department of corporate economy Agenda The Market Why OLAP (On-Line-Analytic-Processing Introduction
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationData Exploration Data Visualization
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
More informationVisual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics
Motivation Visual Data Mining Visualization for Data Mining Huge amounts of information Limited display capacity of output devices Chidroop Madhavarapu CSE 591:Visual Analytics Visual Data Mining (VDM)
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 11, November 2015 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationPartJoin: An Efficient Storage and Query Execution for Data Warehouses
PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2
More informationSpatio-Temporal Mapping -A Technique for Overview Visualization of Time-Series Datasets-
Progress in NUCLEAR SCIENCE and TECHNOLOGY, Vol. 2, pp.603-608 (2011) ARTICLE Spatio-Temporal Mapping -A Technique for Overview Visualization of Time-Series Datasets- Hiroko Nakamura MIYAMURA 1,*, Sachiko
More informationIndex Selection Techniques in Data Warehouse Systems
Index Selection Techniques in Data Warehouse Systems Aliaksei Holubeu as a part of a Seminar Databases and Data Warehouses. Implementation and usage. Konstanz, June 3, 2005 2 Contents 1 DATA WAREHOUSES
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
More informationBinary Coded Web Access Pattern Tree in Education Domain
Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: kc.gomathi@gmail.com M. Moorthi
More informationKnowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationNew Approach of Computing Data Cubes in Data Warehousing
International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 14 (2014), pp. 1411-1417 International Research Publications House http://www. irphouse.com New Approach of
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationVISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc., http://willsfamily.org/gwills
VISUALIZING HIERARCHICAL DATA Graham Wills SPSS Inc., http://willsfamily.org/gwills SYNONYMS Hierarchical Graph Layout, Visualizing Trees, Tree Drawing, Information Visualization on Hierarchies; Hierarchical
More informationData Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
More informationA 0.9 0.9. Figure A: Maximum circle of compatibility for position A, related to B and C
MEASURING IN WEIGHTED ENVIRONMENTS (Moving from Metric to Order Topology) Claudio Garuti Fulcrum Ingenieria Ltda. claudiogaruti@fulcrum.cl Abstract: This article addresses the problem of measuring closeness
More informationMultimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.
Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1
More informationLess naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems
More informationis in plane V. However, it may be more convenient to introduce a plane coordinate system in V.
.4 COORDINATES EXAMPLE Let V be the plane in R with equation x +2x 2 +x 0, a two-dimensional subspace of R. We can describe a vector in this plane by its spatial (D)coordinates; for example, vector x 5
More informationINTEROPERABILITY IN DATA WAREHOUSES
INTEROPERABILITY IN DATA WAREHOUSES Riccardo Torlone Roma Tre University http://torlone.dia.uniroma3.it/ SYNONYMS Data warehouse integration DEFINITION The term refers to the ability of combining the content
More informationInternational Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 Load Balancing Heterogeneous Request in DHT-based P2P Systems Mrs. Yogita A. Dalvi Dr. R. Shankar Mr. Atesh
More informationXM-Tree, a new index for Web Information Retrieval
XM-Tree, a new index for Web Information Retrieval Claudia Deco, Guillermo Pierángeli, Cristina Bender Departamento de Sistemas e Informática Facultad de Ciencias Exactas, Ingeniería y Agrimensura Universidad
More informationSpace-filling Techniques in Visualizing Output from Computer Based Economic Models
Space-filling Techniques in Visualizing Output from Computer Based Economic Models Richard Webber a, Ric D. Herbert b and Wei Jiang bc a National ICT Australia Limited, Locked Bag 9013, Alexandria, NSW
More informationVisualization Techniques in Data Mining
Tecniche di Apprendimento Automatico per Applicazioni di Data Mining Visualization Techniques in Data Mining Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo di Milano
More informationA REVIEW PAPER ON MULTIDIMENTIONAL DATA STRUCTURES
A REVIEW PAPER ON MULTIDIMENTIONAL DATA STRUCTURES Kujani. T *, Dhanalakshmi. T +, Pradha. P # Asst. Professor, Department of Computer Science and Engineering, SKR Engineering College, Chennai, TamilNadu,
More informationUniversity of Gaziantep, Department of Business Administration
University of Gaziantep, Department of Business Administration The extensive use of information technology enables organizations to collect huge amounts of data about almost every aspect of their businesses.
More informationAuthors. Data Clustering: Algorithms and Applications
Authors Data Clustering: Algorithms and Applications 2 Contents 1 Grid-based Clustering 1 Wei Cheng, Wei Wang, and Sandra Batista 1.1 Introduction................................... 1 1.2 The Classical
More informationMetaGame: An Animation Tool for Model-Checking Games
MetaGame: An Animation Tool for Model-Checking Games Markus Müller-Olm 1 and Haiseung Yoo 2 1 FernUniversität in Hagen, Fachbereich Informatik, LG PI 5 Universitätsstr. 1, 58097 Hagen, Germany mmo@ls5.informatik.uni-dortmund.de
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationSpeed Up Your Moving Object Using Spatio-Temporal Predictors
Time-Series Prediction with Applications to Traffic and Moving Objects Databases Bo Xu Department of Computer Science University of Illinois at Chicago Chicago, IL 60607, USA boxu@cs.uic.edu Ouri Wolfson
More informationSOLUTIONS TO ASSIGNMENT 1 MATH 576
SOLUTIONS TO ASSIGNMENT 1 MATH 576 SOLUTIONS BY OLIVIER MARTIN 13 #5. Let T be the topology generated by A on X. We want to show T = J B J where B is the set of all topologies J on X with A J. This amounts
More informationEfficient Integration of Data Mining Techniques in Database Management Systems
Efficient Integration of Data Mining Techniques in Database Management Systems Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France
More informationCopyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1
Slide 29-1 Chapter 29 Overview of Data Warehousing and OLAP Chapter 29 Outline Purpose of Data Warehousing Introduction, Definitions, and Terminology Comparison with Traditional Databases Characteristics
More informationDesigning an Object Relational Data Warehousing System: Project ORDAWA * (Extended Abstract)
Designing an Object Relational Data Warehousing System: Project ORDAWA * (Extended Abstract) Johann Eder 1, Heinz Frank 1, Tadeusz Morzy 2, Robert Wrembel 2, Maciej Zakrzewicz 2 1 Institut für Informatik
More informationOptimized Data Indexing Algorithms for OLAP Systems
Database Systems Journal vol. I, no. 2/200 7 Optimized Data Indexing Algoritms for OLAP Systems Lucian BORNAZ Faculty of Cybernetics, Statistics and Economic Informatics Academy of Economic Studies, Bucarest
More informationIndexing Techniques for Data Warehouses Queries. Abstract
Indexing Techniques for Data Warehouses Queries Sirirut Vanichayobon Le Gruenwald The University of Oklahoma School of Computer Science Norman, OK, 739 sirirut@cs.ou.edu gruenwal@cs.ou.edu Abstract Recently,
More informationIndexing and Retrieval of Historical Aggregate Information about Moving Objects
Indexing and Retrieval of Historical Aggregate Information about Moving Objects Dimitris Papadias, Yufei Tao, Jun Zhang, Nikos Mamoulis, Qiongmao Shen, and Jimeng Sun Department of Computer Science Hong
More informationRELEVANT TO ACCA QUALIFICATION PAPER P3. Studying Paper P3? Performance objectives 7, 8 and 9 are relevant to this exam
RELEVANT TO ACCA QUALIFICATION PAPER P3 Studying Paper P3? Performance objectives 7, 8 and 9 are relevant to this exam Business forecasting and strategic planning Quantitative data has always been supplied
More informationPersuasion by Cheap Talk - Online Appendix
Persuasion by Cheap Talk - Online Appendix By ARCHISHMAN CHAKRABORTY AND RICK HARBAUGH Online appendix to Persuasion by Cheap Talk, American Economic Review Our results in the main text concern the case
More informationCharacterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies
Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems November 3-6, 1999 in Cambridge Massachusetts, USA Characterizing the Performance of Dynamic Distribution
More informationA Non-Linear Schema Theorem for Genetic Algorithms
A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland
More informationEfficient Structure Oriented Storage of XML Documents Using ORDBMS
Efficient Structure Oriented Storage of XML Documents Using ORDBMS Alexander Kuckelberg 1 and Ralph Krieger 2 1 Chair of Railway Studies and Transport Economics, RWTH Aachen Mies-van-der-Rohe-Str. 1, D-52056
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationIndexing Spatio-Temporal archive As a Preprocessing Alsuccession
The VLDB Journal manuscript No. (will be inserted by the editor) Indexing Spatio-temporal Archives Marios Hadjieleftheriou 1, George Kollios 2, Vassilis J. Tsotras 1, Dimitrios Gunopulos 1 1 Computer Science
More information2 Associating Facts with Time
TEMPORAL DATABASES Richard Thomas Snodgrass A temporal database (see Temporal Database) contains time-varying data. Time is an important aspect of all real-world phenomena. Events occur at specific points
More informationWhat is Visualization? Information Visualization An Overview. Information Visualization. Definitions
What is Visualization? Information Visualization An Overview Jonathan I. Maletic, Ph.D. Computer Science Kent State University Visualize/Visualization: To form a mental image or vision of [some
More informationSmart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets
Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract
More informationGAZETRACKERrM: SOFTWARE DESIGNED TO FACILITATE EYE MOVEMENT ANALYSIS
GAZETRACKERrM: SOFTWARE DESIGNED TO FACILITATE EYE MOVEMENT ANALYSIS Chris kankford Dept. of Systems Engineering Olsson Hall, University of Virginia Charlottesville, VA 22903 804-296-3846 cpl2b@virginia.edu
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationPhysical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
More informationCREATING MINIMIZED DATA SETS BY USING HORIZONTAL AGGREGATIONS IN SQL FOR DATA MINING ANALYSIS
CREATING MINIMIZED DATA SETS BY USING HORIZONTAL AGGREGATIONS IN SQL FOR DATA MINING ANALYSIS Subbarao Jasti #1, Dr.D.Vasumathi *2 1 Student & Department of CS & JNTU, AP, India 2 Professor & Department
More informationTHE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS
THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS KEITH CONRAD 1. Introduction The Fundamental Theorem of Algebra says every nonconstant polynomial with complex coefficients can be factored into linear
More informationTHE concept of Big Data refers to systems conveying
EDIC RESEARCH PROPOSAL 1 High Dimensional Nearest Neighbors Techniques for Data Cleaning Anca-Elena Alexandrescu I&C, EPFL Abstract Organisations from all domains have been searching for increasingly more
More information1 Representation of Games. Kerschbamer: Commitment and Information in Games
1 epresentation of Games Kerschbamer: Commitment and Information in Games Game-Theoretic Description of Interactive Decision Situations This lecture deals with the process of translating an informal description
More informationDATA MINING - 1DL360
DATA MINING - 1DL360 Fall 2013" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/per1ht13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationCarnegie Mellon University. Extract from Andrew Moore's PhD Thesis: Ecient Memory-based Learning for Robot Control
An intoductory tutorial on kd-trees Andrew W. Moore Carnegie Mellon University awm@cs.cmu.edu Extract from Andrew Moore's PhD Thesis: Ecient Memory-based Learning for Robot Control PhD. Thesis Technical
More informationDescriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics
Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),
More informationPersonalization of Web Search With Protected Privacy
Personalization of Web Search With Protected Privacy S.S DIVYA, R.RUBINI,P.EZHIL Final year, Information Technology,KarpagaVinayaga College Engineering and Technology, Kanchipuram [D.t] Final year, Information
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationOn the k-path cover problem for cacti
On the k-path cover problem for cacti Zemin Jin and Xueliang Li Center for Combinatorics and LPMC Nankai University Tianjin 300071, P.R. China zeminjin@eyou.com, x.li@eyou.com Abstract In this paper we
More informationEM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationStandardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
More information