Improving Sort-Tile-Recusive Algorithm for R-tree Packing in Indexing Time Series

Similar documents
R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

QuickDB Yet YetAnother Database Management System?

CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases

Vector storage and access; algorithms in GIS. This is lecture 6

Data Warehousing und Data Mining

Oracle8i Spatial: Experiences with Extensible Databases

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

Visual Data Mining with Pixel-oriented Visualization Techniques

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online

Indexing the Trajectories of Moving Objects in Networks

Resource Allocation Schemes for Gang Scheduling

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Ag + -tree: an Index Structure for Range-aggregation Queries in Data Warehouse Environments

CSE 326: Data Structures B-Trees and B+ Trees

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Indexing Spatio-Temporal archive As a Preprocessing Alsuccession

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

Indexing and Retrieval of Historical Aggregate Information about Moving Objects

SeaCloudDM: Massive Heterogeneous Sensor Data Management in the Internet of Things

TREE BASIC TERMINOLOGIES

ISSN Index Terms Cloud computing, outsourcing data, cloud storage security, public auditability

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Data Structures. Level 6 C Module Descriptor

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Parallel Simplification of Large Meshes on PC Clusters

Speed Performance Improvement of Vehicle Blob Tracking System

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Symbol Tables. Introduction

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

A Comparison of Dictionary Implementations

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

1. Relational database accesses data in a sequential form. (Figures 7.1, 7.2)

Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li. Advised by: Dave Mount. May 22, 2014

Clustering & Visualization

BIRCH: An Efficient Data Clustering Method For Very Large Databases

New Hash Function Construction for Textual and Geometric Data Retrieval

Picture Maze Generation by Successive Insertion of Path Segment

GiST. Amol Deshpande. March 8, University of Maryland, College Park. CMSC724: Access Methods; Indexes; GiST. Amol Deshpande.

Distributed forests for MapReduce-based machine learning

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc.,

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

DATA STRUCTURES USING C

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

Feature Selection with Monte-Carlo Tree Search

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Cluster Analysis for Optimal Indexing

Binary Image Scanning Algorithm for Cane Segmentation

Voronoi Treemaps in D3

Database Management Systems Comparative Study: Performances of Microsoft SQL Server Versus Oracle

Partitioning and Divide and Conquer Strategies

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Algorithms and Data Structures

Analysis of Compression Algorithms for Program Data

Full and Complete Binary Trees

New Improvements in Optimal Rectangle Packing

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

International Journal of Advance Research in Computer Science and Management Studies

Scalable Cluster Analysis of Spatial Events

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Structure for String Keys

Binary Space Partitions

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

Record Storage and Primary File Organization

Spatio-Temporal Mapping -A Technique for Overview Visualization of Time-Series Datasets-

Performance in the Infragistics WebDataGrid for Microsoft ASP.NET AJAX. Contents. Performance and User Experience... 2

The SPSS TwoStep Cluster Component

Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations

Colored Range Searching on Internal Memory

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Monitoring the Dynamics of Network Traffic by Recursive Multi-dimensional Aggregation

FPGA-based Multithreading for In-Memory Hash Joins

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. The Best-fit Heuristic for the Rectangular Strip Packing Problem: An Efficient Implementation

Transcription:

Improving Sort-Tile-Recusive Algorithm for R-tree Packing in Indexing Time Series Bui Cong Giao, Duong Tuan Anh Presenter: Bui Cong Giao

Contents 1. Introduction 2. Preliminaries 3. Strategies for improving STR 4. Experimental Evaluation 5. Conclusions & Future work 26/01/2015 2

Contents 1. Introduction 2. Preliminaries 3. Strategies for improving STR 4. Experimental Evaluation 5. Conclusions & Future work 26/01/2015 3

R-tree The most popular index structure for spatial data management Collection of nodes which are hierarchically organized as a search tree Fig. 0 Node structure of R-tree 26/01/2015 4

R-tree Basic operations on R-tree: Insert (many versions) Search Delete R * -tree is a well-known amelioration of R-tree. 26/01/2015 5

Fig. 1 A set of MBRs and an R tree built for it 26/01/2015 6

Bulk-loading Building an R-tree at a time is the bulk-loading method. Advantages: Faster loading the R-tree with all spatial objects at once Minimizing empty spaces in the nodes of R-tree Better partitioning spatial objects into nodes of R-tree 26/01/2015 7

Sort-based loading A kind of bulk-loading for R-tree Simplicity of implementation yet a good query performance The only method commonly used in DBMS and GIS at the moment Sort-Tile-Recursive (STR), a typical example of sort-based loading. 26/01/2015 8

Motivations STR has been not compared with R * -tree yet in the paper that introduced it. Researchers continously attempt to improve R- tree in performance, space. 26/01/2015 9

Main contributions Two strategies to partition spatial objects in the nodes of R-tree, and how to connect the ends of consecutive runs into a suboptimum spacefilling curve. 26/01/2015 10

Contents 1. Introduction 2. Preliminaries 3. Strategies for improving STR 4. Experimental Evaluation 5. Conclusions & Future work 26/01/2015 11

Set P = r M STR with d = 2, sort the rectangles by x-coordinate and partition them into S = P vertical slices. A slice consists of a run of S M rectangles. Sort the rectangles of each slice by y-coordinate. Pack them into nodes by grouping them in size of M. 26/01/2015 12

STR with d > 2 Sort the hyper-rectangles by the first coordinate and partition them into S = P 1 d slices. A slice consists of a run of M P d 1 d sorted consecutive hyper-rectangles. Each slice is now processed recursively using the remaining d 1 coordinates. 26/01/2015 13

Build an R-tree from a collection of sorted spatial objects 1. r rectangles are ordered in P consecutive groups of M rectangles, where each group of M is placed in the same leaf level node. 2. Group M successive leaf nodes into a parent node 3. Recursively pack these parent nodes into the nodes at the higher level, proceeding upwards, until the root node is created. 26/01/2015 14

Space-Filling Curves A space-filling curve visits all the points in a d- dimensional grid exactly once and never crosses itself. To cluster nearby points into the same leaf nodes, space-filling curves are often used. The Z-order and the Hilbert curve are typical examples. 26/01/2015 15

Space-Filling Curves (cont.) 26/01/2015 16

Contents 1. Introduction 2. Preliminaries 3. Strategies for improving STR 4. Experimental Evaluation 5. Conclusions & Future work 26/01/2015 17

Intuitional improvements on STR STR sorts the hyper-rectangles according to the first coordinate of their center. We choose the longest coordinate that has the two most distant centers of hyper-rectangles in the coordinate. STR connects ends of the runs of consecutive slices to make a space-filling curve similar to the Z-order. We connect ends of the runs of consecutive slices to make a suboptimum space-filling curve. 26/01/2015 18

First strategy (ISTR1) Create a set of axes in descending order of the distances between the two most distant centers of hyper-rectangles in each axis Connect the ends of run of consecutive slices in accordance with the rule by which two ends are nearer in the partitioned axis will be linked 26/01/2015 19

Fig. 3 An illustration of the connection of two runs in the first strategy 26/01/2015 20

Second strategy (ISTR2) The longest coordinate is initially determined, and then slices will be created on the axis and each slice has its own longest coordinate from remaining coordinates. These runs of the slices are connected together in accordance with the minimum Euclidean distance between ends of the runs. Notice that the distance is be calculated on all axes, not in the same axis like the first strategy. 26/01/2015 21

Fig. 4 An illustration of the connection of two runs in the second strategy 26/01/2015 22

Contents 1. Introduction 2. Preliminaries 3. Strategies for improving STR 4. Experimental Evaluation 5. Conclusions & Future work 26/01/2015 23

Platform Intel Dual Core i3 M350 2.27 GHz, 4GB RAM PC C# 26/01/2015 24

How to assess improved STR Use a benchmark program that implements a solution to Efficient Similarity Search for Static Queries in Streaming Time Series (*) with DFT, Haar DWT, and PAA Compare ISTR1, ISTR2 with Quadratic R-tree, R * -tree and STR (*) in Proceedings of the 2014 International Conference on Green and Human Information Technology, HoChiMinh City, 2014, pp. 259-265. 26/01/2015 25

Fig. 5 Query filter through 26/01/2015 resolution levels 26

Datasets Dataset 2 is created by randomly generating integers. Dataset 3 is created by generating integers in random walk fashion. 26/01/2015 27

Other parameters 10,000 text files are created from every dataset so that they play a role of static queries. The size of the queries varies from 8 to 512. R-tree with m = 8 and M = 20 The search radius is ε = 0.01. 26/01/2015 28

Results of range search The true hits for each dataset in all cases are the same. 26/01/2015 29

Space efficiency of the index structures Since STR, ISTR1, and ISTR2 build an R-tree in the same way, their R-trees have the same total number of nodes and the same percentage of full nodes. STR requires less memory than Quadratic R-tree and R * -tree. Also, R * -tree is often better than R-tree in saving memory. STR has the largest percentages of full nodes and they are all over 98.5%. 26/01/2015 30

CPU times to build the arrays of R-trees Dataset 1 Dataset 2 Dataset 3 26/01/2015 31

CPU times for range search on arrays of R-trees Dataset 1 Dataset 2 Dataset 3 26/01/2015 32

Runtime efficiency of the index structures ISTR1 has the least CPU search times and ISTR2 is nearly the same ISTR1 in CPU search times. These strategies partition point objects on R-trees better than STR, R * -tree, and Quadratic R-tree. ISTR2 incurs more overlap between the nodes of R- trees than ISTR1 does and, as a result, it is rather inferior to ISTR1 in CPU search time. 26/01/2015 33

Summary of experimental evaluations 1. ISTR1 makes the best partition on R-tree. 2. The bulk-loading methods often performs faster than Quadratic R-tree and R * -tree. 3. The bulk-loading methods use memory more efficiently than Quadratic R-tree and R * -tree. 4. Close-reinsert is not superior to far-reinsert in the insert operation of R * -tree. 26/01/2015 34

Contents 1. Introduction 2. Preliminaries 3. Strategies for improving STR 4. Experimental Evaluation 5. Conclusions & Future work 26/01/2015 35

Conclusions The work presents two heuristic strategies to improve STR in similarity search over time series based on R-trees. These strategies determine how to choose the longest axis in each step and how to connect ends of the runs of consecutive slices. Extensive experiments show that these strategies, especially for the first, significantly improve STR and outperform Quadratic R-tree, R * -tree. 26/01/2015 36

Future Work Compare our solution with RR * -tree and other bulk-loading methods 26/01/2015 37

References 1. S. T. Leutenegger, J. M. Edgington, and M. A. Lopez, "STR: A simple and efficient algorithm," in Proceedings 13th International Conference on Data Engineering, 1997, p. 497 506. 2. A. Guttman, "R-tree : A dynamic index structure for spatial searching," in Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, 1984, pp. 47-57. 3. N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, "The R * -tree: An efficient and robust access method for points and rectangles," in ACM SIGMOD International Conference on Management of Data, Atlantic City, New Jersey, USA, May 23-25, 1990, pp. 322-331. 4. N. Mamoulis, Spatial Data Management, M. T. Özsu, Ed. Morgan & Claypool, 2012. 5. D. Greene, "An implementation and performance analysis of spatial data access methods," in Proceedings the 5th International Conference on Data Engineering, Los Angeles, CA, USA, 1989, pp. 606-615. 6. B. C. Giao and D. T. Anh, "Efficient similarity search for static queries in streaming time series," in Proceedings of the 2014 International Conference on Green and Human Information Technology, HoChiMinh City, 2014, pp. 259-265.

Thanks for listening Questions & Answers