Improving Sort-Tile-Recusive Algorithm for R-tree Packing in Indexing Time Series Bui Cong Giao, Duong Tuan Anh Presenter: Bui Cong Giao
Contents 1. Introduction 2. Preliminaries 3. Strategies for improving STR 4. Experimental Evaluation 5. Conclusions & Future work 26/01/2015 2
Contents 1. Introduction 2. Preliminaries 3. Strategies for improving STR 4. Experimental Evaluation 5. Conclusions & Future work 26/01/2015 3
R-tree The most popular index structure for spatial data management Collection of nodes which are hierarchically organized as a search tree Fig. 0 Node structure of R-tree 26/01/2015 4
R-tree Basic operations on R-tree: Insert (many versions) Search Delete R * -tree is a well-known amelioration of R-tree. 26/01/2015 5
Fig. 1 A set of MBRs and an R tree built for it 26/01/2015 6
Bulk-loading Building an R-tree at a time is the bulk-loading method. Advantages: Faster loading the R-tree with all spatial objects at once Minimizing empty spaces in the nodes of R-tree Better partitioning spatial objects into nodes of R-tree 26/01/2015 7
Sort-based loading A kind of bulk-loading for R-tree Simplicity of implementation yet a good query performance The only method commonly used in DBMS and GIS at the moment Sort-Tile-Recursive (STR), a typical example of sort-based loading. 26/01/2015 8
Motivations STR has been not compared with R * -tree yet in the paper that introduced it. Researchers continously attempt to improve R- tree in performance, space. 26/01/2015 9
Main contributions Two strategies to partition spatial objects in the nodes of R-tree, and how to connect the ends of consecutive runs into a suboptimum spacefilling curve. 26/01/2015 10
Contents 1. Introduction 2. Preliminaries 3. Strategies for improving STR 4. Experimental Evaluation 5. Conclusions & Future work 26/01/2015 11
Set P = r M STR with d = 2, sort the rectangles by x-coordinate and partition them into S = P vertical slices. A slice consists of a run of S M rectangles. Sort the rectangles of each slice by y-coordinate. Pack them into nodes by grouping them in size of M. 26/01/2015 12
STR with d > 2 Sort the hyper-rectangles by the first coordinate and partition them into S = P 1 d slices. A slice consists of a run of M P d 1 d sorted consecutive hyper-rectangles. Each slice is now processed recursively using the remaining d 1 coordinates. 26/01/2015 13
Build an R-tree from a collection of sorted spatial objects 1. r rectangles are ordered in P consecutive groups of M rectangles, where each group of M is placed in the same leaf level node. 2. Group M successive leaf nodes into a parent node 3. Recursively pack these parent nodes into the nodes at the higher level, proceeding upwards, until the root node is created. 26/01/2015 14
Space-Filling Curves A space-filling curve visits all the points in a d- dimensional grid exactly once and never crosses itself. To cluster nearby points into the same leaf nodes, space-filling curves are often used. The Z-order and the Hilbert curve are typical examples. 26/01/2015 15
Space-Filling Curves (cont.) 26/01/2015 16
Contents 1. Introduction 2. Preliminaries 3. Strategies for improving STR 4. Experimental Evaluation 5. Conclusions & Future work 26/01/2015 17
Intuitional improvements on STR STR sorts the hyper-rectangles according to the first coordinate of their center. We choose the longest coordinate that has the two most distant centers of hyper-rectangles in the coordinate. STR connects ends of the runs of consecutive slices to make a space-filling curve similar to the Z-order. We connect ends of the runs of consecutive slices to make a suboptimum space-filling curve. 26/01/2015 18
First strategy (ISTR1) Create a set of axes in descending order of the distances between the two most distant centers of hyper-rectangles in each axis Connect the ends of run of consecutive slices in accordance with the rule by which two ends are nearer in the partitioned axis will be linked 26/01/2015 19
Fig. 3 An illustration of the connection of two runs in the first strategy 26/01/2015 20
Second strategy (ISTR2) The longest coordinate is initially determined, and then slices will be created on the axis and each slice has its own longest coordinate from remaining coordinates. These runs of the slices are connected together in accordance with the minimum Euclidean distance between ends of the runs. Notice that the distance is be calculated on all axes, not in the same axis like the first strategy. 26/01/2015 21
Fig. 4 An illustration of the connection of two runs in the second strategy 26/01/2015 22
Contents 1. Introduction 2. Preliminaries 3. Strategies for improving STR 4. Experimental Evaluation 5. Conclusions & Future work 26/01/2015 23
Platform Intel Dual Core i3 M350 2.27 GHz, 4GB RAM PC C# 26/01/2015 24
How to assess improved STR Use a benchmark program that implements a solution to Efficient Similarity Search for Static Queries in Streaming Time Series (*) with DFT, Haar DWT, and PAA Compare ISTR1, ISTR2 with Quadratic R-tree, R * -tree and STR (*) in Proceedings of the 2014 International Conference on Green and Human Information Technology, HoChiMinh City, 2014, pp. 259-265. 26/01/2015 25
Fig. 5 Query filter through 26/01/2015 resolution levels 26
Datasets Dataset 2 is created by randomly generating integers. Dataset 3 is created by generating integers in random walk fashion. 26/01/2015 27
Other parameters 10,000 text files are created from every dataset so that they play a role of static queries. The size of the queries varies from 8 to 512. R-tree with m = 8 and M = 20 The search radius is ε = 0.01. 26/01/2015 28
Results of range search The true hits for each dataset in all cases are the same. 26/01/2015 29
Space efficiency of the index structures Since STR, ISTR1, and ISTR2 build an R-tree in the same way, their R-trees have the same total number of nodes and the same percentage of full nodes. STR requires less memory than Quadratic R-tree and R * -tree. Also, R * -tree is often better than R-tree in saving memory. STR has the largest percentages of full nodes and they are all over 98.5%. 26/01/2015 30
CPU times to build the arrays of R-trees Dataset 1 Dataset 2 Dataset 3 26/01/2015 31
CPU times for range search on arrays of R-trees Dataset 1 Dataset 2 Dataset 3 26/01/2015 32
Runtime efficiency of the index structures ISTR1 has the least CPU search times and ISTR2 is nearly the same ISTR1 in CPU search times. These strategies partition point objects on R-trees better than STR, R * -tree, and Quadratic R-tree. ISTR2 incurs more overlap between the nodes of R- trees than ISTR1 does and, as a result, it is rather inferior to ISTR1 in CPU search time. 26/01/2015 33
Summary of experimental evaluations 1. ISTR1 makes the best partition on R-tree. 2. The bulk-loading methods often performs faster than Quadratic R-tree and R * -tree. 3. The bulk-loading methods use memory more efficiently than Quadratic R-tree and R * -tree. 4. Close-reinsert is not superior to far-reinsert in the insert operation of R * -tree. 26/01/2015 34
Contents 1. Introduction 2. Preliminaries 3. Strategies for improving STR 4. Experimental Evaluation 5. Conclusions & Future work 26/01/2015 35
Conclusions The work presents two heuristic strategies to improve STR in similarity search over time series based on R-trees. These strategies determine how to choose the longest axis in each step and how to connect ends of the runs of consecutive slices. Extensive experiments show that these strategies, especially for the first, significantly improve STR and outperform Quadratic R-tree, R * -tree. 26/01/2015 36
Future Work Compare our solution with RR * -tree and other bulk-loading methods 26/01/2015 37
References 1. S. T. Leutenegger, J. M. Edgington, and M. A. Lopez, "STR: A simple and efficient algorithm," in Proceedings 13th International Conference on Data Engineering, 1997, p. 497 506. 2. A. Guttman, "R-tree : A dynamic index structure for spatial searching," in Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, 1984, pp. 47-57. 3. N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, "The R * -tree: An efficient and robust access method for points and rectangles," in ACM SIGMOD International Conference on Management of Data, Atlantic City, New Jersey, USA, May 23-25, 1990, pp. 322-331. 4. N. Mamoulis, Spatial Data Management, M. T. Özsu, Ed. Morgan & Claypool, 2012. 5. D. Greene, "An implementation and performance analysis of spatial data access methods," in Proceedings the 5th International Conference on Data Engineering, Los Angeles, CA, USA, 1989, pp. 606-615. 6. B. C. Giao and D. T. Anh, "Efficient similarity search for static queries in streaming time series," in Proceedings of the 2014 International Conference on Green and Human Information Technology, HoChiMinh City, 2014, pp. 259-265.
Thanks for listening Questions & Answers