Similarity Search for Numerous Patterns in Multiple High-Speed Time-Series Streams

Size: px

Start display at page:

Download "Similarity Search for Numerous Patterns in Multiple High-Speed Time-Series Streams"

Marcia Sabina Kelly
7 years ago
Views:

1 Similarity Search for Numerous Patterns in Multiple High-Speed Time-Series Streams Bui Cong Giao, Duong Tuan Anh Presenter: Bui Cong Giao

2 Contents 1. Introduction 2. Preliminaries 3. The Proposed Method 4. Experimental Evaluation 5. Conclusions

3 Introduction Pattern discovery by similarity search in streaming context where new values are continuously appended as time progresses Retrievvals of newcoming time-series subsequences of streaming time series, which are approximately matched with static time-series patterns under the Euclidean distance (ED) Important scenario in which incoming time-series data are from many concurrent time-series streams at high-speed rates, and there are numerous patterns 3

4 Main contributions A novel multi-scale representation of time-series data for similarity search in streaming context Range search over streaming time-series for numerous patterns in which every pattern has its own search radius 4

5 Preliminaries Two ways to search patterns in time-series sequences under ED Whole matching : the sequences to be compared have the same length, e.g UCR-ED (UCR- Euclidean Distance) Subsequence matching : the sequences is partitioned into many segments. The search procedure begins from the first segment to the last one, e.g SS-NOS (Similar Search using Non- Overlapped Segmentation) 5

6 UCR-ED Introduced by Rakthanmanon et al. in 2012 Conduct similarity search for patterns in static timeseries sequences Read the time-series sequence into many big sections. After that, UCR-ED uses z-normalization in an incremental fashion while the window slides over a big section of the time-series sequence to find matching pairs Change UCR-ED so that the method accommodates with multi-threading, referred as TUCR-ED 6

7 SS-NOS Similar Search using Non-Overlapped Segmentation Introduced by us in 2014 Similar search for patterns over streaming time-series using non-overlapped segmentation Fig. 1 The non-overlapped segmentation of a time-series pattern 7

8 SS-NOS (cont.) Phase 1 Phase 2 Retrieve the coefficient vectors of the z-normalized non-overlapped segments of patterns by DFT, or Haar DWT, or PAA Store the coefficient vectors in an array of R-trees as a multi-resolution index structure Equipped with multi-threading, SS-NOS carries out similarity search in streaming time series using the array of R-trees 8

9 Restrictions of SS-NOS If the length of the remainder is long, then the filtering process is likely inefficient for such a time- series pattern since the filtering process can miss out the unpromising patterns. SS-NOS performs range search with one search radius for all time-series patterns, so this is inflexible and rather impractical. 9

10 The Proposed Method Similar search for patterns over streaming time-series using overlapped segmentation, Similar Search using Overlapped Segmentation (SS-OS) Fig. 2 The overlapped segmentation of a time-series pattern 10

11 The Proposed Method (cont.) SS-OS is basically similar to SS-NOS in similarity search Fig. 3 SS-OS conducts similar search for patterns in a time-series stream. 11

12 The Proposed Method (cont.) Algorithm RangeSearch( S) When there is a new-coming data of S, T n // Phase 2 1. postcheckset // the set of patterns for post-checking 2. pset P // the set of potential patterns 3. for i = 1 to maxlevel 4. Incrementally normalize s i 5. for i = 1 to maxlevel 6. Retrieve v i 7. pset SearchInRtree( R-tree[i], pset, v i ) 8. if pset = then 9. break // go to phase foreach (p in pset) 11. if i is the maximum filter level of p then 12. postcheckset postcheckset p 13. Remove p from pset 14. foreach (p in postcheckset) // Phase Normalize c 16. Compute the ED distance between np and z-normalized c to check whether the distance is within p.r The core subroutine searches patterns whose i th coefficient vector is similar to v i within their own search radius. The range search takes place in the R-tree of the i th filter level. 12

13 Experimental Evaluation Platform Intel Dual Core i3 M GHz, 4GB RAM PC C# Parameters The circular buffers of the time-series streams have the size of 1,024. The minimum node occupancy of R-trees is 4 and the maximum node occupancy is

14 Three query sets were created from the time-series dataset. The number of queries in each query set is The length of the query sequences varies from 8 to

15 Experimental Evaluation Implement range search by UCR-ED, TUCR-ED, SS- NOS, and SS-OS on the three pattern sets with the same radius search (0.01) Use Haar DWT in SS-NOS and SS-OS. Compare the search methods in terms of their precision, the number of distance function calls in post processing, and wall-clock time. 15

16 Experimental Results SS-OS has the same precision as UCR-ED and SS- NOS. The number of distance function calls of the UCR-ED and TUCR-ED are very large, while SS-OS and SS- NOS use multi-scale filtering so their numbers are very small. The pruning power of SS-OS is over 99.92%, whereas that of SS-NOS is only over 99.89%. 16

17 Experimental Results Fig. 4 The number of distance function calls in the post-processing phase 17

18 Experimental Results On average, the wall-clock times of SS-OS and SS- NOS are tiny, varying from 16 seconds to 19 seconds. The wall-clock times of UCR-ED for the three pattern sets are roughly 10 minutes, 13 minutes, and 11 minutes, respectively. The wall-clock times of TUCR-ED for the three pattern sets are roughly 2 minutes. 18

19 Experimental Results SearchInRtree in Algorithm RangeSearch performs range search in R-trees precisely. The average CPU times to process a new-coming data point of RangeSearch in all cases are tiny, varying from 2,000 ticks (*) to 2,600 ticks. PAA has the best performance in run time. Able to perform similarity search for numerous patterns over multiple high-speed time-series streams. (*) 1 millisecond = 10,000 ticks 19

20 Conclusions Propose an efficient multi-scale representation of timeseries data, the overlapped segmentation, for similarity search Perform range search for time-series patterns in which each pattern has its own search radius Work precisely and have fast responses while dealing with multiple streaming time series at high-speed rates 20

21 References [1] B. C. Giao and D. T. Anh, "Efficient similarity search for static queries in streaming time series," in Proceedings of International Conference on Green and Human Information Technology (ICGHIT) 2014, HoChiMinh City, 2014, pp [2] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria and E. Keogh, "Searching and mining trillions of time series subsequences under Dynamic Time Warping," in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, August 12 16, 2012, pp [3] R. Agrawal, C. Faloutsos, and A. Swami, "Efficient similarity search in sequence databases," in Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms (FODO '93), Chicago, Illinois, USA, October 13-15, 1993, pp [4] K.-p. Chan and A. W.-c. Fu, "Efficient time series matching by wavelets," in Proceedings of the 15th IEEE International Conference on Data Engineering, March 23-26, 1999, pp [5] A. Guttman, "R-tree : A dynamic index structure for spatial searching," in Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 1984, pp [6] E. Keogh, K. Chakrabarti, S. Mehrotra, and M. Pazzani, "Locally adaptive dimensionality reduction for indexing large time series databases," in Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, May 2001, pp [7] E. Keogh. The UCR time series classification/clustering page. [Online].

22 Thanks for listening Questions & Answers

Time series databases. Indexing Time Series. Time series data. Time series are ubiquitous

Time series databases. Indexing Time Series. Time series data. Time series are ubiquitous Time series databases Indexing Time Series A time series is a sequence of real numbers, representing the measurements of a real variable at equal time intervals Stock prices Volume of sales over time Daily