Time series databases. Indexing Time Series. Time series data. Time series are ubiquitous

Transcription

1 Time series databases Indexing Time Series A time series is a sequence of real numbers, representing the measurements of a real variable at equal time intervals Stock prices Volume of sales over time Daily temperature readings ECG data A time series database is a large collection of time series Time series data Time series are ubiquitous A time series is a collection of observations made sequentially in time. value axis People measure things... The president s approval rating. Their blood pressure. The annual rainfall in Santa Barbara. The value of their Yahoo stock. The number of web hits per second. and things change over time Thus time series occur in numerous medical, scientific and business 4.75 time axis applications. 3 4

2 Search: Similarity search in time series: Applications a doctor searching for a particular pattern (that implies a heart irregularity) in the ECG database for diagnosis a stock analyst searching for a particular stock price pattern in the stock database for prediction Data Mining: Classification: Classify patients based on ECG patterns Clustering: Group websites with similar traffic patterns. Association Rules: If we see this peak followed by this plateau, then we will have a downward trend with a confidence of.4 and a support of.. Time series problems (from a databases perspective) The Similarity Problem X = x, x,, x n and Y = y, y,, y n Define and compute Sim(X, Y) E.g. do stocks X and Y have similar movements? Retrieve similar time series (Similarity Queries) 5 6 Types of queries Examples Whole match vs sub-pattern match Range query vs k-nn Join (self) Find companies with similar stock prices over a time interval Find products with similar sell cycles Cluster users with similar credit card utilization Find similar subsequences in DNA sequences Find scenes in video streams 7 8

3 $price $price Problems $price 365 day 365 day 365 day distance function: by expert (eg, Euclidean distance) Define the similarity (or distance) function Find an efficient algorithm to retrieve similar time series from a database (Faster than sequential scan) The Similarity function depends on the Application 9 Euclidean similarity measure Euclidean model View each sequence as a point in n- dimensional Euclidean space (n = length of each sequence) Define (dis-)similarity between sequences X and Y as L p = ( n i= x y ) i i p / p Query Q n datapoints Euclidean Distance between two time series Q = {q, q,, q n } and S = {s, s,, s n } Q Database Distance Rank 4 p= Manhattan distance S p= Euclidean distance n (, S) ( ) q i s i i= n datapoints D Q

4 Advantages Disadvantages Easy to compute: O(n) Allows scalable solutions to other problems, such as indexing clustering etc... Does not allow for different baselines Stock X fluctuates at $, stock Y at $3 Does not allow for different scales Stock X fluctuates between $95 and $5, stock Y between $ and $4 Synchronization of signals Missing points 3 4 Dynamic Time Warping [Berndt, Clifford, 994] Example Allows acceleration-deceleration of signals along the time dimension Basic idea Consider X = x, x,, x n, and Y = y, y,, y n We are allowed to extend each sequence by repeating elements Euclidean distance now calculated between the extended sequences X and Y Matrix M, where m ij = d(x i, y j ) Euclidean distance vs DTW 4

5 Dynamic Time Warping [Berndt, Clifford, 994] Restrictions on warping paths j = i + w Y y3 y y warping path j = i w Monotonicity Path should not go down or to the left Continuity No elements may be skipped in a sequence Warping Window i j <= w x x x3 X Formulation Solution by Dynamic Programming Let D(i, j) refer to the dynamic time warping distance between the subsequences x, x,, x i y, y,, y j D(i, j) = x i y j + min { D(i, j), D(i, j ), D(i, j ) } Basic implementation = O(n ) where n is the length of the sequences will have to solve the problem for each (i, j) pair If warping window is specified, then O(nw) Only solve for the (i, j) pairs where i j <= w 5

6 Longest Common Subsequence Measures (Allowing for Gaps in Sequences) Basic LCS Idea X = 3,, 5, 7, 4, 8,, 7 Y =, 5, 4, 7, 3,, 8, 6 LCS =, 5, 7, Sim(X,Y) = LCS or Sim(X,Y) = LCS /n Gap skipped LCS for time series We can extend the idea of LCS to time series using a threshold ε for matching. LCS for two subsequences: x, x,, x i and y, y,, y j is: LCS[i,j] = if i= or j= + LCS[i-, j-] if x i y j < ε max(lcs[i-, j], LCSS[i, j-]) otherwise Metric properties and indexing Edit distance is a metric! So, we can use metric indexes DTW and LCS are not metrics we have to use specialized indexes How to convert this into distance? 6

7 Similarity retrieval Range Query Find all time series S where D ( Q, S ) ε Nearest Neighbor query Find all the k most similar time series to Q A method to answer the above queries: Linear scan very slow A better approach DFT (other distance preserving transforms also possible) Preserves L norm Choose a few coefficients F-index Two-step process to remove false hits Discussion of F-index [AFS93] Subsequence matching Given a query Q and data sequence S, search for similarity over all subsequences of size length(q) over S Sequential scan Atomic subsequence approach Divide Q and S into atomic subsequences Index the atomic subsequences and look for a match 5 6 Window-based approach Generate information about a window of length w at each point in the data sequence Use DFT and choose the first few coefficients Generate a trail by plotting the information about each point Divide the trail into sub-trails Generate an MBR for each sub-trail Build an index on these MBRs, possibly from different sequences Division into sub-trails I-naïve: index points I-fixed: a fixed number of points per MBR I- adaptive: Compute the marginal cost of adding next point to the existing sub-trail if the marginal cost increases, start a new subtrail otherwise, assign to existing sub-trail 7 8 7

8 What if query length is greater than the window length w? Q is query, S is a database sequence, Q = kw If d(q,s[i..i+kw-]) ε then d(q[j..j+w-], S[i +j..i+j+w-]) ε for all j (k-)w Search the database using any atomic subsequence of query Q. Called Prefix Search Multi-piece search If d(q,s[i..i+kw-]) ε then there exists j such that d(q[j*w..j*w+w-], S[i+j*w..i+j*w+w-]) ε/ k. Search the database using each piece of query Q within a radius of ε/ k. If each pair of corresponding atomic subsequence of Q and S[i..i+kw-] is at a distance ε/ k, then the two sequences are at a distance of ε. If the two sequences are at a distance of ε and if we query using an atomic subsequence of Q then we will find an atomic subsequence of S[i..i+kw-] within a distance of ε/ k. 9 3 Searching with a smaller sequence Searching with an atomic subsequence of Q whose size is k times smaller within a radius of r is equivalent to searching with a radius of r. k with the whole Q. For Prefix Search, the resulting search volume is ( k ) d times the original volume, d is the number of embedding dimensions. For Multi-piece search, the resulting search volume for each piece is same as the original search volume. The combined search volume is k times the original search volume. Shift and scale invariance 3 3 8

9 Shift and scale invariance Shift and scale invariance: an example x x3 x x = a x + b V = [, 6, 4, ] V = [, 3,, 5] V 3 = [5, 9, 7, 3] V 4 = [5, 7, 6, 9] x t V 3 = V +3I V = V V 4 = V +4I Shift and scale invariant time sequences z-normalization [GK95] Motivation: The expression level of genes may be scaled or shifted due to experimental conditions. Different countries may have different units (e.g. Fahrenheit v.s. Celsius). There can be linear relationships between time sequences (stock market data). Z(x) = (x-mean(x))/std(x). Does not minimize distance under shifting and scaling! u = [,,, ]. v = [3,,, ]. d(z(u),z(v)) =.94. d(u, -.5v+.5) =.5. Distance between the transformed sequences is higher than the distance after scaling and shifting

10 SE-plane [CW99] To find shortest distance between lines SH(v) and SC(u), project them on the shift eliminated plane and measure the distance between point TSE(SH(v)) and the line TSE(SC(u)). SH(v) Problem with this scheme u=[,,, ] a=- b=.5 v=[3,,, ] d(u,v)= SE-plane v N=[,,...,] SC(u).5.5 TSE(SH(v)) u TSE(SC(u)) Problem with this scheme Shift & Scale Invariant Distance u=[,,, ] v=[3,,, ] a=-.5 b=.5 d(v,u)=.5 D(u,v) = min{d(u,v), d(v,u)} = min{ TSE(u), TSE(v) } sin(θ uv ) θ TSE(u) TSE(v) SE-plane

11 CS-index Structure CS-index Structure SE-plane SE-plane Fanout=3 Page size=p Fanout=3 Page size=p 4 4 CS-index Structure CS-index Structure SE-plane Fanout=3 Page size=p 43 44

12 Dimensionality curse Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases Keogh, Chakrabarti, Mehrotra, Pazzani, SIGMOD n (query length) is large (-s) Index structures perform poorly at such high dimensionalities Solution: First reduce dimensionality from n to n that can be efficiently handled by the index structure, then build index on n - dimensional data (GEMINI framework, Faloutsos et. al.) Correctness criteria: D indexspace (A,B) D true (A,B) no false dismissals n datapoints Dimensionality reduction techniques Piecewise Aggregate Approximation (PAA) DFT X X' X X' DWT Haar Haar Haar Haar 3 Haar 4 Haar 5 Haar 6 Haar 7 SVD X X' eigenwave eigenwave eigenwave eigenwave 3 eigenwave 4 eigenwave 5 eigenwave 6 eigenwave 7 Original time series (n-dimensional vector) S={s, s,, s n } n -segment PAA representation (n -d vector) S = {sv, sv,, sv n } value axis time axis sv sv sv sv 5 sv3 4 sv 6 sv 7 PAA representation satisfies the lower bounding property (Keogh, Chakrabarti, Mehrotra and Pazzani, ; Yi and Faloutsos ) sv 8

13 Can we improve upon PAA? n -segment PAA representation (n -d vector) S = {sv, sv,, sv N } sv sv sv sv 5 sv3 4 sv 6 sv 7 sv 8 APCA approximates original signal better than PAA Improvement factor = Reconstruction error PAA Reconstruction error APCA Adaptive Piecewise Constant Approximation (APCA) sv 3 n /-segment APCA representation sv (n -d vector) sv S= { sv, sr, sv, sr,, sv M, sr M } sv (M is the number of segments = n /) sr sr sr 3 49 sr 4 5 Construction of APCA Distance measure Exact (Euclidean) distance D(Q,S) Lower bounding distance D LB (Q,S) Wavelet transform Keep highest coefficients Reconstruct series to obtain the intervals Use actual means Merge adjacent regions as needed Q S S S Q Q D(Q,S) D LB (Q,S) D(Q,S) D LB (Q,S) n M ( q i s i ) = ( sr i i sr i qv i sv )( i ) i= 5 5 3

14 R S Index on M-dimensional APCA space S S3 S5 R R3 R4 R3 R R4 S8 S7 M-dimensional APCA space S4 S6 S9 R R R3 R4 S S S3 S4 S5 S6 S7 S8 S9 k-nearest neighbor algorithm R MINDIST(Q,R) MINDIST(Q,R3) S S3 R3 S S4 Q R MINDIST(Q,R4) R4 S8 S9 S7 S5 S6 Any index structure can used. 53 For any node U of the index structure with MBR R, MINDIST(Q,R) D(Q,S) for any data item S under U 54 Index modification for MINDIST computation APCA point S= { sv, sr, sv, sr,, sv M, sr M } smax sv smin smax sv smin sv 3 smax 3 smin 3 smax 4 sv 4 smin 4 sr sr sr 3 sr 4 R R S S S3 R4 R3 S8 S7 S5 S4 S6 S9 MBR representation in time-value space We can view the MBR R=(L,H) of any node U as two APCA representations L= { l, l,, l (N-), l N } and H= { h, h,, h (N-), h N } REGION REGION 3 APCA rectangle S= (L,H) where REGION l l 4 h h 4 l 6 L= { smin, sr, smin, sr,, smin M, sr M } and h 6 H = { smax, sr, smax, sr,, smax M, sr M } L= { l, l, l 3, l 4, l 5, l 6 } 55 time axis 56 value axis h l l 3 h 3 l 5 H= { h, h, h 3, h 4, h 5, h 6 } h 5 4

15 Regions REGION i M regions associated with each MBR; boundaries of ith region: h (i-) value axis h l REGION l 3 h 3 h 5 l (i-) l (i-) + REGION 3 h i value axis h l Regions ith region is active at time instant t if it spans across t The value s t of any time series S under node U at time instant t must lie in one of the regions active at t t l 3 t REGION h 3 h 5 REGION 3 l 5 REGION l 5 l l 4 h h 4 h 6 REGION l 6 l l 4 h h 4 h 6 time axis 57 l 6 58 time axis MINDIST Computation For time instant t, MINDIST(Q, R, t) = min region G active at t MINDIST(Q,G,t) MINDIST(Q,R,t) =min(mindist(q, Region, t), t MINDIST(Q, Region, t)) REGION =min((q t - h), (q t - h3) ) =(q h t - h) 3 h l l 3 REGION 3 h 5 Experimental Results Compare APCA with SVD, DFT, DWT and PAA in terms of Time to compute reduced representation Pruning power (quality of approximation) Query performance (in terms of I/O and CPU costs) Hybrid tree to index the reduced space Datasets: Electrocardiogram data,, objects, n (query length) varying from to Mixed Bag data,, objects, n varying from to l 5 REGION l l 4 h h 4 h 6 n MINDIST(Q,R) = l 6 = MINDIST ( Q, R, t) t Lemma: MINDIST(Q,R) D(Q,C) for any time series C under node U

16 5 Time (sec) K 3K 6K Time to Compute Reduced Representation 8K 4K SVD 3 K 3K 8 K 3K 6K 8K 4K APCA Fourier 3 K 3K 6K 8 K 3K 6K 8K 4K Wavelet 3 PAA 8 Pruning power Number of objects examined for -NN query 6K 8K 8K 4K 8 4K Original Reduced Database Size Dimensionality (n) dimensionality (n) dimensionality (n ) Spring (4K 3 - K) (3-) (-) (6-) # objects examined 3 Fourier 6 3 Wavelet/ PAA 6 3 APCA 6 3 Search Performance (I/O Cost) Linear Scan Fourier Wavelet/ PAA APCA Search Performance (CPU Cost) Linear Scan Fourier Wavelet/ PAA APCA 4 3 # random disk access CPU time (sec) Original Reduced Original Reduced dimensionality (n) dimensionality (n ) dimensionality (n) dimensionality (n ) (-) (6-) (-) (6-) 63 6

17 References Efficient Similarity Search in Sequence Databases, Rakesh Agrawal, Christos Faloutsos and Arun Swami, FODO conference, Evanston, Illinois, Oct. 3-5, 993 Fast subsequence matching in time-series databases, Christos Faloutsos and M. Ranganathan and Yannis Manolopoulos, pages , SIGMOD 994. Variable Length Queries for Time Series Data, Tamer Kahveci and Ambuj K. Singh, ICDE, pages Fast Time-Series Searching with Scaling and Shifting, Kelvin Kam Wing Chu, Man Hon Wong, ACM Priciples on Databases Systems, 999, pp Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, Eamonn Keogh, Kaushik Chakrabarti, Sharad Mehrotra, Michael Pazzani, ACM SIGMOD. On Similarity Queries for Time-Series Data: Constraint Specification and Implementation, D. Goldin and P. Kanellakis, Proceedings of the st International Conference on Principles and Practice of Constraint Programming, 995. Fast Time-Series Searching with Scaling and Shifting, K. Chu and M. H. Wong, PODS 999. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, E. J. Keogh, K. Chakrabarti, S. Mehrotra and M. J. Pazzani, SIGMOD, Similarity Searching for Multi-Attribute Sequences, T. Kahveci, A. K. Singh, and A. Gurel, SSDBM. 65 7