Time series databases. Indexing Time Series. Time series data. Time series are ubiquitous



Similar documents
MINING TIME SERIES DATA

56 Mining Time Series Data

Virtual Landmarks for the Internet

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

Three Myths about Dynamic Time Warping Data Mining

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Social Media Mining. Data Mining Essentials

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Using multiple models: Bagging, Boosting, Ensembles, Forests

Big Data Analytics CSCI 4030

Manifold Learning Examples PCA, LLE and ISOMAP

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Support Vector Machines with Clustering for Training with Very Large Datasets

The Scientific Data Mining Process

Introduction to Support Vector Machines. Colin Campbell, Bristol University

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Cluster Analysis: Advanced Concepts

Image Compression through DCT and Huffman Coding Technique

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Indexing Spatio-Temporal archive As a Preprocessing Alsuccession

International Journal of Advance Research in Computer Science and Management Studies

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Large Databases. Abstract. Many indexing approaches for high dimensional data points have evolved into very complex

Distance Degree Sequences for Network Analysis

Offline sorting buffers on Line

Similarity Search in a Very Large Scale Using Hadoop and HBase

Community Mining from Multi-relational Networks

Data Warehousing und Data Mining

Introduction to Graph Mining

Data Structures for Moving Objects

Neural Networks Lesson 5 - Cluster Analysis

Real-Time Adaptive Algorithm for Resource Monitoring

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

Time Series Representation for Elliott Wave Identification in Stock Market Analysis

Local features and matching. Image classification & object localization

Practical Graph Mining with R. 5. Link Analysis

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

DATA ANALYSIS II. Matrix Algorithms

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Wavelet analysis. Wavelet requirements. Example signals. Stationary signal 2 Hz + 10 Hz + 20Hz. Zero mean, oscillatory (wave) Fast decay (let)

Data Preprocessing. Week 2

Indexing the Trajectories of Moving Objects in Networks

Instituto de Engenharia de Sistemas e Computadores de Coimbra Institute of Systems Engineering and Computers INESC Coimbra

Detecting Network Anomalies. Anant Shah

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Alessandro Laio, Maria d Errico and Alex Rodriguez SISSA (Trieste)

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

JetBlue Airways Stock Price Analysis and Prediction

Towards Online Recognition of Handwritten Mathematics

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

CS Introduction to Data Mining Instructor: Abdullah Mueen

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Predictive Indexing for Fast Search

Graphical Representation of Multivariate Data

Module II: Multimedia Data Mining

Principles of Data Mining by Hand&Mannila&Smyth

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Part-Based Recognition

Content Delivery Networks. Shaxun Chen April 21, 2009

A hierarchical multicriteria routing model with traffic splitting for MPLS networks

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Server Load Prediction

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics

3. Interpolation. Closing the Gaps of Discretization... Beyond Polynomials

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

CHAPTER-24 Mining Spatial Databases

How To Cluster

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Decision Trees from large Databases: SLIQ

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

Transcription:

Time series databases Indexing Time Series A time series is a sequence of real numbers, representing the measurements of a real variable at equal time intervals Stock prices Volume of sales over time Daily temperature readings ECG data A time series database is a large collection of time series Time series data Time series are ubiquitous 5.75 5.75 5.5 5.5 5.5 5.75 5.35 5.35 5.35 5.4 5.4 5.35 5.5 5. A time series is a collection of observations made sequentially in time. value axis 9 8 7 6 People measure things... The president s approval rating. Their blood pressure. The annual rainfall in Santa Barbara. The value of their Yahoo stock. The number of web hits per second. and things change over time. 5.75.. 5.. 4.65 4 4.675 4.675 4.65 3 4.65 5 5 5 3 35 4 45 5 4.65 4.675 Thus time series occur in numerous medical, scientific and business 4.75 time axis applications. 3 4

Search: Similarity search in time series: Applications a doctor searching for a particular pattern (that implies a heart irregularity) in the ECG database for diagnosis a stock analyst searching for a particular stock price pattern in the stock database for prediction Data Mining: Classification: Classify patients based on ECG patterns Clustering: Group websites with similar traffic patterns. Association Rules: If we see this peak followed by this plateau, then we will have a downward trend with a confidence of.4 and a support of.. Time series problems (from a databases perspective) The Similarity Problem X = x, x,, x n and Y = y, y,, y n Define and compute Sim(X, Y) E.g. do stocks X and Y have similar movements? Retrieve similar time series (Similarity Queries) 5 6 Types of queries Examples Whole match vs sub-pattern match Range query vs k-nn Join (self) Find companies with similar stock prices over a time interval Find products with similar sell cycles Cluster users with similar credit card utilization Find similar subsequences in DNA sequences Find scenes in video streams 7 8

$price $price Problems $price 365 day 365 day 365 day distance function: by expert (eg, Euclidean distance) Define the similarity (or distance) function Find an efficient algorithm to retrieve similar time series from a database (Faster than sequential scan) The Similarity function depends on the Application 9 Euclidean similarity measure Euclidean model View each sequence as a point in n- dimensional Euclidean space (n = length of each sequence) Define (dis-)similarity between sequences X and Y as L p = ( n i= x y ) i i p / p Query Q n datapoints Euclidean Distance between two time series Q = {q, q,, q n } and S = {s, s,, s n } Q Database Distance.98.7. Rank 4 p= Manhattan distance S p= Euclidean distance n (, S) ( ) q i s i i= n datapoints D Q.43 3 3

Advantages Disadvantages Easy to compute: O(n) Allows scalable solutions to other problems, such as indexing clustering etc... Does not allow for different baselines Stock X fluctuates at $, stock Y at $3 Does not allow for different scales Stock X fluctuates between $95 and $5, stock Y between $ and $4 Synchronization of signals Missing points 3 4 Dynamic Time Warping [Berndt, Clifford, 994] Example Allows acceleration-deceleration of signals along the time dimension Basic idea Consider X = x, x,, x n, and Y = y, y,, y n We are allowed to extend each sequence by repeating elements Euclidean distance now calculated between the extended sequences X and Y Matrix M, where m ij = d(x i, y j ) Euclidean distance vs DTW 4

Dynamic Time Warping [Berndt, Clifford, 994] Restrictions on warping paths j = i + w Y y3 y y warping path j = i w Monotonicity Path should not go down or to the left Continuity No elements may be skipped in a sequence Warping Window i j <= w x x x3 X Formulation Solution by Dynamic Programming Let D(i, j) refer to the dynamic time warping distance between the subsequences x, x,, x i y, y,, y j D(i, j) = x i y j + min { D(i, j), D(i, j ), D(i, j ) } Basic implementation = O(n ) where n is the length of the sequences will have to solve the problem for each (i, j) pair If warping window is specified, then O(nw) Only solve for the (i, j) pairs where i j <= w 5

Longest Common Subsequence Measures (Allowing for Gaps in Sequences) Basic LCS Idea X = 3,, 5, 7, 4, 8,, 7 Y =, 5, 4, 7, 3,, 8, 6 LCS =, 5, 7, Sim(X,Y) = LCS or Sim(X,Y) = LCS /n Gap skipped LCS for time series We can extend the idea of LCS to time series using a threshold ε for matching. LCS for two subsequences: x, x,, x i and y, y,, y j is: LCS[i,j] = if i= or j= + LCS[i-, j-] if x i y j < ε max(lcs[i-, j], LCSS[i, j-]) otherwise Metric properties and indexing Edit distance is a metric! So, we can use metric indexes DTW and LCS are not metrics we have to use specialized indexes How to convert this into distance? 6

Similarity retrieval Range Query Find all time series S where D ( Q, S ) ε Nearest Neighbor query Find all the k most similar time series to Q A method to answer the above queries: Linear scan very slow A better approach DFT (other distance preserving transforms also possible) Preserves L norm Choose a few coefficients F-index Two-step process to remove false hits Discussion of F-index [AFS93] Subsequence matching Given a query Q and data sequence S, search for similarity over all subsequences of size length(q) over S Sequential scan Atomic subsequence approach Divide Q and S into atomic subsequences Index the atomic subsequences and look for a match 5 6 Window-based approach Generate information about a window of length w at each point in the data sequence Use DFT and choose the first few coefficients Generate a trail by plotting the information about each point Divide the trail into sub-trails Generate an MBR for each sub-trail Build an index on these MBRs, possibly from different sequences Division into sub-trails I-naïve: index points I-fixed: a fixed number of points per MBR I- adaptive: Compute the marginal cost of adding next point to the existing sub-trail if the marginal cost increases, start a new subtrail otherwise, assign to existing sub-trail 7 8 7

What if query length is greater than the window length w? Q is query, S is a database sequence, Q = kw If d(q,s[i..i+kw-]) ε then d(q[j..j+w-], S[i +j..i+j+w-]) ε for all j (k-)w Search the database using any atomic subsequence of query Q. Called Prefix Search Multi-piece search If d(q,s[i..i+kw-]) ε then there exists j such that d(q[j*w..j*w+w-], S[i+j*w..i+j*w+w-]) ε/ k. Search the database using each piece of query Q within a radius of ε/ k. If each pair of corresponding atomic subsequence of Q and S[i..i+kw-] is at a distance ε/ k, then the two sequences are at a distance of ε. If the two sequences are at a distance of ε and if we query using an atomic subsequence of Q then we will find an atomic subsequence of S[i..i+kw-] within a distance of ε/ k. 9 3 Searching with a smaller sequence Searching with an atomic subsequence of Q whose size is k times smaller within a radius of r is equivalent to searching with a radius of r. k with the whole Q. For Prefix Search, the resulting search volume is ( k ) d times the original volume, d is the number of embedding dimensions. For Multi-piece search, the resulting search volume for each piece is same as the original search volume. The combined search volume is k times the original search volume. Shift and scale invariance 3 3 8

Shift and scale invariance Shift and scale invariance: an example x x3 x x = a x + b V = [, 6, 4, ] V = [, 3,, 5] V 3 = [5, 9, 7, 3] V 4 = [5, 7, 6, 9] x t V 3 = V +3I V = V V 4 = V +4I 33 34 Shift and scale invariant time sequences z-normalization [GK95] Motivation: The expression level of genes may be scaled or shifted due to experimental conditions. Different countries may have different units (e.g. Fahrenheit v.s. Celsius). There can be linear relationships between time sequences (stock market data). Z(x) = (x-mean(x))/std(x). Does not minimize distance under shifting and scaling! u = [,,, ]. v = [3,,, ]. d(z(u),z(v)) =.94. d(u, -.5v+.5) =.5. Distance between the transformed sequences is higher than the distance after scaling and shifting 35 36 9

SE-plane [CW99] To find shortest distance between lines SH(v) and SC(u), project them on the shift eliminated plane and measure the distance between point TSE(SH(v)) and the line TSE(SC(u)). SH(v) Problem with this scheme u=[,,, ] a=- b=.5 v=[3,,, ] d(u,v)= SE-plane v 3.5 3 3.5 3 N=[,,...,] SC(u).5.5 TSE(SH(v)) u.5.5.5.5 TSE(SC(u)) 3 4 3 4 37 38 Problem with this scheme Shift & Scale Invariant Distance u=[,,, ] v=[3,,, ] a=-.5 b=.5 d(v,u)=.5 D(u,v) = min{d(u,v), d(v,u)} = min{ TSE(u), TSE(v) } sin(θ uv ) 3.5 3.5 3 3.5.5.5.5.5.5 -.5 3 4 θ TSE(u) TSE(v) SE-plane 3 4 39 4

CS-index Structure CS-index Structure SE-plane SE-plane Fanout=3 Page size=p Fanout=3 Page size=p 4 4 CS-index Structure CS-index Structure SE-plane Fanout=3 Page size=p 43 44

Dimensionality curse Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases Keogh, Chakrabarti, Mehrotra, Pazzani, SIGMOD n (query length) is large (-s) Index structures perform poorly at such high dimensionalities Solution: First reduce dimensionality from n to n that can be efficiently handled by the index structure, then build index on n - dimensional data (GEMINI framework, Faloutsos et. al.) Correctness criteria: D indexspace (A,B) D true (A,B) no false dismissals n datapoints 45 46 Dimensionality reduction techniques Piecewise Aggregate Approximation (PAA) DFT 3 4 5 6 7 X X' 4 6 8 4 X X' DWT 4 6 8 4 Haar Haar Haar Haar 3 Haar 4 Haar 5 Haar 6 Haar 7 SVD X X' 4 6 8 4 eigenwave eigenwave eigenwave eigenwave 3 eigenwave 4 eigenwave 5 eigenwave 6 eigenwave 7 Original time series (n-dimensional vector) S={s, s,, s n } n -segment PAA representation (n -d vector) S = {sv, sv,, sv n } 47 48 value axis time axis sv sv sv sv 5 sv3 4 sv 6 sv 7 PAA representation satisfies the lower bounding property (Keogh, Chakrabarti, Mehrotra and Pazzani, ; Yi and Faloutsos ) sv 8

Can we improve upon PAA? n -segment PAA representation (n -d vector) S = {sv, sv,, sv N } sv sv sv sv 5 sv3 4 sv 6 sv 7 sv 8 APCA approximates original signal better than PAA Improvement factor = Reconstruction error PAA Reconstruction error APCA Adaptive Piecewise Constant Approximation (APCA).69. 3.77 sv 3 n /-segment APCA representation sv (n -d vector) sv S= { sv, sr, sv, sr,, sv M, sr M } sv 4 3..75.3 (M is the number of segments = n /) sr sr sr 3 49 sr 4 5 Construction of APCA Distance measure Exact (Euclidean) distance D(Q,S) Lower bounding distance D LB (Q,S) Wavelet transform Keep highest coefficients Reconstruct series to obtain the intervals Use actual means Merge adjacent regions as needed Q S S S Q Q D(Q,S) D LB (Q,S) D(Q,S) D LB (Q,S) n M ( q i s i ) = ( sr - - - i i sr i qv i sv )( i ) i= 5 5 3

R S Index on M-dimensional APCA space S S3 S5 R R3 R4 R3 R R4 S8 S7 M-dimensional APCA space S4 S6 S9 R R R3 R4 S S S3 S4 S5 S6 S7 S8 S9 k-nearest neighbor algorithm R MINDIST(Q,R) MINDIST(Q,R3) S S3 R3 S S4 Q R MINDIST(Q,R4) R4 S8 S9 S7 S5 S6 Any index structure can used. 53 For any node U of the index structure with MBR R, MINDIST(Q,R) D(Q,S) for any data item S under U 54 Index modification for MINDIST computation APCA point S= { sv, sr, sv, sr,, sv M, sr M } smax sv smin smax sv smin sv 3 smax 3 smin 3 smax 4 sv 4 smin 4 sr sr sr 3 sr 4 R R S S S3 R4 R3 S8 S7 S5 S4 S6 S9 MBR representation in time-value space We can view the MBR R=(L,H) of any node U as two APCA representations L= { l, l,, l (N-), l N } and H= { h, h,, h (N-), h N } REGION REGION 3 APCA rectangle S= (L,H) where REGION l l 4 h h 4 l 6 L= { smin, sr, smin, sr,, smin M, sr M } and h 6 H = { smax, sr, smax, sr,, smax M, sr M } L= { l, l, l 3, l 4, l 5, l 6 } 55 time axis 56 value axis h l l 3 h 3 l 5 H= { h, h, h 3, h 4, h 5, h 6 } h 5 4

Regions REGION i M regions associated with each MBR; boundaries of ith region: h (i-) value axis h l REGION l 3 h 3 h 5 l (i-) l (i-) + REGION 3 h i value axis h l Regions ith region is active at time instant t if it spans across t The value s t of any time series S under node U at time instant t must lie in one of the regions active at t t l 3 t REGION h 3 h 5 REGION 3 l 5 REGION l 5 l l 4 h h 4 h 6 REGION l 6 l l 4 h h 4 h 6 time axis 57 l 6 58 time axis MINDIST Computation For time instant t, MINDIST(Q, R, t) = min region G active at t MINDIST(Q,G,t) MINDIST(Q,R,t) =min(mindist(q, Region, t), t MINDIST(Q, Region, t)) REGION =min((q t - h), (q t - h3) ) =(q h t - h) 3 h l l 3 REGION 3 h 5 Experimental Results Compare APCA with SVD, DFT, DWT and PAA in terms of Time to compute reduced representation Pruning power (quality of approximation) Query performance (in terms of I/O and CPU costs) Hybrid tree to index the reduced space Datasets: Electrocardiogram data,, objects, n (query length) varying from to Mixed Bag data,, objects, n varying from to l 5 REGION l l 4 h h 4 h 6 n MINDIST(Q,R) = l 6 = MINDIST ( Q, R, t) t Lemma: MINDIST(Q,R) D(Q,C) for any time series C under node U 59 6 5

5 Time (sec) K 3K 6K Time to Compute Reduced Representation 8K 4K SVD 3 K 3K 8 K 3K 6K 8K 4K APCA Fourier 3 K 3K 6K 8 K 3K 6K 8K 4K Wavelet 3 PAA 8 Pruning power Number of objects examined for -NN query 6K 8K 8K 4K 8 4K 8 3 3 Original Reduced Database Size Dimensionality (n) dimensionality (n) dimensionality (n ) Spring (4K 3 - K) (3-) (-) (6-) 6 6 4 # objects examined 3 Fourier 6 3 Wavelet/ PAA 6 3 APCA 6 3 Search Performance (I/O Cost) Linear Scan Fourier Wavelet/ PAA APCA Search Performance (CPU Cost) Linear Scan Fourier Wavelet/ PAA APCA 4 3 # random disk access 3 6 3 6 3 6 3 6 8 6 4 CPU time (sec) 3 6 3 6 3 6 3 6 Original Reduced Original Reduced dimensionality (n) dimensionality (n ) dimensionality (n) dimensionality (n ) (-) (6-) (-) (6-) 63 6

References Efficient Similarity Search in Sequence Databases, Rakesh Agrawal, Christos Faloutsos and Arun Swami, FODO conference, Evanston, Illinois, Oct. 3-5, 993 Fast subsequence matching in time-series databases, Christos Faloutsos and M. Ranganathan and Yannis Manolopoulos, pages 49--49, SIGMOD 994. Variable Length Queries for Time Series Data, Tamer Kahveci and Ambuj K. Singh, ICDE, pages 73-8. Fast Time-Series Searching with Scaling and Shifting, Kelvin Kam Wing Chu, Man Hon Wong, ACM Priciples on Databases Systems, 999, pp. 37-48. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, Eamonn Keogh, Kaushik Chakrabarti, Sharad Mehrotra, Michael Pazzani, ACM SIGMOD. On Similarity Queries for Time-Series Data: Constraint Specification and Implementation, D. Goldin and P. Kanellakis, Proceedings of the st International Conference on Principles and Practice of Constraint Programming, 995. Fast Time-Series Searching with Scaling and Shifting, K. Chu and M. H. Wong, PODS 999. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, E. J. Keogh, K. Chakrabarti, S. Mehrotra and M. J. Pazzani, SIGMOD, Similarity Searching for Multi-Attribute Sequences, T. Kahveci, A. K. Singh, and A. Gurel, SSDBM. 65 7