An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]



Similar documents
Categorical Data Visualization and Clustering Using Subjective Factors

MatArcs: An Exploratory Data Analysis of Recurring Patterns in Multivariate Time Series

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

The Scientific Data Mining Process

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

SoSe 2014: M-TANI: Big Data Analytics

Method for detecting software anomalies based on recurrence plot analysis

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

The Value of Visualization 2

Class-specific Sparse Coding for Learning of Object Representations

Protein Protein Interaction Networks

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Chapter 7. Cluster Analysis

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Efficient on-line Signature Verification System

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

Chapter ML:XI (continued)

Evaluation of Heart Rate Variability Using Recurrence Analysis

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Clustering Data Streams

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS INTRODUCTION

Visualization of large data sets using MDS combined with LVQ.

Local outlier detection in data forensics: data mining approach to flag unusual schools

Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Clustering Time Series Based on Forecast Distributions Using Kullback-Leibler Divergence

Principles of Data Mining by Hand&Mannila&Smyth

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Time series clustering and the analysis of film style

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc.,

Classifying Manipulation Primitives from Visual Data

Discovery of User Groups within Mobile Data

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

Unsupervised learning: Clustering

Exploratory Spatial Data Analysis

The Theory of Concept Analysis and Customer Relationship Mining

Hierarchical Cluster Analysis Some Basics and Algorithms

Project Participants

Temporal Data Mining for Small and Big Data. Theophano Mitsa, Ph.D. Independent Data Mining/Analytics Consultant

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08

A Comparative Study of clustering algorithms Using weka tools

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis: Advanced Concepts

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Offline Word Spotting in Handwritten Documents

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS.

Graph Mining and Social Network Analysis

Visual Data Mining with Pixel-oriented Visualization Techniques

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

A NURSING CARE PLAN RECOMMENDER SYSTEM USING A DATA MINING APPROACH

The Graphical Method: An Example

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Tutorial for proteome data analysis using the Perseus software platform

Example 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum. asin. k, a, and b. We study stability of the origin x

Competitive Analysis of On line Randomized Call Control in Cellular Networks

Chapter 6. The stacking ensemble approach

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Alignment and Preprocessing for Data Analysis

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi

Exploratory Data Analysis with MATLAB

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

Ontology-Based Discovery of Workflow Activity Patterns

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

An Overview of Knowledge Discovery Database and Data mining Techniques

Module II: Multimedia Data Mining

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Topological Data Analysis Applications to Computer Vision

Going Big in Data Dimensionality:

Florida International University - University of Miami TRECVID 2014

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Probabilistic Latent Semantic Analysis (plsa)

Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals. Introduction

Dotted Chart and Control-Flow Analysis for a Loan Application Process

Private Record Linkage with Bloom Filters

Comparison of Elastic Matching Algorithms for Online Tamil Handwritten Character Recognition

Social Media Mining. Data Mining Essentials

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

DATA MINING TECHNIQUES AND APPLICATIONS

Solving Simultaneous Equations and Matrices

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

How To Cluster

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Transcription:

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7, 10587 Berlin, Germany {spiegel, albayrak}@dai-lab.de Keywords: Abstract: Time Series : Distance Measure : Recurrence Plots Although there has been substantial progress in time series analysis in recent years, time series distance measures still remain a topic of interest with a lot of potential for improvements. In this paper we introduce a novel Order Invariant Distance measure which is able to determine the (dis)similarity of time series that exhibit similar sub-sequences at arbitrary positions. Additionally, we demonstrate the practicality of the proposed measure on a sample data set of synthetic time series with artificially implanted patterns, and discuss the implications for real-life data mining applications. 1 INTRODUCTION The distance between time series needs to be carefully defined in order to reflect the underlying (dis)similarity of such data. This is particularly desirable for similarity-based retrieval, classification, clustering, segmentation and other mining procedures of time series (Ding et al., 2008). The choice of time series distance measure depends on the invariance required by the domain (Batista and Wang, 2011). Recent work has introduced techniques designed to efficiently measure similarity between time series with invariance to (various combinations of) the distortions of warping, uniform scaling, offset, amplitude scaling, phase, occlusions, uncertainty and wandering baseline (Batista and Wang, 2011). In this study we propose a novel Order-Invariant Distance (OID) measure which is able to determine the (dis)similarity of time series that exhibit similar sub-sequences at arbitrary positions. We claim that order invariance is an important consideration for domains such as automotive engineering and smart home environments, where multiple sensors log contextual patterns in their naturally occurring order, and time series are compared to discriminate complex situations (Spiegel et al., 2011a; Spiegel et al., 2011b). To ensure the validity of our claim, we demonstrate the practicability and capabilities of the introduced OID measure on a sample data set of synthetic time series with artificially implanted pattens. The rest of the paper is structured as followed. Section 2 presents popular distance measures which are frequently used to compare time series. Known invariance and important considerations for the design of time series distance measures are discussed in Section 3. Our proposed Order-Invariant Distance is introduced in Section 4. The capabilities of our presented OID measure are demonstrated in Section 5. Finally we conclude the paper in Section 6. 2 DISTANCE MEASURES Suppose we have two time series, Q and C, of length n. To measure their similarity, we can use the Euclidean Distance (ED): ED(Q,C)= 2 n i=1(q i c i ) 2 (1) While ED is a simple measure, it is suitable for many problems (Lin et al., 2004). Nevertheless, in many domains the data is distorted in some way, and either the distortion must be removed before using Euclidean Distance, or a more robust measure must be used instead (Batista and Wang, 2011). Dynamic Time Warping (DTW) allows a more intuitive distance measure for time series that have similar shape, but are not aligned in time (Keogh and Ratanamahatana, 2005). To align two sequences using DTW we construct a matrix which contains the

distances between any two points. A warping path W = w 1...w K is a contiguous set of matrix elements that defines a mapping between Q and C under several constrains (Keogh and Ratanamahatana, 2005): { DTW(Q,C)=min 2 K k=1 w k (2) This path can be found using dynamic programming to evaluate the following recurrence which defines the cumulative distance (i, j) as the distance d(i, j) found in the current cell and the minimum of the cumulative distances of the adjacent elements (Keogh and Ratanamahatana, 2005): (i, j)=d(q i,c j )+min{ (i 1, j 1), (i 1, j), (i, j 1)} (3) The Euclidean Distance between two sequences can be seen as a special case of DTW where both time series have same length and the warping path complies with the main diagonal of the distance matrix. However, the superiority of DTW over ED has been demonstrated by several authors (Ding et al., 2008; Keogh, 2003) for many data mining applications. 3 KNOWN INVARIANCE Important considerations regarding time series distance include amplitude invariance and offset invariance (Batista and Wang, 2011). If we try to compare two time series measured on different scales they will not match well, even if they have similar shapes. Similarly, even if two time series have identical amplitudes, they may have different offsets. To measure the true underlying similarity we must first center and scale the time series (by trivial z-normalization). Furthermore, local scaling invariance or rather warping invariance (Batista and Wang, 2011) should be taken into account when comparing time series. This invariance is necessary in almost all biological signals, which match only when one is locally warped to align with the other. Recent empirical evidence strongly suggests that Dynamic Time Warping is a robust distance measure which works exceptionally well (Ding et al., 2008). In contrast to the localized scaling that DTW deals with, in many data sets we must account for uniform scaling invariance (Batista and Wang, 2011), where we try to match a shorter time series against the prefix of a longer one. The main difficulty in creating uniform scaling invariance is that we typically do not know the scaling factor in advance, and are thus condemned to testing all possibilities within a given range (Keogh, 2003). Phase invariance (Batista and Wang, 2011) is important when matching periodic time series such as heart beats. By holding one time series fixed, and testing all circular shifts of the other, we can achieve phase invariance. In domains where a small sub-sequence of a time series may be missing we must consider occlusion invariance (Batista and Wang, 2011). This form of invariance can be achieve by selectively declining to match subsections of a time series. However, most real-life problems require multiple invariance. 4 ORDER-INVARIANT DISTANCE This paper introduces an Order-Invariant Distance (OID) measure which is able to determine the (dis)similarity of time series which have different shapes, but exhibit similar sub-sequences in arbitrary order. For example, we can imagine the speed recorded during two different car drives, from home to the convenience store and back, where the signals exhibit the same location-dependent traffic situations (e.g. crosswalk, intersection, driveway, traffic light) in reverse order (refer to Figure 1) and are therefore similar in regard of order invariance. Although order invariance may be an important consideration for other real-life data mining applications, relevant literature (Batista and Wang, 2011) is lacking a time series distance measure which is able to determine the (dis)similarity of signals that contain multiple similar events at arbitrary positions in time. Commonly used measures like ED and DTW are not designed to deal with order invariance, because they discriminate time series according to their shapes and fail to recognize cross-alignments between unordered sub-sequences. To this end, we developed an Order Invariant Distance measure which matches similar sub-sequences regardless of their order or location. Our proposed OID measure is based on the Cross Recurrence Plot (CRP) approach (Marwan, 2008; Marwan et al., 2007) which tests for occurrences of similar states in two different systems, or rather time series (with same physical units). The data length of both time series can differ, leading to a non-square recurrence matrix R: R i, j = ( q i c j ) (4) where represents the Heaviside function (i.e. (x)=0 if x < 0, and (x)=1 otherwise), is a

A B C D D C B A A D *BC* B C Figure 1: Sample data set of normally distributed pseudo-random time series (named as,, and *BC*, illustrated left) with artificially implanted sinus patterns (labeled as A to D, presented in their occurring order on the right). *BC* *BC* 19 19.5 20 20.5 21 0.005 0.01 0.015 0.02 0.025 Figure 2: Agglomerative hierarchical cluster tree (dendrogram) of synthetic time series data (introduced in Figure 1) according to the Dynamic Time Warping (left) and our proposed Order-Invariant Distance (right), where the x-axis reveals the distance between the time series being merged and the y-axis illustrates the corresponding name and shape of the signal. norm (i.e. L 2 -norm) and is a threshold distance that determines the radius of the similarity neighborhood. A closer inspection of the Cross Recurrence Plot (matrix R) reveals small-scale structures, which can be classified in single dots and lines of different direction (Marwan, 2008; Marwan et al., 2007). We are especially interested in lines that run parallel to the main diagonal, which occur when the trajectories of two sub-sequences are similar. In order to go beyond the visual impressions yielded by Cross Recurrence Plots, several measures of complexity which quantify the small-scale structures have been introduced and are known as Recurrence Quantification Analysis (RQA) (Marwan, 2008; Marwan et al., 2007). These measures are based on the recurrence matrix R( ) considering a certain - neighborhood. Commonly used measures include recurrence rate, determinism, entropy as well as average diagonal line length L: L(,l min )= N l=l min l P(,l) N l=l min P(,l) (5) where P(,l)= N i, j=1 {(1 R i 1, j 1 ( )) (1 R i+l, j+l ( )) l 1 k=0 R i+k, j+k ( )} (6) is the histogram of diagonal line of length l. The frequency and length of the diagonal lines are obviously related to a certain similarity between the dynamics of both time series. A measure based on the lengths of such lines can be used to find non-linear interrelations between two time series, which cannot be detected by the common cross-correlation function (Marwan et al., 2007). To this end, we propose the reciprocal average diagonal line length of a Cross Recurrence Plot as an Order-Invariant Distance (OID) measure for time series that exhibit similar sub-sequences at arbitrary positions in time. OID = 1/L (7)

Figure 3: Cross Recurrence Plot (CRP) of synthetic time series and (left) as well as and (right) introduced in Figure 1. Note that the main diagonal runs from upper left to bottom right. 5 CASE STUDY In this section we demonstrate the practicality of our proposed Order-Invariant Distance measure on a sample data set of synthetic time series. As illustrated in Figure 1, we consider four different normally distributed pseudo-random time series with artificially implanted sinus patterns. The first two time series comprise the same sub-sequences in reverse order, whereas the last two time series contain a subset of the artificially implanted signals. Figure 2 shows a direct comparison of Dynamic Time Warping and our introduced Order-Invariant Distance measure. As expected, the hierarchical cluster tree generated by means of DTW indicates a relatively small distance between the time series, and *BC*, because they exhibit similar subsequences at the same positions. However, DTW treats the time series as an outlier, due to the artificially implanted patterns occurring in reverse order. In contrast, the OID measure considers the time series and as most similar, because the order of the matched patterns is disregarded. Furthermore, the dendrogram generated by means of OID reveals that the time series and *BC* are dissimilar to and, which is due to the fact that the overlap of same or similar sub-sequences is relatively small ( 50%). Figure 3 illustrates the Cross Recurrence Plots of the time series and as well as and introduced in Figure 1. Lines parallel to the main diagonal (from upper left to bottom right) indicate similar sub-sequences in both time series. The average diagonal line length L is higher for the Cross Recurrence Plot of the time series and than for the CRP of the pair and. Since we want similar time series to have a small distance, our proposed Order Invariant Distance measure is defined as the reciprocal of the average diagonal line length (refer to Equation 7). The presented results serve to demonstrate the capabilities of the proposed Order-Invariant Distance measure, rather than to draw any conclusions without full evaluation. However, we strongly believe that the introduced OID measure is suitable to determine the (dis)similarity of time series which exhibit same or similar sub-sequences at arbitrary positions in time. 6 CONCLUSION In this paper we introduced order invariance for time series, which has, to our knowledge, been missed by the community. Hence, we proposed a novel Order Invariant Distance measure which is able to determine the (dis)similarity of time series with similar sub-sequences at arbitrary positions in time. In addition, we demonstrated the practicality of our proposed OID measure on a sample data set of synthetic time series with artificially implanted patterns. We strongly believe that order invariance is an important consideration for many real-life data mining applications. Our future work will include a full evaluation on publicly available data.

REFERENCES Batista, G. and Wang, X. (2011). A complexity-invariant distance measure for time series. SIAM Internation Conference on Data Mining (SDM)on Data Mining, Philadelphia, PA, USA. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., and Keogh, E. (2008). Querying and mining of time series data: experimental comparison of representations and distance measures. Time, 1(2):15421552. Keogh, E. (2003). Efficiently finding arbitrarily scaled patterns in massive time series databases. volume 2838 of Lecture Notes in Computer Science, pages 253 265. Springer Berlin / Heidelberg. Keogh, E. and Ratanamahatana, C. A. (2005). Exact indexing of dynamic time warping. Knowl. Inf. Syst., 7(3):358 386. Lin, J., Keogh, E., and Lonardi, S. (2004). Visualizing and discovering non-trivial patterns in large time series databases. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, volume 4, page 61. SAGE Publications. Marwan, N. (2008). A historical review of recurrence plots. The European Physical Journal Special Topics, 164(1):3 12. Marwan, N., Carmenromano, M., Thiel, M., and Kurths, J. (2007). Recurrence plots for the analysis of complex systems. Physics Reports, 438(5-6):237 329. Spiegel, S., Gaebler, J., Lommatzsch, A., De Luca, E., and Albayrak, S. (2011a). Pattern recognition and classification for multivariate time series. In KDD-2011: Proceeding of ACM International Workshop on Knowledge Discovery from Sensor Data (SensorKDD-2011), San Diego, CA, USA. ACM. Spiegel, S., Jain, B.-J., De Luca, E., and Albayrak, S. (2011b). Pattern recognition in multivariate time series - dissertation proposal. In CIKM 2011: Proceedings of 4th Workshop for Ph.D. Students in Information and Knowledge Management (PIKM 2011), Glasgow, UK. ACM.