Cluster analysis of time series data

Similar documents

Java Modules for Time Series Analysis

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Advanced Ensemble Strategies for Polynomial Models

A New Concept of PTP Vector Network Analyzer

Standardization and Its Effects on K-Means Clustering Algorithm

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Data Exploration Data Visualization

Analysis of algorithms of time series analysis for forecasting sales

Machine Learning Logistic Regression

Algebra 1 Course Title

How To Use Neural Networks In Data Mining

An Introduction to Data Mining

How To Predict Web Site Visits

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Algebra I Vocabulary Cards

CRLS Mathematics Department Algebra I Curriculum Map/Pacing Guide

3. Data Analysis, Statistics, and Probability

High Productivity Data Processing Analytics Methods with Applications

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

1. Classification problems

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES *

Support Vector Machines with Clustering for Training with Very Large Datasets

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CS Introduction to Data Mining Instructor: Abdullah Mueen

Visualization of large data sets using MDS combined with LVQ.

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

DRAFT. Algebra 1 EOC Item Specifications

Simple and efficient online algorithms for real world applications

MEASURES OF VARIATION

An Overview of Knowledge Discovery Database and Data mining Techniques

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

ITSM-R Reference Manual

Application of Data Mining Techniques in Intrusion Detection

CURRICULUM VITAE PETROS KARVELIS

A C T R esearcli R e p o rt S eries Using ACT Assessment Scores to Set Benchmarks for College Readiness. IJeff Allen.

MATLAB-based Applications for Image Processing and Image Quality Assessment Part II: Experimental Results

Social Media Mining. Data Mining Essentials

What are the place values to the left of the decimal point and their associated powers of ten?

How To Identify Noisy Variables In A Cluster

Overview Accounting for Business (MCD1010) Introductory Mathematics for Business (MCD1550) Introductory Economics (MCD1690)...

Statistics Graduate Courses

Diploma of Business Unit Guide 2015

Bachelor's Degree in Business Administration and Master's Degree course description

Pre-Algebra Academic Content Standards Grade Eight Ohio. Number, Number Sense and Operations Standard. Number and Number Systems

Advice for Students completing the B.S. degree in Computer Science based on Quarters How to Satisfy Computer Science Related Electives

DISCRETE EVENT SIMULATION HELPDESK MODEL IN SIMPROCESS

Exploratory data analysis (Chapter 2) Fall 2011

AP Physics 1 and 2 Lab Investigations

Identification algorithms for hybrid systems

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Grid Density Clustering Algorithm

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Thnkwell s Homeschool Precalculus Course Lesson Plan: 36 weeks

Introduction to Machine Learning Using Python. Vikram Kamath

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Master Specialization in Knowledge Engineering

Tracking and Recognition in Sports Videos

Geostatistics Exploratory Analysis

Digital Imaging and Multimedia. Filters. Ahmed Elgammal Dept. of Computer Science Rutgers University

Univariate Regression

Efficient Curve Fitting Techniques

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

Forecasting of Economic Quantities using Fuzzy Autoregressive Model and Fuzzy Neural Network

The Enron Corpus: A New Dataset for Classification Research

Data Preprocessing. Week 2

Wireless Sensor Networks Coverage Optimization based on Improved AFSA Algorithm

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Parameter identification of a linear single track vehicle model

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Algebra 1 Course Information

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

School of Mathematics, Computer Science and Engineering Department or equivalent School of Engineering and Mathematical Sciences Programme code

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

GRADES 7, 8, AND 9 BIG IDEAS

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Algebra and Geometry Review (61 topics, no due date)

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

COMMON CORE STATE STANDARDS FOR

QUALITY ENGINEERING PROGRAM

Data Mining: A Preprocessing Engine

Mathematics Georgia Performance Standards

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Adaptive Equalization of binary encoded signals Using LMS Algorithm

Comparison of K-means and Backpropagation Data Mining Algorithms

Data Mining - Evaluation of Classifiers

1 Prior Probability and Posterior Probability

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Scalable Developments for Big Data Analytics in Remote Sensing

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Basic Data Analysis Tutorial

Transcription:

Cluster analysis of time series data Supervisor: Ing. Pavel Kordík, Ph.D. Department of Theoretical Computer Science Faculty of Information Technology Czech Technical University in Prague January 5, 2012

The Problem The Problem find suitable method for identification of patterns assign samples into (unknown) groups

The Problem Expected results

Goals The Problem Goals capture global trends absolute values (sometimes) doesn t matter signals are not periodical discover unknown patterns

The Problem Biological background

Data mining Phases of clustering process 1 data cleaning 2 data integration 3 data selection 4 data transformation 5 clustering 6 pattern evaluation 7 knowledge representation

Data mining Clustering Clustering

Data mining Clustering No correct clustering exists

Data mining Clustering Definition Those methods concerned in some way with the identification of homogeneous groups of objects [Arabie et al., 1996] Definition A cluster is a set of entities that are alike, and entities from different clusters are not alike [Everitt, 1993]

Data mining Clustering clustering can be used for understanding data to perform clustering you need to understand data

Data mining Determine number of clusters Clustering

k = 3 Data mining Clustering

k = 6 Data mining Clustering

k = 9 Data mining Clustering

k = 14 Data mining Clustering

Data mining Clustering Determining the number of clusters in a data set is challenging [Mufti et al., 2005]

www.zo.utexas.edu/faculty/antisense/tree.pdf

Data mining Clustering from Chinese encyklopedia Heavenly Emporium of Benevolent Knowledge. Animals are divided into [Borges, 1952]: those that belong to the emperor embalmed ones those that are trained suckling pigs mermaids fabulous ones those that are included in this classification innumerable ones etcetera

Data mining Clustering Clustering is ill-defined [Caruana et al., 2006] All we care about is the usefulness of the clustering for achieving our final goal [Guyon et al., 2009]

Time series Time series

Problem Time series sensitive to small changes sum of distance does not capture shape of curve computationally expensive redundant information

Time series Autoregressive model predict an output of a system based on the previous outputs p X t = c + ϕ i X t i + ɛ t i=1 ϕ i parameters of the AR model X t amplitude of the signal ɛ t white noises

Time series Moving average q X t = µ + ɛ t + φ i ɛ t i i=1 φ i parameters of the AR model µ expectations of X t (often assumed to equal 0) ɛ t white noises

Time series Autoregressive moving-average model putting all together: p q X t = c + ɛ t + ϕ i X t i + ɛ t + φ i ɛ t i i=1 i=1 ARMA(p, q) refers to the model with p autoregressive terms and q moving-average terms in Matlab function armax[time-domain, data object]

Time series Fitting Exponential Polynomial

Time series Fitting Representation of inputs Measured values Approximated model too many inputs does not represent patterns only 5 parameters describing whole curve represent patterns

Time series Fitting How many parameters do we need? 2 parameters 5 parameters

Time series Fitting Which parameters to choose? on previous slide input parameters were following: mean minimum maximum linear coefficient quadratic coefficient for EEG clustering is [Siuly et al., 2011] using: minimum maximum mean median modus first quartile third quartile inter-quartile range standard deviation

Dendrogram Time series Visualizations

Comparing clusterings C-Index The C-index was reviewed in Hubert and Levin [1976] p c index = d w min(d w ) max(d w ) min(d w ) where d w is the sum of the within cluster distances. Gamma p gamma = s(+) + s( ) s(+) s( ) where s(+) represents the number of consistent comparisons involving between and within cluster distances, and s( ) represents the number of inconsistent outcomes Milligan and Cooper [1985]

Comparing clusterings The strive for objectivity, repeatability, testability etc. is perfectly right attitude as long as their proper place in the hierarchy of aims is maintained, but becomes very harmful if these tools dominate over the purpose of scientific research. [Holynski, 2005, p. 487]

Questions? Comparing clusterings Thanks for your attention! tomas.barton@fit.cvut.cz

Comparing clusterings References I P. Arabie, L. J. Hubert, and G. D. Soete. Clustering and classification. World Scientific, 1996. J.L. Borges. El idioma analítico de john wilkins. Obras completas, 1952. Rich Caruana, Mohamed Elhawary, Nam Nguyen, and Casey Smith. Meta clustering. In Proceedings of the Sixth International Conference on Data Mining, ICDM 06, pages 107 118, Washington, DC, USA, 2006. IEEE Computer Society. ISBN 0-7695-2701-9. doi: 10.1109/ICDM.2006.103. URL http://dx.doi.org/10.1109/icdm.2006.103. B. S. Everitt. Cluster Analysis. Edward Arnold, 1993.

References II Comparing clusterings I. Guyon, U. Von Luxburg, and R.C. Williamson. Clustering: Science or art. In NIPS 2009 Workshop on Clustering Theory, 2009. Roman B. Holynski. Philosophy of science from a taxonomist s perspective. Genus, 16(4):469 502, 2005. L.J. Hubert and J.R. Levin. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 83(6):1072, 1976. Glenn W. Milligan and Martha C. Cooper. An examination of procedures for determining the number of clusters in a dataset. Psychometrika, 50(2):159 179, June 1985.

References III Comparing clusterings G. Bel Mufti, P. Bertrand, and L. El Moubarki. Determining the number of groups from measures of cluster stability. In Proceedings of International Symposium on Applied Stochastic Models and Data Analysi, pages 404 412, 2005. Siuly, Yan Li, and Peng (Paul) Wen. Clustering technique-based least square support vector machine for eeg signal classification. Computer Methods and Programs in Biomedicine, 104(3): 358 372, 2011. URL http://dblp.uni-trier.de/db/ journals/cmpb/cmpb104.html#siulylw11;http://dx.doi. org/10.1016/j.cmpb.2010.11.014;http://www.bibsonomy. org/bibtex/2976ac83c8e51dd3ff108ee52da22902d/dblp.