Going Big in Data Dimensionality:

Similar documents

Unsupervised Data Mining (Clustering)

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Cluster Analysis: Advanced Concepts

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Cluster Analysis: Basic Concepts and Algorithms

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Chapter 7. Cluster Analysis

Data Mining: Algorithms and Applications Matrix Math Review

Machine Learning and Pattern Recognition Logistic Regression

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Manifold Learning Examples PCA, LLE and ISOMAP

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Least-Squares Intersection of Lines

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Unsupervised and supervised dimension reduction: Algorithms and connections

Standardization and Its Effects on K-Means Clustering Algorithm

Classification Techniques for Remote Sensing

Virtual Landmarks for the Internet

Data, Measurements, Features

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

Social Media Mining. Data Mining Essentials

Nonlinear Iterative Partial Least Squares Method

Clustering & Visualization

Authors. Data Clustering: Algorithms and Applications

The Scientific Data Mining Process

Support Vector Machine (SVM)

Neural Networks Lesson 5 - Cluster Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Section 1.4. Lines, Planes, and Hyperplanes. The Calculus of Functions of Several Variables

Linear Algebra Review. Vectors

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Subspace Analysis and Optimization for AAM Based Face Alignment

Linear Threshold Units

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Section 1.1. Introduction to R n

Dimensionality Reduction: Principal Components Analysis

Orthogonal Projections

Principal Component Analysis

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Component Ordering in Independent Component Analysis Based on Data Power

Chapter ML:XI (continued)

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Data Preprocessing. Week 2

Why do statisticians "hate" us?

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

An Overview of Knowledge Discovery Database and Data mining Techniques

Machine Learning Logistic Regression

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Classification algorithm in Data mining: An Overview

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Cluster Analysis: Basic Concepts and Methods

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

B490 Mining the Big Data. 2 Clustering

Recall that two vectors in are perpendicular or orthogonal provided that their dot

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

A Computational Framework for Exploratory Data Analysis

Local outlier detection in data forensics: data mining approach to flag unusual schools

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Using multiple models: Bagging, Boosting, Ensembles, Forests

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

An Introduction to Cluster Analysis for Data Mining

A Demonstration of a Robust Context Classification System (CCS) and its Context ToolChain (CTC)

Math 215 HW #6 Solutions

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Statistical Machine Learning

Applied Algorithm Design Lecture 5

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Outlier Detection in Clustering

2 Basic Concepts and Techniques of Cluster Analysis

Multivariate Analysis of Ecological Data

Visualization of large data sets using MDS combined with LVQ.

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Factor Analysis. Chapter 420. Introduction

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Section 13.5 Equations of Lines and Planes

Transcription:

LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für Datenbanksysteme Ludwig-Maximilians-Universität München

Going Big in Data Dimensionality? The three V s Vsof big data Volume Velocity Variety Let s talk about variety before talking about velocity and volume An aspect of variety (and volume) is high dimensionality An archaelogical finding can have 10+ attributes/dimensions Recommendation data can feature 100+ dimensions per object Micro array data may contain 1000+ dimensions per object TF vectors may contain 10,000+ dimensions per object Note: we are talking about structured data!!! 2

Outline Why Bother? Solutions Perspectives Open Issues 3

Why Bother? Clustering Automatically partition the data into clusters of similar objects Diverse applications ranging from business applications (e.g. customer segmentation) to science (e.g. molecular biology) Similarity? Objects are points in some feature spaces (let us assume an Euclidean space for convenience) Similarity can be defined as the vicinity of two feature vectors in the feature space, usually applying a distance function like some Lp norm, the cosine distances, etc. 4

Why Bother? But: In the early days of data mining: The relevance of features was carefully analyzed before recording them because data acquisition was costly So far so good: only relevant features were recorded Nowadays: Data acquisition is cheap and easy (mobile devices, sensors, modern machines, etc. everyone measures everything, everywhere) Consequence: sure big data, but what bothers? The relevance of a given feature to the analysis task is not necessarily clear Data is high dimensional containing a possibly large number of irrelevant attributes OK, but why does this bother? 5

Why Bother? High-dimensional data problems General: the curse of dimensionality Dmax d Dmin d 0 : limd [ dist (,0) ] 1 Dmin Relative contrast of distances decrease No cluster to find any more, only noise Special/Additional: Clusters in subspaces Local feature relevance Local feature correlation d Dmax = Distance to the farthest neighbor Dmin = Distance to the nearest neighbor d = dimensionality of the data space irrele evant attribut te relevant attribute/ relevant subspace 6

Why Bother? Example: local feature correlation customers (C1 C10) rate products (P1 P3) P1 P2 P3 C1 3 2 1 C2 2 3 4 C3 0 1 2 C4 3 4 5 C5 5 5 5 C6 1 3 10 C7 4 6 8 P1 2 P2 + P3 = 0 C8 0 2 3 P1 P2 + 2 = 0 C9 7 9 1 C10 3 5 5 P2 relevant subspace (2D plane perpendicular to the line) P3 relevant subspace (1D line perpendicluar to the plane) P1 7

Why Bother? High-dimensional data problems General: the curse of dimensionality Consequences: Dmax d Dmin d 0 : limd [ dist (,0) ] 1 Dmin Dmax = Distance to the farthest neighbor Dmin = Distance to the nearest neighbor d = dimensionality of the data space d We should care about accuracy before caring about Relative contrast of distances decrease No cluster to find any more, only noise scalability Special/Additional: Clusters in subspaces If we take a traditional clustering method and Local feature relevance Local feature correlation optimize it for efficiency, we will most likely not get the desired results 8

Why Bother? General Solution: correlation clustering Search for clusters in arbitrarily oriented subspaces Affine subspace S+a,SS R d, affinity ar d, is interesting if a set of points clusters (are dense) within this subspace The points of the cluster may exhibit high variance in the perpendicular subspace (R d \ S)+a points form a hyper-plane along this subspace S+a S+a a a 9

Why Bother? Back to the definition of clustering Clustering: partition the data into clusters of similar objects Traditionally, similarity means similar values in all attributes (this obviously does not account for high dimensional data) Now, similarity is defined as a common correlation in a given (sub)set of features (actually, that is not too far apart) 10

Why Bother? Why not feature selection? (Unsupervised) feature selection (e.g. PCA, SVD, ) is global; it always returns only one (reduced) feature space The local feature relevance/correlation problem states that we usually need multiple feature spaces (possibly one for each cluster) Example: Simplified metabolic screening data (here: 2D, 43D in reality) Disorder 3 11

Why Bother? Use feature selection before clustering PCA DBSCAN Projection on first principal component 12

Why Bother? Two tasks: 1. We still need to search for clusters (depends on cluster model) E.g. minimal cut of the similarity graph is NP-complete 2. But now, we also need to search for arbitrarily oriented subspaces (search space probably infinite) Naïve solution: BTW: Given a cluster criterion and a database of n points Compute for each subset of k points the subspace in which these points cluster and test the cluster criterion in this subspace Search space: n k 1 n k 2 n 1 O(2 How can we compute the subspace of the cluster? => see later What is a cluster criterion? => see task 1 n ) 13

Why Bother? Even worse: Circular Dependency Both tasks depend on each other In order to determine the correct subspace of a cluster, we need to know (at least some) cluster members In order to determine the correct cluster memberships, we need to know the subspaces of all clusters How to solve the circular dependency problem? Integrate subspace search into the clustering process Due to the complexity of both tasks, we need heuristics These heuristics should simultaneously solve the clustering gproblem the subspace search problem 14

Outline Why Bother? Solutions Perspectives Open Issues 15

Solutions Finding clusters in arbitrarily oriented subspaces Given a set D of points (e.g. a potential cluster); how can we determine the subspace in which these points cluster? Principal Component Analysis (PCA) determines the directions of highest variance Compute Covariance-matrix D für D x D : centroid of D D 1 D xd x x x x T D D Obtain Eigenvalue-Matrix and Eigenvector-Matrix D V V D : new basis, first Eigenvector = direction of fthe highest h variance E D : covariance-matrix of D in the new coordinate system V D D E D V T D 16

Solutions If the points in D form a -dimensional hyper-plane p then this hyper-plane is spanned by the first Eigenvectors The relevant subspace in which the points cluster is spanned by the remaining i d- Eigenvectors Vˆ V D The sum of the smallest d- Eigenvalues d i1 is minimal w.r.t. all possible transformations points cluster optimal in this subspace e Dii Model for Correlation Cluster The -dimensional hyper-plane accommodating the cluster C R d is defined by a system of d- equations for d variables and an affinity (e.g. the centroid of the cluster x C ): V ˆ T x V ˆ T x C The equation system is approximately fulfilled by all xc C C 17

Solutions Correlation clustering methods based on PCA Integrate (local) PCA into existing clustering algorithms Learn a distance measure that reflects the subspace of points and/or parts of clusters (typically: specialized Mahalanobis distance) Conquer the circular dependency of the two tasks by the so-called Locality Assumption A local selection of points (e.g. the k-nearest neighbors of a potential cluster center) represents the hyper-plane of the corresponding cluster The application of PCA on this local selection yields the subspace of the corresponding cluster Curse of dimensionality??? 18

Solutions Many methods rely on a local application of PCA to sets of potential cluster members Rely on locality assumption Alternative: random sampling How can we avoid the Locality Assumption/Random Sampling??? CASH (Clustering in Arbitrary Subspaces based on the Hough transform) 19

Solutions Basic idea of CASH Transform each object into a so-called parameter space representing all possible subspaces accommodating this object (i.e. all hyper- planes through this object) This parameter space is a continuum of all these subspaces The subspaces are represented by a considerably small number of parameters This transform is a generalization of the Hough Transform (which h is designed to detect linear structures in 2D images) for arbitrary dimensions 20

Solutions Transform For each d-dimensional point p there is an infinite number of (d-1)- dimensional hyper-planes p through p Each of these hyper-planes s is defined by (p, 1,, d-1 ), where 1,, d-1 is the normal vector n s of the hyper-plane s The function f p ( 1,, d-1 ) = s = <p,n s > maps p and 1,, d-1 onto the distance s of the hyper-plane s to the origin The parameter space plots the graph of this function y p 3 s fp 3 p 2 fp 2 fp 1 n s s s p 1 data space x ( s, s ) Parameter space 21

Solutions Properties of this transform point in the data space = sinusoide curve in the parameter space point in the parameter space = hyper-plane in the data space points on a common hyper-plane in the data space (cluster) = sinusoide curves intersecting at one point in the parameter space intersection ti of sinusoide id curves in the parameter space = hyper-plane accommodating the corresponding points in data space 22

Solutions Detecting clusters determine all intersection points of at least m curves in the parameter space => (d-1)-dimensional cluster Exact solution (check all pair-wise intersections) is too costly Heuristics are employed Grid-based bisecting search => Find cells with at least m curves determining the curves that are within a given cell is in O(d 3 ) Number of cells O(r d ), where r is the resolution of the grid high h value for r necessary dense region cluster C1 dense region cluster C2 23

Solutions Complexity Complexity (c = number of cluster found not an input parameter!!!) Bisecting search O(s. c) Determination of curves in a cell O(n. d 3 ) Over all O(s. c. n. d 3 ) (algorithms for PCA are also in O(d 3 )) Robustness against noise 24

Solutions 25

Outline Why Bother? Solutions Perspectives Open Issues 26

Perspectives Open Issues What is next? Still a lot to take care about at the accuracy end!!! Big Data? Examining the results (Are they significant? How to evaluate?). Novel heuristics with new assumptions (limitations?). Other patterns like outlier detection Variety: non-linear correlations, non-numericnumeric data, relational data, Velocity: dynamic data, data streams, Volume: scalability (size/dimensionality), approximate solutions, 27

Schluss Edmonton, 15.02.2011 28