Clustering and mapper



Similar documents
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

WORKSHOP ON TOPOLOGY AND ABSTRACT ALGEBRA FOR BIOMEDICINE

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

How To Understand And Understand The Theory Of Computational Finance

They can be obtained in HQJHQH format directly from the home page at:

Metrics on SO(3) and Inverse Kinematics

Topological Data Analysis Applications to Computer Vision

JustClust User Manual

Topological Data Analysis

TOPOLOGICAL ANALYSIS OF MOBILIZE BOSTON DATA

Machine Learning using MapReduce

Introduction to Topology and its Applications to Complex Data

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Geometry and Topology from Point Cloud Data

The Value of Visualization 2

Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Visualization by Linear Projections as Information Retrieval

TDA and Machine Learning: Better Together

Big Data Analytics for Healthcare

Protein Protein Interaction Networks

Psychological Questionnaires and Kernel Extension

Big Data: Rethinking Text Visualization

CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理

Topology. Shapefile versus Coverage Views

Efficient Storage, Compression and Transmission

Shake N Bake Basketball Services High School Level

MeshLab and Arc3D: Photo-Reconstruction and Processing of 3D meshes

Wii Remote Calibration Using the Sensor Bar

3D Scanner using Line Laser. 1. Introduction. 2. Theory

Algebra II EOC Practice Test

Volume visualization I Elvins

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Chapter ML:XI (continued)

Monitoring of Internet traffic and applications

Digital Imaging and Communications in Medicine (DICOM) Supplement 132: Surface Segmentation Storage SOP Class

How To Understand The Network Of A Network

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Vector storage and access; algorithms in GIS. This is lecture 6

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

SIGNATURE VERIFICATION

A Study of Web Log Analysis Using Clustering Techniques

Socci Sport Alternative Games

Visualization methods for patent data

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

South East of Process Main Building / 1F. North East of Process Main Building / 1F. At 14:05 April 16, Sample not collected

Using multiple models: Bagging, Boosting, Ensembles, Forests

Bioinformatics: Network Analysis

Packet TDEV, MTIE, and MATIE - for Estimating the Frequency and Phase Stability of a Packet Slave Clock. Antti Pietiläinen

Social Media Mining. Network Measures

Healthcare data analytics. Da-Wei Wang Institute of Information Science

EXPONENTIAL FUNCTIONS

: Introduction to Machine Learning Dr. Rita Osadchy

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

A Non-Linear Schema Theorem for Genetic Algorithms

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Using Data Mining for Mobile Communication Clustering and Characterization

An Open Framework for Reverse Engineering Graph Data Visualization. Alexandru C. Telea Eindhoven University of Technology The Netherlands.

For additional information, see the Math Notes boxes in Lesson B.1.3 and B.2.3.

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Secure Because Math: Understanding ML- based Security Products (#SecureBecauseMath)

Advanced Ensemble Strategies for Polynomial Models

Buckets: Visualizing NBA Shot Data CPSC 547 Project Proposal

Imputing Values to Missing Data

Functional Data Analysis of MALDI TOF Protein Spectra

Multiphysics Software Applications in Reverse Engineering

Supervised and unsupervised learning - 1

Models of Cortical Maps II

The Tangent Bundle. Jimmie Lawson Department of Mathematics Louisiana State University. Spring, 2006

Vector Spaces; the Space R n

Dmitri Krioukov CAIDA/UCSD

Editing Common Polygon Boundary in ArcGIS Desktop 9.x

TEXTURE AND BUMP MAPPING

BB2798 How Playtech uses predictive analytics to prevent business outages

Classification algorithm in Data mining: An Overview

w ki w kj k=1 w2 ki k=1 w2 kj F. Aiolli - Sistemi Informativi 2007/2008

Auditing EMR System Usage. You Chen Jan, 17, 2013

A Secure Online Reputation Defense System from Unfair Ratings using Anomaly Detections

Elementary Statistics

Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

Similarity Search in a Very Large Scale Using Hadoop and HBase

Visual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics

High-dimensional labeled data analysis with Gabriel graphs

Basic Understandings

Big Data for Satellite Business Intelligence

Christian Bettstetter. Mobility Modeling, Connectivity, and Adaptive Clustering in Ad Hoc Networks

Teaching Mathematics and Statistics Using Tennis

Constrained Tetrahedral Mesh Generation of Human Organs on Segmented Volume *

A Learning Based Method for Super-Resolution of Low Resolution Images

Exploratory data analysis approaches unsupervised approaches. Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis

YMCA Basketball Games and Skill Drills for 3 5 Year Olds

Random Map Generator v1.0 User s Guide

Transcription:

June 17th, 2014

Overview Goal of talk Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.) Basic idea: perform clustering at different scales, track how clusters change as scale varies. Motivation: 1 Coarser than manifold learning, but still works in nonlinear situations. 2 Still retains meaningful geometric information about data set. 3 Efficiently computable (and so can apply to very large data sets).

Overview Goal of talk Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.) Basic idea: perform clustering at different scales, track how clusters change as scale varies. Motivation: 1 Coarser than manifold learning, but still works in nonlinear situations. 2 Still retains meaningful geometric information about data set. 3 Efficiently computable (and so can apply to very large data sets).

Overview Goal of talk Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.) Basic idea: perform clustering at different scales, track how clusters change as scale varies. Motivation: 1 Coarser than manifold learning, but still works in nonlinear situations. 2 Still retains meaningful geometric information about data set. 3 Efficiently computable (and so can apply to very large data sets).

Overview Goal of talk Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.) Basic idea: perform clustering at different scales, track how clusters change as scale varies. Motivation: 1 Coarser than manifold learning, but still works in nonlinear situations. 2 Still retains meaningful geometric information about data set. 3 Efficiently computable (and so can apply to very large data sets).

Overview Goal of talk Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.) Basic idea: perform clustering at different scales, track how clusters change as scale varies. Motivation: 1 Coarser than manifold learning, but still works in nonlinear situations. 2 Still retains meaningful geometric information about data set. 3 Efficiently computable (and so can apply to very large data sets).

Overview Goal of talk Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.) Basic idea: perform clustering at different scales, track how clusters change as scale varies. Motivation: 1 Coarser than manifold learning, but still works in nonlinear situations. 2 Still retains meaningful geometric information about data set. 3 Efficiently computable (and so can apply to very large data sets).

Morse theory Basic idea Describe topology of a smooth manifold M using levelsets of a suitable function h : M R. We recover M by looking at h 1 ((, t]), as t scans over the range of h. Topology of M changes at critical points of h.

Morse theory Basic idea Describe topology of a smooth manifold M using levelsets of a suitable function h : M R. We recover M by looking at h 1 ((, t]), as t scans over the range of h. Topology of M changes at critical points of h.

Morse theory Basic idea Describe topology of a smooth manifold M using levelsets of a suitable function h : M R. We recover M by looking at h 1 ((, t]), as t scans over the range of h. Topology of M changes at critical points of h.

Reeb graphs Convenient simplification: 1 For each t R, contract each component of f 1 (t) to a point. 2 Resulting structure is a graph.

Reeb graphs Convenient simplification: 1 For each t R, contract each component of f 1 (t) to a point. 2 Resulting structure is a graph.

Reeb graphs Convenient simplification: 1 For each t R, contract each component of f 1 (t) to a point. 2 Resulting structure is a graph.

Mapper The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson] Assume given a data set X. 1 Choose a filter function f : X R. 2 Choose a cover U α of X. 3 Cluster each inverse image f 1 (U α ). 4 Form a graph where: 1 Clusters are vertices. 2 An edge connects two clusters C and C if both U α U α and C C. 5 Color vertices according to average value of f in the cluster.

Mapper The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson] Assume given a data set X. 1 Choose a filter function f : X R. 2 Choose a cover U α of X. 3 Cluster each inverse image f 1 (U α ). 4 Form a graph where: 1 Clusters are vertices. 2 An edge connects two clusters C and C if both U α U α and C C. 5 Color vertices according to average value of f in the cluster.

Mapper The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson] Assume given a data set X. 1 Choose a filter function f : X R. 2 Choose a cover U α of X. 3 Cluster each inverse image f 1 (U α ). 4 Form a graph where: 1 Clusters are vertices. 2 An edge connects two clusters C and C if both U α U α and C C. 5 Color vertices according to average value of f in the cluster.

Mapper The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson] Assume given a data set X. 1 Choose a filter function f : X R. 2 Choose a cover U α of X. 3 Cluster each inverse image f 1 (U α ). 4 Form a graph where: 1 Clusters are vertices. 2 An edge connects two clusters C and C if both U α U α and C C. 5 Color vertices according to average value of f in the cluster.

Mapper The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson] Assume given a data set X. 1 Choose a filter function f : X R. 2 Choose a cover U α of X. 3 Cluster each inverse image f 1 (U α ). 4 Form a graph where: 1 Clusters are vertices. 2 An edge connects two clusters C and C if both U α U α and C C. 5 Color vertices according to average value of f in the cluster.

Mapper The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson] Assume given a data set X. 1 Choose a filter function f : X R. 2 Choose a cover U α of X. 3 Cluster each inverse image f 1 (U α ). 4 Form a graph where: 1 Clusters are vertices. 2 An edge connects two clusters C and C if both U α U α and C C. 5 Color vertices according to average value of f in the cluster.

Mapper The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson] Assume given a data set X. 1 Choose a filter function f : X R. 2 Choose a cover U α of X. 3 Cluster each inverse image f 1 (U α ). 4 Form a graph where: 1 Clusters are vertices. 2 An edge connects two clusters C and C if both U α U α and C C. 5 Color vertices according to average value of f in the cluster.

Mapper The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson] Assume given a data set X. 1 Choose a filter function f : X R. 2 Choose a cover U α of X. 3 Cluster each inverse image f 1 (U α ). 4 Form a graph where: 1 Clusters are vertices. 2 An edge connects two clusters C and C if both U α U α and C C. 5 Color vertices according to average value of f in the cluster.

Mapper The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson] Assume given a data set X. 1 Choose a filter function f : X R. 2 Choose a cover U α of X. 3 Cluster each inverse image f 1 (U α ). 4 Form a graph where: 1 Clusters are vertices. 2 An edge connects two clusters C and C if both U α U α and C C. 5 Color vertices according to average value of f in the cluster.

Filter functions Clearly, choice of filter function is essential. Some kind of density measure. A score measure difference (distance) from some baseline. An eccentricity measure.

Filter functions Clearly, choice of filter function is essential. Some kind of density measure. A score measure difference (distance) from some baseline. An eccentricity measure.

Filter functions Clearly, choice of filter function is essential. Some kind of density measure. A score measure difference (distance) from some baseline. An eccentricity measure.

Filter functions Clearly, choice of filter function is essential. Some kind of density measure. A score measure difference (distance) from some baseline. An eccentricity measure.

Breast cancer example Highly successful example of real data analysis. [Nicolau, Carlsson, Levine] Working with vectors of gene expression data. Distance metric is correlation. Filter is a measure of (unsigned) deviation of expression from normal tissue. Results identify previously unknown c-myb+ region, which are very different from normal tissue but have very high survival rates.

Breast cancer example Highly successful example of real data analysis. [Nicolau, Carlsson, Levine] Working with vectors of gene expression data. Distance metric is correlation. Filter is a measure of (unsigned) deviation of expression from normal tissue. Results identify previously unknown c-myb+ region, which are very different from normal tissue but have very high survival rates.

Breast cancer example Highly successful example of real data analysis. [Nicolau, Carlsson, Levine] Working with vectors of gene expression data. Distance metric is correlation. Filter is a measure of (unsigned) deviation of expression from normal tissue. Results identify previously unknown c-myb+ region, which are very different from normal tissue but have very high survival rates.

Breast cancer example Highly successful example of real data analysis. [Nicolau, Carlsson, Levine] Working with vectors of gene expression data. Distance metric is correlation. Filter is a measure of (unsigned) deviation of expression from normal tissue. Results identify previously unknown c-myb+ region, which are very different from normal tissue but have very high survival rates.

Breast cancer example Highly successful example of real data analysis. [Nicolau, Carlsson, Levine] Working with vectors of gene expression data. Distance metric is correlation. Filter is a measure of (unsigned) deviation of expression from normal tissue. Results identify previously unknown c-myb+ region, which are very different from normal tissue but have very high survival rates.

NBA example Clever example of application to sports analytics. [Alagappan] Data set consists of vectors of statistics (points scored, rebounds, etc.). Distance metric is Euclidean. Filter is points per minute. Results identify many new positions.

NBA example Clever example of application to sports analytics. [Alagappan] Data set consists of vectors of statistics (points scored, rebounds, etc.). Distance metric is Euclidean. Filter is points per minute. Results identify many new positions.

NBA example Clever example of application to sports analytics. [Alagappan] Data set consists of vectors of statistics (points scored, rebounds, etc.). Distance metric is Euclidean. Filter is points per minute. Results identify many new positions.

NBA example Clever example of application to sports analytics. [Alagappan] Data set consists of vectors of statistics (points scored, rebounds, etc.). Distance metric is Euclidean. Filter is points per minute. Results identify many new positions.

NBA example Clever example of application to sports analytics. [Alagappan] Data set consists of vectors of statistics (points scored, rebounds, etc.). Distance metric is Euclidean. Filter is points per minute. Results identify many new positions.

Role Player One-of-a-kind Scoring Rebounder Role-Playing Ball-Handler 3PT Rebounder Shooting Ball- Handler All-NBA 1st Team All-NBA 2nd Team Offensive Ball- Handler Combo Ball-Handler Defensive Ball- Handler Scoring Paint Protector Paint Protector

Summary Claim Mapper can be successfully applied to analysis of geometric structures in large data sets from a wide variety of domains. Key idea: clustering across scales, represent relationships between clusters as scale varies Choice of filter function(s) is critical to successful aplication.

Summary Claim Mapper can be successfully applied to analysis of geometric structures in large data sets from a wide variety of domains. Key idea: clustering across scales, represent relationships between clusters as scale varies Choice of filter function(s) is critical to successful aplication.

Summary Claim Mapper can be successfully applied to analysis of geometric structures in large data sets from a wide variety of domains. Key idea: clustering across scales, represent relationships between clusters as scale varies Choice of filter function(s) is critical to successful aplication.