Generalized association plots (GAP): Dimension free information visualization environment for multivariate data structure



Similar documents
Data, Measurements, Features

Rank one SVD: un algorithm pour la visualisation d une matrice non négative

Data Mining: Algorithms and Applications Matrix Math Review

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

DATA SCIENCE Workshop November 12-13, 2015

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Exploratory Data Analysis with MATLAB

Chapter ML:XI (continued)

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Singular Spectrum Analysis with Rssa

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

The Value of Visualization 2

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Visual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics

Principal Component Analysis

5: Magnitude 6: Convert to Polar 7: Convert to Rectangular

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Learning outcomes. Knowledge and understanding. Competence and skills

Clustering & Visualization

How to report the percentage of explained common variance in exploratory factor analysis

STATISTICA Formula Guide: Logistic Regression. Table of Contents

CHAPTER 4 EXAMPLES: EXPLORATORY FACTOR ANALYSIS

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Performance Metrics for Graph Mining Tasks

Math 1050 Khan Academy Extra Credit Algebra Assignment

CLUSTER ANALYSIS FOR SEGMENTATION

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Partial Least Squares (PLS) Regression.

Multivariate Normal Distribution

Lecture 9: Introduction to Pattern Analysis

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Categorical Data Visualization and Clustering Using Subjective Factors

Solving Systems of Linear Equations

MULTIPLE-OBJECTIVE DECISION MAKING TECHNIQUE Analytical Hierarchy Process

A MULTIVARIATE OUTLIER DETECTION METHOD

Nonlinear Iterative Partial Least Squares Method

1 Example of Time Series Analysis by SSA 1

Introduction to Principal Components and FactorAnalysis

Using multiple models: Bagging, Boosting, Ensembles, Forests

3.1 State Space Models

Introduction to Data Analysis in Hierarchical Linear Models

Introduction to Clustering

Factor Analysis. Chapter 420. Introduction

Overview. Essential Questions. Precalculus, Quarter 4, Unit 4.5 Build Arithmetic and Geometric Sequences and Series

Data Exploration Data Visualization

Estimation of Fractal Dimension: Numerical Experiments and Software

Chapter 7 Factor Analysis SPSS

Regression Clustering

Performing a data mining tool evaluation

Environmental Remote Sensing GEOG 2021

General Framework for an Iterative Solution of Ax b. Jacobi s Method

QUALITY ENGINEERING PROGRAM

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Elementary Matrices and The LU Factorization

Social Media Mining. Data Mining Essentials

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Introduction to Matrix Algebra

Manifold Learning Examples PCA, LLE and ISOMAP

Part 2: Community Detection

Vectors 2. The METRIC Project, Imperial College. Imperial College of Science Technology and Medicine, 1996.

AN EXPERT SYSTEM TO ANALYZE HOMOGENEITY IN FUEL ELEMENT PLATES FOR RESEARCH REACTORS

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Grade 6 Mathematics Performance Level Descriptors

Goodness of fit assessment of item response theory models

An example. Visualization? An example. Scientific Visualization. This talk. Information Visualization & Visual Analytics. 30 items, 30 x 3 values

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Multivariate Analysis of Ecological Data

Multivariate Analysis of Variance (MANOVA): I. Theory

Data Mining 5. Cluster Analysis

Master of Arts in Mathematics

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Bernice E. Rogowitz and Holly E. Rushmeier IBM TJ Watson Research Center, P.O. Box 704, Yorktown Heights, NY USA

Linear Threshold Units

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

Data exploration with Microsoft Excel: analysing more than one variable

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Biomarker Discovery and Data Visualization Tool for Ovarian Cancer Screening

Common factor analysis

Computer program review

Factorization Theorems

Integration of Cluster Analysis and Visualization Techniques for Visual Data Analysis

Big Data: Rethinking Text Visualization

Similar matrices and Jordan form

The PageRank Citation Ranking: Bring Order to the Web

D-optimal plans in observational studies

Subspace Analysis and Optimization for AAM Based Face Alignment

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

Transcription:

Generalized association plots (GAP): Dimension free information visualization environment for multivariate data structure Chun-houh Chen, hun-chuan Chang, Yueh-Yun Chi, and Chih-Wen Ou-Young Academia inica, Taiwan Abstract: GAP is a dimension free visualization environment for multivariate data structure. Given a multivariate data set, GAP first compute the proximity matrices for variables as well as for subjects. Proper seriations (permutations) are searched for rearrange these two matrices to satisfy certain properties. Double sorted raw data matrix together with two sorted proximity matrices are then projected through appropriate color spectrums to create matrix maps. These three maps should be cross-examined to identify important information structure embedded in the raw data matrix. GAP has unique seriation algorithm for identifying permutations of matrices with better global sense and clustering algorithm for finding multiple clustering patterns co-exist in proximity matrices. Extensions have been added to GAP for analyzing more complicated formats such as categorical, longitudinal, and multiple-set structure. CONVERGING EQUENCE OF CORRELATION MATRICE The computation kernel of GAP is a converging sequence of iteratively generated correlation matrices, Figure. This sequence of correlation matrices is used to dynamically identify seriation and clustering for a given proximity matrix. D R () R () R (2) 237 55 934 24 R (3) R (4) R (5) R (6) 33 383 58 76 R (7) R (8 ) R (9) R ( ) 29 2437 2499 25 DL DL ρ ( D) = 5 ρ ( ) = 49 ρ ( ) = 49 ρ (2 ) = 4 DL DL AH2 AH3 DL AH DL6 TH NE NA BE NC NB4 NA7 ND NA5 NB2 NB ρ (3 ) = 2 ρ (4 ) = 7 ρ (5 ) = 3 ρ (6 ) = 2 DL6 AH3 AH2AHDL TH NE BE NC NA NA7 NB NB2 NB4 NA5 ND DL DL DL DL6 AH2 AH3 DL AH DL DL6 AH2 AH3 AHDL BE NE TH NC NB2 NB NB4 NA7 NA NA5 ND TH NE BE NC NA NA5 NA7 NB NB2 NB4 ND DL DL DL6 AH3 AH2 DL DL TH NE BE AH DL DL6 AH2 AH3 AH DL NC NB2 NA NB NA5 NA7 NB4 ND TH NE BE NA NA5 NA7 NB NB2 NC ND NB4 DL DL DL DL DL6 AH3 AH2 AH DL DL6 AH3 AH2AH DL TH NE BE NC NB2 NA NB NA5 NA7 NB4 ND TH NE BE NA NA5 NA7 NB NB2 NB4 NC ND ρ (7 ) = 2 ρ (8) = 2 ρ (9 ) = 2 ρ ( ) = DL DL AH2 AH3 DL6 AH DL TH NA NA5 NA7 NB NB2 BE NB4 ND NC NE DL6 DL DL AH AH2 AH3 DL BE TH NA NA5 NA7 NB NB2 NB4 NC ND NE AH AH2 AH3 DL6 DL DL AH AH2 AH3 DL6 DL DL NA NA TH NA5 NA7 NB NB2 NB4 NC ND NE TH NA5 NA7 NB NB2 NB4 NC ND NE BE BE AH AH2 AH3 DL6 DL DL BE TH NA NA5 NA7 NB NB2 NB4 NC ND NE (a) (b) Figure. Converging equence of Iteratively Generated Correlation Matrices. (a). orted Color Maps for the Correlation Matrices; (b). First Two Eigen-Vectors for the Correlation Matrices. 2 BAIC PRINCIPLE OF GAP There are three major pieces of information contained in any multivariate data set with n subjects and p variables:. the linkage amongst n subject points in the p-dimensional space; 2. the

linkage between p variable vectors in the n-dimensional space; and 3. the interaction linkage between the sets of subjects and variables. We use a psychosis disorder data with 95 patients on 5 symptoms to illustrate the framework of a standard GAP analysis, in Figure 2. GAP integrates the following four major steps to extract and summarize information embedded in a multivariate data set. 2. Raw Data and Proximity Matrix Maps with uitable Color Projection The raw data matrix is denoted as D a in Figure 2a. A gray spectrum is applied to project numerical numbers into gray dots with different intensities. The correlation matrix is calculated as the proximity matrix V a for the 5 symptoms. For the 95 patients, also the correlation matrix is used as the proximity matrix. The diverging blue-red color scheme is used to represent the bi-directional property of the correlation coefficients. (a) (b) V a V b D a a D b b (c) (d) V c V d v v2 v3 v4 v5 D c c v v2 v3 v4 v5 D d d Figure 2. tandard GAP Procedure.

(a). Original Raw Data and Proximity Matrices Maps; (b). orted Data Matrix Map and Proximity Matrix Maps; (c). Clustering for Variables and ubjects; (d). ufficient Graph with Three Multivariate Linkages. 2.2 The orted Matrix Maps with the Principle of Geometry The next step is to find proper seriations for V a and a respectively. eriation is a data analytic tool for finding a permutation or ordering of a set of objects using a data matrix. The seriations are then applied to arrange the two correlation matrices V a and a into V b and b in Figure 2b. The same seriations are also used to reshape the raw data matrix D a into D b. The difference between V a and V b is not much since V a is already grouped by the symptom tables. However, there is a dramatic change from a to b since the patients are admitted in a random order. There is a clear latent structure in D b. A band of dark gray dots moves from the upper right corner to the lower left corner. 2.3 Partitioned Matrix Maps with near tationary Iterations After the seriations have been applied to the matrix maps, the potential groups of variable and clusters of subject are identified using dynamic clustering procedures (to be discussed in ection 2) as displayed in Figure 2c. The sorted proximity matrix maps for variables and subjects are then partitioned into squares on the diagonal for within-group mean correlation structure and rectangles off diagonal for between-group mean correlation relationship, Figure 2c. The double sorted raw data matrix map is also partitioned into rectangles to represent the mean interaction effect between each subject-cluster on every variable-group. 2.4 The ufficient Graph with Three Multivariate Linkages In order to extract and summarize the visualized information in Figure 2b, we can further convert these matrix maps into a simplified version. Illustrated in Figure 2d are the mean-structure maps of the three matrices for raw data and proximities. These three mosaic-displays in Figure 2d contain the principal structural information embedded in the original data set. The mean function in Figure 2d can be replaced with any statistic for displaying desired information structure. We shall name these three mosaic displays the sufficient graph for a multivariate data set. The sufficient graph is then used to answer the three multivariate problems raised by the psychiatrist. Fifty symptoms are divided into five symptom-groups with different within and between group structure.

Ninety-five patients are also grouped into three clusters. The general behavior of these three patient-clusters on each of the five symptom-groups can now be easily comprehended. 3 EXTENION OF GAP ection 2 introduced a typical GAP procedure for a data set with continuous variables measured at a single time point. We have developed several modules of GAP for handling many possible variants of data structure. 3. Categorical GAP Figure 3 displays four matrix maps with different data type. Categorical GAP with dual scaling can be used for information visualization for binary and nominal data matrices. Continuous Ordinal Binary Categorical 2 3 4 5 6 7 - ACGT Figure 3. Information Visualization for Data Matrices with Different Data Types. 3.2 Longitudinal GAP It is possible that data profile may be measured at multiple time points as illustrated in Figure 4. When the data profile is observed at multiple time points, we have an extra piece of linkage for subjects and variables over time. This extra linkage makes the visualization of longitudinal multivariate data profiles a much harder statistical challenge than

DL6 DL DL AH3 AH AH2 DL ND NA7 NA5 NA NB4 NB NB2 NC NE BE TH BPR7 BPR3 BPR4 BPR4 BPR3 BPR6 BPR9 BPR6 BPR5 BPR BPR5 BPR BPR BPR8 DL6 DL DL AH3 AH AH2 DL ND NA7 NA5 NA NB4 NB NB2 NC NE BE TH BPR7 BPR3 BPR4 BPR4 BPR3 BPR6 BPR9 BPR6 BPR5 BPR2 BPR BPR2 BPR5 BPR BPR BPR8 the single profile GAP procedure. Longitudinal GAP is created for visualizing data profiles with longitudinal structure. t t2 t3 Figure 4. Composed Color Profile with eriation for Data Matrices at Multiple Time Points 3.3 Canonical GAP Given two data sets with different sets of variables for the same subjects, we would like to explore the relationship with Canonical GAP, Figure 5. Corr(X) Corr(X,Y) Corr((XY)') Corr(Y,X) BPR2 BPR2 Corr(Y) Corr((XY)') X Y Figure 5. Information Visualization for Comparing Two ets of Variables with Canonical GAP.

3.4 Cartograph GAP When each subject in a multivariate categorical data set is affiliated to a district in a map, Cartography GAP extents the power of Categorical GAP to systematically identify suitable color scheme for coloring districts in the map, Figure 6. x _ ŽØ ÚŽ s À x š- q x n x _ø yƒðø ÆÁŽÈø s Àø ]ƽø x ø 5 4.5 4 3 2 Û ß s \ßÐ Û ß s \ ßÐ š ²ø nßîø Ž Lø š- qø x nø ŽØø à Fø x Fø ž ø ºÍ Úø ø s øø Figure 6. Cartography GAP for Taiwan Presidential Election 2.