On Some Mathematics for Visualizing High Dimensional Data



Similar documents
Chapter 1 Microeconomics of Consumer Theory

Improved SOM-Based High-Dimensional Data Visualization Algorithm

Sebastián Bravo López

Channel Assignment Strategies for Cellular Phone Systems

A Holistic Method for Selecting Web Services in Design of Composite Applications


Recovering Articulated Motion with a Hierarchical Factorization Method

How To Fator

Hierarchical Clustering and Sampling Techniques for Network Monitoring

Capacity at Unsignalized Two-Stage Priority Intersections

A Keyword Filters Method for Spam via Maximum Independent Sets

Classical Electromagnetic Doppler Effect Redefined. Copyright 2014 Joseph A. Rybczyk

protection p1ann1ng report

Granular Problem Solving and Software Engineering

Weighting Methods in Survey Sampling

Computer Networks Framing

Programming Basics - FORTRAN 77

) ( )( ) ( ) ( )( ) ( ) ( ) (1)

Henley Business School at Univ of Reading. Pre-Experience Postgraduate Programmes Chartered Institute of Personnel and Development (CIPD)

Fixed-income Securities Lecture 2: Basic Terminology and Concepts. Present value (fixed interest rate) Present value (fixed interest rate): the arb

Supply chain coordination; A Game Theory approach

arxiv:astro-ph/ v2 10 Jun 2003 Theory Group, MS 50A-5101 Lawrence Berkeley National Laboratory One Cyclotron Road Berkeley, CA USA

10.1 The Lorentz force law

RATING SCALES FOR NEUROLOGISTS

A Survey of Usability Evaluation in Virtual Environments: Classi cation and Comparison of Methods

5.2 The Master Theorem

In order to be able to design beams, we need both moments and shears. 1. Moment a) From direct design method or equivalent frame method

Intelligent Measurement Processes in 3D Optical Metrology: Producing More Accurate Point Clouds

' R ATIONAL. :::~i:. :'.:::::: RETENTION ':: Compliance with the way you work PRODUCT BRIEF

i_~f e 1 then e 2 else e 3

Chapter 5 Single Phase Systems

Performance Analysis of IEEE in Multi-hop Wireless Networks

User s Guide VISFIT: a computer tool for the measurement of intrinsic viscosities

Pattern Recognition Techniques in Microarray Data Analysis

Static Fairness Criteria in Telecommunications

Research on Virtual Vehicle Driven with Vision

The Application of Mamdani Fuzzy Model for Auto Zoom Function of a Digital Camera

Behavior Analysis-Based Learning Framework for Host Level Intrusion Detection

Big Data Analysis and Reporting with Decision Tree Induction

Picture This: Molecular Maya Puts Life in Life Science Animations

WORKFLOW CONTROL-FLOW PATTERNS A Revised View

A novel active mass damper for vibration control of bridges

Neural network-based Load Balancing and Reactive Power Control by Static VAR Compensator

1.3 Complex Numbers; Quadratic Equations in the Complex Number System*

An Enhanced Critical Path Method for Multiple Resource Constraints

Open and Extensible Business Process Simulator

Electrician'sMathand BasicElectricalFormulas

Trade Information, Not Spectrum: A Novel TV White Space Information Market Model

Price-based versus quantity-based approaches for stimulating the development of renewable electricity: new insights in an old debate

The Basics of International Trade: A Classroom Experiment

Interpretable Fuzzy Modeling using Multi-Objective Immune- Inspired Optimization Algorithms

THE PERFORMANCE OF TRANSIT TIME FLOWMETERS IN HEATED GAS MIXTURES

NOMCLUST: AN R PACKAGE FOR HIERARCHICAL CLUSTERING OF OBJECTS CHARACTERIZED BY NOMINAL VARIABLES

cos t sin t sin t cos t

Henley Business School at Univ of Reading. Chartered Institute of Personnel and Development (CIPD)

A Context-Aware Preference Database System

THE UNIVERSITY OF TEXAS AT ARLINGTON COLLEGE OF NURSING. NURS Introduction to Genetics and Genomics SYLLABUS

Agile ALM White Paper: Redefining ALM with Five Key Practices

An Efficient Network Traffic Classification Based on Unknown and Anomaly Flow Detection Mechanism

Chapter 1: Introduction

Marker Tracking and HMD Calibration for a Video-based Augmented Reality Conferencing System

AUDITING COST OVERRUN CLAIMS *

A Three-Hybrid Treatment Method of the Compressor's Characteristic Line in Performance Prediction of Power Systems

A Comparison of Default and Reduced Bandwidth MR Imaging of the Spine at 1.5 T

Srinivas Bollapragada GE Global Research Center. Abstract

3 Game Theory: Basic Concepts

Learning Curves and Stochastic Models for Pricing and Provisioning Cloud Computing Services

On the Characteristics of Spectrum-Agile Communication Networks

Measurement of Powder Flow Properties that relate to Gravity Flow Behaviour through Industrial Processing Lines

Parametric model of IP-networks in the form of colored Petri net

FIRE DETECTION USING AUTONOMOUS AERIAL VEHICLES WITH INFRARED AND VISUAL CAMERAS. J. Ramiro Martínez-de Dios, Luis Merino and Aníbal Ollero

An integrated optimization model of a Closed- Loop Supply Chain under uncertainty

Nodal domains on graphs - How to count them and why?

State of Maryland Participation Agreement for Pre-Tax and Roth Retirement Savings Accounts

Masters Thesis- Criticality Alarm System Design Guide with Accompanying Alarm System Development for the Radioisotope Production L

Optimal Sales Force Compensation

Electronic signatures in German, French and Polish law perspective

Scalable Hierarchical Multitask Learning Algorithms for Conversion Optimization in Display Advertising

Optimal Online Buffer Scheduling for Block Devices *

Deadline-based Escalation in Process-Aware Information Systems

BENEFICIARY CHANGE REQUEST

SHAFTS: TORSION LOADING AND DEFORMATION

FOOD FOR THOUGHT Topical Insights from our Subject Matter Experts

International Journal of Supply and Operations Management. Mathematical modeling for EOQ inventory system with advance payment and fuzzy Parameters

Discovering Trends in Large Datasets Using Neural Networks

CIS570 Lecture 4 Introduction to Data-flow Analysis 3

Findings and Recommendations

Asymmetric Error Correction and Flash-Memory Rewriting using Polar Codes

Dynamic and Competitive Effects of Direct Mailings

SOFTWARE ENGINEERING I

Derivation of Einstein s Equation, E = mc 2, from the Classical Force Laws

A Comparison of Service Quality between Private and Public Hospitals in Thailand

Transcription:

On Some Mathematis for Visualizing High Dimensional Data Edward J. Wegman Jeffrey L. Solka Center for Computational Statistis George Mason University Fairfax, VA 22030 This paper is dediated to Professor C. R. Rao on the oasion of his 80th birthday. Abstrat: The analysis of high-dimensional data offers a great hallenge to the analyst beause the human intuition about geometry of high dimensions fails. We have found that a ombination of three basi tehniques proves to be extraordinarily effetive for visualizing large, high-dimensional data sets. Two important methods for visualizing high-dimensional data involve the parallel oordinate system and the grand tour. Another tehnique whih we have dubbed saturation brushing is the third method. The parallel oordinate system involves methods in high-dimensional Eulidean geometry, projetive geometry, and graph theory while the the grand tour involves high-dimensional spae filling urves, differential geometry, and fratal geometry. This paper desribes a synthesis of these tehniques into an approah that helps build the intuition of the analyst. The emphasis in this paper is on the underlying mathematis. 1. Introdution Seeing objets in high-dimensional spaes is often a fantasy of and a wish for many mathematiians. Often the revered mathematiians of the past were reputed to be able to see in their minds high-dimensional objets, whih gave them insight beyond those of more ordinary mathematiians. Ahieving suh a feat has intrigued the present authors sine we were mathematial juveniles. The ombination of two tehniques known respetively as parallel oordinates and the - dimensional grand tour allows us to atually visually gain insight into hyperspae muh like those revered mathematial geniuses of the past. A third method known as saturation brushing allows for visualization of relatively massive datasets. We have used these tehniques in onjuntion with eah other for the analysis of data sets ontaining as many as 250,000 observations in as high as 18-dimensional spae. These tehniques have been ombined in an evolutionary series of softwares that one of us (EJW) has helped to design over the past deade or so. Our first effort dating from ira 1988 was a DOS program known as Mason Hypergraphis. This was followed around 1992 by a UNIX program known as ExplorN, and most reently in 2000 by a Windows 95/98/NT software known as Crystal Vision. We have written extensively about aspets of these ideas. See for example Wegman and Bolorforoush (1988), Wegman (1990), Miller and Wegman (1991), Wegman (1991), Wegman and Carr (1993), 1

Wegman and Shen (1993), Wegman and Luo (1997), Solka, Wegman, Rogers, Poston (1997), Wegman, Poston and Solka (1998), Solka, Wegman, Reid and Poston (1998), and Wilhelm, Wegman and Symanzik (1999). Many of our ideas have appeared only in proeedings papers and tehnial reports, or have never been published at all. This paper is not intended as a general review paper of all high-dimensional data visualization methods, but as a synthesis of our approahes to visualizing high-dimensional data. Wegman, Carr and Luo (1993) provides a onvienient introdution to many other highdimensional data visualization tehniques. The present paper is intended to fous, not on the appliations nor on the methodology itself, but rather on the underlying mathematis. The intent is to onsolidate the mathematial developments that have been sattered over a wide variety of literature into one onvenient paper. As indiated above, this paper is not an attempt to review all approahes to high dimensional visualization, but to synthesize speifi tehniques we have found useful. A ompanion paper fousing on appliations of these methodologies to ahieve a number of statistial tasks is being planned. However, the ombination of mathematial tools, mainly geometri tools, is itself interesting, and in our view it is worthwhile to doument these. It is espeially of interest in a paper in honor of Professor C. R. Rao, who has ontributed so muh to make geometri methods an integral part of statistial analysis. 2. Parallel Coordinates Parallel oordinate displays are a tool for visualizing multivariate data. They were introdued into the mathematis literature by Inselberg (1985) and suggested as a tool for high dimensional data analysis by Wegman (1990). Sine then many additional refinements have been suggested. However, the basi parallel oordinate plot remains an intriguing mathematial objet. Traditional Cartesian oordinates suffer from the fat that we live in three spatial dimensions. Beyond three dimensions all manner of artifies have been invented to represent higher-dimensional data, using time, olor, glyphs, and so on. Regrettably these do not treat all variables in the same way and, hene, make omparing variables with different representations extremely diffiult. The parallel oordinate plot devie is based on the observation that problems assoiated with Cartesian plotting arise beause of the orthogonality onstraint. Beause this is the ase, in parallel oordinates, we simply give up orthogonality and draw the axes as parallel. Any number of parallel axes an be drawn in a plane 1. If the data is -dimensional, simply draw parallel axes. A data vetor % ~²%Á%ÁÃÁ%³ is drawn by loating % on the -th oordinate axis and simply joining the % to % b by a line segment for ~ Á Ã Á À There is in priniple no upper bound on the dimension of the data that an be represented, although there are pratial limits related to the resolution available on a omputer sreen and to the human eye. See Wegman (1990) for muh more detail on interpretation and usage of these displays. 1 Suggestions have also been made to replae the plane with a ylinder. In an attempt to overome the pairwise adjaenies problem desribed in setion 2.3, parallel axes are straight lines drawn on the ylinder. In an attempt to deal with irular data, the parallel axes are atually drawn as parallel irles on the ylinder. 2

The power of and motivation for using parallel oordinate displays derives from the underlying onnetion with projetive geometry. Axiomati syntheti projetive geometry is motivated by the asymmetry in Eulidean geometry indued by the parallel lines axiom. That is, most pairs of lines in a two-plane meet in a point, exept if the lines are parallel. However, all pairs of points determine a line. In syntheti projetive geometry, the parallel lines axiom is replaed with the axiom: every pair of lines meets in a point. Together with the other axioms of projetive geometry, this axiom has the effet that any statement that is true about lines and points is also true when the words "line" and "point" are interhanged. This notion of duality between points and lines indues all types of additional dualities in a projetive plane. Nondegenerate mappings between projetive planes have the property of preserving ertain geometri strutures. In the ase of transformations from Cartesian oordinate geometry to parallel oordinate geometry, this implies struture in Cartesian oordinates have a dual struture in parallel oordinates. The impliation is that not only does a parallel oordinate display have the ability to uniquely map high-dimensional points into a planar diagram, but that the parallel oordinate display an be interpreted geometrially. It is worth making a ouple of observations. First syntheti projetive geometry is an abstrat mathematial onstrut. One model for syntheti projetive geometry is the ordinary Eulidean plane supplemented by so-alled ideal points. The ideal points are in one-to-one orrespondene with the slopes of ordinary lines. The idea of this model is that "parallel lines meet at infinity." Hene all parallel lines having the same slope will meet at the same ideal point. The set of ideal points form the so-alled ideal line. In this model, the projetive plane thus has "regular" points, i.e. those from the Eulidean plane, and "ideal" points, whih we have just desribed. Another model for the projetive plane is the "ross ap," whih is a hemisphere with opposite points on the equator topologially identified. Neither of these models for syntheti projetive geometry is entirely satisfatory, sine they make distintion among ertain types of points. However the model, whih regards the projetive plane as an extended Eulidean plane, is extremely useful for data visualization purposes. Just as syntheti Eulidean geometry an be tied to the Cartesian oordinate system to form an analyti geometry, syntheti projetive geometry an be tied to a oordinate system known as natural homogeneous oordinates to form an analyti projetive geometry. We disuss these oordinates in setion 2.2. For more details on projetive geometry, see Wegman (2000). Thus the intriguing aspet of parallel oordinate plots is the mathematial duality between Cartesian plots and parallel oordinate plots. For the purposes of the present mathematial disussion, we fous on just two dimensions. In this ontext we have just suggested that a point in an ordinary Cartesian plot is represented by a line in a parallel oordinate plot. Indeed, if we oneive of both the Cartesian two-dimensional plot and the parallel oordinate plot as representing two projetive two-planes, we derive a number of interesting dualities. 2.1 Parallel Coordinate Geometry. The parallel oordinate representation enjoys some elegant duality properties with the usual Cartesian orthogonal oordinate representation. Consider a line B in the Cartesian oordinate plane given by B : &~%band onsider two points lying on that line, say ²Á b ³ and ²Á b ³. For simpliity of 3

omputation we onsider the %& Cartesian axes mapped into the %& parallel axes as desribed in Figure 2.1. We superimpose a Cartesian oordinate axes!" on the %& parallel axes so that the & parallel axis has the equation " ~. The point ²Áb³ in the %& Cartesian system maps into the line joining ²Á³ to ²bÁ³ in the!" oordinate axes. Similarly, ²Áb³ maps into the line joining ²Á³ to ²bÁ³. It is a straightforward omputation to show that these two lines interset at a point (in the!" plane) given by B : ²² ³ Á ² ³ ³. Notie that this point in the parallel oordinate plot depends only on and the parameters of the original line in the Cartesian plot. Thus B is the dual of B and we have the interesting duality result that points in Cartesian oordinates map into lines in parallel oordinates while lines in Cartesian oordinates map into points in parallel oordinates. Figure 2.1 Illustrating the duality between points and lines in Cartesian and parallel oordinate plots For ²³, is negative and the intersetion ours between the parallel oordinate axes. For ~, the intersetion is exatly midway. A ready statistial interpretation an be given. For highly negatively orrelated pairs, the dual line segments in parallel oordinates will tend to ross near a single point between the two parallel oordinate axes. The sale of one of the variables may be transformed in suh a way that the intersetion ours midway between the two parallel oordinate axes in whih ase the slope of the linear relationship is negative one. In the ase that ² ³ or ² ³, is positive and the intersetion ours external to the region between the two parallel axes. In the speial ase ~, this formulation breaks down. However, it is lear that the point pairs are ²Á b ³ and ²Á b ³. The dual lines to these points are the lines in parallel oordinate spae with slope and interepts and respetively. Thus the duals of these lines in parallel oordinate spae are parallel lines with slope. We thus append 4

the ideal points to the parallel oordinate plane to obtain a projetive plane. These parallel lines interset at the ideal point in diretion. In the statistial setting, we have the following interpretation. For highly positively orrelated data, we will tend to have lines not interseting between the parallel oordinate axes. By suitable linear resaling of one of the variables, the lines may be made approximately parallel in diretion with slope. In this ase the slope of the linear relationship between the resaled variables is one. 2.2. Natural Homogeneous Coordinates and Conis. The point-line, line-point duality seen in the transformation from Cartesian to parallel oordinates extends to oni setions. To see this onsider both the %& plane and the!" plane to be augmented by suitable ideal points so that we may regard both as projetive planes. The representation of points in parallel oordinates is thus a transformation from one projetive plane to another. Computation is simplified by an analyti representation. However, the usual oordinate pair, ²%Á &³, is not suffiient to represent ideal points. Thus, for purposes of analyti projetive geometry, we represent points in the projetive plane by triples ²%Á&Á'³. As motivation for this representation, onsider two distint parallel lines having equations in the projetive plane Z % b & b ' ~ and % b & b ' ~. (2.1) Simultaneous solution yields ² Z ³' ~ so that ' ~. Thus when ' ~, the triple ²%Á&Á'³, i.e. ²%Á&Á³, desribes ideal points. The representation of points in the projetive plane is by triples, ²%Á&Á'³, whih are alled natural homogeneous oordinates. If ' ~, the resulting equation is % b & b ~ and so ²%Á &Á ³ is the natural representation of a point ²%Á &³ in Cartesian oordinates lying on % b & b ~. Notie that if ² %Á &Á ³ is any multiple of ²%Á &Á ³ on % b & b ~, we have %b &b ~ ²%b&b³~ h~. (2.2) Thus the triple ² %Á &Á ³ equally well represents the Cartesian point ²%Á&³ lying on % b & b ~ so that the representation of a point in natural homogeneous oordinates is not unique. However, if is not or, we an simply re-sale the natural homogeneous triple to have a for the '-omponent and thus read off the Cartesian oordinates diretly. If the '-omponent is zero, we know immediately that we have an ideal point. Notie that we ould equally well onsider the triples ²ÁÁ³ as natural homogeneous oordinates of a line. Thus, triples an either represent points or lines reiterating the fundamental duality between points and lines in the projetive plane. Reall now that the line B: &~%b mapped into the point B : ²² ³, ² ³ ) in parallel oordinates. In natural homogeneous oordinates, B is represented by the triple ²Á Á ³ and the point B by the triple ²² ³ Á ² ³ Á ³ or equivalently by ²Á Á ³. The latter yields the appropriate ideal point when ~ÀA straightforward omputation shows for 5

( ~ x 0 0 1{ 0 1 1 y 1 0 0 (2.3) that!~%( or ²ÁÁ³ ~ ²Á Á³(. Thus the transformation from lines in orthogonal oordinates to points in parallel oordinates is a partiularly simple projetive transformation with the rather nie omputational property of having only adds and subtrats. Similarly, a point ²% Á% Á³ expressed in natural homogeneous oordinates maps into the line represented by ²Á % % Á % ) in natural homogeneous oordinates. Another straightforward omputation shows that the linear transformation given by!~%) or ²Á% %Á%³ ~ ²%Á%Á³) where ) ~ x 0 1 1{ 0 1 0 y 1 0 0 (2.4) desribes the projetive transformation of points in Cartesian oordinates to lines in parallel oordinates. Beause these are nonsingular linear tranformations, hene projetive transformations, it follows from the elementary theory of projetive geometry that onis are mapped into onis. This is straightforward to see sine an elementary Z Z quadrati form in the original spae, say %*% ~ where % denotes % transpose, represents the general oni. Clearly then sine!~%), ) nonsingular, we have Z Z % ~!), so that!) *²) ³! ~ is a quadrati form in the image spae. An instrutive omputation involves omputing the image of an ellipse % b & ' ~ with Á Á. The image in the parallel oordinate spae is ²! b "³ " ~ #, a general hyperboli form. Figure 2.2a: One satterplot of five-dimensional data showing elliptial ross setion. 6

It should be noted that the solution to this equation is not a lous of points, but the natural homogeneous oordinates of a lous of lines, a line oni. The envelope of this line oni is a point oni. In the ase of this omputation, the point oni in the original Cartesian oordinate plane is an ellipse, the image in the parallel oordinate plane is as we have just seen a line hyperbola with a point hyperbola as envelope. Figure 2.2b: Parallel oordinate plot of the same five-dimensional data showing the hyperboli dual struture. We mentioned the duality between points and lines and onis and onis. It is worthwhile to point out two other nie dualities. Rotations in Cartesian oordinates beome translations in parallel oordinates and vie versa. Perhaps more interesting from a statistial point of view is that points of infletion in Cartesian spae beome usps in parallel oordinate spae and vie versa. Thus the relatively hard-to-detet infletion point property of a funtion beomes the notably more easy to detet usp in the parallel oordinate representation. Inselberg (1985) disusses these properties in detail. It is well worth noting that the natural homogeneous oordinate representation is a standard devie in omputer graphis. 2.3 Permutation of the Axes for Pairwise Comparisons. One of the most ommon objetions to parallel oordinate displays is the preferential positioning of adjaent axes. If the parallel oordinate axes are ordered from 1 through, then there is an easy pairwise omparison of 1 with 2, 2 with 3 and so on. However, the pairwise omparison of 1 with 3, 2 with 5 and so on was not easily done beause these axes were not adjaent. One simple mathematial question then is what is the minimal number of permutations of the 7

axes in order to guarantee all possible pairwise adjaenies. Although there are [ permutations, many of these dupliate adjaenies. Atually far fewer permutations are required. Figure 2.3 Illustrating the graph labeling for determining parallel oordinate permutations A onstrution for determining the permutations is represented in Figure 2.3. A graph is drawn with verties representing oordinate axes, labeled lokwise to. Edges represent adjaenies, so that vertex one onneted to vertex two by an edge means axis one is plaed adjaent to axis two. To onstrut a minimal set of permutations that ompletes the graph is equivalent to finding a minimal set of orderings of the axes so that every possible adjaeny is present. Figure 2.3b illustrates the basi zig-zag pattern used in the onstrution. This reates an ordering whih in the example of Figure 2.3b is 1 2 7 3 6 4 5. For even this general sequene an be written as ÁÁÁÁÁÁ Á à Á ² b ³ and for odd as Á Á Á Á Á Á Á à Á ² b ³. An even simpler formulation is ~ ² b² 1 ³ ³Á ~ÁÁÃÁ (2.5) + + with ~. Here it is understood that ~ ~. This zig-zag pattern an ²³ be reursively applied to omplete the graph. That is to say if we let ~, we may define ²+ ³ ²³ ~ ² b ³ Á ~ Á Á à Á µ (2.6) 8

where hµ is the greatest integer funtion. For even, it follows that this onstrution generates eah edge in one and only one permutation. Thus is the minimal number of permutations needed to assure that every edge appears in the graph or equivalently that every adjaeny ours in the parallel oordinate representation. For odd, the result is not exatly the same. We will not have any dupliation of adjaenies for µ. 1 However, µ will not provide a omplete graph. The ase ~ µ in equation (2.6) will omplete the graph, but also reate some redundanies. Nevertheless, it is lear b that µ permutations are the minimal number needed to omplete the graph and thus provide every adjaeny in the parallel oordinate representation. Thus we have that the minimal number of permutations of the parallel oordinate axes needed to insure b adjaeny of every pair of axes is µ. These permutations may be onstruted using formulas (2.5) and (2.6). It is worthwhile to point out that all possible pairs may be found b in only µ distint parallel oordinate plots, but for a satterplot matrix, 4 5 ~ plots are required. One pratial onsequene is that for a fixed omputer sreen size, elements in the satterplot matrix beome diffiult to see muh more rapidly than the parallel oordinate plots. In general suh permutation arguments are rendered unneessary with the introdution of the grand tour. 3. The Grand Tour in -dimensions The grand tour is, in a sense, the generalization of rotations in high-dimensional spae and is an invaluable tool for animating high-dimensional visualization. When used in onjuntion with satterplot matrix displays or with parallel oordinate displays, the grand tour allows the data analyst a variety of views for exploring the struture of data. The basi idea, introdued by Asimov (1985) and Buja and Asimov (1985), is to apture the popular sense of a grand tour. That is, to fully understand a subjet item, one must examine it from all possible angles. This translates in a mathematial perspetive to examining the data loud from all possible angles. In the formulation introdued by Asimov and Buja, this meant to projet into a set of two-planes dense in the - dimensional spae of the data. The idea is to move from one two-plane to the next so as to see the data from all possible angles. Not only should the set of two-planes be dense in the data spae, but it is also required to move ontinuously (smoothly) from one twoplane to the next so that the human visual system an smoothly interpolate the data and trak individual points and strutures in the data. Hene the mathematis of the Asimov- Buja grand tour requires a ontinuous, spae-filling path through the set of two planes in the -dimensional data spae. The idea then is to projet the data onto the two-planes and view them in a time-sequened set of two-dimensional images. The pratial implementation is to step through the set of two-planes with a small step size in time rather than to move through the set of two-planes in some ontinuous sense. This type of grand tour was also studied by Buja, Hurley, and MDonald (1986), Cook, Buja and Cabrera (1991), Cook et al. (1993), Cook et al. (1995), Cook and Buja (1997), Furnas and Buja (1994), and Hurley and Buja (1990),. 9

Wegman (1991) formally suggested replaing the manifold of two-planes with a manifold of -planes where, being the dimension of the data spae, and disussed adapting the methods of Asimov-Buja for onstruting a spae-filling urve through the manifold of -planes. The data is then projeted into the -plane and visualized using either a parallel oordinate display or a satterplot matrix display. This method was atually implemented in the Mason Hypergraphis software (Wegman and Bolorforoush, 1988) muh earlier. The approah formulated by Asimov is known as the torus method for reasons that shall beome lear during our development of the grand tour mathematis. The geometri form of the standard two-torus imbedded in three spae is of ourse the traditional doughnut shape. The generalization of the two-torus to higher dimensional spae is harder to visualize. For this reason, it is somewhat easier to oneive of the basi struture of interest as a multidimensional hyperube. The the torus, then, is a hyperube with opposite faes identified. Beause we want to deal with angles, we an tthink of the length of eah side of the ube a À We shall first formulate the Asimov- Buja winding algorithm, then a random urve algorithm, and finally a fratal algorithm. We also present a two-dimensional pseudo grand tour. 3.1 The Asimov-Buja Winding Algorithm in -spae. Let ~ ²Á Á à Á Á Á Á à Á ³ be the anonial basis vetor of length. The is in the -th position. The are the unit vetors for eah of the oordinate axes in the initial position. We want to do a general rigid rotation of these axes into a new position with basis vetors ²!³ ~ ² ²!³Á ²!³Á à Á ²!³³, where! is the time index. The strategy then is to take the inner produt of eah data point, say %, ~ Á à Á with the basis vetors, ²!³À This operation projets the data into the rotated oordinate system. By onvention, will refer to the dimension of the data and will refer to the sample size of the data set. Of ourse, the subsript on ²!³ means that ²!³ is the image under the generalized rotation of the anonial basis vetor. Thus the data vetor % is ²% Á% ÁÃÁ% ³ so that the representation of % in the oordinate system is & ²!³ ~ ²& ²!³Á & ²!³Á à Á & ²!³³Á ~ Á Á à Á (3.1) with & ²!³ ~ % ²!³Á ~ Á à Á and ~ Á à Á. (3.2) ~ The vetor & ²!³ is a linear ombination of the basis vetors representing the -th data point in the rotated oordinate system at time!. It is also worth pointing out that & ²!³ is also a linear ombination of the data. If one omponent of the vetor is held out from the grand tour (i.e. a partial grand tour), then the partial grand tour lends itself to an interpretation in terms of multiple linear regression. The general goal then is to find a generalized rotation 8 suh that 8² ³ ~. We an oneive of 8 as either a funtion on the spae of basis vetors or as a d 10

matrix 8 where d 8 ~ À We implement this by hoosing 8 as an element of the speial orthogonal group denoted by :6²³ of orthogonal d matries having determinant b. Thus we must find a ontinuous spae filling urve through :6²³. We shall do this by a omposite mapping from the real line, l, to the -dimensional hyperube Á µ, i.e. ls Á µ, where ~ ² ³. The omponents of ²!³ are taken to be angles. The mapping from Á µ onto :6²³ is given by ² ÁÁÁÁÃÁ Á³~9 ² ³d9 ² ³dÄd9 Á² Á³À (3.3) The fat that this is an onto mapping guarantees that the urve is spae filling. See Asimov (1985). There are ~ ² ³ fators in the expression (3.3). These orrespond to the 4 5 ~ ² ³ distint two-flats formed by the anonial basis vetors. In a general -dimensional spae, there are axes orthogonal to eah twoflat. Thus rather than rotating around an axis as we onventionally do in threedimensional geometry, we must rotate in a two-plane in -dimensional spae. We let 9 ² ³ be the element of :6²³ whih rotates the plane through an angle of. Thus x Ä Ä Ä { Å Æ Å Æ Å Æ Å Ä ² ³ Ä ² ³ Ä 9 ² ³ ~ Å Æ Å Æ Å Æ Å, (3.4) Ä ² ³ Ä ² ³ Ä z } Å Æ Å Æ Å Æ Å y Ä Ä Ä where the osines and sines are respetively in the -th and -th olumns and rows. The restritions on are Á À The angles are alled the Euler angles. Finally, we onstrut ²!³ ~ ²!Á!Á à Á!³ as the mapping from l to Á µ where, of ourse,! is taken modulo. ÁÃÁ are taken to be linearly independent real numbers over the rational numbers. Thus we define 8! ~ ² ²!³³À The fat that the are linearly independent over the rationals guarantees that no an be written in terms of the remaining. This guarantees that they are mutually irrational and that slopes through the hyperube annot be multiples of one another. As mentioned earlier, opposite faes of a -hyperube are topologially identified to onstrut a -dimensional torus. Hene the terminology for the torus method. It is easy to see in two dimensions that opposite sides of a retangle may be identified to form an ordinary torus. If we take ~and to be any irrational number, then the urve on the two-torus desribed by ²!³ simply winds around the torus in a spae filling urve. This is the origin of the idea of the winding algorithm. The mapping from the -torus to :6²³ is onto, whih guarantees that the image of the above urves are spae filling. But there is a potential problem. We don't in general know how lose to uniformly distributed the mapped urve is on :6²³. The urve on 11

the -torus is equi-distributed, but in general the image may not be on :6²³. In the twoplane formulation, this had been empirially a problem beause some implementations of the torus method have given grand tours that behaved very non-unformly. In partiular, the tour appears to dwell for a long time near ertain axes while others appear rarely. Our experiene with using this algorithm in the higher-dimensional formulation suggests empirially that this is less of a problem. While seleted pairs of axes may appear relatively stati, others are atually moving quite signifiantly. Thus the overall dynami when visualizing high-dimensional plots is more satisfatory than when viewing 2- dimensional versions. Nonetheless, an adequate theoretial understanding of the overall dynamis of the torus algorithm is still an open question. It is also worth pointing out that for a -dimensional grand tour, one only needs the first olumns of a matrix in :6²³. Thus from a omputational point of view, there is no need to formulate the rotations in 2-planes that are to stay fixed. This simplifies the omputational omplexity somewhat, although it makes the overall algorithm more ompliated with different algorithms for the omputation of matries in :6²³ for eah sub-dimension. An interesting researh question, posed by one of the referees, is whether overparametrizing improves the uniformity properties of the resulting grand tour. This also is an open question. 3.2 The Random Curve Algorithm. The key to the winding algorithm is the onstrution of the funtion ²!³ whih reates a spae filling urve through the - dimensional hyperube (or -torus). The omposition of with reates a spae-filling urve through :6²³À Alternate onstrutions whih reate a spae filling urve through the -dimensional hyperube an also be used to effet a spae-filling urve through :6²³. A simple way of doing this is to hoose points at random in the hyperube. One initiates this algorithm by hoosing two points at random in the hyperube, say and, and reating a linear interpolant between them going from to. Upon arriving at, we hoose a third point,, and form the linear interpolant from to À In general we have a sequene of points, Á hosen randomly with linear interpolants between them. For any % Á µ and any given,, eventually with probability one, for some, P% P ÀThus eventually the random urve will pass arbitrarily lose to any point in the -hyperube. Two aveats must be mentioned. Sine opposite faes are identified, the shortest path between two points may not be through the hyperube but aross a fae of the ube. Sine we are really interested in geodesis on the -torus, one must not think in terms of staying stritly within the hyperube. This involves a slightly more ompliated algorithm. In pratie, a strit interpolation path within the hyperube also seems to be a quite satisfatory approah and will still pass within of any point with probability one. The seond point to make is that, as with the winding algorithm, the random urve algorithm an, in priniple, ontinue forever. Our original ode in Mason Hypergraphis ira 1988 ontained the winding algorithm. Our more reent odes in ExplorN and CrystalVision are based on the random urve algorithm. In pratie, although theoretially these algorithms an go on forever, our experiene has been that most 12

struture in high-dimensional data shows up very rapidly, say within five or ten minutes of viewing the grand tour, and that it is unneessary to ontinue the grand tour beyond that time. 3.3 The Fratal Curve Algorithm. In 1887, George Cantor demonstrated that any two finite-dimensional, smooth manifolds have the same ardinality regardless of their dimensions. In priniple therefore Á µ an be mapped bijetively onto Á µ À Many attempts to do this have been reated most notable among these methods are the Peano urves and the Hilbert urves. The advantage of these urves are that they are fratal in harater and hene, for a fixed fratal level, have a finite length and a fixed auray. By following a fratal urve through the hyperube, one an preselet an auray level and guarantee that the grand tour will terminate with a known time. To illustrate the omputation, we will desribe a two-dimensional Peano urve, a three-dimensional Hilbert urve, and our generalization to the -dimensional Hilbert urve. Figure 3.1 Third level Peano urve through Áµ À We shall be onerned with ternary and otal expansions of frational numbers between 0 and 1. We adopt the following notation for a ternary expansion:!!! À!!!Ä~ b b bäá! ~Áor À (3.5) Similarly, for an otal expansion, 13

$ $ $ À$$$Ä~ b b bäá$ ~ÁÁÁÃÁÀ (3.6) The two-dimensional Peano urve of level is given by the two vetor ( À!!Ä! ³~ À! ²! ³²! ³Ä À²! ³²! ³Ä. (3.7)!! b!!! b!!! Here ²!³~!Á!~ÁÁand is the!-th iterate of. Of ourse the omputation is arried out until is satisfied. To make this omputation onrete, let us onsider a level 3 example. Impliitly the fourth position in the deimal expansion is 0. For illustration ( À³ ~ À²³ À²³²³! ~ À À! ~ 4 5 À The resulting sequene of points when joined determines a urve through in this ase the square. Beause there is onsiderable overlapping of points, a useful strategy is to join the midpoints of the line segments whih results in the derived Peano urve in Figure 3.2. Figure 3.2 Derived Peano urve of level three. The three-dimensional formulation of the Hilbert urve is somewhat more tedious. We define the following matries: x { x { x { / ~ / ~ / ~ y y y 14

x { x { x { / ~ / ~ / ~ y y y x { x { / ~ / ~ À y y We also onstrut the following olumn vetors: x { x { x { x { x { ~ ~ ~ ~ ~ y y y y y x { x { x { ~ ~ ~ À y y y Then the three-dimensional Hilbert urve of level is given by ²À$$Ä$ ³~ / / / Ä/ $ $ $ $ $ ~ (3.8) where / $ is the identity matrix. A three-dimensional level 2 Hilbert urve is given in Figure 3.3. 15

Figure 3.3 A 3-dimensional, level 2 Hilbert urve. Both the Hilbert urve and the Peano urve an be extended to higher dimensions. See for example Solka et al. (1998). We give briefly the algorithm here. First let ²%Á ³ ~ % if is even and ²%Á³~% if is odd. Let! ~ -th ternary digit of %. Let be the dimension and let be the level we desire. Let ²³ ²³ ~ 6! Á! À ²³ ~Á 7 (3.9) Then ²³ ²³ ²³ ~ ~ ~ ²%³ ~ @ Ä AÀ (3.10) As before this desribes a series of points in Áµ À These an be joined by line segments to form the general -dimensional Peano urve and the midpoints of the line segments joined to form the -dimensional derived Peano urve. By inreasing the level of the Peano urve we an ome as lose as desired to every point in the -dimensional hyperube Áµ. Further reading on Peano urves and related fratal urves an be found in Peano (1890), Steinhaus (1936) and Sagan (1994). 16

3.4 Andrews Plots, Parallel Coordinate Plots, and Tours. The Andrews plot (Andrews, 1972) was an early attempt to give a two-dimensional plot of multidimensional data. As suh the Andrews plots have an interesting interpretation in onnetion with both parallel oordinates and the grand tour. To desribe the Andrews plot, let ~²%Á%ÁÃÁ%³ be the data vetors, then the Andrews plot is given by % &²!³~ % b% ²!³b% ²!³b% ²!³b% ²!³bÄÀ (3.11) Traditionally, the Andrews plot was onstruted by plotting &²!³ versus! for ~ÁÃÁÀIn brief the Andrews plot is a finite, hene, periodi Fourier expansion with the weights given by the omponents of the data vetor. By plotting &²!³ versus! for every in a stati plot, one ould group points that had similar urves. Beause of the appliability of the Parseval relation, Andrews plots also have the property of preserving 3 distanes. Andrews plots an be reognized in another sense. If one onsiders a onedimensional plot of the &²!³ animated as a funtion of!(and this is a view reognized in Andrews, 1972), then the Andrews plot an be regarded as a one-dimensional tour. As a tour, the Andrews plot is a series of interpolations between various one-dimensional views of the data. In a similar way, the parallel oordinate plot an be viewed as a series of linear interpolations between one-dimensional projetions of the data. Although these two plot devies have some similarity of interpretation in this sense, this interpretation misses the powerful geometri struture whih motivated the parallel oordinate plot and lies at its intelletual roots. Of ourse, the tour view of the Andrews plot also has a onnetion with the grand tour notion we have been examining. The Asimov-Buja grand tour was originally formulated as a series of projetions into two-dimensional planes, not one-dimensional lines. The availability of multi-dimensional representations suh as satterplot matries and parallel oordinates suggested the possibility full-dimensional grand tours. However, the two major riteria for grand tours are ontinuity and spae-filling. The Andrews plot is a ontinuous tour, but as we shall see in the next setion, it is not spae filling. 3.5 A Pseudo-Grand Tour As reently as 1990, the Andrews plot was haraterized as a one-dimensional grand tour. See for example Crawford and Fall (1990). However, beause of the familiar trigonometri identities, ²! b! ³ ~ ²! ³ ²! ³ b ²! ³ ²! ³; (3.12) ²! b! ³ ~ ²! ³ ²! ³ ²! ³ ²! ³ (3.13) and 17

²!³ b ²!³ ~ Á (3.14) Wegman and Shen (1993) showed that the Andrews plot was not a one-dimensional grand tour beause it was not anywhere nearly spae filling even in only one dimension. However, motivated by the Andrews plot, Wegman and Shen suggested a very fast algorithm for omputing an approximate two-dimensional grand tour. Consider the - dimensional data vetor % ~²%Á%ÁÃÁ%³. If is not even augment the vetor by one additional 0. We assume without loss of generality that is even. Let ²!³ ~ m ² ²!³Á ²!³Á à Á ²!³Á ²!³³ (3.15) and ²!³ ~ m ² ²!³Á ² ³Á à Á ²!³Á ²!³³À (3.16) Here are as before linearly independent over the rationals. Note that P ²!³P ~ ² ²!³ b ²!³³ ~ Á ~ P ²!³P ~ ² ²!³b² ³ ²!³³~Á ~ and ²!³, ²!³ ~ ² ²!³ ²!³ ²!³ ²!³³ ~ À 2 ~ Thus ²!³ and ²!³ form an orthonormal basis for two-planes. They are not quite spae filling beause of the dependene between ²!³ and ²!³ implied by (3.14). However, the algorithm based on (3.15) and (3.16) is muh more omputationally onvenient than the torus-based winding algorithms. Of ourse, it does not generalize to a full -dimensional grand tour. A two-dimensional projetion of the data onto the - plane an be aomplished by taking the inner produt as in equation (3.2) with ~ Á. 4. Saturation Brushing We have earlier mentioned saturation brushing as a tehnique for dealing with large data sets. A basi exposition of the saturation brushing idea an be found in Wegman and Luo (1997). While there is little in the way of mathematial underpinnings for the idea, it is appropriate for sake of ompleteness to briefly desribe the idea here. When dealing with large data sets, overplotting beomes a serious problem. It is diffiult 18

to tell whether a pixel represents a single observation or perhaps hundreds or thousands of observations. The idea of saturation brushing is to desaturate a brushing olor until it ontains only a very small omponent of olor and hene is very nearly blak. Most modern omputers have a so-alled -hannel whih allows for ompositing of overplots. The -hannel is used omputer graphis as a devie for inorporating transpareny. However, by using suh a devie to build up olor intensity, we an obtain a visual indiation of how muh overplotting there is at a pixel. In effet, the brighter, more saturated a pixel is, the more overplotting. 5. Conlusions We have used this ombination of methods, i.e. parallel oordinate plots, satterplot matries, and full -dimensional grand tours as well as partial grand tours, to analyze data sets ranging in dimension from 4 to 68 and ranging in data set size from as few as 22 points to as large as 280,000 points. An amazing amount of visual insight an be gained when these methods are applied in pratial settings. Appliations have inluded disovery of reasons for bank failures, disovery of hidden priing mehanisms for ommerial produts suh as ereals, disovery of physial struture of pi mesonproton ollisions, reation of detetion shemes for hemial and biologial warfare agents, reation of the ability to detet buried landmines, demonstration of the impossibility of finding linear preditors of ost in a ertain lass of software development tools, and a host of other pratial and interesting appliations. We believe these ombinations of tehniques are both inredibly powerful from an appliations point of view as well as having very interesting mathematial underpinnings. Aknowledgements The work of Dr. Wegman was supported by the Army Researh Offie under Grant DAAG55-98-1-0404, by the Offie of Naval Researh under Grant DAAD19-99- 1-0314 administered by the Army Researh Offie, and by the Defense Advaned Researh Projets Ageny under Agreement 8905-48174 with The Johns-Hopkins University. The paper was ompleted while Dr. Wegman was an ASA/NSF/BLS Senior Researh Fellow at the Bureau of Labor Statistis. Any opinions expressed in this paper are those of the authors and do not onstitute poliy of the Bureau of Labor Statistis. The work of Dr. Solka was supported by the NSWC ILIR Program through the Offie of Naval Researh and by the SWT Blok Program of the Offie of Naval Researh. The authors would like to express our gratitude to the referees who made insightful and very useful suggestions for improvements to the paper and our presentation of this material. Referenes Andrews, D. F. (1972), "Plots of high dimensional data," Biometris, 28, 125-136. 19

Asimov, D. (1985), "The grand tour: a tool for viewing multidimensional data," SIAM J. Sient. Statist. Comput., 6, 128-143. Buja, A. and Asimov, D. (1985), "Grand tour methods: an outline," Computer Siene and Statistis: Proeedings of the Seventeenth Symposium on the Interfae, D. Allen, ed., Amsterdam: North Holland, 63-67. Buja, A., Hurley, C. and MDonald, J. A. (1986), "A data viewer for multivariate data," Computer Siene and Statistis: Proeedings of the Eighteenth Symposium on the Interfae, T. Boardman, ed., Alexandria, VA: Amerian Statistial Assoiation, 171-174. Cook, D., Buja, A., and Cabrera, J. (1991), "Diretion and motion ontrol in the grand tour," Computing Siene and Statistis, 23, 180-183. Cook, D., Buja, A., Cabrera, J. and Swayne, D. (1993), Grand Tour and Projetion Pursuit (a video), ASA Statistial Graphis Video Lending Library. Cook, D., Buja, A., Cabrera, J., and Hurley, C. (1995), "Grand tour and projetion pursuit," Journal of Computational and Graphial Statistis, 4(3), 155-172. Cook, D. and Buja, A. (1997) "Manual ontrols for high-dimensional data projetions," J. Computat. Graph. Statist., 6, 464-480. Crawford, S. L. and Fall, T. C. (1990), Projetion pursuit tehniques for visualizing high-dimensional data sets," in Visualization in Sientifi Computing, (G. M. Nielson, B. Shrivers, L. J. Rosenblum, editors), 94-108, Los Alamitos, CA: IEEE Computer Soiety Press. Furnas, G. and Buja, A. (1994), "Prosetion views: Dimensional inferene through setions and projetions," J. Computat. Graph. Statist., 3, 323-353. Hurley, C. and Buja, A. (1990), "Analyzing high dimensional data with motion graphis," SIAM J. Si. Statist. Comput., 11, 1193-1211. Inselberg, A. (1985), "The plane with parallel oordinates," The Visual Computer, 1, 69-91. Peano, G. (1890), "Sur une ourbe qui remplit toute une aire plane," Math. Annln., 36, 157-160. Miller, J. J. and Wegman, E. J. (1991), "Constrution of line densities for parallel oordinate plots," in Computing and Graphis in Statistis (A. Buja and P. A. Tukey, eds.), New York: Springer-Verlag, 107-123. Sagan, H. (1994), Spae-Filling Curves, New York: Springer-Verlag. 20

Solka, J. L., Wegman, E. J., Reid, L. and Poston, W. L. (1998), "Explorations of the spae of orthogonal transformations from 9 to 9 using spae-filling urves," Computing Siene and Statistis, 30, 494-498. Solka, J. L., Wegman, E. J., Rogers, G. W., and Poston, W. L. (1997), "Parallel oordinate plot analysis of polarimetri NASA/JPL AIRSAR imagery," Automati Target Reognition VII - Proeedings of SPIE, 3069, 175-184. Steinhaus, H. (1936), "La ourbe de Peano et les fontions independantes," C. R. Aad. Si., Paris, 202, 1961-1963. Wegman, E. J. (1990), "Hyperdimensional data analysis using parallel oordinates," Journal of the Amerian Statistial Assoiation, 85, 664-675. Wegman, E. J. (1991), "The grand tour in -dimensions," Computing Siene and Statistis: Proeedings of the 22nd Symposium on the Interfae, 127-136. Wegman, E. J. (2000), Geometri Methods in Statistis, Leture Notes available at http://www.galaxy.gmu.edu. Wegman, E. J. and Bolorforoush, M. (1988), "On some graphial representations of multivariate data," Computing Siene and Statistis: Proeedings of the 20th Symposium on the Interfae, 121-126. Wegman, E. J. and Carr, D. B. (1993), "Statistial graphis and visualization," in Handbook of Statistis 9: Computational Statistis, (Rao, C. R., ed.), Amsterdam: North Holland, 857-958. Wegman, E. J., Carr, D. B. and Luo, Q. (1993) "Visualizing multivariate data," in Multivariate Analysis: Future Diretions, (Rao, C. R., ed.), Amsterdam: North Holland, 423-466. Wegman, E. J. and Luo, Q. (1997), "High dimensional lustering using parallel oordinates and the grand tour," Computing Siene and Statistis, 28, 352-360, republished in Classifiation and Knowledge Organization, (R. Klar and O. Opitz, eds.), Berlin: Springer-Verlag, 93-101, 1997. Wegman, E. J., Poston, W. L., and Solka, J. L. (1998), "Image grand tour," Automati Target Reognition VIII - Proeedings of SPIE, 3371, 286-294. Wegman, E. J. and Shen, J. (1993), "Three-dimensional Andrews plots and the grand tour," Computing Siene and Statistis, 25, 284-288. 21

Wilhelm, A. F. X., Wegman, E. J., and Symanzik, J. (1999), "Visual lustering and lassifiation: The Oronsay partile size data set revisted," Computational Statistis, 14(1), 109-146. Software Referenes Mason Hypergraphis, opyright () 1988, 1989 by Edward J. Wegman and Masood Bolorforoush, a MS-DOS pakage for high-dimensional data analysis. Originally programmed in Turbo-Pasal, the sofware ontained parallel oordinate plots, satterplot matries, grand-tour linked to parallel oordinates, anaglyph stereo 3-D satterplots, and parallel oordinate density plots. It is still available at ftp://www.galaxy.gmu.edu/pub/software/hypergra.zip. ExplorN, opyright () 1992, by Daniel B. Carr, Qiang Luo, Edward J. Wegman, and Ji Shen, a UNIX pakage for Silion Graphis workstations inorporating satterplot matries, stereo ray glyph plots, parallel oordinates, and the -dimensional grand tour. Reent versions also inlude saturation brushing. The ode was done using Silion Graphis proprietary GL graphis subroutines and, hene, only runs on Silion Graphis workstations. The pakage is available at ftp://www.galaxy.gmu.edu/pub/software/explorn_v1.tar. CrystalVision, opyright () 2000 by Crystal Data Tehnologies, is a Windows 95/98/NT pakage for Wintel omputers. The software inorporates satterplot matries, stereosopi 3-D satterplots using Crystal Eyes tehnology, parallel oordinate plots, -dimensional grand tours and partial grand tours, saturation brushing, and density plots. The ode was onstruted using opengl and will run on any modern Wintel omputer. A demonstration version of CrystalVision is available at ftp://www.galaxy.gmu.edu/pub/software/crystalvisiondemo.exe 22