Interactive Clustering for Data Exploration

Transcription

1 Interactive Clustering for Data Exploration Joel R. Brandt Jiayi Chong Sean Rosenbaum Stanford University Figure 1: A complete view of our system. In the top left, the Solution Explorer is shown. Below this is the Member Table, and to the right are the visualizations of two solutions. A BSTRACT 1 Clustering algorithms are widely used in data analysis systems. However, these systems are largely static in nature. There may be interaction with the resulting visualization of the clustering, but there is rarely interaction with the process. Here, we describe a system for visual data exploration using clustering. The system makes the exploration and understanding of moderately-large ( instances) multidimensional (10-20 dimensions) data sets easier. Clustering is a natural part of the human cognitive process. When shown a set of objects, an individual naturally groups and organizes these items within his or her mind. In the domain of machine learning, unsupervised clustering algorithms have been developed to mimic this intrinsic process. Yet these algorithms are usually employed within rigid frameworks, removing the fluid, exploratory nature of human clustering. In this paper, we present a visual system that uses clustering algorithms to aid data exploration in a natural, fluid way. CR Categories: H.5.0 [Information Systems]: Information Interfaces and Presentation General; I.5.3 [Computing Methodologies]: Pattern Recognition Clustering Keywords: clustering, data exploration, interaction [email protected] [email protected] [email protected] 1.1 I NTRODUCTION Motivation Clustering techniques are widely used to analyse large, multidimensional data sets [1, 2]. However, this use is typically static in nature: the user loads a data set, selects a few parameters, runs a clustering algorithm, and then views the results. The process then stops here; clustering is used simply to analyze the data, not to explore it. We believe that with the right visualization environment, clustering can be used to provide a very natural way for users to explore

2 complex data sets. For example, when a user is given a small, lowdimensional data set to explore (such as a collection of objects on a table), a typical individual intuitively groups, or clusters, similar items mentally. The individual may then compare clusters, break up individual groups by different attributes, completely re-cluster the set based on different attributes, and so on. Without aid, however, both the size of the data set and the types of attributes that an individual can operate on is quite limited. Our system allows the user to perform these intrinsic operations within a much larger space. 1.2 Major Contributions Our system makes contributions in two main areas: data exploration techniques and visualization. More specifically, our system provides an intuitive mechanism for visually exploring moderately-large multi-dimensional data sets, supports a fluid, iterative process of clustering, refinement, and re-clustering to enable this exploration, and proposes a novel, faithful visualization of high-level characteristics of the clustering results. 1.3 Organization The rest of this paper proceeds as follows. In Section 2 we begin with an analysis of prior work in this area. We then detail our data exploration and visualization contributions in Sections 3 and 4 respectively. In Section 5, we give a complete example of the system in use. Finally, we conclude in Section 6 with a plan for future work. 2 PRIOR WORK A great deal of prior work has been done in the areas of visualizing clustering results and interacting with these visualizations. A relatively smaller amount of work has been done in the field of interacting with the clustering process. We examine each of these areas in turn, and then consider some related work that leverages techniques other than clustering. 2.1 Visualization of Clustering Results Given the complex output of clustering algorithms, good visualizations are necessary for users to interpret the results accurately. Visualizations generally display either micro-level, or macro-level characteristics of the clustering. Micro-level visualizations show explicitly the pairwise relations between instances. Conversely, macro-level visualizations attempt to express the quality of the clustering result, such as the size and compactness of each cluster and the separation of clusters relative to each other. We believe that understanding macro-level characteristics is most useful for data exploration, whereas interpreting micro-level characteristics lies in the domain of data analysis. This distinction is described in detail in Section 3.1. Here, we will examine existing visualizations for each class of characteristics separately Micro-level Visualizations Many micro-level visualizations begin by projecting the clusters into 2-dimensional or 3-dimensional space [5, 8]. The projection is chosen, as much as is possible, such that nearby items lie in the same cluster, and nearby clusters are similar. However, 3- dimensional visualizations are often unintuitive due to occlusion and depth-cuing problems. Likewise, 2-dimensional projections are often problematic because of the issues associated with accurately projecting high-dimensional data into a low-dimensional space. Lighthouse [5] takes an interesting approach by allowing the user to switch between 2- and 3-dimensional views. The usability of this feature, however, is not well studied. A colored matrix representation is another widely used method for visualizing clustering results [1, 2, 9, 10]. In this representation, instances lie on one axis, and features lie on the other. Each cell (corresponding to an instance/feature pair) is colored according to the value of that feature for that instance. The features are ordered by a hierarchical clustering method so that like rows appear next to each other. gcluto [9] takes this a step further by allowing the user to also cluster the transpose of the data set, and sort the features by similarity. Alternatively, this same colored matrix representation is often used to express pairwise distances between all instances. In this visualization, each instance lies on both axes. Each cell is colored according to the relative distance between the two instances represented. Hierarchical clustering methods are used to produce an ordering on the axes, so that the majority of cells corresponding to nearby points lie near the diagonal Macro-level Visualizations A relatively smaller amount of effort has been devoted to producing compelling macro-level visualizations. gcluto presents a Mountain visualization technique. The centroids of each cluster are projected into the plane as mentioned above. Then, at each centroid, a mountain is formed. Attributes of the mountain are mapped to attributes of the clusters: the height is mapped to the internal similarity of the cluster, the volume is mapped to the number of objects in the cluster, and the color of the peak is mapped to the internal standard deviation of the cluster s objects. While these are all important attributes to consider, the method of displaying them is arguably a bit unintuitive. 2.2 Interaction with Clustering Work on interaction with clustering algorithms is best divided into two categories: interaction with the result set and interaction with the clustering process Interaction with the Result Set Most commonly, systems provide a means of interacting with the result set. Because these result sets are often too large to be represented in their entirety on a typical display, these interactions usually center around hiding data. When hierarchical clustering is performed, visualization tools often provide a means for collapsing portions of the hierarchy. The Hierarchical Clustering Explorer [10] supports this through dynamic queries, and gcluto [9] supports this through a typical expandable tree structure. While these methods are effective for reducing screen clutter, little semantic meaning is tied to the directional branching of the tree, so it can be difficult to select only the regions of interest. Domain-specific methods of interacting with the result set are also common [5, 8]. For example, Lighthouse produces a visualization of web search results using clustering, and then allows the user to select a point in the representation to visit the corresponding site. Such domain-specific interactions are of little interest in this work. Finally, many clustering visualizations use detail on demand techniques [5, 9, 10]. Positioning a cursor over a particular data point, for example, often brings up a small window with metadata about the corresponding instance. Such techniques are necessary because of the large amount of data being displayed.

3 Visualize Solutions Subset Data Generate Sub-problem Cluster Figure 2: The data exploration pipeline for our system Interaction with the Clustering Process Systems that allow interaction with the clustering process are somewhat more rare. Many systems let users define initial cluster centroids in a visual way, rather than choosing them randomly. The value of such a system, however, is unclear: if it is easy for the user to select centroids, it is probably unnecessary to cluster the data! Some systems go a bit further and allow the user to interact with the clustering process as it is occurring. For example, Looney s work [6] allows the user to eliminate or merge clusters at various steps in the algorithm. While this work takes strides to solve some of the major problems with clustering, it requires that the user understand the data set in order to produce a result. We seek exactly the opposite paradigm: the user iteratively produces results in order to understand the data set Alternatives to Clustering Self-organizing maps are an unsupervised learning technique built on neural networks. They have been widely explored as a tool to aid both visualization and data exploration [3, 4]. They are typically employed to aid in the production of low-dimensional visualizations of high-dimensional data. The benefits of self-organizing maps are somewhat contrary to the goals of this work: they automatically reduce the dimensionality of the data, while providing little evidence of why a particular projection was chosen. Instead, we enable the user to explore and re-weight dimensions at his or her discretion, helping the user to understand links between these dimensions. 3 DATA EXPLORATION The principle goal of our system is to enable intuitive data exploration of moderately-large, multi-dimensional data sets. In this paper, we center our data exploration process around k-means, a straightforward clustering algorithm [7]. However, we believe our data exploration pipeline is applicable when coupling user interaction with any automated technique. We begin by explaining the difference between data exploration and data analysis. Then, we discuss our the details of our data exploration pipeline. 3.1 Data Exploration versus Data Analysis The distinction between data analysis and data exploration seems subtle at first. Most simply, in data analysis, the user knows what he or she is looking for; in data exploration, the user does not. Data analysis tasks typically investigate specific data instances, and their relation to other instances. The analyst usually has a large understanding of the structure of the data set he or she is working with. That is, the relations between attributes are typically well understood, or at least the characteristics of a particular attribute are well known. The examination of a gene array clustering, for example, is a typical data analysis task [1, 2]. The analyst knows what each gene is, and what each experiment is, and is attempting to determine which genes respond in similar ways to particular experiments. Such a task is completely static: a clustering is produced, visualized, and analyzed. Data exploration tasks are those which attempt to uncover the general structure of the data. Here, the user may not know which attributes best separate or explain the data, may not know the relationships between attributes, and may not even know which attributes are useful. However, the user is likely to have domain knowledge about the data set being explored. For example, the user may have high-level knowledge about instances in the data set, and may be interested in determining which attributes are most useful in predicting or explaining that knowledge. Data exploration is an iterative process of discovery. As such, tools for data exploration must support this iterative search. Specifically, we believe tools for data exploration must make it easy for the user to explore the data along multiple paths, create branches in those exploration paths, and compare various exploration paths. 3.2 The Data Exploration Pipeline The use of our system centers around our Data Exploration Pipeline, shown in Figure 2. The user explores the data by iterating through this pipeline. After loading the data set, the user is presented with a visualization of a solution to a trivial clustering problem: clustering all of the data into one cluster. From this visualization, the exploration begins: 1. The user explores a solution, visualizing it through the techniques described in Section The user selects a subset of the data to continue exploring. This subset may, of course, be the entire set. 3. The user generates a sub-problem using this subset of the data. This involves chooses the value of several parameters, such as number of clusters to form, which attributes to use when clustering, and the relative weights of each of those attributes. 4. The clustering is performed and the sub-solution is stored. 5. The process repeats using the new sub-solution. Of course, the user has more control over the pipeline than what is given here. For example, the user can generate several different sub-problems from any solution, varying the parameters (and even the sets) in each sub-problem. As a result of this flexibility, the

4 1 Figure 3: A view of an individual cluster. The centroid is shown in red. The currently selected point is shown in blue. Points that lie close to the selected point (in high-dimensional space) are shown in gray. pipeline results in a hierarchy of clustering solutions, where each sub-solution is a refinement of a subset of its parent solution. Furthermore, as will be discussed in Section 4, we allow the user to open up visualizations of as many solutions as is desired, and link the display of these solutions so that similarities and differences can easily be seen. In this way, it is easy for the user to explore the effects of clustering using different features and parameters. When each solution is generated, we keep track of the parameters used in the clustering, as well as the parameters defining the subset of instances to be clustered. With this information, if new data is added to the system (for example, if we want to classify additional instances), we can place the new instances in the appropriate clusters in all solutions within the system. 4 VISUALIZATION TECHNIQUES In this section, we examine the visualization techniques used to support the data exploration pipeline discussed in Section 3. We devote the majority of our attention to the techniques used to visualize a particular solution, and to compare several solutions, as this is the novel portion of our system. However, in Section 4.2.1, we discuss the interfaces for managing a hierarchy of solutions and for generating new solutions. The visualization techniques presented here have been developed with the goals of effectively representing macro-level characteristics of clustering results and enabling intuitive comparison of multiple clustering solutions. These are the techniques required for data exploration. While we provide some drill-down into the micro-level characteristics of a solution as a part of our brushing and highlighting techniques, these characteristics are not our primary concern. Investigation of micro-level characteristics lies mainly in the domain of data analysis rather than data exploration. We believe that much of the prior work discussed in Section 2 accomplishes the data analysis task successfully. So, in a complete system for both data exploration and analysis, we propose the marrying of our new techniques with extensions of existing analysis techniques. This is discussed further in our section on future work (Section 6.1). 4.1 Small Multiple Histograms for Cluster Visualization In the simplest sense, we visualize a clustering solution as a collection of histograms. Each cluster is represented by a histogram, as shown in Figure 3. The centroid of the cluster is placed on the left of the histogram, and the instances are arranged according to their Euclidean distance from the centroid. All of the histogram axes are scaled the same within one clustering solution. Complete sets of small multiples for a solution can be seen in Figures 1 and Figure 4: Four views of the same centroid histogram. In each histogram, the centroid indicated is used as the basis Centroid Histogram Furthermore, we produce a histogram layout of the centroids that mimics the individual cluster histograms. The user is able to select the centroid to serve as the basis, and the other centroids are placed in the histogram according to their Euclidean distance from the basis centroid. (This rocking of the basis element is discussed further in Section ) Examples of this visualization can be seen in Figure 4. Together, these histograms provide an intuitive summary of macro-level cluster characteristics. Cluster size and distribution can easily be seen within one cluster histogram. Comparing cluster histograms gives an understanding of relative compactness of each cluster. Finally, the centroid histogram summarizes the inter-cluster separation One Dimension versus Two Dimensions Our histograms can be thought of as a one-dimensional projection of the data. At first consideration, it may seem that projecting into one dimension (instead of two or three) gives up a great deal of flexibility. However, we believe that when combined with the decoupling of inter- and intra-cluster characteristics mentioned above, our histograms lead to a more faithful representation of the macro-level characteristics than would be possible in a typical two-dimensional projection. Consider a projection of all instances into two dimensions using a technique such as multi-dimensional scaling. In such a technique, one attempts to find the projection that preserves the pairwise distances between instances to the greatest extent possible. In most cases, the actual distance between two points will be well reflected by their distance in projected space. However, for some pairs of instances, such results may be impossible to achieve. It may be that the best projection overall still places a significant number of points close to other points that are actually far apart. Such misrepresentations make understanding macro-level characteristics of the

5 clustering result more difficult using this representation. Furthermore, even without these misrepresentations, the decoupling of inter- and intra-cluster characteristics in our method is not easily achievable in a two-dimensional projection of all data. Instead, the user must segment the space mentally to perform such comparisons. Our decoupling allows the user to more easily examine the characteristics he or she is concerned about, without having to block out additional information. As has been mentioned in Section 2, other two-dimensional techniques, such as colored matrix representations, exist for expressing cluster results. These techniques, however, are more suited toward micro-level examination, which is outside the scope of this work Rocking Rocking is a technique often used to solve depth-cuing problems when visualizing three-dimensional point data in two dimensions. If the point set is rotated slightly, the necessary depth cue is provided: points in front move one direction while points in back move the opposite direction. We borrow this idea of rocking to improve our histogram displays. In our histograms, it is often the case that two distant points will end up nearly the same distance from the centroid, and thus in the same place on the histogram. (Note that this is not a misrepresentation, we make no claim that distant points will end up far apart in the histogram.) We allow the user to select two points, one of which may be the centroid. After doing so, the first point is used as the basis for computing distances to build the histogram. A slider can be used to rock the basis point along the line between the first and second points. As before, points that move left are closer to the second point than the first, and conversely for points that move right. An example of this is shown in Figure 5. As mentioned earlier, we allow a similar type of rocking within the centroid display. We allow the user to select a basis centroid by clicking. When the user changes basis centroids, new positions for each centroid in the histogram are computed, and the movement between old and new positions is animated. As before, the movements express the relative locations of other centroids as compared to the two bases. Figure 4 shows all possible rockings of a single centroid histogram. From these views, it is clear that clusters 2 and 3 are located quite close to each other, whereas 1 and 4 are both far from each other and far from 2 and Data Subsetting A crucial part of the data exploration pipeline is the subsetting and reprocessing of data. We support data subsetting through dynamic queries as shown in Figure 6. The user makes a range selection simply by dragging the range selection brackets so that they enclose only the points of interest. Note that this range selection may be made with any point selected as the basis for distance calculation (and even when the basis point is being rocked.) This allows the user to select points that are close to (or far from) any point, rather than making selections with respect to the centroid only Generating Solution Hierarchies Once the desired subset is selected within each cluster histogram, a new sub-solution may be generated. The Sub-Solution Generator window (not shown) presents a simple user interface to select the attributes for clustering, assign weights to those attributes, choose the number of clusters to be formed, and initiate the clustering. When a new clustering solution is generated, it is placed in the Solution Explorer (shown in Figure 1) as a child of its parent solution. Any number of child and grandchild solutions may be created Figure 5: Rocking of a cluster. The two points used to compute the rocking line are shown in yellow. The amount of rocking is controlled by moving the slider. Figure 6: A dynamic query within a cluster histogram, used for both data subsetting and range selection in view linking.

6 from any parent solution. In this way, the user is able to traverse multiple exploration paths, branching as desired. The Solution Explorer keeps each solution organized, providing a summary of the attributes used to generate the clustering. 4.3 Brushing and View Linking With the generation of multiple solutions comes the need to explore them in concert. We support this exploration through brushing and view linking. As shown in Figure 3, brushing is used to highlight instances near a selected instance (in high-dimensional space). When a user hovers over an instance, it is highlighted in blue. Additionally, all nearby instances are colored gray. Note that this highlighting is somewhat akin to rocking. Only the instances which are actually near the instance of interest are highlighted. In addition to highlighting close-by instances within the cluster, we also highlight instances corresponding to the selected instance in all other views. Similarly, we highlight all instances contained within the active region of the currently selected cluster in all solution visualizations. (The currently selected cluster is defined by the chosen basis centroid in the currently focused window.) This highlighting is shown in Figure 7. View linking allows the user to easily visualize the relative consistency of clustering between different solutions using different attributes and weights. In this way, relations between attributes can be easily found and understood, helping the user uncover the structure of the data, the ultimate goal of data exploration. 4.4 Micro-level Data Examination A minimal amount of support is provided within the system for micro-level data examination. The member table, shown in Figure 1, lists all instances in the currently selected cluster. Brushing is supported between the member table and solution visualization: the selected instance in the visualization is highlighted in the member table, and likewise, a selected instance in the member table is highlighted in the solution visualization. Finally, if a range selection is made within the active cluster, only the selected instances are shown in the member table. Support for micro-level data examination could be greatly enhanced by building upon much of the prior work mentioned in Section We discuss our plans for this further in Section 6.1. None the less, we believe that the linking of brushing between macro- and micro-level visualizations presented here here would prove to be a very useful feature regardless of the micro-level visualization used. 5 EXAMPLE USE In this section, we present a brief example of one possible use of our system. Consider a network administrator who is attempting to locate machines that are behaving atypically. The administrator believes that the usage patterns of most machines stay consistent from month to month, and that a change in behavior might be an indication of an intrusion or other exploit. However, she has a large number of traffic metrics available to her, and is not sure which of these metrics best express a machine s usage pattern. The network administrator begins her data exploration by compiling a variety of traffic metrics for each month for each machine. Each of these values becomes an attribute. Each instance (a machine) has a group of attributes for each month of data, resulting in n m attributes, where n is the number of traffic metrics, and m is the number of months. All of this data is loaded into the system. She first decides to cluster the entire dataset into 5 clusters using the all of the first months attributes. To do this, she opens the initial solution (a clustering of everything into one cluster) present in the Solution Explorer. She does not need to subset the data, so she simply uses the Sub-Problem Generator to define her clustering problem: she selects the attributes of interest and chooses 5 clusters. After the clustering is completed, she visualizes the new solution. She observes that two clusters contain most of the instances. She quickly hovers over the few instances that lie in the other three clusters. She discovers that all of these machines are servers of one sort or another. Since she watches these machines pretty closely using other tools, she decides to exclude them from her exploration. She makes an empty range selection in each of these clusters, leaves the entire range selected in the two dense clusters, and opens up the Sub-Problem Generator again. She decides now to try to confirm her theory that the behavior of most machines does not change from month to month. Using the subset of machines selected, she produces two new sub-solutions: one using the first month s attributes, and one using the second month s attributes. She opens up both solutions, and utilizes the view linking features to explore their similarity. This exploration is shown in Figure 7. She selects each cluster in the first month s visualization in turn. As she does so, the corresponding instances in the second month s visualization are highlighted. As is shown in Figure 7, the clusters stay relatively consistent. She quickly examines those instances that change clusters. For some, she easily observes that the machines are outliers in both clustering solutions. This suggests that the attributes or weights in use may not be optimal. For others, she uses her domain knowledge to explain the differences. For example, perhaps one of the instances is a machine that was added in the middle of the first month. For a few others, she decides to do further investigation. From this point, her data exploration process could go any number of directions. She could produce several clustering solutions for the same month using different attributes to explore links between the attributes. She could produce clustering solutions of varying sizes for the same attributes, giving her a clearer picture of how many types of machines she really has. The system easily supports these and many other tasks. Furthermore, once she has determined the set of attributes and weights that best characterizes her data, she can carry this information over into the data analysis domain and use these same clustering techniques in her daily network monitoring. 6 CONCLUSION We have presented a visualization system that harnesses clustering algorithms to make exploration of moderately-large, highdimensional data sets more intuitive. An iterative process of visualization, query refinement, and re-processing was presented that we believe accurately represents the ideal data exploration process. Furthermore, we proposed a novel method for visualizing macrolevel clustering characteristics. Finally, we showed an example use of our system. 6.1 Future Work We believe that a complete solution would provide means for both data exploration and data analysis. In this work, we have only explored the domain of data exploration. We are interested in augmenting the visualizations presented here with adaptations of some of the techniques presented in Section 2 to produce such a complete system. We also plan to characterize more clearly the trade-offs between existing two-dimensional visualizations and the one-dimensional approach presented here when visualizing macro-level characteristics of clustering results. Furthermore, we plan to explore ways to afford the user more control over the visualizations produced using our techniques. For example, the user could control the distance

7 Figure 7: An example of view linking. In each group, the same two solutions are shown. The cluster indicated at the bottom of each group is selected in the left visualization. Instances in the right visualization are colored blue if they are members of the cluster selected on the left. metric used to perform the layout separately from the weights used in clustering. This would allow exploration of the tightness of various dimensions. Similarly, we would like to investigate ways to make rocking more general and intuitive. We also plan to explore the affinity of other clustering techniques (as well as other non-clustering-based machine learning techniques) to our data exploration framework. Ultimately, an in-depth user study of an extended version of this system would be quite valuable. This is the only way to accurately judge the usefulness and applicability of these techniques. 7 ACKNOWLEDGEMENTS We would like to thank Alexis Battle, Dan Ramage, and Ling Xiao for their helpful insights. We would also like to thank Pat Hanrahan and all of the Winter 2006 CS448b class for their useful comments. R EFERENCES [1] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. In Proceedings of the National Academy of Sciences, volume 96, pages , [2] Amir Ben-Dor, Ron Shamir, and Zohar Yakhini. Clustering gene expression patterns. Journal of Computational Biology, 6(3/4): , [3] Arthur Flexer. On the use of self-organizing maps for clustering and visualization. In Principles of Data Mining and Knowledge Discovery, pages 80 88, [4] M.Y. Kiang, U.R. Kulkarni, and Y.T. Kar. Self-organizing map network as an interactive clustering tool an application to group technology. Decision Support Systems, 15(4): , December [5] Anton Leuski and James Allan. Lighthouse: Showing the way to relevant information. In INFOVIS, pages , [6] Carl G. Looney. Interactive clustering and merging with a new fuzzy expected value. Pattern Recognition, 35(11): , November [7] J. McQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages , [8] Sougata Mukherjea, James D. Foley, and Scott E. Hudson. Interactive clustering for navigating in hypermedia systems. In ECHT 94: Proceedings of the 1994 ACM European conference on Hypermedia technology, pages , New York, NY, ACM Press. [9] Matt Rasmussen and George Karypis. gcluto: An interactive clustering, visualization, and analysis system. Technical Report TR-04021, University of Minnesota, [10] Jinwook Seo and Ben Shneiderman. Interactively exploring hierarchical clustering results. IEEE Computer, 35(7):80 86, July 2002.