2D, 3D and High-Dimensional Data and Information Visualization

Transcription

1 University of Hannover Institut für Wirtschaftsinformatik (IWI) 2D, 3D and High-Dimensional Data and Information Visualization Kim Bartke Tutor: Prof. Michael H. Breitner Seminar on Data and Information Management SS 05

2 CONTENTS 1 Contents 1 Introduction 3 2 Data visualization in knowledge discovery Data types Data visualization techniques Two-dimensional data Three-dimensional data High-dimensional data Icon-based methods Hierarchical methods Geometrical methods Projection techniques Multi-dimensional scaling (MDS) Self-organizing maps (SOMs) Interaction Filtering Linking and brushing Zooming Manipulation of data Software and Applications 21 7 Outlook and Conclusion 22

3 LIST OF FIGURES 2 List of Figures 1 Scatterplot of car data set [Hoffmann99] Total crimes D linegraph (surface) [generated with Matlab] Chernoff faces [Ward99] Star glyphs [Oellien03] Dimensional stacking Fractal foam [Hoffmann99] Parallel coordinates for points A, B, C Parallel coordinates (Iris data set) [Grinstein01] Parallel coordinates (Iris data set) [Hoffmann99] Andrew s curves for data points A, B, C Andrews curves (Iris data set) [Hoffmann99] Scatter plot matrix (Car data set) [Hoffmann99] RadViz (Iris data set) [Grinstein01] PolyViz (Iris data set) [Grinstein01] Hyperbolic patch embedded into R Circle Limit III (1958) Self organizing map of the Iris data set [Grinstein01] Tramnetwork F+C technique [Keahey]

4 1 INTRODUCTION 3 1 Introduction We do it in the city, in the country side and on mountains; we even do it in space. You are asking yourself what I am talking about. We collect data. Since the beginning of mankind, people have gathered data. They used to do it by hand, counting mammoths or cattle, watching the sun and the clouds. Nowadays we use electronic devices to inspect and monitor our environment. In police departments you can find huge databases about criminals and crime. Everywhere on the planet we see weather stations that collect meteorological data, and in space large space shuttles examine planets, their atmosphere and the Earth. These electronic devices enable us to collect a vast amount of data and todays computers make it possible to store and process it. The reason we make the effort to accumulate data in such quantities is because of the information it can yield. For example police officers endeavour to find patterns in criminals behaviour to enable them to react on crimes more quickly and more effectively, while meteorology stations are used to predict the weather and, who knows, we may even find signs of life in outer space. So we have two main goals in our data analysis: the generation of hypotheses and their verification. This means that we use the gathered information to either describe the actual situation or make predictions for the behaviour of future data. As mentioned before computers do not only allow us to store data but also to process large amounts of it. In order to discover knowledge we often use numerical solutions such as data mining algorithms. But with the increase of both the number of variables (or dimensions) and the number of cases it is very demanding to recognize patterns in data. It is also difficult to understand the data structure itself and the exploration process. The numerical solutions are also very susceptible to noise and incorrect data. Remedial measures can be taken by presenting the data visually. Data visualization is the mapping of data into a Cartesian space. This method integrates a person s creativity and expertise into the knowledge discovery process and therefore allows a symbiosis of the computational power with our visual potentials. The visualization does also give the user the opportunity to interact with the computer, e. g. change the data and observe the reactions to gain a better insight into the data structure. The greatest challenge for visualizing data is to find a good spatial representation. Most of the time, the data that needs to be presented is high dimensional. That means in order to communicate the information we have to scale the data so that we can display it in two-dimensional (2D) or three-dimensional (3D) coordinate systems while losing the least amount of information. This is fairly easy for scientific data. Problems arise if data depends on variables such as customer satisfaction or purchase channels which are nonspatial. In the following paper I will present the uses of data and information visualization in the knowledge discovery process by explaining and evaluating data visualization techniques for both 2D and 3D representation. I will then deal with human-machine interaction methods and will also provide some examples of software packages and their application. I will conclude with giving you a preview on the direction of further research, e. g. the further development of the group of visualization tools.

5 2 DATA VISUALIZATION IN KNOWLEDGE DISCOVERY 4 2 Data visualization in knowledge discovery As mentioned before, knowledge discovery in databases (KDD) is the way of gaining interpretable information out of a set of raw data. Fayyad et al. [Rhodes02] make the distinction between KDD as the process of discovering knowledge whereas data mining is the method of extracting patterns from the data. Bearing this in mind the following steps in the KDD process can be identified [Oellien03, p. 2]. The first step is the pre-processing and data preparation. This includes the selection of a particular part of the data set that seems to be suitable for the following process. It is also very important to carry out noise reduction algorithms and exclude false data points to guarantee a result that reflects its nature as well as possible. The data set might also need to be adjusted. Normalization is a very useful and often performed transformation. Another key technique is the projection of high-dimensional data into the two or three dimensional space. The second step is the actual use of data mining algorithms. The goal of this step is the classification of the data as well as the clustering and summarization of it [Fayyad02]. It concludes in the pattern recognition. Finally the information gained from the data needs to be communicated to the user(s). All these elements of the analytic process are often done with the help of visualization. It is this visualization that this paper will focus on. Since the techniques always depend on the data type, this chapter will start with an overview of data types. 2.1 Data types Every data set has a general structure. It is always characterised by a group of variables (also called dimensions) and the records the database contains. One way of categorizing data is to differentiate between sets that can be described by dimensionality and sets that cannot [Keim02]. The first group consists of one-dimensional, two-dimensional, three-dimensional and highdimensional data sets. The variable in one-dimensional data is usually time. An example is the log of interrupts in a processor. Two-dimensional data can often be found in statistics like the number of financial transactions in a certain period of time. Threedimensional data can be positions in three-dimensional space or points on a surface whereas time (the third dimension) varies. High-dimensional data contains all those sets of data that have more than three considered variables. Examples are locations in space that vary with time (here: time is the fourth dimension) or any other combination of more than three variables, e. g. product - channel - territory - period - customer s income. In the second group we distinguish between text and graphs. Especially since the birth of the World Wide Web, analysing text (or in this case hypertext) becomes more and more important. Text itself is not easily analysed by data visualization techniques, so a transformation into numbers is necessary. A technique for this change from textual to numerical data could be word counting for instance. A graph is a set of objects, called nodes, and connections between these objects, called edges [Keim02]. Any relational databases are examples for this type of data sets. In the following chapters I will present several data visualization techniques for two-, three- and high-dimensional data sets. I will also give an overview of two non-linear proc 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization

6 3 DATA VISUALIZATION TECHNIQUES 5 Figure 1: Scatterplot of car data set [Hoffmann99] jection techniques to reduce the size of high dimensional data, namely multi-dimensional scaling (MDS) and Kohonens Self-Organizing Maps (SOM) as an example for a neural network algorithm. 3 Data visualization techniques 3.1 Two-dimensional data Two-dimensional data can be visualized in different ways. A very common visualization form is the scatterplot. In a scatterplot the frame for the data presentation is a Cartesian coordinate system, in which the axes correspond to the two dimensions. The data is usually represented by points in the coordinate systems first quadrant (assuming the data point values are not negative). In case of two or more data sets being displayed in the same coordinate system different colours can be used to distinguish between the distinct plots. A problem with this way of displaying data arises when the amount of data points gets very high as the points become too dense. In order to avoid this Becker suggests binning of the data set [Sahling03]. The quality of the visualization now depends on the number of bins and their sizes. Figure 1 shows the distribution of miles per gallon (MPG) vs. horsepower for American (red), European (blue) and Japanese (green) cars. Another important visualization technique for two-dimensional data is the linegraph. The difference to scatterplots is that this time the relation between the dimension on the horizontal axis and the one on the vertical axis is definite. Figure 2 shows an example for a linegraph displaying the number of crimes in Niedersachsen in the years 1993 to Extensions of linegraphs are survey plots. They can be obtained by turning the plot 90 degrees clockwise and then halve the length of the rays and add this half on the other side of the now vertical axis. The last technique I would like to mention here is the visualization of data as barcharts. Considering the last figure a barchart representation would be the same as above but with the area under the graph filled in. Histograms are particular barcharts with the bar standing for the sum of the data point class [Hoffmann02]. 3.2 Three-dimensional data The two-dimensional techniques can easily be extended to three dimensions. The third dimension is achieved in scatterplots and barcharts by adding a further axis, orthogonal

7 3 DATA VISUALIZATION TECHNIQUES 6 Figure 2: Total crimes Figure 3: 3D linegraph (surface) [generated with Matlab] to the other two. The additional dimension in a linegraph representation has the effect that the resulting plot is a surface. Figure 3 shows an example that has been generated with Matlab. A very widespread technique for visualizing the third dimension in a two-dimensional coordinate system is the use of colour or a variation of the data point size. Another very interesting visualization technique is animation to show the variation of the plot with time for instance. 3.3 High-dimensional data The visualization of high-dimensional data raises a very severe problem: the visualization space is limited to three dimensions or even to only two since data is usually displayed on screens or paper. One of the obstacles in the discovery of high-dimensional data sets information Mihalisin [Mihalisin02] points out is that techniques of extracting lowdimensional information and displaying it cannot automatically be employed for highdimensional data as the data set size is too large. Next we have to study the effect on the possible resulting data sets if we increase the number of variables or values they can hold. In order to do this, consider the following example [Mihalisin02]: We have a data set

8 3 DATA VISUALIZATION TECHNIQUES 7 Figure 4: Chernoff faces [Ward99] consisting of six columns which represent the attributes product, territory, sales channel, method of payment, time of payment and a unique identifier. Furthermore we have 100,000 rows representing the records. Our company sells five products in five different territories via two sales channels. We also offer the opportunity of two distinct methods of payment, all divided into five quarters. This means there are = 500 possible cell results. 100,000 records, each having one of the 500 cell results, leads to 100, 499! 100, = as the amount of different data sets. This is a huge number (larger than the number quantity of atoms in the universe!) and it is only a very small database. Coming now to the different visualization techniques, we distinguish between icon-based, hierarchical and geometrical methods Icon-based methods Icon-based methods are approaches that use icons (or glyphs) to represent high-dimensional data. They map data components to graphical attributes. The most famous technique is the use of Chernoff faces [Hoffmann02]. In this case a data point is represented by an individual face whereas the features map the data dimensions. Five different sizes of the eyes could correspond to the five products of the example above and the mouth might symbolize the two methods of payment. This scheme uses a person s ability of recognizing faces. Examples for Chernoff faces shows figure 4. The probably most common icon-based technique is the use of star glyphs to denote data points. A star glyph consists of a centre point with equally angled rays. These branches correspond to the different dimensions and the length of the limbs mark the value of this particular dimension for the studied data point. A polygon line connects the outer ends of the spokes [Oellien03]. An illustration of the star glyphs approach is figure 5. These icon-based techniques are very vivid but have several disadvantages. A very severe problem is the organisation of the glyphs on the screen as no coordinate system representing two of the dimensions is provided. Even if you decided to use a Cartesian system it

9 3 DATA VISUALIZATION TECHNIQUES 8 Figure 5: Star glyphs [Oellien03] would put more weight on these two dimensions and so probably distort the data pattern. Another obstacle is the amount of variables and the size of the data set itself. If the number of rays become too high a distinction between the different spokes and the values they represent is not possible anymore. A similar unclear map emerges if the number of data points exceeds a certain amount Hierarchical methods The most important representative of the group of hierarchical visualization techniques is dimensional stacking. It is a method of embedding coordinate systems recursively into each other [Grinstein02a]. Consider again the example with the five products, five territories, two sales channels, two methods of payment and five quarters [Mihalisin02]. First of all you have to select the two outermost dimensions. We choose the quarters and the pay types. Our horizontal axis is now divided into five parts while the vertical axis becomes halved. We now decide that we would like the sales channel to be embedded into the method of payment, so each part of the pay type axis gets further divided into two parts that represent the different channels. The axis corresponding to the quarters will embed the products so these elements become subdivided as well. Finally the upright axis lodges the five territories. The resulting coordinate axes combination system can be obtained in figure 6. It shows that goods of product type four, sold in quarter one in territory four, via the first sales channel and the first type of payment can be represented by the coloured rectangle. In order to visualize the amount of data points you can use a colour/grey scale. Considering the colour scale drawn next to the plot the filled rectangle would represent an amount of less than 40,000 items. This value is binned since otherwise a clear visualization would not be possible. The common depiction of the dimensional stacking technique is a bit more compact and not as nicely presented as the one above. Usually the rectangles, which are now spaced to make the distinction between the different attribute combinations easier, are close to each other, only separated by a thicker line.

10 3 DATA VISUALIZATION TECHNIQUES 9 Figure 6: Dimensional stacking This method is very useful for hierarchical data sets that only have a small number of dimensions as otherwise the embedding process will make the resulting plot too crowded. A great challenge is the question of labelling. The way chosen in the example is one possibility of naming the different variables in the plot. A technique that displays the correlation between dimensions (not the data itself!) recursively [Hoffmann02] is the fractal foam. The starting point is a chosen dimension that is depicted by a coloured circle. Attached to this circle are further circles, which symbolize the other dimensions. The size of these rings corresponds to the correlation between the inner circle and the fastened ones. A high correlation requires a large circle. Fixed to the second layer of circles is a third layer which describes the correlation of these dimensions and so on. An example of fractal foam can be found in figure Geometrical methods Geometrical methods are a very large group of visualization techniques. Probably the easiest and most commonly used one is the method of parallel coordinates. Here the dimensions are represented by parallel lines, which are equally spaced. They are linearly scaled so that the bottom of the axis stands for the lowest possible value whereas the top corresponds to the highest value. A data point is now drawn into this system of axes with a polygonal line, which crosses the variable lines at the locations the data point holds for the examined dimension. A simple example with three points and four dimensions is shown in figure 8. The points displayed are A = (1; 3; 2; 5), B = (2; 4; 1; 6) and C = (1; 4; 3; 5). This method is not exclusively applicable to data sets that are as simple as the last one. One of the familiar high-dimensional data set examples used to explain data visualization techniques is the Iris data set. It consists of three different Iris types, namely Iris Setosa,

11 3 DATA VISUALIZATION TECHNIQUES 10 Figure 7: Fractal foam (sepal length - centre (white), petal length - right (red), petal width - top (yellow), sepal width - bottom (green)) [Hoffmann99] Figure 8: Parallel coordinates for points A, B, C

12 3 DATA VISUALIZATION TECHNIQUES 11 Figure 9: Parallel coordinates (Iris data set) [Grinstein01] Iris Versicolor and Iris Virginica. The variables of this data set are the sepal length, the sepal width, the petal length and the petal width, all measured in millimetres. As you can see in plot 9 the parallel coordinate technique is a tool which enables you to find out attributes that allow a categorization of the different flower types. In the diagram the petal width seems to be a good classifier for the red Iris type. It is also a fairly good attribute to distinguish between the violet and green flower category. A very significant feature of this visualization technique is that the dimensions are treated equally. This characteristic permits a rearrangement of the displayed dimensions, which gives another view on the data and therefore might lead to the recognition of certain patterns (or classification attributes) that would otherwise be hidden in the actual visualization arrangement. Figure 10 shows the same Iris data set but this time normalized and with the dimensions sepal width and sepal length swapped. The resultant graph looks very different and much clearer. Another interesting geometrical visualization technique is the use of Andrew s curves [Hoffmann02]. This method plots each data point as a function of the data values using a specific equation. The data point curves are usually sketched in the interval π < t < π. The function which draws these curves is shown as: f(t) = x 1 + x 2 sin(t) + x 3 cos(t) + x 4 sin(2 t) + x 5 cos(2 t) +..., 2 where x = (x 1, x 2,..., x n ) and x n are the values of the data points for the particular dimension. Consider the example of the three data points already used to explain the parallel coordinates technique (A = (1; 3; 2; 5), B = (2; 4; 1; 6) and C = (1; 4; 3; 5)). For data point A, the function f A (t) = sin(t) + 2 cos(t) + 5 sin(2 t). For data point B, the function f B (t) = sin(t) + 1 cos(t) + 6 sin(2 t). For data point C, the function f C (t) = sin(t) + 3 cos(t) + 5 sin(2 t). If you plot these three data points into one coordinate system using Matlab you obtain the result depicted in figure 11. Applying this algorithm now on the Iris data set mentioned before results in a graph (Figure 12) that looks slightly more complex.

13 3 DATA VISUALIZATION TECHNIQUES 12 Figure 10: Parallel coordinates (Iris data set) [Hoffmann99] Figure 11: Andrew s curves for data points A, B, C

14 3 DATA VISUALIZATION TECHNIQUES 13 Figure 12: Andrews curves (Iris data set) [Hoffmann99] The advantage of this algorithm is that it is easily applied to data with a large amount of dimensions. The disadvantage is the long computational time as every data point requires the calculation of a trigonometric function [Hoffmann02]. A very basic technique to visualize high-dimensional data is the application of multiple views. They are often used with scatterplots or barcharts leading to an n n cell matrix, where n is the number of dimensions. Each cell of this matrix is then a scatterplot or a barchart respectively. This method is widely employed for data sets that contain diverse attributes. It reveals correlations and disparities between variables since the representation of the different component combinations next to each other allows a visual comparison of the possible connections. In the next example the method has been applied to the car data set, another widely employed set for visualization techniques. This table contains the combinations of miles per gallon (MPG), year of manufacture, cylinders, acceleration, horsepower and weight for three different car types. The red spots in figure 13 symbolize American cars, the green ones Japanese cars and the blue ones European cars. This figure clearly identifies a positive correlation between horsepower and weight, whereas the combination of MPG and weight reveals a negative correlation [Hoffmann02]. Even though this method is a very functional tool in the visualization of data it does have several disadvantages. A very problematic one is the fact that the user becomes overwhelmed by the number of charts they have to evaluate and keep in mind while doing so. The usage of space is a more practical aspect that needs consideration. The car example produces a matrix, which is not only a manageable quantity to work with but also to display. If the data set was extended to ten dimensions for instance the presentation of

15 3 DATA VISUALIZATION TECHNIQUES 14 Figure 13: Scatter plot matrix (Car data set) [Hoffmann99] the corresponding graph in a clear way would no longer be possible. The last two techniques I would like to present in this paper belong to the division of anchor visualization methods. They are both fairly new approaches to the problem, the second being the further development of the first one. Radial Coordinate Visualization (RadViz) uses the spring paradigm [Hoffmann02]. From a centre point n equally spaced limbs of the same length spread out, each representing one dimension. The ends of the lines mark the dimensional anchor (DA) of the respective variable, which are connected forming a circle. Before the data points can be visualized by this technique they need to be normalized. After that one end of a spring is fastened to each dimensional anchor, the other end to the data point. The spring constant of each spring is the value of the data point of the respective dimension. In order to determine the location of the data point the sum of the spring forces needs to equal zero. If you apply this method to the well known Iris data set you can obtain figure 14. An advantage of RadViz is the fact that it preserves certain symmetries of the data set [Hoffmann02]. The major disadvantage is the overlap of points. The second dimensional anchor technique, which has been named PolyViz, takes remedial measures. The emerging plot is a combination of RadViz and the application of the barchart technique. It illustrates the DAs not as points as in RadViz but as lines so that the graph becomes a polygon. This technique nevertheless shows the clustering of the data points in the middle of the polygon as it uses the same spring paradigm. But it also makes a study of the distribution along the different dimensions possible since it

16 4 PROJECTION TECHNIQUES 15 Figure 14: RadViz (Iris data set) [Grinstein01] plots this scattering along the axes using the barchart technique [Hoffmann02] (Figure 15). All the techniques explained above visualize data sets without trying to change them in order to simplify the visualization. In the following chapter I will introduce non-linear projection methods that reduce the size of the dimension vector so that the display of the data sets becomes facilitated. 4 Projection techniques The general goal of projection techniques is the reduction of the dimensionality of data to obtain a spatial mapping of the particular data set in the available space. I will distinguish between two methods that differ in their side condition. The first method, multi-dimensional scaling, tries to preserve distances between data points whereas neural networks focus on the maintenance of structure. 4.1 Multi-dimensional scaling (MDS) As stated above multi-dimensional scaling has its focus on the preservation of distance. The distance d ij is the Euclidean distance between the data points x i and x j in the n-dimensional space, which is d ij = x i x j, with x i R n, i, j 1, 2,..., N. Multidimensional scaling attempts to reconstruct the distances between the data points in the n-dimensional space by the determination of dissimilarity vectors δ ij for the subvector space. With non-linear MDS the relationships between the distance vectors and

17 4 PROJECTION TECHNIQUES 16 Figure 15: PolyViz (Iris data set) [Grinstein01] dissimilarity vectors are not proportional. In order to determine the dissimilarity vector δ ij we need to apply a monotone transformation D(.), which results in the disparity matrix D ij = D(δ ij ) [Walter02]. An extensively used algorithm of multi-dimensional scaling is the Sammon s Algorithm [Walter02]. Sammon came up with the following equation: N E({x i }) = w ij (d ij D ij ) 2 i=0 j>i This formula describes a minimization problem as a sum over the weighted squares of the differences of distance and disparity vectors. The values of w ij are a means to normalize the cost function, as well as weigh the different disparities; they depend on the normalization technique (local, intermediate or global normalization). In order to calculate the cost or stress - E, Sammon recommends the employment of iterative methods such as the Newton method to recursively calculate the minimum. A different approach to the problem of preserving distance is the hyperbolic multidimensional scaling (H-MDS). Before explaining the basic idea of H-MDS I would like to give an overview of the hyperbolic space. If people think of geometry and spaces they usually assume spherical geometry. This geometry s property is the positive curvature which results in spherical surfaces like the moon. As there is geometry with positive curvature, there also exists geometry with negative curvature, which is called the hyperbolic plane - H2. The hyperbolic plane can be represented by two important equations [Walter02]. The area a and the circumference c of radius r in H2 can be defined by: a(r) = 4πsinh 2 ( r ) and c(r) = 2πsinh(r) 2

18 4 PROJECTION TECHNIQUES 17 Figure 16: Hyperbolic patch embedded into R 3. The circumference and area grow exponentially in the drawn circle. The sum of angles in the triangle is smaller than 180 [Walter]. These two equations hold an amazing feature of the hyperbolic plane. For small values of r sinh 2 ( r ) r2 and sinh(r) r, so that a(r) 2 4 πr2 and c(r) 2πr. This is identical to the functional description of circles in spherical geometry. For larger values of r though both the area and the circumference grow exponentially. In order to imagine this exponential growth you can think of the H2 as a ball of crumpled paper. If you now draw a circle on this wrinkled sheet of paper and unfold it afterwards the area and the circumference of the resulting object will be much larger than it appeared when drawn. Figure 16 shows the embedding of an extract of the H2 into the R 3. There are several approaches towards mapping the hyperbolic plane into the Euclidean surface. I will provide you with a rough overview of the Poincaré model as it is the most widely used method for this task. Basic features of this projection are the display compatibility, which means that the entire H2 space fits into the Poincaré disk (PD), and the infinite size of the circle rim. The first mentioned characteristic inspired M. Escher to create the picture in figure 17. Note that in the drawing the white lines are perpendicular to the rim at any time and represent straight lines in the H2. It also clearly illustrates the fish-eye effect, which is the larger appearance of the images in the middle in comparison to the ones in the outer parts of the circle even though they are all the same size. Coming back to hyperbolic multi-dimensional scaling we now have to adapt the distance vector d ij to the new assumptions. The distance according to Riemann [Walter02, cf. 8] is now given by: d ij = arctanh( x i x j 1 x i x j ), x i, x j P D I will not go into detail how to determine the cost function resulting from this type of distance vector but would like to mention that the non-linearity of the distance vector d ij influences the transformation matrix in H2 more than it did in the spherical geometry [Walter02]. This suggests that the choice of D(.) needs a lot more consideration.

19 4 PROJECTION TECHNIQUES 18 Figure 17: Circle Limit III (1958) 4.2 Self-organizing maps (SOMs) The second method I would like to familiarise you with, is the self-organizing maps, a method of artificial neural networks. Neural networks are adapted from neurobiological models [Oellien03]. In order to reduce dimensionality they use a combination of analytic and graphical techniques to group data while preserving the data structure [Grinstein01]. A neural network consists of several layers that are made up of neurons. The data is presented to the input layer, processed inside the network and then returned from the output layer. During the phase of processing various rules are applied to the given data set, constituting the learning algorithm. Neural networks can further be divided into supervised and unsupervised nets. Selforganizing maps (SOMs) are the most famous example of unsupervised learning. Teuvo Kohonen developed this method; for this reason they are often referred to as Kohonen s SOMs. The term unsupervised corresponds to the fact that only the input data set is presented to the neural network. The algorithm then determines similarities automatically and returns a map as the output of the process. This output map is characterized by numerous clusters, in which similar objects lie close to each other therefore showing the relationship between the input variables. Since there is no need to know anything about the underlying relationships of the data points, this technique is ideal for data sets for which the structure or system is unknown. Figure 18 shows the self-organized map of the Iris data set. In the next chapter I will deal with the need of interacting with the visualized data and will present the most essential techniques.

20 5 INTERACTION 19 Figure 18: Self organizing map of the Iris data set [Grinstein01] 5 Interaction Interaction plays an important role in the understanding of the knowledge discovery process and the data itself. The methods I present cover the most basic but crucial techniques that are necessary to become an active data analyst. The distinction between an active and a passive user is the ability to discuss problems that arise [Thearling02] and by working on the data set (in the visualization mode) find solutions for them. The order in which I present the techniques and the summarization I chose is not the only possible solution but it reflects the normal flow of interactive applications. 5.1 Filtering The interaction technique that does the central procedure of selecting data points is filtering. The use of selection is either the elimination of impossible/false data points or the focus on a particular cluster that needs further consideration. As an example for impossible data points consider the following case: The data set is the number of crimes over the age of the population. If the plot indicates crimes committed by three-years-old children or people that are 150 years old we can assume that these points have been entered incorrectly into the system and should be eliminated. In the category of filtering techniques we distinguish between browsing and querying. Browsing is the direct selection of data points. This is a suitable method for identifying clusters and, for instance, choosing them for further investigation. Querying however offers the opportunity of directly entering specifications the data points are required to meet. This could possibly eradicate the unrealistic data points mentioned in the example above.

21 5 INTERACTION Linking and brushing Other selecting tools are linking and brushing. Even though these procedures can also be used to completely delete data points, this is not their primary goal. Linking refers to the connection between different plots. In multiple views (refer to Geometrical methods/multiple views) the manipulation of data in one plot automatically affects the respective data set in the other (linked) graphs. Highlighting data points is also called brushing. It is often used in connection with linking so the user can observe the effect of highlighting data in one graph on the other views. Consider the following example. The data set is three dimensional with the dimensions product type, territory and number of sales. As a visualization technique we use scatterplots in multiple views. The first scatterplot shows the distribution of the number of sales over the different product types, whereas the second one presents the number of sales over the territories. Highlighting the data points for the first product in the first graph leads to (since the graphs are linked) highlighting the respective data points in the second plot. The highlighted cluster in the linked graph now indicates (e. g.) that the chosen product is mostly sold in a particular territory. This connection between the two attributes might not have been seen if the graphs had not been linked. This is a very simple example but the principle can easily be applied to higher dimensional data sets. It is also not limited to multiple views of the same visualization technique but is also very useful for multiple views of different plot types. 5.3 Zooming Zooming is the method of showing a particular part of the data set in detail. The problem that arises with this way of focussing on portions of the data is that you might lose the big picture. A remedial measure takes the approach of non-linear magnification. Methods that fall within this category are also known as focus+context (F+C) techniques [Sahling03]. This progressive form of zooming tries to expand the selected area while showing the original context. Examples for non-linear magnification are the fish-eye lens or the hyperbolic space. If you view data through a fish-eye lens it seems to be seen with a wide-angle camera lens [Sahling03]. The example in figure 19 shows the tram network in Washington D.C., USA. Due to its properties (see also 4.1 Multi-dimensional scaling (MDS)/HMDS) the use of the hyperbolic space is predestined for zooming interactions. 5.4 Manipulation of data Manipulation of data, mainly input data, or the removal of outliers is a very important part of the whole knowledge discovery process. The observation of the consequences in the output data (or neural network) due to a change in the input data helps to get a feel for the basics. It also allows the study of what if -questions on the particular data set. All these methods cannot be related to a particular location in the visualization or interaction process in KDD but are applicable at any time and therefore indispensable.

22 6 SOFTWARE AND APPLICATIONS 21 Figure 19: Tramnetwork F+C technique [Keahey] 6 Software and Applications Several software packets are available at the moment, most of them are commercial but there are a couple of public-domain software tools as well. I would like to give an overview of two packages as examples. XGobi is a data visualization tool that has been developed by Deborah F. Swayne, Di Cook and Andreas Buja. It handles multivariate data presentation using scatterplots. This software also offers projection techniques to reduce the dimensionality of the data sets and also supports the basic interaction methods like brushing and zooming. A very important feature of this tool is the handling of missing values. An extension and further development of XGobi is GGobi. It provides the user with a new and clearer interface and supports new technology or software such as XML and database systems. Xmdv is a public-domain software. The main focus of this tool is the interaction of user and machine. It handles multidimensional data sets by applying the visualization tools scatterplots, star glyphs, parallel coordinates and dimensional stacking (all presented in this paper). Since this software concentrates on interaction it supports the methods I dealt with in the last chapter. Furthermore it offers the opportunity to mask dimensions in order to study their impact on the clustering in the data set. A unique feature of this tool is its use of clustering data sets to visualize them without overloading the screen and therefore the user. This is a good aspect regarding clarity but it excludes the user from the process. Both of the above mentioned software packages are available for many industry branches. There often exist certain modules that adapt the general idea of the tool to those different areas of interest. Xmdv for instance offers modules for fields such as finance and geochemistry (cf. Xmdv). A very interesting application for such tools is investigative visualization in the medical field. On the one hand graphical representations of statistics such as the number of breast cancer patients for the last twenty years are essential to reveal trends. On the other hand, especially useful for research, the graphs can be used to

23 REFERENCES 22 do studies on causes of diseases. The occurrence of certain illnesses such as breast cancer depends on various variables. In order to find out which combination of these variables leads to the actual disease a graphical representation is crucial. I would like to conclude my paper with a glance at future research. 7 Outlook and Conclusion Even though there has been a large amount of research on the human-machine integration it is still an area which needs further improvement and development to more effectively and efficiently use the power that results of this symbiosis. Keim [Keim02] is of the opinion that not only computers in general need to be integrated with the human. He states that the two components, namely visualization techniques and the well-known methods applied in different areas such as statistics and operations research for instance have to be brought together more extensively. From this point of view research on the human perceptibility should also be increased [Keahey99]. Especially methods that do not reduce dimensionality can overload the user and therefore have opposite effects as the user might not be able to see the overall picture. Another question that arises [Brodbeck97] is whether to mainly present data in the 2D space or to use three-dimensional visualization techniques or perhaps even animations. Animation is a very functional visualization technique. It is widely applicable to many kinds of data sets as the last dimension allows the visualization of a flow in a variable, usually time. An advantage of 3D is that the human world is three dimensional and therefore this visualization seems to be more natural to the human user than any of the others. The problem though is again the overload 3D data produces. Humans do still have to break down the visualized data into two dimensions and compare it in their minds. Intensive interaction could be helpful, such as rotation of the data set as well as the already presented tools. Until now 2D representation is the most extensively used form of visualization. Hyperbolic planes are also an area that visualization tools should employ further. The advantages of its infinite representation space and focus+context capabilities are very important features that will be greatly needed in the future. The hindrance could be the fact that hyperbolic planes are geometries we are not familiar with. This might incur the users displeasure but I think that professional education and information about this view of the world will improve the chance of a smooth introduction of hyperbolic planes as the base of data representation. In order to conclude this paper I can say that the existing techniques use very distinct approaches to the problems. Each offers a selection opportunity, since, as mentioned before, different data types need diverse graphical representations. As I pointed out above there is still a lot of research that needs to be done but I think the requirement has been identified and we can therefore look forward to a large amount of new and innovative techniques for the visualization of data and information in the future. References [Brodbeck97] Brodbeck, D. et al.: Domesticating Bead: Adapting an Information Visualization System to a Financial Institution.

24 REFERENCES 23 [Docherty02] [Fayyad02] matthew/papers/infovis97.pdf, 1997, printed Docherty, P., Beck, A. (edt): A Visual Metaphor for Knowledge Discovery. In: Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco (CA) 2002 Fayyad, U., Grinstein, G. (editors): Introduction. In: Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco (CA) 2002 [Grinstein01] Grinstein, G., Trutschi, M., Cvek, U.: High-Dimensional Visualizations. mtrutsch/research/high- Dimensional Visualizations-KDD2001-color.pdf, 2001, printed [Grinstein02a] [Grinstein02b] [Hoffmann99] [Hoffmann02] Grinstein, G., Hoffmann, P., Pickett, R. (edt): Benchmark Development for the Evaluation of Visualization for Data Mining. In: Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco (CA) 2002 Grinstein, G., Ward, M. (edt): Introduction to Data Visualization. In: Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco (CA) 2002 Hoffmann, P. E.: Table Visualizations: A Formal Model and Its Applications, peh2.hoffman/tablevizx.pdf, 1999, printed Hoffmann, P., Grinstein, G. (edt): A Survey of Visualizations for High- Dimensional Data Mining. In: Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco (CA) 2002 [Kaidi00] Kaidi, Z.: Data visualization. kzhao/papers/ 00 course Data visualization.pdf, 2000, printed [Keahey99] Keahey, T.A.: Visualization of High-Dimensional Clusters Using Nonlinear Magnification spie.pdf, 1999, printed [Keahey] Keahey, T.A.: A Brief Tour of Nonlinear Magnification. tkeahey/research/nlm/nlmtour.html [Keim02] Keim, D. A.: Information Visualization and Visual Data Mining. arnumber=981847&isnumber=21152, 2002, printed [Kontkanen99] Kontkanen, P. et al.: Supervised model-based visualization of high- dimensional data , printed [Koua03] Koua, E.L: Using Self-Organizing Maps for Information Visualization and Knowledge Discovery in Complex Geospatial Datasets /art proc/koua.pdf, 2003, printed

25 REFERENCES 24 [Rhodes02] [Mihalisin02] Rhodes, P. (edt): Discovering New Relationships. In: Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco (CA) 2002 Mihalisin, T. (edt): Data Warfare and Multidimensional Education. In: Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco (CA) 2002 [Oellien03] Oellien, F.: Data Mining und Datenvisualisierung (ch 5 of Algorithmen und Applikationen zur interaktiven Visualisierung und Analyse chemiespezifischer Datensätze). Oellien/diss/index.html, 2003, printed [Sahling03] Sahling, G. N.: Interactive 3D Scatterplots From High Dimensional Data to Insight. NSahling/masterthesis.html, 2003 [Thearling02] [Walter] Thearling, K. et al. (edt): Visualizing Data Mining Models. In: Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco (CA) 2002 Walter, J. A.: Interactive Visualization and Navigation using the Hyperbolic Space. walter/h2vis/ [Walter02] Walter, J. A., Ritter, H.: On Interactive Visualization of High-dimensional Data using the Hyperbolic Plane. walter/pub/walter02-kdd.pdf, 2002, printed [Ward99] Ward, M. O.: A Taxonomy of Glyph Placement Strategies for Multidimensional Data Visualization. matt/courses/glyphs/, 1999 [Wezel03] Wezel, M. C. van, Kosters, W.A.: Nonmetric multidimensional scaling: Neural networks versus traditional techniques. kosters/ida00191.pdf, 2003, printed