GEO-VISUALIZATION SUPPORT FOR MULTIDIMENSIONAL CLUSTERING



Similar documents
Exploratory Data Analysis for Ecological Modelling and Decision Support

Visualization methods for patent data

Interactive Analysis of Event Data Using Space-Time Cube

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Blending Aggregation and Selection: Adapting Parallel Coordinates for the Visualization of Large Datasets

DATA VISUALIZATION GABRIEL PARODI STUDY MATERIAL: PRINCIPLES OF GEOGRAPHIC INFORMATION SYSTEMS AN INTRODUCTORY TEXTBOOK CHAPTER 7

A Spatial Decision Support System for Property Valuation

Impact of Data and Task Characteristics on Design of Spatio-Temporal Data Visualization Tools

Pentaho Data Mining Last Modified on January 22, 2007

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

Scalable Cluster Analysis of Spatial Events

A Review of Data Mining Techniques

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

MAP GENERALIZATION FOR OSMASTERMAP DATA IN LOCATION BASED SERVICES & MOBILE GIS APPLICATIONS

Implementation of a Flow Map Demonstrator for Analyzing Commuting and Migration Flow Statistics Data

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS

An Overview of Knowledge Discovery Database and Data mining Techniques

Exploratory spatio-temporal visualization: an analytical review

Big Data: Rethinking Text Visualization

The STC for Event Analysis: Scalability Issues

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Visibility optimization for data visualization: A Survey of Issues and Techniques

Getting Started With Mortgage MarketSmart

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS.

Financial Trading System using Combination of Textual and Numerical Data

How To Predict Web Site Visits

GENNADY L. ANDRIENKO and NATALIA V. ANDRIENKO

Geointelligence New Opportunities and Research Challenges in Spatial Mining and Business Intelligence

Data Mining and Soft Computing. Francisco Herrera

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Data Mining Applications in Higher Education

Perspectives on Data Mining

Exploratory Spatial Data Analysis

Adobe Insight, powered by Omniture

An Introduction to Data Mining

Quality Control of National Genetic Evaluation Results Using Data-Mining Techniques; A Progress Report

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

SPATIAL DATA CLASSIFICATION AND DATA MINING

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ICT Perspectives on Big Data: Well Sorted Materials

Cleaned Data. Recommendations

Information Visualization WS 2013/14 11 Visual Analytics

FIVE STEPS FOR DELIVERING SELF-SERVICE BUSINESS INTELLIGENCE TO EVERYONE CONTENTS

Utilizing spatial information systems for non-spatial-data analysis

Digital Cadastral Maps in Land Information Systems

WEKA Explorer User Guide for Version 3-4-3

Geovisualization. Geovisualization, cartographic transformation, cartograms, dasymetric maps, scientific visualization (ViSC), PPGIS

VISUALIZATION OF GEOSPATIAL METADATA FOR SELECTING GEOGRAPHIC DATASETS

Working with telecommunications

GIS Tutorial 1. Lecture 2 Map design

Location Analytics for Financial Services. An Esri White Paper October 2013

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Data Visualization. Brief Overview of ArcMap

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

2.0 COMMON FORMS OF DATA VISUALIZATION

Architecture and Applications. of a Geovisual Analytics Framework

Introduction to Data Mining

Improving Decision Making and Managing Knowledge

Management Decision Making. Hadi Hosseini CS 330 David R. Cheriton School of Computer Science University of Waterloo July 14, 2011

Introduction. A. Bellaachia Page: 1

A Statistical Text Mining Method for Patent Analysis

How To Learn Data Analytics

DESCARTES & EFIMOD: An Integrated System for Simulation Modelling and Exploration Data Analysis for Decision Support in Sustainable Forestry

Data Mining in Telecommunication

Comparison of K-means and Backpropagation Data Mining Algorithms

Prediction of Heart Disease Using Naïve Bayes Algorithm

KNOWLEDGE BASE DATA MINING FOR BUSINESS INTELLIGENCE

Security visualisation

Data Mining Solutions for the Business Environment

Data Mining System, Functionalities and Applications: A Radical Review

Introduction to Data Mining

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

CHAPTER-24 Mining Spatial Databases

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

Data Visualization. Prepared by Francisco Olivera, Ph.D., Srikanth Koka Department of Civil Engineering Texas A&M University February 2004

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Tracking System for GPS Devices and Mining of Spatial Data

Data Mining with Weka

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

CSU, Fresno - Institutional Research, Assessment and Planning - Dmitri Rogulkin

Transcription:

Geoinformatics 2004 Proc. 12th Int. Conf. on Geoinformatics Geospatial Information Research: Bridging the Pacific and Atlantic University of Gävle, Sweden, 7-9 June 2004 GEO-VISUALIZATION SUPPORT FOR MULTIDIMENSIONAL CLUSTERING Gennady Andrienko and Natalia Andrienko Fraunhofer AiS - Autonomous Intelligent Systems Institute, Schloss Birlinghoven, Sankt-Augustin, D-53754 Germany, http://www.ais.fraunhofer.de/and/, gennady.andrienko@ais.fraunhofer.de, Tel +49-2241-142486, Fax +49-2241-142072 Abstract In this paper we consider how multidimensional clustering can be complemented by interactive visualization. We propose a link between geovisualization and data mining systems for supporting an iterative analysis cycle, including data pre-processing and visual exploration, automatic detection of clusters in multidimensional space of user-selected attributes, and visual analysis of cluster analysis results. INTRODUCTION CommonGIS (Andrienko et al., 2003) (formerly DESCARTES, Andrienko and Andrienko, 1999) is a system for visual analysis of spatially referenced data. CommonGIS is unique as a combination of well-integrated instruments, which can complement and enhance each other thus allowing sophisticated analyses. The system includes a variety of methods for cartographic visualisation, non-spatial graphs, tools for querying, search, classification, and computation-enhanced visual techniques. A common feature of all the tools is their high user interactivity, which is essential for exploratory data analysis (EDA). Although our primary focus is visual data analysis supported by graphical and cartographical data displays, we realise that analytical capabilities of humans are very limited in terms of the volume of data that can be analysed and the complexity of the relevant information that may be hidden in the data. Thus, increasing the number of objects leads to heavy cluttering and overlapping of graphical symbols on maps and graphs. On a map, the objects may appear so small that they are hardly visible. Zooming does not completely solve the problem because it prevents the overall view on a territory. Redrawing of maps and graphs containing multitudes of objects is rather slow, which makes the displays less dynamic and less reactive to user s actions. Visual analysis becomes especially problematic with increasing the number of attributes. It is unpractical to visualise and explore them one by one for detecting interesting distributional features. It is also practically impossible to test all combinations of the attributes for revealing relationships between them. For analysis of large data sets with great number of objects and/or attributes, visual tools need to be complemented by appropriate computational facilities. Data mining (Witten and Frank, 1999) is a suitable apparatus for such purposes. Data mining is defined as the nontrivial process of identifying valid, novel, potentially useful, and understandable patterns in data, in particular, in very large datasets. While in exploratory data analysis it is the task of a human analyst to uncover important characteristics of data, and the role of computers is to 329

G. Andrienko and N. Andrienko facilitate this task, data mining methods are designed for the automatic extraction of knowledge from data. From a broader perspective, data mining techniques are intended for quite the same purposes as the conventional EDA tools based on data visualization, namely, for acquiring useful information from data. Hence, these two groups of methods are complementary, and one would expect a benefit from combining data mining methods with visualisation tools. Few years ago we investigated possibilities of linking geo-visualizations in Descartes with data mining tools Kepler (Andrienko et al., 2001). The combined system supported the iterative process of data pre-processing and visual analysis in Descartes (for example, constructing qualitative attributes using classification maps on the basis of a single numeric attribute or groups of attributes), followed by computations in Kepler (for example, building classification trees or evaluating relevance of attributes), and visual analysis of the data mining results on maps and statistical graphics again in Descartes. Now we develop a link between CommonGIS (successor of Descartes) and public domain data mining software WEKA (Witten and Frank, 1999). The link supports the iterative process of exploratory data analysis. On the initial stage, data visualisation in CommonGIS is used for data preview as well as for encoding spatial information in a symbolic form suitable for data mining (e.g. in the form of classes of geographical objects). Then, data mining techniques are applied with the aim to reveal previously unknown regularities and relationships in the dataset, in particular, relate spatial and non-spatial characteristics of geographical objects. The results of data mining are interpreted with the help of visual displays of CommonGIS that provide the spatial reference to them or allow the analyst to verify the findings. However, it is naïve to expect that just a single run of one of the existing data mining methods will discover all important facts about data. Therefore, after interpreting and verifying the initial data mining results, it is useful to apply other data mining methods, or change the parameters of the previously used method, or modify the input data passed to the method. Hence, the process of data exploration is iterative. ANALYSIS SCENARIO Let us consider a scenario of analysing the age structure of USA population at the county level using interactive visual displays in combination with data mining. An analyst starts with visualizing proportions of different age groups on multiple choropleth maps (Figure 1). He notices that the values for some age groups form spatial clusters. For example, proportions of 18-29 years and 30-49 years age groups are especially high on the south-west, while high proportions of population aged from 5 to 17 years are concentrated closer to the centre of the country. To make the patterns more prominent, the analyst replaces the original values by the differences to the mean value for each age group. The results are presented in Figure 2. The shades of brown are used to represent positive differences and the shades of blue represent negative differences. The contrast of the brown and blue colours makes the spatial clusters better visible. 330

Geo-visualization support for multidimensional clustering Figure 1: The proportions of different age groups in US population at the county level (3140 counties) are shown using juxtaposed choropleth maps. Figure 2: The proportions of the age groups are shown in comparison to the mean value for each age group over all counties. Although the visualization of transformed data exposed interesting spatial patterns in value distribution of each individual age group, it is difficult to integrate this information into an overall picture of the variation of age structure over the country. Therefore the analyst decides to apply one of the methods for multidimensional clustering - SimpleKmeans 331

G. Andrienko and N. Andrienko available in the WEKA system. After selecting the attributes to be taken into account and the desired number of clusters (Figure 3), WEKA produces three clusters, which are immediately visualized in CommonGIS (Figure 4). Figure 3: The user interface for choosing attributes and setting parameters for cluster analysis. Figure 4: Area colouring shows the results of division of the counties into 3 clusters according to the proportions of different age groups in their population. To investigate the characteristics of the resulting clusters, the analyst requests the system to build a panel with value frequency histograms for the source attributes. Then he uses the broadcast classification tool of CommonGIS to make the clusters represented on the histograms. This is done by means of dividing the bars into coloured segments with the lengths proportional to the number of counties belonging to the corresponding clusters (Figure 5). One can see that class 1 (red) is characterized by high proportion of old population (AGE_65_UP %), while classes 2 (yellow) and 3 (blue) are mostly distinguished according to values of the attributes AGE_5_17 % and AGE_30_49 %. Class 3 can be characterised as having unusually high proportions of children while class 2 consists of the counties with higher proportions of middle-age population. 332

Geo-visualization support for multidimensional clustering Figure 5: Profiles of the clusters are shown on histograms. For a more comprehensive analysis, the analyst wishes to compare the results of clustering into 3 classes with that into 4 classes. Figure 6 represents the results of division into 4 clusters made by the same SimpleKmeans method. It can be seen that the fourth cluster contains only 171 counties (about 5 % of the total number of counties), while numbers of counties in classes 1, 2, and 3 have not significantly changed except for a small decrease of the number of class 1 members and some increase in class 2. It is interesting that a spatial cluster of counties belonging to class 2 (yellow) has appeared on the north-west of the continental part of the USA (not including Alaska). With the division into 3 classes, this area was more patchy : counties from classes 1 and 2 were mixed (see Figure 4) Figure 6: Results of division into 4 clusters. A cross-classification map (Figure 7) allows the analyst to see which counties have moved to another class in the result of division into 4 clusters. Here the red, yellow, and blue colours are used for painting the counties that had not change their class. All transitions between classes are depicted in colours obtained by mixing the original class colours. The map legend (Figure 7, left) and the bar chart (right) provide the statistics of the transitions. One can see that the 4th class is formed mostly by 164 counties that belonged formerly to class 2. Other major exchanges between classes include class 1 -> class 2 (266 counties) and class 3 -> class 2 (149 counties). 333

G. Andrienko and N. Andrienko Figure 7: A cross-classification maps supports comparison of results of two runs of the clustering algorithm with different input parameters. To understand the meaning of the new classes, the analyst projects them on the same histograms as were used before. The result is shown in Figure 8. Figure 8: The results of division into 4 classes are projected on histograms. It may be seen from the histograms that the new class 4 (green) is characterised by low proportions of the age groups 50-64 years and 65 and more years and high proportions of people from 18 to 29 years. The observation of the characteristics of the other classes is consistent with that made after the first run of the clustering method: the counties of class 1 (red) tend to be older than others, class 2 (yellow) is middle-aged, and class 3 (blue) consists of counties with high proportions of children. It seems that the clustering algorithm has raised the age threshold for assigning counties to class 1, and this accounts for the movement of some counties from class 1 to class 2. 334

Geo-visualization support for multidimensional clustering CONCLUSIONS From the given example scenario one can see that the computation power of data mining methods is a valuable complement to visual techniques of data analysis, and vice versa, interactive and dynamic visualization is an effective analytical instrument that can significantly help in interpretation of data mining results. The research on synergistic use of visual and computational analysis methods is very promising and worth continuation. In the nearest future we are going to experiment with linking the visual tools of CommonGIS to other data mining methods. ACKNOWLEDGEMENTS This work was partly supported by the European Commission in project GIMMI (IST-2001-34245). We are grateful to Mr. Mark Ostrovsky for his help in implementation of CommonGIS components considered in the paper. REFERENCES Andrienko, G. and Andrienko, N. 1999: Interactive maps for visual data exploration. International Journal Geographical Information Science, v. 13(4), 355-374. Andrienko, N., Andrienko, G., Savinov, A., Voss, H., and Wettschereck, D. 2001: Exploratory Analysis of Spatial Data Using Interactive Maps and Data Mining. Cartography and Geographic Information Science, v. 28(3), 151-165. Andrienko, G. Andrienko, N. and Voss. H. 2003: GIS for Everyone: the CommonGIS project and beyond. In M.Peterson (ed.), Maps and the Internet. Elsevier Science, 131-146. Witten, I.H. and Frank, E., 1999: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann. 335