# Variable Binned Scatter Plots

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Variable Binned Scatter Plots Ming C. Hao, Umeshwar Dayal, Ratnesh K. Sharma Hewlett Packard Laboratories, Palo Alto, CA Daniel A. Keim, Halldór Janetzko University of Konstanz, Germany ABSTRACT The scatter plot is a well-known method of visualizing pairs of two continuous variables. Scatter plots are intuitive and easy-to-use, but often have a high degree of overlap which may occlude a significant portion of the data. To analyze a dense non-uniform dataset, a recursive drill-down is required for detailed analysis. In this paper, we propose variable binned scatter plots to allow the visualization of large amounts of data without overlapping. The basic idea is to use a non-uniform (variable) binning of the x and y dimensions and to plot all data points that are located within each bin into the corresponding squares. In the visualization, each data point is then represented by a small cell (pixel). Users are able to interact with individual data points for record level information. To analyze an interesting area of the scatter plot, the variable binned scatter plots with a refined scale for the subarea can be generated recursively as needed. Furthermore, we map a third attribute to color to obtain a visual clustering. We have applied variable binned scatter plots to solve real-world problems in the areas of credit card fraud and data center energy consumption to visualize their data distributions and causeeffect relationships among multiple attributes. A comparison of our methods with two recent scatter plot variants is included. Keywords: Variable Binned Scatter Plots, Correlations, Patterns, Cause-Effect, Data Distribution 1. INTRODUCTION 1.1 Motivation Scatter plot are one of the most powerful tools for data analysis in daily business operations. Analysts face the challenge of understanding underlying data and finding important relationships from which to draw conclusions, such as answering questions on how one variable is affected by another. For example, in credit card fraud analysis, business analysts want to know fraud impact factors (i.e., amount, count, and region) and distributions. Data center managers want to find the cause-effect of resource consumption to increase energy savings. A scatter plot of power consumptions against temperature can show the impact between these two variables for administrators to improve cooling efficiency. Scatter plots are widely used, intuitive, and easy to understand. However, scatter plots often have a high number of overlapping data points. When there are many data points and significant overlap, scatter plots become less useful. Besides, in analyzing a very dense dataset, the current scatter plots are too coarse and recursion in certain areas is needed. For example, the traditional scatter plot in Figure 1A shows 70,465 fraud observations, but only about 200 distinct data points are visible in the scatter plot, which may mislead the user in judging the density of the data. In addition, users can not distinguish the details in the left bottom corner. There are several approaches that can be used when this occurs (for details see section 2). But the difficulties still remain, especially when visualizing very large multi-dimensional high density data sets. Current scatter plots do not provide a complete picture of the data regarding: - Detailed information at record level. - Overlapping data points. - Distribution and clusters in the high density areas. 1.2 Our Contribution In order to visualize the data distributions and discover cause-effect between attributes (variables), we have to solve the overlap problem. Our solution is the variable binned scatter plot. First, the dataset is binned into proper value ranges. Then, we use density estimation with distortion techniques to place overlapping data points that fall within each bin into corresponding square. The bin size is variable and is computed from the data density. The degree of variation is optimized based on the number of overlapping data points and the available space. Further, our variable binned scatter plot uses the value of a third attribute as the color of the data points. With the color, data points can be classified into different groups (clusters). This feature is especially useful in placing 1

2 overlapping data points by certain categories (i.e., sales regions). Overlapping data points are sorted by the value of the third attribute and then placed together to form clusters. Binned scatter plots can be extended into a variable binned scatter plot matrix to display pair-wise relations between multiple attributes. Each data point is represented by a pixel [10, 11]. Because a pixel is the smallest element on the screen, large volumes of data points can be displayed in a single view. Variable binned scatter plots are interactive. Analysts can rubber-band an interesting area and zoom into detailed information. Using a recursive drill down, users are able to select one or multiple bins in the variable binned scatter plot to generate a new variable binned scatter plot with refined data value range of x-axis and y-axis for detailed analysis. Variable binned scatter plots have been applied with success to real-world credit card fraud analysis and data center thermal management applications. Both applications use variable binned scatter plots to visualize data distributions and impacts among various factors. This paper is structured as follows: Section 2 provides an overview of related work. Section 3 introduces the variable binned scatter plots basic idea and three basic techniques: pixel cell-based representation, binning, data point placement and grouping, and recursive scatter plot generation. In section 4, we present application examples in which real-world data are used to demonstrate the effectiveness of our techniques. An evaluation of the strengths and weaknesses of our approach versus other variants is presented in section RELATED WORK The scatter plot is a well-known data analysis method to show how much one variable is affected by another. Overlap is always a problem in visualizing high density data sets using scatter plots. In 1984, Cleveland [5] introduced sunflowers to draw overlapping points and superposition of smoothing methods for enhancing the x- and y-axes in scatter plots. Cleveland s ideas are great improvements of scatter plots, but they do not solve the overlap problem. In 1999, Lee Wilkinson [13] suggested the usage of semi-transparency to make overlapping data points partially visible. In the book by Antony Unwin et al. [12], a number of interesting visualization techniques were introduced regarding scatter plots, such as drawing overlap points with slightly bigger sizes and reducing the x and y axes by certain factors. JMP 8 Software [9] generates scatter plots with nonparametric density contours and marginal distributions to show where the data is most dense. Each contour line in the curved shape encloses 5% of the data. Carr [4] uses a hexagonal-shaped symbol whose size increases monotonically as the number of observations in the associated bin increases, and HexBin scatter plots [8] determine the brightness value of each HexBin cell depending on the number of data points in the cell. All three techniques, Unwin s distortion, Carr s binning and the HexBin visualization techniques, are close to the method presented in this paper. Bowman s smooth contour scatter plot [2, 3] applies smoothing techniques for data analysis. Bachthaler continuous scatter plots [1] are different from the above scatter plots. Continuous scatter plots are used for visualizing spatially continuous input data instead of discrete data values. The above approaches provide excellent methods for data correlation analysis. However, analysts are not able to see and access all data points, especially if the third variable mapped to color is of high importance. In this paper, we introduce the variable binned scatter plot technique with recursive visual analytics. We combine the best features of the above methods (e.g., binning and zooming) with distortion to find the best placement for the overlapping data points to enable analysts to quickly discover distribution and clusters. Also, we use color to visualize the third attribute, while the previous approaches use color to represent density. This feature helps the users to quickly identify patterns and clusters. 3.1 Basic Idea of Variable Binned Scatter Plots 3. OUR APPROACH Binning is an efficient approach [5] to reduce the complexity of large volumes of multi-dimensional data by dividing the plot area into a number of value ranges. We introduce the concept of the variable binned scatter plot to manage large volumes of data that overlap. Variable binned scatter plots are derived from traditional scatter plots to address the issue of overlapping. Variable binned scatter plots group data in a two-dimensional space based on the densities of pairs of variables. Each data point defined as (x i, y i, z i ) for i from 1 to n consists of a pair of two variables, x and y. A scatter plot of x i against y i shows the relationship between x and y; and color z to show a third variable. Variable binned scatter plots employ color (z i ) to cluster related data points. 2

3 Figure 1: From Traditional Scatters Plot to Variable Binned Scatters Plot Without Overlapping (x-axis: fraud count, y-axis: fraud amount, color: #of overlapping data points/fraud \$amount) Figures 1A to 1D illustrate the progression from traditional scatter plots to variable binned scatter plots. The scatter plot in Figure 1A has many overlapping points. There is no indication as to which data points are overlapping, potentially resulting in a misleading data representation. In Figure 1B, the overlapping data points use the color to denote the number of data points which have the same (x i, y i ) position, but the plot area is too dense to see all the colored data points. In the scatter plot in Figure 1C, the first bin is slightly enlarged resulting in less overlap. More data points become visible, but it is still not possible to see all data points with their distributions and patterns. Variable binned scatter plot in Figure 1D uses a non-uniform (variable) binning of the x and y dimensions and plots all the data points that fall within each bin into the corresponding square area. These square areas are scaled to allow each data point to be shown without any overlap. The relative position of a data point within a bin is retained as accurately as possible. Users are now able to visualize all the data points without losing information. Users are able to visualize the impact between two variables accurately and quickly, and without misrepresentations of the data. Variable binned scatter plots enhance the traditional scatter plots in analyzing very large and dense datasets. 3.2 Construction of Variable Binned Scatter Plots Figure 2 illustrates a pipeline on how to construct a variable binned scatter plot using the following techniques: 1) Use of pixel cells to represent data points in a binned scatter plot Variable binned scatter plots use the smallest element on the screen, such as a pixel, to represent a data point. Analysts are able to view large volumes of data points in a single display. Data points of binned scatter plots are interactive. Analysts can zoom-in on a data point to view specific data attributes. Intelligent visual queries [6] are also provided for analysts to select a focused area in a scatter plot. 3

4 Figure 2: A Variable Binned Scatter Plots Construction Pipe Line Then apply automated analysis methods to identify characteristics of the selected data as well as their relationships to other attributes and data points. 2) Binning & Distortion Based on Carr et al. s definition [5], binning is an approximation for density of the joint distribution of two variables. Our variable binned scatter plots use non-uniform bin sizes to display high density areas. A bin contains the data points which have their (x, y) -coordinates within defined x and y value ranges. The binning of the x- and y-axes is determined according to the data value ranges which are computed from the incoming data and their density distribution. When the data size grows, the bin size will be enlarged accordingly. Therefore, there are no overlapping data points in the variable binned scatter plots. The following illustrates the overall binning algorithm using a non-uniform graphical density display [7]: - Determine the density distribution and value ranges in x and y directions. - Assign the number of bins in x and y directions and compute the bin size based on data density distributions and value ranges. - Determine bin width according to the total window width divided by the number of bins on the x-axis. - Determine bin height for each row according to the maximum number of data points of all bins in the corresponding row. In our current application, the bins on the x-axis use equal widths based on the window size. The bins on the y-axis have different heights according to the maximum number of data points within the bins in the row. For example, in a fraud dataset, a data point P with (x, y) = (172K, 35M) is positioned within the bin (20K-525.9K, 1M-66.55M) where x=20k-525.9k and y=1m-66.55m in Figures 3 and 5. Figure 3: From Tradition Scatter Plots with Overlapping to Variable Binned Scatter Plots without Overlapping 4

5 3) Placement and Grouping Variable binned scatter plots place the non-overlapping data points according to their x and y coordinates within the corresponding bins. The overlapping data points are sorted according to the value of the third attribute to form groups in two-dimensional space. The placement algorithm uses the available space around the already occupied data points to compute the best location for the data points that would otherwise be overlapping in a traditional scatter plot. Data points with the same x and y coordinates are sorted and placed in nearby neighborhood according to the similarity of the third attribute (color). Figure 3A illustrates 11 data points with the same (x, y) coordinates. The data point P is overlapped by the data points P1 through P10. Overlap causes two problems in visualizing data distributions and patterns: (1) the number of overlaid data points is unknown and (2) the value of overlaid data points is not visible. Figure 3B shows how to place the overlapping data points to form a square group around the data point P ordered by the third variable values. If the neighborhood position is already occupied (such as two data points are conflict/interest each other), then the bin axes will be proportionally enlarged and will push the already occupied data points away along the x (toward right) and y (toward top) directions. As the result of this placement process, a red square for sales region 6 is constructed from the 11 overlapped data points. 4) Recursive Variable Binned Scatter Plots Users can select one or more adjacent bins to construct a subset of the variable binned scatter plot. The algorithm uses the data points from the selected bins to generate a new variable binned scatter plot for the selected data points with a new binning adapted to the data distribution of the selected data. The (x, y) coordinates are defined by the value range of the data points from the selected bins. The color remains the same. The scale is computed with a refined scale based on the new data ranges. Figure 4 shows a sequence of recursive variable binned scatter plots (plots #1, #2 and #3) in a credit card fraud analysis application (section 4.1), generated by the user rubber-banding a high density area (x=1-100, y=2k-10k, color=region). In the variable binned scatter plots#2, users are able to detect the correlation between fraud amount and fraud count using the refined scale from fraud amounts in the ranges 2K, 2.2K, 2.5K, to10k. The analyst can further select another group of bins in plot #2 to compare fraud distribution in the refined data ranges, such as x=2k-5k and y=1-20. The result is shown in the plot #3. Figure 4: A Credit Card Fraud Analysis Variable Binned Scatter Plot (x-axis: Fraud Count, y-axis: Fraud Amount, Color: Region 1-6) 5

6 4. APPLICATIONS 4.1 Credit Card Fraud Analysis Fraud is one of the major problems faced by many companies in the banking, insurance, and telephony industries. Large volumes of dollars in fraudulent transactions are processed yearly on credit card payments. Transforming raw transaction data into valuable business intelligence to support fraud analysis will save companies millions of dollars. Fraud analysis specialists require visual analytics tools that help them to better understand fraud behavior, geographical locations, and correlated factors as well as identify exceptions. Typical questions in fraud analysis are: Q1. What is the fraud distribution and which are the most correlated attributes? Q2. Are there any outliers and what are their causes? Q3. Which sales regions and purchase amount have the most fraud? Plot#1 in Figure 4 shows a binned scatter plot with 70,465 fraud records. Analysts use it to analyze fraud distributions and correlations among different attributes (i.e., amount, count, and region) to answer the first question. In a variable binned scatter plot, each fraud data point is represented by a pixel. Because there are no overlapping data points in a variable binned scatter plot, analysts are able to visualize fraud distribution at each data point along the x and y directions. The binning of the x and y direction is determined according to the fraud amount and fraud count. The color of a data point represents the sales region (1-6) where the fraud occurred. Plot#1 shows that the fraud amount is almost increases linearly with fraud count. The fraud amount is highly impacted by the fraud count. However, there is an outlier at the top left bin (x=1-100, y=1m-66.55m) with a low fraud count of 5 but a very high fraud amount of \$ M that occurred in region 6 (red). To answer the second question, analysts can drill down to the credit card payment which might be a potential problem or error. Data points that would be overlapped in traditional scatter plots are now represented as clusters distributed inside a bin. Bin(x=1-100, y=1-1k) has three large clusters from regions 1-5 (orange, yellow, and green) In order to answer the third question on finding which sales region in which fraud amount range (bin) has the most fraud, we optimize data point placement, so that data points from the same sales regions (colors) are placed together (the technique is described in Section 3) for analysts to see the fraud regional distribution. There are three large clusters (orange, yellow, and green) in bin (x=1-100, y=1-1k). Sales region 1 (green) has the lowest fraud amount and count. The smallest cluster is sales region 6 (red) but with the highest fraud amount and count. To find which fraud amount and region have the most fraud, analysts can first select a bin, such as bin (x=20k K, y=1m-66.55) and then recursively drilldown to generate the binned scatter plots #2 and #3 with refined scales. From plot #2 and plot#3, analysts can learn that the most frauds came from region 6 (red) and the purchase amount is above \$50M. Using the above information, the company is able to place strict control on certain sales regions and purchase amount, such as sales region 6 and a purchase amount above \$1M. 4.2 Data Center Thermal Monitoring Cooling is the major operational cost in a data center. The chiller consumes over 600 KW of power in order to keep a normal temperature for the daily IT load in a data center with 500 racks and 11 air conditioning units. Chillers consume power to extract heat from the warm water and provide cold water to the air conditioning units to keep the data center temperature cool. Visual monitoring of the utilization of chillers and power consumption and their impacts on temperatures can greatly reduce operating expenses and equipment downtime. Questions of a data center service manager s frequent concern are: Q1. What is our daily power consumption? How do we optimize the cooling system performance? Q2. How is the chiller operating? Are there any problems? Q3 What are the cause-effects of the power consumption on temperature? To answer the above three questions, we have used variable binned scatter plots to enable administrators to visualize the relationships and impacts among these three thermal factors: temperature, power consumption, and chiller utilization. Figure 5A shows a time series scatter plot. Each data point is represented by a pixel and defined with three attributes (x, y, color) where the x-axis represents the time line from 9/3 to 9/8. The y-axis represents temperature. The color of the data point represents the power consumption which runs the chiller, from low (green) to medium (yellow, orange) to high (red). With variable binned scatter plots, the overlapping data points are sorted and placed close to their original locations as described in Section 3. 6

7 5A: Data Center Temperature Time Series Visual Analysis using Variable Binned Scatter Plots (x-axis: days (9/3 to 9/8, y-axis: Temperature O F Power Consumption KW, color: Power Consumption KW) 5B: Power Consumption Visual Analytics using Recursive Binned Scatter Plots Generated from the three rubber-band areas in Figure 5A (x-axis: days 9/3 to 9/5, y-axis: Temperature O F, color: Power Consumption KW) 5C: Visual Analysis of Hourly Temperature and Power Cost-Effect using Recursive Binned Scatter Plots Generated from the 9/3 time series in Figure 5A (x-axis: day 9/3, y-axis: Temperature O F, color: Power Consumption KW) Figure 5: Data Center Thermal Monitoring Variable Binned Scatter Plots 7

9 4. Variable binned scatter plots provide a recursive drill down capability which allows analysts to view detailed information by using refined scales. Figure 6: Evaluation of HexBin (6A, 6B) and Smoothed Contour (6C, 6D) Scatter Plots with Variable Binned Scatter Plots (6E, 6F) 9

10 The weaknesses of the variable binned scatter plot include: 1. The HexBin and smooth contour scatter plots show a better trend line than the variable binned scatter plot. Trend line is visible with the variable binned scatter plots, but requires users to follow the bins (value ranges). 2. The HexBin and smooth contour scatter plots use different shading to visualize data density while the variable binned scatter plots introduce some distortion to visualize all data points. In summary, both HexBin and smooth contour scatter plots are able to provide a quick overview of data density and correlations. Variable binned scatter plots visualize the entire data distribution but also allow access to each individual data point for users to retrieve information at the record level. These three variants of scatter plots complement one another. 6. CONCLUSION In this paper, we introduce variable binned scatter plots with recursive drill down for a visual analysis of data distributions and cause-effect of multi-dimensional data in addition to correlation between pairs of variables. Variable binned scatter plots resolve the overlapping issue and allow users to visually analyze large data sets at the record level. A minimal distortion is introduced to provide space for the overlapping data points. We are able to use color for a third attribute of the data to help analysts quickly identify patterns and clusters. The recursive drill down capability of the variable binned scatter plots provides detailed information with refined scales for analysts to further analyze important areas in a plot. An evaluation of the recent HexBin and smooth contour scatter plots demonstrates the benefits of using variable binned scatter plots to reveal clusters and distribution in a high density area. Our future work will be in the area of visual prediction using scatter plots to study the trends. ACKNOWLEDGMENTS The authors wish to thank Meichun Hsu for her encouragements, and thank Alex Zhang, Manish Marwah and Cullen E. Bash for providing comments and suggestions. REFERENCES [1] Bachthaler S. and Weiskopf D. Continuous Scatterplots, IEEE Transactions on Visualization and Computer Graphics, Vol. 14, No. 6, November/December [2] Bowman, A.W. and Azzalini, A. (1997). Applied Smoothing Techniques for Data Analysis: the Kernel Approach With S-Plus Illustrations. Oxford University Press, Oxford. [3] Bowman, A.W. and Azzalini, A. (2003). Computational aspects of nonparametric smoothing with illustrations from the sm library. Computational Statistics and Data Analysis, 42, [4] Carr, D. B., Littlefield, R. J., Nicholson W. L., and Kuttkefuekdm J. S. Scatterplot Matrix Techniques for Large N, Journal of American Statistics Association 82, [5] Cleveland W. S., The Many Faces of a Scatterplot. Robert McGill Journal of the American Statistical Association, Vol. 79, No. 388 (Dec. 1984), pp [6] Hao, M., Dayal, U., Keim, D. A., and Morent, D., Intelligent Visual Analytics Queries. IEEE Symposium on Visual Analytics Science and Technology, pp , [7] Hao, M., Dayal, U., Method for visualizing graphical data sets having a non-uniform graphical density for display. US patent number 7,046,247 issued in May, [8] HexBin Scatter Plot released by R System in January, 2009, documented [9] JMP 8 Software. new 64-bit computers and visual analytics tools. [10] Keim, D. A.: Designing Pixel-oriented Visualization Techniques: Theory and Applications, IEEE Transactions on Visualization and Computer Graphics (TVCG), Vol. 6, No. 1, pp , January- March, [11] Keim, D. A., Kriegel, H. P., and Ankerst, M. Recursive pattern: A Technique for Visualizing Very Large Amounts of Data. Proc. IEEE Visualization, pp , [12] Unwin A., Martin T., Heike H, Graphics of Large Datasets, Springer, NY, 2006, pp [13] Wilkirson, L. The grammar of Graphics, New York, Springer, Singh, Mala, MrExcel, Ohio, USA. 10

### Visual Mining of E-Customer Behavior Using Pixel Bar Charts

Visual Mining of E-Customer Behavior Using Pixel Bar Charts Ming C. Hao, Julian Ladisch*, Umeshwar Dayal, Meichun Hsu, Adrian Krug Hewlett Packard Research Laboratories, Palo Alto, CA. (ming_hao, dayal)@hpl.hp.com;

### ABSTRACT. 1. Introduction

A Visual Analysis of Multi-Attribute Data Using Pixel Matrix Displays Ming C. Hao, Umeshwar Dayal, Daniel Keim*, Tobias Schreck* Hewlett Packard Laboratories, Palo Alto, CA (ming.hao, umeshwar.dayal)@hp.com

### Mobile Visual Analytics for Data Center Power and Cooling Management

Mobile Visual Analytics for Data Center Power and Cooling Management Ratnesh Sharma, Ming Hao, Ravigopal Vennelakanti, Manish Gupta, Umeshwar Dayal, Cullen Bash, Chandrakant Patel, Deepa Naik, A Jayakumar,

### Application of Visual Analytics For Thermal State Management in Large Data Centers

Eurographics/ IEEE-VGTC Symposium on Visualization (2009), pp. 1-8 Hans-Christian Hege, Ingrid Hotz, and Tamara Munzner (Co-Chairs) M. C. Hao 1, R. K. Sharma 1, U. Dayal 1, C. Patel 1, R. Vennelakanti

### Pushing the limit in Visual Data Exploration: Techniques and Applications

Pushing the limit in Visual Data Exploration: Techniques and Applications Daniel A. Keim 1, Christian Panse 1, Jörn Schneidewind 1, Mike Sips 1, Ming C. Hao 2, and Umeshwar Dayal 2 1 University of Konstanz,

### Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

### THE rapid increase of automatically collected data has led

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 13, NO. 4, JULY/AUGUST 2007 1 Value-Cell Bar Charts for Visualizing Large Transaction Data Sets Daniel A. Keim, Member, IEEE Computer Society,

### Information Visualization Multivariate Data Visualization Krešimir Matković

Information Visualization Multivariate Data Visualization Krešimir Matković Vienna University of Technology, VRVis Research Center, Vienna Multivariable >3D Data Tables have so many variables that orthogonal

### Data Exploration Data Visualization

Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

### Visual exploration of frequent patterns in multivariate time series

430769IVI0010.1177/1473871611430769Hao et al.information Visualization 2011 Article Visual exploration of frequent patterns in multivariate time series Information Visualization 0(0) 1 13 The Author(s)

### AMPLIO VQA A Web Based Visual Query Analysis System for Micro Grid Energy Mix Planning

International Workshop on Visual Analytics (2012) K. Matkovic and G. Santucci (Editors) AMPLIO VQA A Web Based Visual Query Analysis System for Micro Grid Energy Mix Planning A. Stoffel 1 and L. Zhang

### Visual Exploration of Frequent Patterns in Multivariate Time Series

Visual Exploration of Frequent Patterns in Multivariate Time Series Ming C. Hao 1, Manish Marwah 1, Halldór Janetzko 2, Umeshwar Dayal 1, Daniel A. Keim 2, Debprakash Patnaik 3, Naren Ramakrishnan 3 and

### RnavGraph: A visualization tool for navigating through high-dimensional data

Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session IPS117) p.1852 RnavGraph: A visualization tool for navigating through high-dimensional data Waddell, Adrian University

### 3D Interactive Information Visualization: Guidelines from experience and analysis of applications

3D Interactive Information Visualization: Guidelines from experience and analysis of applications Richard Brath Visible Decisions Inc., 200 Front St. W. #2203, Toronto, Canada, rbrath@vdi.com 1. EXPERT

### Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

### Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary

### Clustering & Visualization

Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

### Exploratory Spatial Data Analysis

Exploratory Spatial Data Analysis Part II Dynamically Linked Views 1 Contents Introduction: why to use non-cartographic data displays Display linking by object highlighting Dynamic Query Object classification

### Visual Data Mining with Pixel-oriented Visualization Techniques

Visual Data Mining with Pixel-oriented Visualization Techniques Mihael Ankerst The Boeing Company P.O. Box 3707 MC 7L-70, Seattle, WA 98124 mihael.ankerst@boeing.com Abstract Pixel-oriented visualization

### Visualization Techniques in Data Mining

Tecniche di Apprendimento Automatico per Applicazioni di Data Mining Visualization Techniques in Data Mining Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo di Milano

Table of contents: Access Data for Analysis Data file types Format assumptions Data from Excel Information links Add multiple data tables Create & Interpret Visualizations Table Pie Chart Cross Table Treemap

### Easy Data Discovery with Smart Data Transitions

Thought Leadership Easy Data Discovery with Smart Data Transitions Data Animations and Transitions What does your data communicate? Can you see the correlations and insights hiding in your data? Visual

### Appendix E: Graphing Data

You will often make scatter diagrams and line graphs to illustrate the data that you collect. Scatter diagrams are often used to show the relationship between two variables. For example, in an absorbance

### Visualization methods for patent data

Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes

### COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

COMP 5318 Data Exploration and Analysis Chapter 3 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping

### Microsoft Business Intelligence Visualization Comparisons by Tool

Microsoft Business Intelligence Visualization Comparisons by Tool Version 3: 10/29/2012 Purpose: Purpose of this document is to provide a quick reference of visualization options available in each tool.

### Robust Panoramic Image Stitching

Robust Panoramic Image Stitching CS231A Final Report Harrison Chau Department of Aeronautics and Astronautics Stanford University Stanford, CA, USA hwchau@stanford.edu Robert Karol Department of Aeronautics

### Maps are an excellent means of presenting statistical information. Not only are they visually attractive, they also:

Statistical Maps: Best Practice Why draw a map? Maps are an excellent means of presenting statistical information. Not only are they visually attractive, they also: make it easier for users to relate data

### Seeing by Degrees: Programming Visualization From Sensor Networks

Seeing by Degrees: Programming Visualization From Sensor Networks Da-Wei Huang Michael Bobker Daniel Harris Engineer, Building Manager, Building Director of Control Control Technology Strategy Development

### Biostatistics: A QUICK GUIDE TO THE USE AND CHOICE OF GRAPHS AND CHARTS

Biostatistics: A QUICK GUIDE TO THE USE AND CHOICE OF GRAPHS AND CHARTS 1. Introduction, and choosing a graph or chart Graphs and charts provide a powerful way of summarising data and presenting them in

### Big Data: Rethinking Text Visualization

Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important

### Specific Usage of Visual Data Analysis Techniques

Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia

### Density Distribution Sunflower Plots

Density Distribution Sunflower Plots William D. Dupont* and W. Dale Plummer Jr. Vanderbilt University School of Medicine Abstract Density distribution sunflower plots are used to display high-density bivariate

### Scientific Graphing in Excel 2010

Scientific Graphing in Excel 2010 When you start Excel, you will see the screen below. Various parts of the display are labelled in red, with arrows, to define the terms used in the remainder of this overview.

### CSU, Fresno - Institutional Research, Assessment and Planning - Dmitri Rogulkin

My presentation is about data visualization. How to use visual graphs and charts in order to explore data, discover meaning and report findings. The goal is to show that visual displays can be very effective

### Principles of Data Visualization for Exploratory Data Analysis. Renee M. P. Teate. SYS 6023 Cognitive Systems Engineering April 28, 2015

Principles of Data Visualization for Exploratory Data Analysis Renee M. P. Teate SYS 6023 Cognitive Systems Engineering April 28, 2015 Introduction Exploratory Data Analysis (EDA) is the phase of analysis

### Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

### Presenting numerical data

Student Learning Development Presenting numerical data This guide offers practical advice on how to incorporate numerical information into essays, reports, dissertations, posters and presentations. The

### TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

### Chapter 2 - Graphical Summaries of Data

Chapter 2 - Graphical Summaries of Data Data recorded in the sequence in which they are collected and before they are processed or ranked are called raw data. Raw data is often difficult to make sense

### Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

### Visualization Quick Guide

Visualization Quick Guide A best practice guide to help you find the right visualization for your data WHAT IS DOMO? Domo is a new form of business intelligence (BI) unlike anything before an executive

### Image Content-Based Email Spam Image Filtering

Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

### Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Iris Sample Data Set Basic Visualization Techniques: Charts, Graphs and Maps CS598 Information Visualization Spring 2010 Many of the exploratory data techniques are illustrated with the Iris Plant data

### Supervised and unsupervised learning - 1

Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

### Excel Tutorial. Bio 150B Excel Tutorial 1

Bio 15B Excel Tutorial 1 Excel Tutorial As part of your laboratory write-ups and reports during this semester you will be required to collect and present data in an appropriate format. To organize and

### Cumulative Curves for Exploration of Demographic Data: a Case Study of Northwest England

Cumulative Curves for Exploration of Demographic Data: a Case Study of Northwest England Natalia Andrienko and Gennady Andrienko Fraunhofer AIS - Autonomous Intelligent Systems Institute, SPADE Spatial

### Visually driven analysis of movement data by progressive clustering

Visually driven analysis of movement data by progressive clustering Extended Abstract S. Rinzivillo D. Pedreschi M. Nanni F. Giannotti N. Andrienko G. Andrienko KDD Lab, University of Pisa {rinziv,padre}@di.unipi.it

### 20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns

20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns John Aogon and Patrick J. Ogao Telecommunications operators in developing countries are faced with a problem of knowing

### Enhancing the Visual Clustering of Query-dependent Database Visualization Techniques using Screen-Filling Curves

Enhancing the Visual Clustering of Query-dependent Database Visualization Techniques using Screen-Filling Curves Extended Abstract Daniel A. Keim Institute for Computer Science, University of Munich Leopoldstr.

### Multi-Dimensional Data Visualization. Slides courtesy of Chris North

Multi-Dimensional Data Visualization Slides courtesy of Chris North What is the Cleveland s ranking for quantitative data among the visual variables: Angle, area, length, position, color Where are we?!

### VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc., http://willsfamily.org/gwills

VISUALIZING HIERARCHICAL DATA Graham Wills SPSS Inc., http://willsfamily.org/gwills SYNONYMS Hierarchical Graph Layout, Visualizing Trees, Tree Drawing, Information Visualization on Hierarchies; Hierarchical

### Using kernel methods to visualise crime data

Submission for the 2013 IAOS Prize for Young Statisticians Using kernel methods to visualise crime data Dr. Kieran Martin and Dr. Martin Ralphs kieran.martin@ons.gov.uk martin.ralphs@ons.gov.uk Office

### Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

### Kriging Interpolation

Kriging Interpolation Kriging is a geostatistical interpolation technique that considers both the distance and the degree of variation between known data points when estimating values in unknown areas.

### Introduction. One of the most convincing and appealing ways in which statistical results may be presented is through diagrams and graphs.

Introduction One of the most convincing and appealing ways in which statistical results may be presented is through diagrams and graphs. Just one diagram is enough to represent a given data more effectively

### Data representation and analysis in Excel

Page 1 Data representation and analysis in Excel Let s Get Started! This course will teach you how to analyze data and make charts in Excel so that the data may be represented in a visual way that reflects

### Visual quality metrics and human perception: an initial study on 2D projections of large multidimensional data.

Visual quality metrics and human perception: an initial study on 2D projections of large multidimensional data. Andrada Tatu Institute for Computer and Information Science University of Konstanz Peter

### 9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08

9. Text & Documents Visualizing and Searching Documents Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08 Slide 1 / 37 Outline Characteristics of text data Detecting patterns SeeSoft

### EXCEL EXERCISE AND ACCELERATION DUE TO GRAVITY

EXCEL EXERCISE AND ACCELERATION DUE TO GRAVITY Objective: To learn how to use the Excel spreadsheet to record your data, calculate values and make graphs. To analyze the data from the Acceleration Due

### GEO-VISUALIZATION SUPPORT FOR MULTIDIMENSIONAL CLUSTERING

Geoinformatics 2004 Proc. 12th Int. Conf. on Geoinformatics Geospatial Information Research: Bridging the Pacific and Atlantic University of Gävle, Sweden, 7-9 June 2004 GEO-VISUALIZATION SUPPORT FOR MULTIDIMENSIONAL

### The Role of Visualization in Effective Data Cleaning

The Role of Visualization in Effective Data Cleaning Yu Qian Dept. of Computer Science The University of Texas at Dallas Richardson, TX 75083-0688, USA qianyu@student.utdallas.edu Kang Zhang Dept. of Computer

### GIS Tutorial 1. Lecture 2 Map design

GIS Tutorial 1 Lecture 2 Map design Outline Choropleth maps Colors Vector GIS display GIS queries Map layers and scale thresholds Hyperlinks and map tips 2 Lecture 2 CHOROPLETH MAPS Choropleth maps Color-coded

### PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

QÜESTIIÓ, vol. 25, 3, p. 509-520, 2001 PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY GEORGES HÉBRAIL We present in this paper the main applications of data mining techniques at Electricité de France,

### DATA VISUALIZATION GABRIEL PARODI STUDY MATERIAL: PRINCIPLES OF GEOGRAPHIC INFORMATION SYSTEMS AN INTRODUCTORY TEXTBOOK CHAPTER 7

DATA VISUALIZATION GABRIEL PARODI STUDY MATERIAL: PRINCIPLES OF GEOGRAPHIC INFORMATION SYSTEMS AN INTRODUCTORY TEXTBOOK CHAPTER 7 Contents GIS and maps The visualization process Visualization and strategies

### Visualizing Relationships and Connections in Complex Data Using Network Diagrams in SAS Visual Analytics

Paper 3323-2015 Visualizing Relationships and Connections in Complex Data Using Network Diagrams in SAS Visual Analytics ABSTRACT Stephen Overton, Ben Zenick, Zencos Consulting Network diagrams in SAS

### Interactive Information Visualization of Trend Information

Interactive Information Visualization of Trend Information Yasufumi Takama Takashi Yamada Tokyo Metropolitan University 6-6 Asahigaoka, Hino, Tokyo 191-0065, Japan ytakama@sd.tmu.ac.jp Abstract This paper

### Visualization of Breast Cancer Data by SOM Component Planes

International Journal of Science and Technology Volume 3 No. 2, February, 2014 Visualization of Breast Cancer Data by SOM Component Planes P.Venkatesan. 1, M.Mullai 2 1 Department of Statistics,NIRT(Indian

### 1. Microsoft Excel Basics:

1. Microsoft Excel Basics: This section introduces data manipulation using Microsoft Excel, including importing, copying and pasting data and entering equations. A basic understanding of computer operating

### Graphical methods for presenting data

Chapter 2 Graphical methods for presenting data 2.1 Introduction We have looked at ways of collecting data and then collating them into tables. Frequency tables are useful methods of presenting data; they

### VisualCalc Dashboard: Google Analytics Comparison Whitepaper Rev 3.0 October 2007

VisualCalc Dashboard: Google Analytics Comparison Whitepaper Rev 3.0 October 2007 5047 Robert J Mathews Pkwy, Suite 200 El Dorado Hills, California 95762 916.939.2020 www.visualcalc.com Introduction Google

### Natural Neighbour Interpolation

Natural Neighbour Interpolation DThe Natural Neighbour method is a geometric estimation technique that uses natural neighbourhood regions generated around each point in the data set. The method is particularly

### INTRODUCTION. The principles of which are to:

Taking the Pain Out of Chromatographic Peak Integration Shaun Quinn, 1 Peter Sauter, 1 Andreas Brunner, 1 Shawn Anderson, 2 Fraser McLeod 1 1 Dionex Corporation, Germering, Germany; 2 Dionex Corporation,

### Data Visualization Handbook

SAP Lumira Data Visualization Handbook www.saplumira.com 1 Table of Content 3 Introduction 20 Ranking 4 Know Your Purpose 23 Part-to-Whole 5 Know Your Data 25 Distribution 9 Crafting Your Message 29 Correlation

### ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION, Fuel Consulting, LLC May 2013 DATA AND ANALYSIS INTERACTION Understanding the content, accuracy, source, and completeness of data is critical to the

### FREE FALL. Introduction. Reference Young and Freedman, University Physics, 12 th Edition: Chapter 2, section 2.5

Physics 161 FREE FALL Introduction This experiment is designed to study the motion of an object that is accelerated by the force of gravity. It also serves as an introduction to the data analysis capabilities

### What is Visualization? Information Visualization An Overview. Information Visualization. Definitions

What is Visualization? Information Visualization An Overview Jonathan I. Maletic, Ph.D. Computer Science Kent State University Visualize/Visualization: To form a mental image or vision of [some

### Galaxy Morphological Classification

Galaxy Morphological Classification Jordan Duprey and James Kolano Abstract To solve the issue of galaxy morphological classification according to a classification scheme modelled off of the Hubble Sequence,

### International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

### Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm

### Storage Management for High Energy Physics Applications

Abstract Storage Management for High Energy Physics Applications A. Shoshani, L. M. Bernardo, H. Nordberg, D. Rotem, and A. Sim National Energy Research Scientific Computing Division Lawrence Berkeley

### an introduction to VISUALIZING DATA by joel laumans

an introduction to VISUALIZING DATA by joel laumans an introduction to VISUALIZING DATA iii AN INTRODUCTION TO VISUALIZING DATA by Joel Laumans Table of Contents 1 Introduction 1 Definition Purpose 2 Data

### University of Arkansas Libraries ArcGIS Desktop Tutorial. Section 2: Manipulating Display Parameters in ArcMap. Symbolizing Features and Rasters:

: Manipulating Display Parameters in ArcMap Symbolizing Features and Rasters: Data sets that are added to ArcMap a default symbology. The user can change the default symbology for their features (point,

### Files Used in this Tutorial

Generate Point Clouds Tutorial This tutorial shows how to generate point clouds from IKONOS satellite stereo imagery. You will view the point clouds in the ENVI LiDAR Viewer. The estimated time to complete

### Interactive Flag Identification using Image Retrieval Techniques

Interactive Flag Identification using Image Retrieval Techniques Eduardo Hart, Sung-Hyuk Cha, Charles Tappert CSIS, Pace University 861 Bedford Road, Pleasantville NY 10570 USA E-mail: eh39914n@pace.edu,

### Formulas, Functions and Charts

Formulas, Functions and Charts :: 167 8 Formulas, Functions and Charts 8.1 INTRODUCTION In this leson you can enter formula and functions and perform mathematical calcualtions. You will also be able to

### RUN-LENGTH ENCODING FOR VOLUMETRIC TEXTURE

RUN-LENGTH ENCODING FOR VOLUMETRIC TEXTURE Dong-Hui Xu, Arati S. Kurani, Jacob D. Furst, Daniela S. Raicu Intelligent Multimedia Processing Laboratory, School of Computer Science, Telecommunications, and

### Visibility optimization for data visualization: A Survey of Issues and Techniques

Visibility optimization for data visualization: A Survey of Issues and Techniques Ch Harika, Dr.Supreethi K.P Student, M.Tech, Assistant Professor College of Engineering, Jawaharlal Nehru Technological

### DRAFT. Encoding Images. Teaching Summary

6 Unit 1: A Bit of Everything Lesson 6 Encoding Images Lesson time: 180 Minutes (4 days) LESSON OVERVIEW: In this lesson, students will explore images and participate in creating an image file format.

### CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu September 7, 2014 2: Data Pre-Processing Getting to know your data Basic Statistical Descriptions of Data

### Visual Discovery in Multivariate Binary Data

Visual Discovery in Multivariate Binary Data Boris Kovalerchuk a*, Florian Delizy a, Logan Riggs a, Evgenii Vityaev b a Dept. of Computer Science, Central Washington University, Ellensburg, WA, 9896-7520,

### Microsoft Excel 2010 Charts and Graphs

Microsoft Excel 2010 Charts and Graphs Email: training@health.ufl.edu Web Page: http://training.health.ufl.edu Microsoft Excel 2010: Charts and Graphs 2.0 hours Topics include data groupings; creating

### Data Visualization Techniques

Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The

### Session 7 Bivariate Data and Analysis

Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares

### Excel: Creating Charts

Excel: Creating Charts TABLE OF CONTENTS CHARTING IN EXCEL...1 WHICH CHART TO USE?!...1 SOME DEFINITIONS OF TERMINOLOGY RELATING TO CHARTS...3 CREATING A CHART...5 CHART WIZARD...5 ADD DATA TO A CHART...7

### SuperViz: An Interactive Visualization of Super-Peer P2P Network

SuperViz: An Interactive Visualization of Super-Peer P2P Network Anthony (Peiqun) Yu pqyu@cs.ubc.ca Abstract: The Efficient Clustered Super-Peer P2P network is a novel P2P architecture, which overcomes

### HDDVis: An Interactive Tool for High Dimensional Data Visualization

HDDVis: An Interactive Tool for High Dimensional Data Visualization Mingyue Tan Department of Computer Science University of British Columbia mtan@cs.ubc.ca ABSTRACT Current high dimensional data visualization