Lecture 26: Data visualization

Similar documents
Diagrams and Graphs of Statistical Data

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Examples of Data Representation using Tables, Graphs and Charts

Choosing a successful structure for your visualization

CSU, Fresno - Institutional Research, Assessment and Planning - Dmitri Rogulkin

Unit 9 Describing Relationships in Scatter Plots and Line Graphs

This file contains 2 years of our interlibrary loan transactions downloaded from ILLiad. 70,000+ rows, multiple fields = an ideal file for pivot

an introduction to VISUALIZING DATA by joel laumans

Principles of Data Visualization

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

CS171 Visualization. The Visualization Alphabet: Marks and Channels. Alexander Lex [xkcd]

Visualization Software

Data Visualization. BUS 230: Business and Economic Research and Communication

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

Summarizing and Displaying Categorical Data

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Quantitative vs. Categorical Data: A Difference Worth Knowing Stephen Few April 2005

Scatter Plots with Error Bars

Chapter 2: Frequency Distributions and Graphs

Data Visualization Best Practice. Sophie Sparkes Data Analyst

Density Distribution Sunflower Plots

Mathematical goals. Starting points. Materials required. Time needed

What Does the Normal Distribution Sound Like?

Table of Contents Find the story within your data

Visualization Quick Guide

Describing, Exploring, and Comparing Data

Exercise 1.12 (Pg )

Number Visualization

Designing science graphs for data analysis and presentation

Data Visualization. Introductions

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Using SPSS, Chapter 2: Descriptive Statistics

CHARTS AND GRAPHS INTRODUCTION USING SPSS TO DRAW GRAPHS SPSS GRAPH OPTIONS CAG08

3D Interactive Information Visualization: Guidelines from experience and analysis of applications

Principles of Data Visualization for Exploratory Data Analysis. Renee M. P. Teate. SYS 6023 Cognitive Systems Engineering April 28, 2015

Common Mistakes in Data Presentation Stephen Few September 4, 2004

MARS STUDENT IMAGING PROJECT

Data Visualization. Scientific Principles, Design Choices and Implementation in LabKey. Cory Nathe Software Engineer, LabKey

GRAPHING DATA FOR DECISION-MAKING

Valor Christian High School Mrs. Bogar Biology Graphing Fun with a Paper Towel Lab

Good Graphs: Graphical Perception and Data Visualization

Sometimes We Must Raise Our Voices

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Variable: characteristic that varies from one individual to another in the population

Module 2: Introduction to Quantitative Data Analysis

Common Tools for Displaying and Communicating Data for Process Improvement

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Part 1: Background - Graphing

Biology 300 Homework assignment #1 Solutions. Assignment:

Demographics of Atlanta, Georgia:

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Visualizing Multidimensional Data Through Time Stephen Few July 2005

Data representation and analysis in Excel

Multi-Dimensional Data Visualization. Slides courtesy of Chris North

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Gestation Period as a function of Lifespan

Data Visualization in R

Statistics Revision Sheet Question 6 of Paper 2

Data Visualization Handbook

Data Visualization. Steve Marschner Cornell CS 3220

Data Exploration Data Visualization

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Session 7 Bivariate Data and Analysis

Excel Chart Best Practices

Exploratory Spatial Data Analysis

Fairfield Public Schools

Data Visualization Techniques

Creating Bar Charts and Pie Charts Excel 2010 Tutorial (small revisions 1/20/14)

Dr. Lisa White

Graphing in SAS Software

Information visualization examples

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Quantitative Displays for Combining Time-Series and Part-to-Whole Relationships

Best Practices in Data Visualizations. Vihao Pham 2014

AP Statistics Solutions to Packet 2

Best Practices in Data Visualizations. Vihao Pham January 29, 2014

Numbers as pictures: Examples of data visualization from the Business Employment Dynamics program. October 2009

Bar Graphs and Dot Plots

ABOUT THIS DOCUMENT ABOUT CHARTS/COMMON TERMINOLOGY

Continuous variables can have a very large number of possible values, potentially any value in an interval.

Microsoft Excel 2010 Charts and Graphs

How To Design A Graph

Visualization. For Novices. ( Ted Hall ) University of Michigan 3D Lab Digital Media Commons, Library

Part 2: Data Visualization How to communicate complex ideas with simple, efficient and accurate data graphics

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Information Visualization Multivariate Data Visualization Krešimir Matković

Exploratory Data Analysis

Sta 309 (Statistics And Probability for Engineers)

Microsoft Business Intelligence Visualization Comparisons by Tool

MTH 140 Statistics Videos

Disseminating Research and Writing Research Proposals

Unit 7: Normal Curves

Introduction to Data Visualization

6. Decide which method of data collection you would use to collect data for the study (observational study, experiment, simulation, or survey):

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

Creating Charts in Microsoft Excel A supplement to Chapter 5 of Quantitative Approaches in Business Studies

Transcription:

Lecture 26: Data visualization STATS 202: Data mining and analysis December 5, 2016

Why does visualization matter? Large size of data makes it necessary to provide summaries. People prefer to look at figures rather than numbers. The two most important functions of visualizations are: 1. Communicate information 2. Support reasoning about data

Communicate information

Support reasoning about about data On January 28, 1986, the space shuttle Challenger exploded because two rubber O-rings leaked due to the very cold temperatures at launch day. This potential problem was discussed the day before the launch: Engineers opposed launching based on data from previous launches, and provided 13 charts to NASA to support their case. However, it is difficult to assess the relationship between temperature and O-ring damage based on these charts.

Support reasoning about about data A visual display of the data from the investigation after the launch. The poor design and use of chart junk makes it difficult to assess the relationship between temperature and O-ring damage.

Support reasoning about about data A simple but pertinent display supports reasoning about the data. (from Tufte)

Elementary ways to display univariate data Pie chart: the size of the angle is proportional to the relative frequency Bar graph: the height of the bar is proportional to the frequency General rule: Use a bar graph whenever the data can be ordered, e.g. numbers. For qualitative data, e.g. colors, a pie chart may be more appropriate.

A histogram displays quantitative data like a bar graph, but it allows for unequal block lengths. Main principle: Area is proportional to frequency. So the percentage falling into a block can be figured without a vertical scale since the total area equals 100%. But it s helpful to have a vertical scale (density scale). Its unit is % per unit, so % per year in above example.

The histogram gives two kinds of information about the data: 1. Density (crowding): The height of the bar tells how many subjects are for one unit on the horizontal scale. For example, the highest density is around age 19 as.04 = 4% of all subjects have ages in the range 19 20 years. In contrast, only about 0.7% of subjects fall into each one year range for ages 60 80.

The histogram gives two kinds of information about the data: 2. Percentages (relative frequences): Those are given by area = height x width. For example, about 14% of all subjects fall into the age range 60 80, because the corresponding area is (20 years) x (0.7 % per year)=14 %. Alternatively, you can find this answer by eyeballing that the area makes up roughly 1/7 of the total area of the histogram, so roughly 1/7=14% of all subjects fall in that range.

The box plot (or: box-and-whisker plot) is a more compact graphical summary depicting 5 numbers: the minimum and maximum, and the 1st, 2nd and 3rd quartiles. The box plot gives an indication of skewness in the data.

The compact display of the box plot makes it well suited for comparing several data sets: This is an example of the principle of small multiples, which we will discuss later.

For bivariate data, the standard display is the scatterplot. It visualizes association between the two variables.

If there are many points then one needs to use shading to avoid big blobs: Scatterplot of latitude and longitude of visitors to a web site. Note how faint grid lines help to interpret the data.

For more than 2 variables sometimes a useful visualization obtains by using scatter + (size,color,... ) mtcars in R with ggplot

An alternative is the scatter plot matrix:

The principle of displaying small multiples is well suited for visualizing differences in objects or changes over time as the human brain is good at processing small multiples.

Graphics software makes it tempting to produce showy but poor visualizations from the New York Times (4/26/2010): We Have Met the Enemy and He Is PowerPoint.

Graphics software makes it tempting to produce showy but poor visualizations

What is good way to visualize data? There is a large number of ways to display data. Which displays are good and why? Cleveland (who coined the name Data Science around the time you were born) and McGill developed a simple taxonomy based on graphical perception: They identified ten elementary perceptual tasks and then they did experiments to find out which tasks are easy or difficult for the human brain. This provides a guideline for the use of existing and the construction of new visualization techniques.

The 10 elementary perceptual tasks

Examples of the elementary perceptual tasks Position on a common scale: bar graph Angle: pie chart Shading: statistical maps display information as a function of geographical location:

Ranking of the 10 elementary tasks 1. Position along a common scale 2. Position along nonaligned scales 3. Length, direction, angle 4. Area 5. Volume, curvature 6. Shading, color saturation

For an illustration, note that assessing length is not ranked near the top. This task comes up for curve-difference charts, where the vertical difference between two curves has to be assessed:

Assessing length is difficult for the eye-brain system: The plots in the right panel show the differences between the pairs of curves in the left panel.

The next example illustrates why it is easier to judge position on an nonaligned scale than to assess length: This fact can also be derived from psychophysical theory: It follows from Weber s law that the difficulty in comparing lengths depends on the ratio of the lengths, not their difference. Using a bounding box helps as the ratio of the white bars is clearly different from 1, while that of the black bars isn t.

This ordering of the elementary tasks leads to the following important principle for choosing or developing methods for visualization: Whenever possible, employ tasks that are ranked high on the list: 1. Position along a common scale 2. Position along nonaligned scales 3. Length, direction, angle 4. Area 5. Volume, curvature 6. Shading, color saturation

For example, Cleveland McGill advocate replacing a pie chart with a bar graph or a dot chart, because perceiving position on a common scale is easier than perceiving angles:

Positioning on a nonaligned scale in a divided bar chart can be replaced by a dot chart with grouping, which uses positioning on a common scale:

Judging length in a curve-difference chart can be replaced by plotting the difference on a common scale:

Shading can be replaced by positioning on a nonaligned scale:

Some important design considerations Use perceptually effective encodings per the above ranking Visualize the facts and only the facts Reduce overhead Context is essential for graphical integrity

Some important design considerations Use perceptually effective encodings per the above ranking. Note that color is ranked low and should be avoided. Some guidelines for using color: Hue (color) is perceived as unordered and should only be used to label qualitative data. Value (brightness) is perceived as ordered and can be used to code quantitative data if the range in values is small.

Some important design considerations Visualize the facts and only the facts Avoid chartjunk Connecting the dots is often a borderline case: It suggests there is more information displayed than there really is, but it helps to visualize trends.

Some important design considerations Reduce overhead: Avoid legend lookup if direct labeling works Use faint grid lines: They provide more information if needed and stay in the background otherwise. Scatterplot of latitude and longitude of visitors to a web site.

Some important design considerations

from Tufte: The Visual Display of Quantitative Information