USING GRAPHS EFFECTIVELY FOR LEARNING: FROM DATA

Similar documents
Diagrams and Graphs of Statistical Data

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

Glencoe. correlated to SOUTH CAROLINA MATH CURRICULUM STANDARDS GRADE 6 3-3, , , 4-9

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Big Ideas in Mathematics

Regression III: Advanced Methods

MTH 140 Statistics Videos

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Scatter Plots with Error Bars

Exercise 1.12 (Pg )

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Factor Analysis. Chapter 420. Introduction

Principles of Data Visualization for Exploratory Data Analysis. Renee M. P. Teate. SYS 6023 Cognitive Systems Engineering April 28, 2015

3D Interactive Information Visualization: Guidelines from experience and analysis of applications

Examples of Data Representation using Tables, Graphs and Charts

Exploratory data analysis (Chapter 2) Fall 2011

Data Exploration Data Visualization

The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA

Excel -- Creating Charts

Information Visualization Multivariate Data Visualization Krešimir Matković

Module 2: Introduction to Quantitative Data Analysis

Visualization Quick Guide

Spreadsheet software for linear regression analysis

Common Core Unit Summary Grades 6 to 8

096 Professional Readiness Examination (Mathematics)

Summarizing and Displaying Categorical Data

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Data representation and analysis in Excel

Getting started with qplot

How To Use Statgraphics Centurion Xvii (Version 17) On A Computer Or A Computer (For Free)

Principles of Data Visualization

Exploratory Data Analysis with MATLAB

Session 7 Bivariate Data and Analysis

What Does the Normal Distribution Sound Like?

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Freehand Sketching. Sections

Formulas, Functions and Charts

Data Visualization Techniques

Scientific Graphing in Excel 2010

Measurement with Ratios

Simple Predictive Analytics Curtis Seare

Fairfield Public Schools

Mathematics. Mathematical Practices

CHARTS AND GRAPHS INTRODUCTION USING SPSS TO DRAW GRAPHS SPSS GRAPH OPTIONS CAG08

Unit 9 Describing Relationships in Scatter Plots and Line Graphs

Choosing a successful structure for your visualization

Data exploration with Microsoft Excel: analysing more than one variable

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

Multi-Dimensional Data Visualization. Slides courtesy of Chris North

Data Visualization Handbook

Gestation Period as a function of Lifespan

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Microsoft Business Intelligence Visualization Comparisons by Tool

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Quantitative vs. Categorical Data: A Difference Worth Knowing Stephen Few April 2005

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

7 Time series analysis

Data Visualization Techniques

Nonlinear Iterative Partial Least Squares Method

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Week 1. Exploratory Data Analysis

Clustering & Visualization

GeoGebra. 10 lessons. Gerrit Stols

Volumes of Revolution

Gates/filters in Flow Cytometry Data Visualization

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

CRLS Mathematics Department Algebra I Curriculum Map/Pacing Guide

CSU, Fresno - Institutional Research, Assessment and Planning - Dmitri Rogulkin

Good Graphs: Graphical Perception and Data Visualization

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Chapter 11. Correspondence Analysis

TEXT-FILLED STACKED AREA GRAPHS Martin Kraus

A simple three dimensional Column bar chart can be produced from the following example spreadsheet. Note that cell A1 is left blank.

Dimensionality Reduction: Principal Components Analysis

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Regression III: Advanced Methods

Demographics of Atlanta, Georgia:


FREE FALL. Introduction. Reference Young and Freedman, University Physics, 12 th Edition: Chapter 2, section 2.5

Density Distribution Sunflower Plots

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

RnavGraph: A visualization tool for navigating through high-dimensional data

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Solving Simultaneous Equations and Matrices

Welcome to CorelDRAW, a comprehensive vector-based drawing and graphic-design program for the graphics professional.

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Describing, Exploring, and Comparing Data

an introduction to VISUALIZING DATA by joel laumans

Multivariate Analysis of Ecological Data

the points are called control points approximating curve

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

CS171 Visualization. The Visualization Alphabet: Marks and Channels. Alexander Lex [xkcd]

(Refer Slide Time: 1:42)

Everyday Mathematics CCSS EDITION CCSS EDITION. Content Strand: Number and Numeration

Visualization methods for patent data

Everyday Mathematics GOALS

Graphing in SAS Software

Transcription:

USING GRAPHS EFFECTIVELY FOR LEARNING: FROM DATA William G. Jacoby ICPSR and Michigan State University Indiana University http://polisci.msu.edu/jacoby/iu/graphics I. Visual Representations of Quantitative Information A. Graphs have always been an important part of the statistical sciences B. Historical development and past uses C. Currently used often in data analysis and statistics II. Evolution of Statistical Graphics A. Major advances in graphical methodologies B. Research on psychology of graphical perception C. Evolution and wide availability of powerful computing equipment and high-resolution graphical displays III. Two Types of Graphics A. Used for different purposes B. Analytical graphs C. Presentational Graphs IV. Objectives of Graphical Approaches to Data Analysis A. Exploring the contents of a dataset B. Finding structure in data C. Checking assumptions of statistical models D. Communicating results

Using Graphs Effectively Introduction Page 2 V. Advantages of Graphical Displays A. Useful summaries for large, complex datasets B. Graphs not as reliant upon underlying assumptions as are numerical summaries C. Greater interaction between researcher and the data VI. Two Interacting Components A. Graphs encode information B. Human perception and cognition decode information VII. Graphs Versus Tables A. Small number of specific values use a table B. Large number of values and/or focus on patterns use a graph VIII. Graphical Perception A. Pattern perception versus table look-up B. Cleveland s theory of pattern perception IX. A Graph is Always a Model of the Data A. Objective is to highlight salient points without distorting features or imposing undue assumptions B. Avoid excessive detail of individual data points, focus on overall structure within the data set

GRAPHICAL DISPLAYS FOR UNIVARIATE DATA William G. Jacoby ICPSR and Michigan State University Indiana University http://polisci.msu.edu/jacoby/iu/graphics I. General Considerations A. Many different displays available for univariate data each has its own advantages and disadvantages B. Two general kinds of data to be considered 1. Values of a single quantitative variable 2. Labeled quantitative values II. Univariate Scatterplot A. Each observation plotted as point along axis representing range of variable values B. Overplotting problems C. Advantages and disadvantages of unvariate scatterplot III. The Histogram A. Most commonly-used univariate graph B. Histogram is a two-dimensional diagram, with one axis representing the range of the variable, and the other axis representing the data density at positions within the range C. If data are relatively continuous, then observations are grouped into mutually exclusive and exhaustive categories, called bins D. Histograms provide a great deal of information about the distribution and data values E. Disadvantages of the histogram, most arising from the binning process IV. Smoothed Histograms A. Shows local densities as a smooth, continuous function of the original data values B. Comparison to traditional histogram and constructing a smoothed histogram

Univariate Graphs Page 2 C. The kernel function D. Effect of specific kernel function on shape of smoothed histogram E. Effect of bandwidth on shape of smoothed histogram V. The Quantile Plot A. Quantiles are the order statistics for a data distribution B. Quantile plot is a two-dimensional display a scatterplot of data values versus their position within the ordered distribution C. Comparison to histogram, for various distribution shapes D. Interpretation, and advantages, of a quantile plot VI. Box Plots (Box-and-Whisker Diagrams) A. Focuses primarily on important quartiles B. Definitions and components of the box plot C. Box plot shows main features of a data distribution D. This kind of display has many advantages, with only one real disadvantage VII. Dot Plots for Labeled Data Values A. Useful display when data values are associated with identifying information, such as a label or an index number. B. Two-dimensional plot of data labels versus data values C. Dot plots present a great deal of information and are very easy to interpret D. Dot plots can be used in a variety of data analysis contexts E. The Dot plot has several advantages over its competitors, pie charts and bar charts F. Dot plots and arbitrary origin of data values VIII. Conclusions: How Does One Select the Appropriate Univariate Display? A. Subjective, somewhat idiosyncratic factors play a big role B. Different kinds of displays better for different aspects of visual perception C. Despite advantages of other methods, the histogram will probably continue to dominate univariate statistical graphics

GRAPHICAL DISPLAYS FOR BIVARIATE DATA William G. Jacoby ICPSR and Michigan State University Indiana University http://polisci.msu.edu/jacoby/iu/graphics I. Scatterplot is the Basic Graphical Display for Bivariate Data A. Two-dimensional display, showing joint distribution of observations on two variables B. Very useful because it avoids the implicit assumptions of statistical models II. General Guidelines A. Use axes on all four sides to enclose the plotting region B. Clear axis labels C. Rectangular grid lines within the plotting region are usually unnecessary D. Plotting symbols should be visually prominent and resistant to overplotting E. Tick marks should point outward, rather than inward and relatively few tick marks should be used on each axis F. Data rectangle should be slightly smaller than the scale rectangle G. Scale rectangle versus data rectangle H. If necessary, transform data values (alternatively, used transformed scales on coordinate axes) so plotted points fill up as much of the data rectangle as possible III. Enhancements for Scatterplots A. Jittering for alleviating overplotting and dealing with repeated data points B. Marginal displays of univariate information C. Labeling data points use labels sparingly!

Bivariate Graphs Page 2 IV. Slicing a Scatterplot A. Scatterplots are usually used to examine functional dependence between variables B. Visual assessment of functional dependence in raw data is problematic C. Dividing the plotting region of the scatterplot into a series of vertical slices enables visual display of conditional Y distributions D. It is useful to employ a univariate graphical display within each slice E. Compare salient characteristics across conditional Y distributions to assess functional dependence V. Scatterplot Smoothing A. Simplest type of nonparametric regression examine the relationship between two variables without specifying functional form in advance of the analysis B. Generate a smooth curve that follows center of conditional Y distribution, across the range of X values C. Many methods exist for nonparametric regression and scatterplot smoothing; Loess (or Lowess) is most popular and has nice properties VI. Loess, or Local regression A. Loess carries out a series of regressions within a window that moves across the range of X values. B. Define m equally-spaced locations (called v j, with j ranging from 1 to m) along the range of X C. At each v j, perform a regression, using WLS to estimate the coefficients D. In the WLS, observations are weighted inversely with their distance from the current v j (hence, local regression ) E. The local regression is used to find a predicted value for Y at the current location along the X axis. This is called the fitted (or predicted) value for v j and is designated g(v j ) F. Plot the point, [v j, g(v j )] G. Perform m local regressions, calculate fitted values for each, and plot the points, [v j, g(v j )] for each of the m locations along the X axis H. Adjacent [v j, g(v j )] points are connected with line segments to form a relatively smooth curve VII. Considerations in Using Loess A. Smoothing parameter, α, gives proportion of data within each fitting window B. Degree of polynomial used in local fitting, λ C. Robustness iterations

Bivariate Graphs Page 3 D. Loess residuals can be used to guide specification of loess parameters E. Residual-fit spread plot for goodness-of-fit (display can also be used for parametric regression models VIII. Aspect Ratio and Banking A. Aspect ratio is the physical size of the vertical scale, relative to the physical size of the horizontal scale B. Banking to 45 degrees is the procedure for setting the aspect ratio to maximize differences in slopes of line segments within a smooth curve C. Banking optimizes visual perception of functional dependence, and can reveal details otherwise hidden in the graphical display

GRAPHICAL DISPLAYS FOR MULTIVARIATE DATA William G. Jacoby ICPSR and Michigan State University Indiana University http://polisci.msu.edu/jacoby/iu/graphics I. The Challenge of Multivariate Data A. Multivariate data require multiple dimensions data space B. Most display media are two-dimensional C. The curse of dimensionality D. Objective is to find the most useful and productive view of data space II. Three-Dimensional Plots for Trivariate Data A. Three-dimensional scatterplot provides a direct physical represeentation of trivariate data space 1. Direct extension of the bivariate scatterplot, simply using three scale axes rather than only two, a scale cube rather than a scale rectangle, and so on 2. Physical representation is a rotation from the usual orthogonal view of a scatterplot 3. Angular separation of axes in display, along with tilt in plotting symbols, encourage threedimensional illusion B. Three-dimensional scatterplot remains unsatisfactory as an analytical graphic. 1. Detection: Heavy overplotting will very likely occur 2. Assembly: Difficult to make out anything more than very general patterns in the plotted points 3. Estimation: Difficult to make accurate visual judgments about data values C. Still, 3-D scatterplots can be useful in certain situations. 1. Presentational graphics 2. Situations where there is very little noise in the data 3. Graphing surfaces rather than data points 4. When motion can be added to the display

Multivariate Graphs Page 2 III. The Scatterplot Matrix A. Attempt to represent physically the contents of the multidimensional data space 1. Moves away from tools employed by the representational artist (which were used for 3-D plots) and toward those employed in drafting and technical illustration 2. Scatterplot matrix is composed of orthogonal views of all primary directions within data space 3. Here, primary directions refer to planes formed by pairs of variables. B. Definitions of the scatterplot matrix 1. Square, symmetric table containing k rows and columns 2. Each cell contains a bivariate scatterplot for the variables represented by the row and column 3. Each cell contains a bivariate scatterplot for the variables represented by the row and column 4. The scatterplot matrix contains an enormous amount of information careful, systematic layout makes this information comprehensible and interpretable C. Main diagonal of the scatterplot matrix 1. Runs from lower-left to upper-right in many scatterplot matrices 2. This is opposite to usual practice with tables (matrices) of numerical information D. Important features of the basic scatterplot matrix 1. Display contains virtually no embellishment or extra graphical features (e.g., no tick marks, scale values, etc.) 2. All information involving any variable can be discerned by simply scanning across one row (or down one column) of the scatterplot matrix E. Interpretation of the scatterplot matrix 1. Scatterplot matrix is the graphical and hence, more flexible analogue to a correlation matrix. It is very useful for discerning structure within data 2. Specific uses vary depending upon circumstances 3. Often, start by looking for problems within the data 4. Examine orientation and shapes of point clouds 5. Look for unusual observations and features of the data F. Potential disadvantages of scatterplot matrix 1. Visual resolution of panels will decrease as the number of variables increases 2. Scatterplot matrix shows multivariate data, but does not really show multivariate structure within the data IV. Conditioning Plot A. Shows conditional dependence, or functional relationship between two variables while holding constant (i.e., conditioning upon ) the values of one or more other variables.

Multivariate Graphs Page 3 B. Basic definition and terminology: 1. Multipanel display; separate panels are identified with the same coordinate system as that introduced for scatterplot matrices 2. Distinctions between dependent variable, panel variable, and one or more conditioning variables 3. Values of each conditioning variable are divided into a series of intervals or slices. 4. Each slice (or combination of slices, if there is more than one conditioning variable) produces one panel in the conditioning plot 5. Each resultant conditioning panel only displays the data points that fall within that slice of the conditioning variable C. Tufte s Principle of Small Multiples 1. Regularity in displays across conditioning panels; scale rectangles and other graphical details remain fixed from one panel to the next. 2. The only thing that changes from one panel to the next is the data 3. Analogous to frames of movie film moving before the lens of a projector 4. Similarly, the analyst scans across panels of a conditioning plot to see how the data move relative to the fixed reference axes of the scale rectangle D. Quantitative, continuous conditioning variables 1. Somewhat contradictory requirements for slices: Wide enough to include enough observations in each slice, but narrow enough to discern any changes that occur among other variables, across range of conditioning variable. 2. Often impossible to meet both of the preceding criteria with mutually exclusive conditioning slices 3. Solution: Use overlapping slices, which provide moving snapshots of the data across the range of the conditioning variable 4. Number of slices and amount of overlap is relatively arbitrary and, within reason, has little effect on the information gained from the final conditioning plot E. Equal count algorithm: Produces slices with approximately equal numbers of observations within each slice and equal proportion of overlap between adjacent slices 1. User specifies the desired number of slices and the desired amount of overlap 2. Result: Series of overlapping slices (sometimes called shingles ) 3. Width of slices varies, but content and overlap is approximately the same across slices 4. Width of each slice is directly related to the local density of the conditioning variable s distribution 5. Graphical display showing the conditioning slices is called a given plot F. Trellis display strategy 1. Information about conditioning intervals is given in strip labels at the top of each dependence panel 2. A shaded rectangle shown in each strip label conveys the slice of the conditioning variable that is shown in that panel 3. Multiple conditioning variables are handled by using multiple strips at the top of each panel

Multivariate Graphs Page 4 G. Applications of conditioning plots 1. Explicating bivariate relationships 2. Examining interactions between explanatory variables 3. Testing additive, linear functional form for multiple regression models 4. Examining conditional linear relationships