How to interpret scientific & statistical graphs

Similar documents
Diagrams and Graphs of Statistical Data

Scatter Plots with Error Bars

Exploratory data analysis (Chapter 2) Fall 2011

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Summarizing and Displaying Categorical Data

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Describing and presenting data

Exercise 1.12 (Pg )

Describing, Exploring, and Comparing Data

Data Exploration Data Visualization


Descriptive Statistics

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

AP * Statistics Review. Descriptive Statistics

MTH 140 Statistics Videos

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Using SPSS, Chapter 2: Descriptive Statistics

List of Examples. Examples 319

Lecture 1: Review and Exploratory Data Analysis (EDA)

Exploratory Data Analysis

Biostatistics: Types of Data Analysis

Variables. Exploratory Data Analysis

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

What Does the Normal Distribution Sound Like?

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Interpreting Data in Normal Distributions

CHAPTER TWELVE TABLES, CHARTS, AND GRAPHS

Statistics Revision Sheet Question 6 of Paper 2

Relationships Between Two Variables: Scatterplots and Correlation

Statistics. Measurement. Scales of Measurement 7/18/2012

Correlation and Regression

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Northumberland Knowledge

Lesson 4 Measures of Central Tendency

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Unit 9 Describing Relationships in Scatter Plots and Line Graphs

Week 1. Exploratory Data Analysis

Biology 300 Homework assignment #1 Solutions. Assignment:

Statistics Chapter 2

Visualization Quick Guide

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

AP Statistics Solutions to Packet 2

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Common Tools for Displaying and Communicating Data for Process Improvement

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

How To Write A Data Analysis

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Mind on Statistics. Chapter 2

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

AP STATISTICS REVIEW (YMS Chapters 1-8)

Chapter 2: Frequency Distributions and Graphs

How To: Analyse & Present Data

Fairfield Public Schools

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

IBM SPSS Direct Marketing 23

Module 2: Introduction to Quantitative Data Analysis

Exploratory Data Analysis. Psychology 3256

II. DISTRIBUTIONS distribution normal distribution. standard scores

Information visualization examples

IBM SPSS Direct Marketing 22

Chapter 2 Data Exploration

Exploratory Data Analysis

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Lecture 2. Summarizing the Sample

Quantitative vs. Categorical Data: A Difference Worth Knowing Stephen Few April 2005

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Data exploration with Microsoft Excel: analysing more than one variable

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

CHAPTER THREE. Key Concepts

Introduction. Chapter Before you start Formulation

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

SPSS Manual for Introductory Applied Statistics: A Variable Approach

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

FREE FALL. Introduction. Reference Young and Freedman, University Physics, 12 th Edition: Chapter 2, section 2.5

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Demographics of Atlanta, Georgia:

A and B This represents the probability that both events A and B occur. This can be calculated using the multiplication rules of probability.

Guidelines for Data Collection & Data Entry

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Data Visualization Handbook

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

6.4 Normal Distribution

determining relationships among the explanatory variables, and

Jitter Measurements in Serial Data Signals

AMS 7L LAB #2 Spring, Exploratory Data Analysis

Information Technology Services will be updating the mark sense test scoring hardware and software on Monday, May 18, We will continue to score

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

a) Find the five point summary for the home runs of the National League teams. b) What is the mean number of home runs by the American League teams?

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Transcription:

How to interpret scientific & statistical graphs Theresa A Scott, MS Department of Biostatistics theresa.scott@vanderbilt.edu http://biostat.mc.vanderbilt.edu/theresascott 1 A brief introduction Graphics: One of the most important aspects of presentation and analysis of data; help reveal structure and patterns. Graphical perception (ie, interpretation of a graph): The visual decoding of the quantitative and qualitative information encoded on graphs. Objective: To discuss how to interpret some common graphs. 2 1

Sidebar: Types of variables Continuous (quantitative data): Have any number of possible values (eg, weight). Discrete numeric set of possible values is a finite (ordered) sequence of numbers (eg, a pain scale of 1, 2,, 10). Categorical (qualitative data): Have only certain possible values (eg, race); often not numeric. Binary (dichotomous) a categorical variable with only two possible value (eg, gender). Ordinal a categorical variable for which there is a definite ordering of the categories (eg, severity of lower back pain as none, mild, moderate, and severe). 3 Graphs for a single variable s distribution 4 2

Histograms Continuous variable. Values are divided into a series of intervals, usually of equal length. Data are displayed as a series of vertical bars whose heights indicate the number (count) or proportion (percentage) of values in each interval. What is the overall shape? Is it symmetric? Is it skewed? Affected by the size of the interval. Is there more than one peak? What is the range of the intervals? Is the shape wide or tight (ie, what s the variability?) Frequency 0 5 10 15 20 25 0 200 400 600 800 1000 1200 Look for concentration of points and/or outliers, which can distort the graph. 5 Boxplots Continuous variable. Displays a numerical summary of the distribution. Most include the 25 th, 50 th (median), and 75 th percentiles. Optionally includes the mean (average). May extend to the min & max or may use a rule to indicate outliers. Graphed either horizontally or vertically. Interpretation: What statistics are displayed? Most often, the central box includes the middle 50% of the values. Whiskers (& outliers) show the range. Symmetry is indicated by box & whiskers and by location of the median (and mean). Variable X 0 200 400 600 800 1000 1200 6 3

Boxplot with raw data Going one step beyond just a boxplot. Boxplot is overlaid with the raw values of the continuous variable. Therefore, displays both a numerical summary as well as the actual data. Gives a better idea the number of values the numerical summary (ie, boxplot) is based on and where they occur. Raw values are often jittered that is, in order to visually depict multiple occurrences of the same value, a random amount of noise is added in the horizontal direction (if boxplot is vertical; in the vertical direction if the boxplot is horizontal). Variable X 0 200 400 600 800 1000 1200 Look for concentration of points and (as before) outliers. 7 Barplots (aka, bar charts) Categorical variable. Data are displayed as a series of vertical (or horizontal) bars whose heights indicate the number (count) or proportion (percentage) of values in each category. Visual representation of a table. How do the heights of the bars compare? Which is largest? Smallest? Proportion 0.0 0.2 0.4 0.6 0.8 1.0 Censored Dead 8 4

Dot plots (aka, dotcharts) Categorical variable. Alternative to a barplot (bar chart). Height of the (vertical) bars are indicated with a dot (or some other character) on a (often horizontal dotted) line. Line represents the counts or percentages. Same interpretation as barplot (bar chart). Dead Censored 0.0 0.2 0.4 0.6 0.8 1.0 Proportion 9 Graphs for the association/relation between two variables 10 5

Side-by-side boxplots A continuous variable and a categorical variable. Displays the distribution of the continuous variable within each category of the categorical variable. Width of the boxes can also be made proportional to the number of values in each category. Here, side-by-side boxplots are overlaid with the raw values. How does the symmetry of each boxplot differ across categories? How do they compare to the boxplot of the continuous variable ignoring the categorical variable? Is there a concentration of points and/or outliers in one particular category? Is the number of values in each category fairly consistent? Serum Albumin 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 1 2 3 4 Histological stage of disease 11 Barplots Two categorical variables. Visual representation of a two-way table. Bars are most often nested. The count/proportion of the 2 nd variable s categories is displayed within each of the 1 st variable s categories. Allows you to compare the 2 nd variable s categories (1) within each of the 1 st variable s categories, and (2) across the 1 st variable s categories. Bars can also be stacked. A single bar is constructed for each category of the 1 st variable & divided into segments, which are proportional to the count/ percentage of values in each category of the 2 nd variable. Counts should sum to the no. of values in the dataset; percentages should sum to 100%. Unlike side-by-side, segments do not have a common axis makes difficult to compare segment sizes across bars. Proportion Proportion 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 D-penicillamine D-penicillamine Treatment Treatment Stage 1 Stage 2 Stage 3 Stage 4 Placebo Stage 4 Stage 3 Stage 2 Stage 1 Placebo 12 6

Dot plots Two categorical variables. Alternative visual representation of a two-way table. Like barplots, can be nested. Have different lines for each category of the 2 nd variable grouped for each category of the 1 st variable. D-penicillamine Stage 4 Stage 3 Stage 2 Stage 1 Placebo Stage 4 Stage 3 Stage 2 Stage 1 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Can also be stacked. Categories of the 2 nd variable are shown on a single line; one line for each category of the 2 nd variable; 1 st variable s categories are distinguished with different symbols. Unlike stacked barplots, do have a common axis for comparisons. Same interpretation as barplot (bar chart). Same comparisons within and across categories. Placebo D-penicillamine Stage 1 Stage 2 Stage 3 Stage 4 0.8 1.0 0.0 0.2 0.4 0.6 Proportion 13 Scatterplots Two continuous variables. Usually, the response variable (ie, outcome) is plotted along the vertical (y) axis and the explanatory variable (ie, predictor; risk factor) is plotted along the horizontal (x) axis. Doesn t matter if there is no distinction between the two variables. Each subject is represented by a point. Often include lines depicting an estimate of the linear/non-linear relation/ association, and/or confidence bands. http://www.stat.sfu.ca/~cschwarz/stat-201/handouts/node41.html What to look for : Overall pattern: Positive association/ relation? Negative association/ relation? No association/relation? Form of the association/relation: Linear? Non-linear (ie, a curve)? Strength of the relation/association: How tightly clustered are the points (ie, how variable is the relation/ association)? Outliers Lurking variables: A 3 rd (continuous or categorical) variable that is related to both continuous variables and may confound the association/relation. Often incorporated into graph see Graphs for mutlivariate data slides. 14 7

Example Scatterplots Weight 110 120 130 140 150 160 170 180 Y 1.0 1.2 1.4 1.6 1.8 20 30 40 50 60 70 Height 5 10 15 20 X 15 Graphs for multivariate data (ie, more than two variables) 16 8

(More complex) Scatterplots Two continuous variables and a categorical variable. Often, categorical variable is a confounder the association/relation between the two continuous variables is (possibly) different between the categories of the categorical variable. Categorical variable incorporated using different symbols and/or line types for each category. What to look for: Same as mentioned for general scatterplot. Serum Albumin 2.0 2.5 3.0 3.5 4.0 D-penicillamine Placebo Does the association/relation between the two continuous variables differ between the categories of the categorical variable? If so, how? 200 400 600 800 1000 Serum Cholesterol 17 Examples of other graphs you might encounter 18 9

Modified side-by-side boxplot (great alternative to a dynamite plot next slide) Age (years) 30 40 50 60 70 80 Mean and SD of Age Across Stage of Disease Stage 1 Stage 2 Stage 3 Stage 4 Histological stage of disease 19 Dynamite plot (often, height of bar = mean; error bar = standard deviation) IMPORTANT Even though commonly seen, not a good graph to generate. Interested in the height of the bar (rest of the bar is just unnecessary ink). Have no idea how many values the mean and standard deviation are based on (often quite small) or how the raw values are distributed. Both affect the values of the mean and standard deviation. Bars can also be hanging, which may represent negative values very confusing. Expression of protein 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Wild Type Type of mouse Knockout 20 10

Survival & Hazard plots Survival Plot Hazard Plot Probability of Survival 0.0 0.2 0.4 0.6 0.8 1.0 Maintenance No Maintenance Cumulative Hazard 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Maintenance No Maintenance 0 50 100 150 Months Each step down represents one or more deaths ; + signs represent censoring. 0 50 100 150 Months Each step up represents one or more deaths ; + signs represent censoring. 21 Spaghetti & Line plots Red cell folate 200 250 300 350 400 Red cell folate 200 250 300 350 400 Treatment Group Placebo Drug A Drug B Baseline 6 mos 12 mos Post-op Post-op Baseline 6 mos 12 mos Post-op Post-op Each line plots the raw data points of a single subject. Each line plots summary measures (eg, mean) from a group of subjects. 22 11

WARNING: Very easy for a graph to lie What are the limits of the axis/axes? Is the scale consistent? How do the height and width of the graph compare to each other? Is the graph a square? A rectangle (ie, short & wide; tall & skinny)? If two or more graphs are shown together (eg, side-by-side, or in a 2x2 matrix), do all of the axes have the same limits? Same scale? Do they have the same relative dimensions? Are there two x- or y-axes in the same graph? If so, do they have the same scale? Can you get a feel for the raw data? The number of data points? Does a graph of a continuous variable show outliers? Does the data look too pretty? 23 General steps Do I understand this graph? If NO: (1) it might be a really bad graph; or (2) it might be a type of graph you don t know about. Carefully examine the axes and legends, noting any oddities. Scan over the whole graph, to see what it is saying, generally. If necessary, look at each portion of the graph. Re-ask Do I understand this graph? If YES, what is it saying? If NO, why not? Overview of Statistical Graphs, Peter Flom 24 12