Introduction to Exploratory Data Analysis



Similar documents
Data exploration with Microsoft Excel: analysing more than one variable

MetroBoston DataCommon Training

Medicare Data Portal. Quick Start Guide

Spatial Analysis with GeoDa Spatial Autocorrelation

SPSS Manual for Introductory Applied Statistics: A Variable Approach

New Tools for Spatial Data Analysis in the Social Sciences

NetSurv & Data Viewer

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Using SPSS, Chapter 2: Descriptive Statistics

Data exploration with Microsoft Excel: univariate analysis

Lab 7. Exploratory Data Analysis

MicroStrategy Desktop

Dealing with Data in Excel 2010

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Diagrams and Graphs of Statistical Data

CHARTS AND GRAPHS INTRODUCTION USING SPSS TO DRAW GRAPHS SPSS GRAPH OPTIONS CAG08

AMS 7L LAB #2 Spring, Exploratory Data Analysis

DeCyder Extended Data Analysis module Version 1.0

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

MARS STUDENT IMAGING PROJECT

SPSS Explore procedure

Directions for using SPSS

An introduction to using Microsoft Excel for quantitative data analysis

Introduction Course in SPSS - Evening 1

Sample Table. Columns. Column 1 Column 2 Column 3 Row 1 Cell 1 Cell 2 Cell 3 Row 2 Cell 4 Cell 5 Cell 6 Row 3 Cell 7 Cell 8 Cell 9.

Microsoft PowerPoint 2010 Computer Jeopardy Tutorial

Instructions for Use. CyAn ADP. High-speed Analyzer. Summit G June Beckman Coulter, Inc N. Harbor Blvd. Fullerton, CA 92835

Introduction to SPSS 16.0

EXCEL Tutorial: How to use EXCEL for Graphs and Calculations.

Using Microsoft Excel to Plot and Analyze Kinetic Data

Interactive Excel Spreadsheets:

Appendix 2.1 Tabular and Graphical Methods Using Excel

Geostatistics Exploratory Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Data Visualization. Brief Overview of ArcMap

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

How to Use a Data Spreadsheet: Excel

Scatter Plots with Error Bars

To Begin Customize Office

Statgraphics Getting started

MicroStrategy Analytics Express User Guide

An introduction to IBM SPSS Statistics

Lesson 3 - Processing a Multi-Layer Yield History. Exercise 3-4

SAS Add-In 2.1 for Microsoft Office: Getting Started with Data Analysis

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

Outlook Web Access. PRECEDED by v\

IBM SPSS Direct Marketing 23

An Introduction to Point Pattern Analysis using CrimeStat

STC: Descriptive Statistics in Excel Running Descriptive and Correlational Analysis in Excel 2013

How to create pop-up menus

Gerrit Stols

How to Download Census Data from American Factfinder and Display it in ArcMap

GeoGebra Statistics and Probability

GeoDa 0.9 User s Guide

Making Visio Diagrams Come Alive with Data

Drawing a histogram using Excel

Hierarchical Clustering Analysis

CentralMass DataCommon

Data Visualization. Prepared by Francisco Olivera, Ph.D., Srikanth Koka Department of Civil Engineering Texas A&M University February 2004

Figure 1. An embedded chart on a worksheet.

Information Literacy Program

Exploratory Data Analysis

How To Run Statistical Tests in Excel

Data representation and analysis in Excel

LEGENDplex Data Analysis Software

Instructions for Creating an Outlook Distribution List from an Excel File

Exploratory data analysis (Chapter 2) Fall 2011

Tutorial 3 - Map Symbology in ArcGIS

Describing, Exploring, and Comparing Data

Using Excel for descriptive statistics

Chapter 5 Analysis of variance SPSS Analysis of variance

Chapter 4 Creating Charts and Graphs

Lecture 1: Review and Exploratory Data Analysis (EDA)

STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc.

IBM SPSS Direct Marketing 22

Data Analysis Tools. Tools for Summarizing Data

Getting started manual

Microsoft Access 2007 Introduction

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

Chapter 4 Displaying and Describing Categorical Data

13 Managing Devices. Your computer is an assembly of many components from different manufacturers. LESSON OBJECTIVES

6. If you want to enter specific formats, click the Format Tab to auto format the information that is entered into the field.

Lab: Data Backup and Recovery in Windows XP

How to make a line graph using Excel 2007

IFAS Reports. Participant s Manual. Version 1.0

Descriptive Statistics

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab Exercise #5 Analysis of Time of Death Data for Soldiers in Vietnam

Converting to Advisor Workstation from Principia: The Research Module

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

Tutorial Segmentation and Classification

Plots, Curve-Fitting, and Data Modeling in Microsoft Excel

Oracle Business Intelligence Publisher: Create Reports and Data Models. Part 1 - Layout Editor

Spotfire v6 New Features. TIBCO Spotfire Delta Training Jumpstart

Transcription:

Introduction to Exploratory Data Analysis A SpaceStat Software Tutorial Copyright 2013, BioMedware, Inc. (www.biomedware.com). All rights reserved. SpaceStat and BioMedware are trademarks of BioMedware, Inc. SpaceStat is protected by U.S. patents 6,360,184, 6,460,011, 6,704,686, 6,738,729, and 6,985,829, with other patents pending. Principal Investigators: Pierre Goovaerts and Geoffrey Jacquez. SpaceStat Team: Eve Do, Pierre Goovaerts, Sue Hinton, Geoff Jacquez, Andy Kaufmann, Sharon Matthews, Susan Maxwell, Kristin Michael, Yanna Pallicaris, Jawaid Rasul, and Robert Rommel. SpaceStat was supported by grant CA92669 from the National Cancer Institute (NCI) and grant ES10220 from the National Institute for Environmental Health Sciences (NIEHS) to BioMedware, Inc. The software and help contents are solely the responsibility of the authors and do not necessarily represent the official views of the NCI or NIEHS.

MATERIALS EDA Tutorial.SPT SpaceStat project ESTIMATED TIME 30 minutes OBJECTIVE This tutorial will conduct an exploratory data analysis of geographically and temporally varying data in SpaceStat. The data are age-standardized lung cancer mortality rates from 1950 to 1995. In this exploratory analysis you will look for patterns and trends in the data as well as statistical outliers. WHAT IS EXPLORATORY DATA ANALYSIS? Unlike hypothesis-driven data analyses which are motivated by a question, exploratory data analyses are not driven by a specific question. Exploratory data analyses search for patterns and relationships within data. At the easier levels an exploratory data analysis might reveal that the variable you are interested in is increasing or decreasing with the passage of time. Further exploratory data analysis might yield a basic description of how the pattern of this variable is distributed across geographic space or how it is related to another variable. Exploratory data analyses are typically used in the early stages of a study when you are trying to get a basic understanding of the data. The insights you achieve from such an analysis might lead you to formulate questions appropriate for a hypothesis-driven data analysis. A typical project might begin by simply looking at the data values, proceed to mapping the data, then conduct some basic pattern description exploratory data analysis, and finally progress to a more sophisticated hypothesis-driven analysis. ABOUT THE DATA The data used for this tutorial are lung cancer mortality rates across the United States at three spatial scales: counties, state economic areas (SEA s), and states. The data come from the National Cancer Institute s National Atlas of Cancer Mortality and are age-standardized mortality rates per 100,000. The populations are white males, black males, white females, and black females and the data spans from 1950 to 1995. Further details on the data can be found at http://www3.cancer.gov/atlasplus. STEP 1: LOAD THE PROJECT 1

To open this project, go to File -> Open, and then browse to EDA Tutorial. Click the OK button and the project will load. Look at the Datasets listed in the Data view. You will notice that the datasets are organized into three geographies: counties, SEA s, and states. The dataset names come from the original files from the National Cancer Institute. Some of the datasets begin with an R which stands for the estimated rates and these data represent the age-adjusted mortality rates per 100,000. Datasets that begin with a LBR represent the lower bounds of the 95% confidence interval of the rate and those with an UBR represent the upper bounds of the 95% confidence interval of the rate. Datasets that begin with a C represent the counts of mortality. The final letters in these datasets designate the populations: WM for white males, BM for black males, WF for white females, and BF for black females. For example the dataset labeled RBM is the age-adjusted lung cancer mortality rates per 100,000 black males. STEP 2: LOOKING AT THE DATA First create open a table where you can look at the data values. To do this, click on the Table button ( ) on the SpaceStat main toolbar. Choose the SEA geography from the dropdown control and click the OK button. A window will open showing the data values in a table. Like all of SpaceStat s time-enabled visualization, the table has a time slider and playback controls that you can use to control at what time you are viewing the data. Also, selecting cells in the table will select the geographic objects in any linked visualizations such as maps or statistical graphs. Table of the SEA geography 2

Sorting the data values in the table may help to find outliers. We will first look at the rate of lung cancer among white males (RWM). Locate the column labeled RWM and right click on the RWM column header. Choose Sort->Descending from the popup menu. This will sort the data values with the SEA having the highest rate first. Note that you have sorted the values for the current time and if the values change over time the objects will remain in the same order in the table which may not be in descending order at other times. STEP 3: MAPPING THE VARIABLES You can close the table you created before to free up some space for this step. You may also wish to close the Log View at the bottom of the SpaceStat interface. Click the Map icon ( ) in the SpaceStat main toolbar. Type RWM as the map name. Make sure that SEA is selected in the Geography dropdown control. Click on the Fill color dataset dropdown control and select the RWM dataset. Make sure that the Add to Map shows New Map. Now click the OK button to create the map. The default map coloring scheme isn t the best for this datset. A classified coloring scheme will make it easier to see patterns in the data. Click on the Map Properties icon ( ) on the map toolbar. Make sure the Polygon fill attributes tab is selected. Choose Classified from the Color mode dropdown. The default setting of 10 quantiles is fine. Click the OK button to apply the changes to the map. Create 3 other maps in this same manner using the RWF, RBM, and RBF datasets for the fill color and naming them appropriately. Note that the classes created for the map display do not have the same scale although you manually could enter the same scale if you wished. To arrange these maps so that you can see all four easily, select Window->Tile from the SpaceStat main menu. Lastly select Window->Time-synchronize all views from the SpaceStat main menu so that the maps will animate together. Your maps should look similar to the screen shot below: Maps of the four datasets 3

At this stage you can begin to observe some patterns in the data. You can ask yourself whether there are geographic patterns in the data, i.e. areas where lung cancer rates are higher or lower. You can look for difference in the patterns between black males and white males or between white males and white females. Looking at one static time will limit your understanding to a particular time period. You can observe temporal trends in the data by animating the maps. Each of the maps has an animation toolbar. Since the maps are time-synchronized, using any of the animation controls on one map will affect all maps. Click the Play button on the animation toolbar to begin the map animation. By default the animation will run through time once and stop. Note that the data for black males and black females was not recorded before 1970. Each of the animations can be exported as an AVI file so that you can use them externally from SpaceStat such as in a PowerPoint presentation (to do this select Map->Export animation from the menu of the visualization that you wish to export). STEP 4: CREATING HISTOGRAMS For now, close all the maps except the RWM map. Click the Histogram button ( ) on the main SpaceStat toolbar. Select SEA from the Geography dropdown control. Select RWM from the dataset dropdown. Click the OK button to create the histogram. The histogram shows the Graph Statistics pane by default, but if you have closed it you can open the statistics by choosing Graph- >Graph statistics in the histogram s menu. The histogram shows the distribution of data values in a single dataset at the current time. It is an important step to check to see if the data is distributed normally, skewed to the right (many more observations on the left) or skewed to the left as these descriptions have implications for the kinds of statistical tests and modeling approaches that can be used later. You can click and drag to select observations (SEA s) from the histogram. For instance if you select the bins with the highest values, you can see on the map that some of the SEA s with the highest lung cancer values for white males in 1950 are in Louisiana, the Carolina coast, in Maryland, and New York City. This process of selecting objects in one visualization to show results in a linked visualization is called brushing. Using brushing to show where the high rates for white male lung cancer mortality are in 1950 4

Select Window->Time-synchronize all views from the SpaceStat main menu to synchronize the animation of the histogram and the map. Now you can use the animation controls to animate both the histogram and the map. Look at how the measures of central tendency (mean, median and mode) change through time. Also look at how dispersion (standard deviation and coefficient of variation) changes through time. You also can observe what happens with the SEAs that were highest at the beginning of the time period. Are these SEAs also the highest at the end? You can repeat this exercise for the other 3 variables (RWF, RBM, RBF). STEP 5: FINDING STATISTICAL OUTLIERS WITH BOXPLOTS Click the Boxplot button ( ) on the main SpaceStat toolbar. Select SEA in the Geography dropdown control. Select RWM in the dataset dropdown control. Click the OK button to create the boxplot. The boxplot is a unique visualization. Data observations are shown as points along a vertical line. There is horizontal black-filled box that is usually towards the center of the plot. This box is the interquartile range ; this range contains 50% of the values, specifically the 25% above the median and 25% below the median. A black line extends through the center of this box, representing the median value. There are also two additional lines on the boxplot above and below the interquartile range called whiskers. These show values +/- 2 times the interquartile range from the median (remember that the interquartile range is bounded by values +/- 0.5 times the interquartile range). A boxplot with labeled parts Any values that fall outside the interquartile range are candidates to be considered outliers. For instance, there are a few values that fall above the top whisker in 1950. You can brush select these values to see where they are on the map. Advance the time slider to 1970 and brush select the 5

possible outliers at this time. At this stage in the exploratory analysis you might try to find explanations for why these values are higher or lower than the others. STEP 6: CREATING AN UNCERTAINTY DATASET The data contains two variables for each population that can be used to create a measurement of uncertainty. For this part of the tutorial, you will focus on the RBM and RBF datasets. For the RBM, the UBRBM and LBRBM represent the upper and lower bounds of the 95% confidence interval around the black male lung cancer mortality observation. The difference between these two values is a measure of the uncertainty of the lung cancer mortality rate. SpaceStat can create new datasets from existing data using the Derive New Dataset operation. Start the procedure by selecting Tools->Derive new dataset from the SpaceStat main menu. Name the new dataset uncertrbm. Select SEA from the geography dropdown. Next choose UBRBM from the Insert Dataset dropdown. Then choose the - from the Insert Function dropdown. Choose LBRBM from the Insert Dataset dropdown. The dialog should look like the following screenshot (pay special attention to the dataset definition): Creating an uncertainty dataset If this is the case, click the OK button to create the dataset. If you have made an error, you can either edit the text manually or delete the definition and start over. Repeat this process to create an uncertrbf dataset from the UBRBF and LBRBF datasets. 6

STEP 7: MULTIVARIATE RELATIONSHIPS With the four variables we have been looking at there are 6 possible pairs of variables. In this tutorial you will look at the relationship between RBM and RBF but you could look at the other combinations on your own. The scatterplot tool in SpaceStat will create a graph of one variable against the other will fit a line through it showing the correlation coefficient. It is a good starting point to explore a bivariate relationship. Click the Scatterplot button ( ) on the SpaceStat main toolbar. Select SEA from the Geography dropdown control. Make sure that the type of scatterplot is set to Two datasets. Select RBM from the X dataset dropdown control. Select RBF from the Y dataset dropdown control. Check the Point color dataset option and choose uncertrbm from the dropdown. Check the Point size dataset option and choose uncertrbf from the dropdown. Check to make sure your dialog matches the screenshot below and click the OK button to create the scatterplot. Create scatterplot dialog settings The scatterplot should appear empty when it is created. This is because there was no data recorded for the black lung cancer rates until 1970. If you advance the slider forward past 1970, you should see the data appear. Some things you should consider at this point are whether there is a 7

relationship between the two variables, how strong is the relationship, and does the linear best fit line match the shape of the relationship. Bivariate outliers is a term that is used to describe locations that don t seem to fit the overall relationship between the two variables. These outliers would appear far away from the best fit line in the scatterplot. In the case of male and female black lung cancer, the relationship is very weak. This is due to the small population size and many 0 rates at the SEA level. If you repeat this step for male and female white populations, you will notice a very different result. STEP 8: PATTERN AND OUTLIER DETECTION WITH THE LOCAL MORAN The Moran scatterplot plots the relationship of each observation with the average value of its neighbors. This is useful to examine spatial autocorrelation. High spatial autocorrelation means that locations are highly correlated with their neighbors while no or low spatial autocorrelation means than the value at a location is unrelated to the value of its neighbors. Run the Moran method in SpaceStat by selecting Methods->Clustering->Local Moran->Univariate local Moran from the SpaceStat main menu. This will open the Task Manager pane for you to choose the settings for the method. For the dataset, select the SEA geography for the first dropdown. Select RWM for the second dataset dropdown. The default times spanning the whole dataset are fine as is the default spatial weight set. At this point your settings should look like the following screenshot. Settings for the Univariate local Moran method Click the Advanced tab and select No for the Use Simes correction dropdown control. Click the Output tab to review where the results will be stored. Now click the Run button to start the method. When the method finishes, results are placed in several locations. First, the global measure, Moran s I, is reported in the Log View. The Log View is normally at the bottom of the SpaceStat 8

interface, but you may have to enable it through Window->Log View in the SpaceStat main menu if you have closed it. Moran s I ranges from about -1 to 1 (it can be outside this range sometimes due to spatial weights) where 1 is perfect positive spatial autocorrelation, 0 is no spatial autocorrelation and -1 is perfect negative spatial autocorrelation. SpaceStat reports Moran s I for each time period and a p value indicating the probability of obtaining this result under randomization if there was no spatial autocorrelation. Look at the values reported in the Log View and determine if there is strong or weak autocorrelation in the RWM values. Is the spatial autocorrelation positive or negative? Does the spatial autocorrelation increase or decrease through time? Table of Moran s I results from the Log View The Moran method also creates a Moran scatterplot. The Moran scatterplot can be used to find spatial outliers. Spatial outliers are those locations that have values very different from their neighbors. The x-axis on the Moran scatterplot shows the rate of the subject SEA while the y-axis shows the average rate of the neighbors. Spatial outliers will be those values that tend to be located in the top left or the bottom right of the scatterplot. Remember that you can brush select these values to see where locations are on the map. 9

Moran scatterplot Lastly the Moran method creates a map of the locations with locations classified by their value and the value of their neighbors. There are 5 possible classes. The first class is locations that are not significant clusters or outliers. There are two types of significant clusters: High-High and Low-Low. High-High (shown in SpaceStat in red) corresponds to locations that have a high value and are surrounded by other locations with high values. Low-Low (shown in SpaceStat in dark blue) corresponds to locations that have a low value and are surrounded by other locations with low values. There are also two types of significant outliers: High-Low and Low-High. High-Low (shown in SpaceStat in pink) corresponds to locations with values that are significantly higher than the values of their neighbors. Low-High (shown in SpaceStat in light blue) corresponds to locations with values that are significantly lower than the values of their neighbors. Map of Local Moran classifications in 1994 10

Look for patterns and possible explanations for the clusters and spatial outliers in lung cancer mortality among white males. Also look for overall temporal trends as well as how locations change with time. The Local Moran method works best with data that is normally distributed. In some cases it is appropriate to transform the data before running the Local Moran method for this purpose. NEXT STEPS Exploratory data analysis is a useful tool to begin to understand patterns in data and detect outlying observations. It is usually a first step in a study and can lead to hypothesis testing or modeling. It is common to follow up an exploratory analysis with cluster detection, disparity detection or regression analysis. The observations and possible explanations that you come to during an exploratory exercise should guide you to choosing appropriate questions to ask of the data. Equally important, exploratory analysis may help you to decide what patterns or hypotheses are unlikely. 11