Visualization and descriptive statistics. D.A. Forsyth



Similar documents
STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Diagrams and Graphs of Statistical Data

AMS 7L LAB #2 Spring, Exploratory Data Analysis

Introduction to Data Visualization

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Summarizing and Displaying Categorical Data

Descriptive statistics parameters: Measures of centrality

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences


Exercise 1.12 (Pg )

2013 MBA Jump Start Program. Statistics Module Part 3

Foundation of Quantitative Data Analysis

Exploratory data analysis (Chapter 2) Fall 2011

Northumberland Knowledge

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

MTH 140 Statistics Videos

Introduction to Regression and Data Analysis

Visualization Quick Guide

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Data Exploration Data Visualization

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Using SPSS, Chapter 2: Descriptive Statistics

Data Visualization Handbook

CRLS Mathematics Department Algebra I Curriculum Map/Pacing Guide

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Data exploration with Microsoft Excel: analysing more than one variable

430 Statistics and Financial Mathematics for Business

Visualizations. Cyclical data. Comparison. What would you like to show? Composition. Simple share of total. Relative and absolute differences matter

GeoGebra Statistics and Probability

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Using Excel for descriptive statistics

Data Visualization Techniques

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Introduction Course in SPSS - Evening 1

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Intro to Statistics 8 Curriculum

Module 2: Introduction to Quantitative Data Analysis

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Describing, Exploring, and Comparing Data

Data Visualization. BUS 230: Business and Economic Research and Communication

Simple Predictive Analytics Curtis Seare

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

UNIT 1: COLLECTING DATA

Mathematics (Project Maths)

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Bar Charts, Histograms, Line Graphs & Pie Charts

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Lecture 1: Review and Exploratory Data Analysis (EDA)

Math 132. Population Growth: the World

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Part 2: Data Visualization How to communicate complex ideas with simple, efficient and accurate data graphics

Creating an Excel XY (Scatter) Plot

Introduction to Dashboards in Excel Craig W. Abbey Director of Institutional Analysis Academic Planning and Budget University at Buffalo

STAB22 section 1.1. total = 88(200/100) + 85(200/100) + 77(300/100) + 90(200/100) + 80(100/100) = = 837,

Data Visualization Best Practice. Sophie Sparkes Data Analyst

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

STAT355 - Probability & Statistics

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Data analysis process

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

II. DISTRIBUTIONS distribution normal distribution. standard scores

Data Visualization Techniques

Descriptive Statistics

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Analyzing Experimental Data

Evaluating the results of a car crash study using Statistical Analysis System. Kennesaw State University

Prentice Hall Mathematics Courses 1-3 Common Core Edition 2013

TIPS FOR DOING STATISTICS IN EXCEL

The Dummy s Guide to Data Analysis Using SPSS

Getting started in Excel

Scatter Plots with Error Bars

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

IBM SPSS Direct Marketing 23

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Measurement & Data Analysis. On the importance of math & measurement. Steps Involved in Doing Scientific Research. Measurement

1. Go to your programs menu and click on Microsoft Excel.

Below is a very brief tutorial on the basic capabilities of Excel. Refer to the Excel help files for more information.

IBM SPSS Direct Marketing 22

Mathematics (Project Maths)

Common Core Unit Summary Grades 6 to 8

Projects Involving Statistics (& SPSS)

Simple Linear Regression

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Chapter 7: Simple linear regression Learning Objectives

Getting started manual

R and Rcmdr : Basic Functions for Managing Data

Exploratory Data Analysis

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Demographics of Atlanta, Georgia:

Data Preparation and Statistical Displays

Transcription:

Visualization and descriptive statistics D.A. Forsyth

What s going on here? Most important, most creative scientific question Getting answers Make helpful pictures and look at them Compute numbers in support of making pictures Data has types Continuous Discrete Ordinal (can be ordered) Categorical (no natural order, cat vs hat ) Different plots apply

Histograms Categorical data Ick!

Bar Charts Categorical data - counts in category

Histograms Ick! Continuous data

Histograms

Conditional Histograms

Data example Clicks, impressions and ages for NYT website https://github.com/oreillymedia/doing_data_science Question: Look at data - what s going on? Example R code on webpage

Why R? It s free It s easy to get pictures up and going from weirdly formatted datasets Many, many tools most of the code I ll work with is downloaded/copied that s the right strategy work with tools *without* implementing them

Some R setwd('/users/daf/current/courses/bigdata/examples') data1<-read.csv('/users/daf/current/courses/bigdata/doing_data_science-master/dds_datasets/dds_ch2_nyt/nyt1.csv') data1$agecat<-cut(data1$age, c(-inf, 0, 18, 24, 34, 44, 54, 64, 74, 84, Inf)) # This breaks the Age column into categories data1$impcat<-cut(data1$impressions, c(-inf, 0, 1, 2, 3, 4, 5, Inf)) # This breaks the impression column into categories summary(data1)

Age Gender Impressions Clicks Signed_In agecat impcat Min. : 0.00 Min. :0.000 Min. : 0.000 Min. :0.00000 Min. :0.0000 (-Inf,0]:137106 (-Inf,0]: 3066 1st Qu.: 0.00 1st Qu.:0.000 1st Qu.: 3.000 1st Qu.:0.00000 1st Qu.:0.0000 (34,44] : 70860 (0,1] : 15483 Median : 31.00 Median :0.000 Median : 5.000 Median :0.00000 Median :1.0000 (44,54] : 64288 (1,2] : 38433 Mean : 29.48 Mean :0.367 Mean : 5.007 Mean :0.09259 Mean :0.7009 (24,34] : 58174 (2,3] : 64121 3rd Qu.: 48.00 3rd Qu.:1.000 3rd Qu.: 6.000 3rd Qu.:0.00000 3rd Qu.:1.0000 (54,64] : 44738 (3,4] : 80303 Max. :108.00 Max. :1.000 Max. :20.000 Max. :4.00000 Max. :1.0000 (18,24] : 35270 (4,5] : 80477 (Other) : 48005 (5, Inf]:176558

Users by age

Impression histogram, faceted by age

Click histogram, faceted by age

Click/Impression histogram, faceted by age

2D Data

Categorical data Pie charts are deprecated - it s hard to judge area by eye accurately

Mosaic Plots

The UFO data set http://www.infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metada UFO sighting data date of sighting; date of report; location; description; some free text rather messy data about 15 years of sightings ( 95-08 with some others) broke into 1000 day blocks looked at most common shape descriptors (' disk', ' light', ' circle', ' triangle', ' sphere', ' oval', ' other', ' unknown') great example of categorical data R-code on website not great code, but informative building a map, merging datasets, reading datasets, mosaic plots you should look at this

Conclusion: UFO shapes haven t changed over time

Ordinal data

Ordinal data

Series

Scatter plots Plot a marker at a location where there is a datapoint Simplest case - geographic

Arsenic in well water

UFO sightings by state

UFO s by interval

UFO s by interval

UFO s by interval

UFO s by interval

UFO s by interval

UFO s by interval

Interesting analogy Blackett s reasoning about submarine sightings in WWII can estimate probability of sightings lead to significantly improved sighting rates, aircraft painting and lighting strategies (see Korner, The pleasures of counting or good histories)

NYT data - remarks Many data points lying on top of each other scatter plot can be deceptive jitter the points (move by a small random amount)

Age Gender Impressions Clicks Signed_In agecat impcat Min. : 0.00 Min. :0.000 Min. : 0.000 Min. :0.00000 Min. :0.0000 (-Inf,0]:137106 (-Inf,0]: 3066 1st Qu.: 0.00 1st Qu.:0.000 1st Qu.: 3.000 1st Qu.:0.00000 1st Qu.:0.0000 (34,44] : 70860 (0,1] : 15483 Median : 31.00 Median :0.000 Median : 5.000 Median :0.00000 Median :1.0000 (44,54] : 64288 (1,2] : 38433 Mean : 29.48 Mean :0.367 Mean : 5.007 Mean :0.09259 Mean :0.7009 (24,34] : 58174 (2,3] : 64121 3rd Qu.: 48.00 3rd Qu.:1.000 3rd Qu.: 6.000 3rd Qu.:0.00000 3rd Qu.:1.0000 (54,64] : 44738 (3,4] : 80303 Max. :108.00 Max. :1.000 Max. :20.000 Max. :4.00000 Max. :1.0000 (18,24] : 35270 (4,5] : 80477 (Other) : 48005 (5, Inf]:176558

NYT scatters

Scale is an issue

Outliers can set scale

But scale is really a problem

Lynx pelts

Data example Housing sales in NYC boroughs https://github.com/oreillymedia/doing_data_science Question: Look at real estate sales - what s going on?

Summary Statistics - mean The average The best estimate of the value of a new datapoint in the absence of any other information about it

Summary statistics - Standard deviation Think of this as a scale Average distance from mean Important math properties in notes

Standard deviation = there are not many points many standard deviations away from the mean = there is at least one point at least one standard deviation away from the mean

Standard coordinates

Suppressing scale effects Do scatter plots in standard coordinates for x, y

Lynx, normalized

x, y don t really matter

Positive Correlation

Zero Correlation

Negative correlation

The Correlation Coefficient

Correlation isn t causality and foot size is positively correlated with reading ability, etc.

but can be used to predict

What s going wrong here? NYT normalized

A Mosaic Plot