Graphs. Exploratory data analysis. Graphs. Standard forms. A graph is a suitable way of representing data if:



Similar documents
Exploratory Data Analysis

Exploratory data analysis (Chapter 2) Fall 2011

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Dongfeng Li. Autumn 2010

Variables. Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

Chapter 4 - Lecture 1 Probability Density Functions and Cumul. Distribution Functions

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Chapter 4 Lecture Notes

Continuous Random Variables

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Descriptive Statistics

Diagrams and Graphs of Statistical Data

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Week 1. Exploratory Data Analysis

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Exploratory Data Analysis

TEST 2 STUDY GUIDE. 1. Consider the data shown below.

Statistics Revision Sheet Question 6 of Paper 2

The Normal Distribution. Alan T. Arnholt Department of Mathematical Sciences Appalachian State University

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

WEEK #22: PDFs and CDFs, Measures of Center and Spread

Statistics Chapter 2

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Lesson 20. Probability and Cumulative Distribution Functions

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

First Midterm Exam (MATH1070 Spring 2012)

Summarizing and Displaying Categorical Data

GeoGebra Statistics and Probability

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Data Exploration Data Visualization

Chapter 3 RANDOM VARIATE GENERATION

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

2 Describing, Exploring, and

Data Visualization in R

4. Continuous Random Variables, the Pareto and Normal Distributions

Key Concept. Density Curve

Random variables P(X = 3) = P(X = 3) = 1 8, P(X = 1) = P(X = 1) = 3 8.

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

sample median Sample quartiles sample deciles sample quantiles sample percentiles Exercise 1 five number summary # Create and view a sorted

How To Use Statgraphics Centurion Xvii (Version 17) On A Computer Or A Computer (For Free)

Exercise 1.12 (Pg )

THE BINOMIAL DISTRIBUTION & PROBABILITY

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Without data, all you are is just another person with an opinion.

Part II Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Part II

Stats on the TI 83 and TI 84 Calculator

MAS108 Probability I

Exploratory Data Analysis

Data Preparation and Statistical Displays

IBM SPSS Direct Marketing 23

Chapter 7 Section 1 Homework Set A

An Introduction to Basic Statistics and Probability

Chapter 2: Frequency Distributions and Graphs

Principle of Data Reduction

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

ECE302 Spring 2006 HW5 Solutions February 21,

Lecture 1: Review and Exploratory Data Analysis (EDA)

), 35% use extra unleaded gas ( A

MA107 Precalculus Algebra Exam 2 Review Solutions

Probability Distributions

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

STAT 35A HW2 Solutions

Describing, Exploring, and Comparing Data

How To Understand The Scientific Theory Of Evolution

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Random Variables. Chapter 2. Random Variables 1

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

Using SPSS, Chapter 2: Descriptive Statistics

AMS 7L LAB #2 Spring, Exploratory Data Analysis

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Maximum Likelihood Estimation

DATA INTERPRETATION AND STATISTICS

SPSS Manual for Introductory Applied Statistics: A Variable Approach

Questions and Answers

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

STAT355 - Probability & Statistics

Joint Exam 1/P Sample Exam 1

How Does My TI-84 Do That

MTH 140 Statistics Videos

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Big Ideas in Mathematics

AP * Statistics Review. Descriptive Statistics

STAT 315: HOW TO CHOOSE A DISTRIBUTION FOR A RANDOM VARIABLE

IBM SPSS Direct Marketing 22

TEACHER NOTES MATH NSPIRED

You flip a fair coin four times, what is the probability that you obtain three heads.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Northumberland Knowledge

Probability Distributions

Transcription:

Graphs Exploratory data analysis Dr. David Lucy d.lucy@lancaster.ac.uk Lancaster University A graph is a suitable way of representing data if: A line or area can represent the quantities in the data in some way. Several standard forms can be used. Standard forms are not the only forms. You can make up your own if you please R very good for this. Exploratory data analysis p.1/36 Exploratory data analysis p.3/36 Graphs Standard forms Graphics are a very important part of making sense of data: Allow the researcher to compare quantities easily and simply by comparing lengths and/or areas. Many humans adapted to view quantities rather than number. Immediate impact. Suggests ideas for further work. For many scientists this is their main form of anlysis - some of the worlds best science has been done purely by graphs. There are three standard types of graph: 1. histogram - used to examine the distribution of a set of observations - can be used to compare distributions between sets of observations - observations may be discrete (underlying continuous) and continuous, 2. scatterplot - use to look for relationships between different continuous variables, 3. boxplot - sometimes called box and whiskers plot - use to compare distributions of continuous variables which is equivalent to looking for relationships between factors and continuous variables. Exploratory data analysis p.2/36 Exploratory data analysis p.4/36

Histograms Histograms Partial Full Partial Full 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 Exploratory data analysis p.5/36 Exploratory data analysis p.7/36 Histograms Exercise 3.2.2 Do not confuse histigrams with barcharts: Histograms have the area proportional to the quantity of interest: not necessarity equal column widths, although most are. Rescale the full histogram so the bars sum to one? Number of runs 71 28 5 2 2 1 in 109 observations Barcharts have the column height proportional to the quantity of interest. Exploratory data analysis p.6/36 Exploratory data analysis p.8/36

Exercise 3.2.2 Histogram comparison = 109 - divide each frequency by 109: Ladybower Reservoir Number of runs 71 28 5 2 2 1 in 109 observations Normalised 0.65 0.26 0.05 0.02 0.02 0.01 Process sometimes called normalisation by scientists 0.00 0.01 0.02 0.03 0.04 0.00 0.01 0.02 0.03 20 40 60 80 Daily max ozone 20 40 60 80 100 120 Daily max ozone Exploratory data analysis p.9/36 Exploratory data analysis p.11/36 Scaled histograms Histogram comparison Differences Partial Full 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 50 40 30 20 10 0 10 20 Daily max ozone Exploratory data analysis p.10/36 Exploratory data analysis p.12/36

Histogram problems Kernal density estimates 0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.02 0.04 0.06 0.08 0.10 probability density 0 20 40 60 80 100 Summer daily maxima 20 40 60 80 100 Summer daily maxima Exploratory data analysis p.13/36 0 2 4 6 8 x Exploratory data analysis p.15/36 Kernal density estimates Cumulative distributions Ladybower Recall the cumulative distribution function (c.d.f.) of a random variable X: F(x) = P(X x) How can we estimate this from a finite number of observations? 20 40 60 80 100 120 Ozone (ppb) Exploratory data analysis p.14/36 Exploratory data analysis p.16/36

Cumulative distributions Cumulative distributions Let us assume That our variables X 1,...,X n are independent and identically distributed (i.i.d.) They are replicates of a random variable X which has cumulative distribution function F. We can denote by x 1,...,x n, the observed values of X 1,...,X n. The empirical c.d.f is a proper distribution function and has the following properties: F(x) is a step function with jumps at the data points; F(x) = 1 if x max(x 1,...,x n ); F(x) = 0 if x < min(x 1,...,x n ). Exploratory data analysis p.17/36 Exploratory data analysis p.19/36 Cumulative distributions Cumulative distributions The empirical cumulative distribution function (c.d.f.) is defined as: F(x) = 1 n n (num of x i=1 i x) = Π(x i x) n where : Π(x i x) = { 1 if x i x 0 if x i > 0 To construct: Take the observed values and order them so that the smallest one comes first. Label these ordered values x (1),x (2),,x (n) so that x (1) x (2) x (n). Then the kth ordered point x (k) is the k/n th quantile. Exploratory data analysis p.18/36 Exploratory data analysis p.20/36

Exercise 3.2.5 Exercise 3.2.5 For the observations {1, 2, 2, 3, 4}, find F(x) and sketch the plot. x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 F(x) density 0.0 0.2 0.4 0.6 0.8 1.0 x Exploratory data analysis p.21/36 Exploratory data analysis p.23/36 Exercise 3.2.5 Exercise 3.2.6 For the observations {1, 2, 2, 3, 4}, find F(x) and sketch the plot. x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 n(x i ) x i 0 0 1 1 3 3 4 4 5 5 5 F(x) 0 0 1/5 1/5 3/5 3/5 4/5 4/5 5/5 5/5 5/5 F(x) 0 0 0.2 0.2 0.6 0.6 0.8 0.8 1 1 1 Construct the cdf for the first 20 points from the Summer ozone measurements and sketch it These are: 32 29 32 32 33 27 34 22 30 35 27 23 28 34 35 45 36 26 23 16 At each sorted data point we have a jump of i/n, which is 1/20 as n = 20. Exploratory data analysis p.22/36 Exploratory data analysis p.24/36

Exercise 3.2.6 Scatterplots Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0 Scatterplots look at the relationship between continuous variables. Usually they project two dimensions onto two dimensions. Several ways of representing three dimensions. Scatterplots are the mainstay of physical sciences. 15 20 25 30 35 40 45 Summer daily maxima Exploratory data analysis p.25/36 Exploratory data analysis p.27/36 Summer ozone Scatterplots Summer Winter Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0 20 40 60 80 0 10 20 30 40 0 20 40 60 80 Summer daily maxima 20 40 60 80 100 NO2 20 40 60 80 100 120 NO2 Exploratory data analysis p.26/36 Exploratory data analysis p.28/36

Scatterplots Independence peripheral COHb saturation level 40 30 20 10 0 10 15 20 25 30 Conditional probabilities were introduced in Math104: If A and B are two events then, as long as P(B) > 0, the conditional probability of A given B is written as P(A B) and calculated from: P(A B) = P(A B). P(B) heart COHb saturation level Exploratory data analysis p.29/36 Exploratory data analysis p.31/36 Independence Independence Scatterplots can be used to look for dependence between continuous variables. They can also be useful to identify situations in which variables appear to be independent. If two variables are independent, then the distribution of one variable will look the same regardless of the value of the other variable. This is what the ozone versus NO 2 above plots looked like. We can look for some structure in our data: including the dependence of one variable on another, by examining conditional distributions of some subsets of our data. Do this by seperating the data by some defined criterion, and plotting the subsets. Exploratory data analysis p.30/36 Exploratory data analysis p.32/36

Exercise 3.2.8 Boxplots Summer Ozone NO2 <= 40 Winter Ozone NO2 <=40 Summer Winter 0 10 20 30 40 50 60 70 Summer Ozone 40<NO2<=60 0 10 20 30 40 50 60 70 0 10 20 30 40 Winter Ozone 40<NO2<=60 0 10 20 30 40 0 20 40 60 80 100. Ladybower. 0 20 40 60 80 100. Ladybower. Exploratory data analysis p.33/36 Exploratory data analysis p.35/36 Boxplots Next session The third of the standard forms for graphs: Similar to multiple histograms. Examine distribution of continuous variable. For different levels of a discrete variable. The discrete variable can be ordered, or nominal. Next time we shall: 1. take a look at boxplots, 2. learn about some of the classic plots from history, 3. find out what makes a good graph, 4. look at some non-standard forms of graphs. Exploratory data analysis p.34/36 Exploratory data analysis p.36/36