Lecture 1: Review and Exploratory Data Analysis (EDA)



Similar documents
Exploratory data analysis (Chapter 2) Fall 2011

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Variables. Exploratory Data Analysis

3: Summary Statistics

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Summarizing and Displaying Categorical Data

Exploratory Data Analysis. Psychology 3256

Module 4: Data Exploration

Data Exploration Data Visualization

Lecture 2. Summarizing the Sample

Geostatistics Exploratory Analysis

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Descriptive Statistics

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Exercise 1.12 (Pg )

AP * Statistics Review. Descriptive Statistics

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Diagrams and Graphs of Statistical Data

Foundation of Quantitative Data Analysis

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Mind on Statistics. Chapter 2

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Week 1. Exploratory Data Analysis

How To Write A Data Analysis

First Midterm Exam (MATH1070 Spring 2012)

Descriptive Statistics and Exploratory Data Analysis

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Describing, Exploring, and Comparing Data

Fairfield Public Schools

Northumberland Knowledge

a. mean b. interquartile range c. range d. median

Using SPSS, Chapter 2: Descriptive Statistics

Measurement & Data Analysis. On the importance of math & measurement. Steps Involved in Doing Scientific Research. Measurement

DESCRIPTIVE STATISTICS - CHAPTERS 1 & 2 1

Intro to Statistics 8 Curriculum

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1.

12: Analysis of Variance. Introduction

determining relationships among the explanatory variables, and

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

+ Chapter 1 Exploring Data

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Descriptive Statistics

THE BINOMIAL DISTRIBUTION & PROBABILITY

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Topic 9 ~ Measures of Spread

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Statistics. Measurement. Scales of Measurement 7/18/2012

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

MBA 611 STATISTICS AND QUANTITATIVE METHODS

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

The Normal Distribution

Chapter 3. The Normal Distribution

EXPLORING SPATIAL PATTERNS IN YOUR DATA

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

2. Filling Data Gaps, Data validation & Descriptive Statistics

Organizing Your Approach to a Data Analysis

MEASURES OF VARIATION

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Statistics Chapter 2

Chapter 1: Exploring Data

II. DISTRIBUTIONS distribution normal distribution. standard scores

List of Examples. Examples 319

2-7 Exploratory Data Analysis (EDA)

NEW YORK CITY COLLEGE OF TECHNOLOGY The City University of New York

Elements of statistics (MATH0487-1)

2 Describing, Exploring, and

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Chapter 2 Data Exploration

Shape of Data Distributions

Descriptive Statistics and Measurement Scales

Bellwork Students will review their study guide for their test. Box-and-Whisker Plots will be discussed after the test.

Section 1.3 Exercises (Solutions)

Exploratory Data Analysis

Classify the data as either discrete or continuous. 2) An athlete runs 100 meters in 10.5 seconds. 2) A) Discrete B) Continuous

Descriptive Statistics

Biostatistics: Types of Data Analysis

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Sta 309 (Statistics And Probability for Engineers)

Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data

Description. Textbook. Grading. Objective

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

AP Statistics Solutions to Packet 2

Data analysis process

3.2 Measures of Spread

STAT355 - Probability & Statistics

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Transcription:

Lecture 1: Review and Exploratory Data Analysis (EDA) Sandy Eckel seckel@jhsph.edu Department of Biostatistics, The Johns Hopkins University, Baltimore USA 21 April 2008 1 / 40

Course Information I Course website: http://www.biostat.jhsph.edu/ seckel/biostat2/ Office hours For questions and help When? I ll announce this tomorrow Homework Three assignments Follow-up on material from class Written exam When: Wednesday 21 May, 10.00-12.00 Where: Multimedia classroom (C wing, 2nd floor), Mannerheimintie 172/Kytosuontie 9 2 / 40

Course Information II 21 April 2008 to 21 May 2008 08.30-12.30 Monday, Tuesday, Thursday, Friday 08.30-10.15 Lecture 10.15-10.30 Break 10.30-12.30 Informal lecture, class exercise or computer lab Activities for the second half of class will vary; also time for questions! 3 / 40

Class goals Biostat I Numbers and probability Sampling distributions and inference Statistical models and association / causality Biostat II Developing scientific questions Translating questions into regression models Interpreting results of regression Critiquing the literature 4 / 40

Issues and recurring themes Populations are complicated... statistical techniques may not capture all of the nuances Natural laws will not perfectly predict outcomes Signal-to-Noise: Comparing a trend to its variability Bias-Variance trade-off: Unadjusted vs. adjusted estimates Population vs. sample 5 / 40

What is Biostatistics? Biostatistics is the use of data to describe and make inferences about a scientific problem Remember the Bio in Biostatistics! Biostatistics has limitations: you can t have it all 6 / 40

Types of Biostatistics 1 Descriptive statistics Exploratory data analysis (EDA): often not in literature Summaries: Table 1 in a paper Goal: to visualize relationships, generate hypotheses 2 Inferential statistics Confirmatory data analysis Methods section of a paper Goal: quantify relationships, test hypotheses 7 / 40

Exploratory Data Analysis (EDA) ALWAYS look at your data! If you can t see it, then don t believe it! EDA allows us to: 1 Visualize distributions and relationships 2 Detect errors 3 Assess assumptions for confirmatory analysis EDA is the first step of data analysis 8 / 40

EDA methods (One-Way) Ordering : Stem-and-Leaf plots Grouping: frequency displays, distributions; histograms Summaries: summary statistics, standard deviation, box-and-whisker plots 9 / 40

Stem-and-Leaf Plots I Age in years (10 observations): 25, 26, 29, 32, 35, 36, 38, 44, 49, 51 Age Interval Observations 20-29 5 6 9 30-39 2 5 6 8 40-49 4 9 50-59 1 10 / 40

Stem-and-Leaf Plots II The age interval is the stem The observations are the leaves Rule of thumb: The number of stems should roughly equal the square root of the number of observations Or the stems should be logical categories 11 / 40

Stem-and-Leaf Plots III Some statistical programs print output like this: where 2* means 20-29. Age Interval Observations 2* 5 6 9 3* 2 5 6 8 4* 4 9 5* 1 12 / 40

Stem-and-Leaf Plots IV Output may also be shown like this: Age Interval 2. 5 6 9 3* 2 3. 5 6 8 4* 4 4. 9 5* 1 Observations where 3* means 30-34 and 3. means 35-39. 13 / 40

Frequency Distribution Tables Shows the number of observations for each range of data Intervals can be chosen in ways similar to stem-and-leaf displays Age Interval Frequency 20-29 3 30-39 4 40-49 2 50-59 1 14 / 40

Cumulative Frequency Distribution Tables Show the frequency, the relative frequency, and cumulative frequency of observations Age Interval Frequency Cum. Freq. Rel. Freq Cum. Rel. Freq. 20-29 3 3 0.3 0.3 30-39 4 7 0.4 0.7 40-49 2 9 0.2 0.9 50-59 1 10 0.1 1.0 This table shows an empirical distribution function obtained from a sample The true distribution function is the distribution of the entire population 15 / 40

Histograms Picture of the frequency or relative frequency distribution Histogram of Age Frequency 0.0 1.0 2.0 3.0 25 30 35 40 45 50 55 Age Note: Graphs are generally better to use in presentations than tables. They allow your audience to visualize a trend quickly. 16 / 40

Summary Statistics Percentiles Measures of central tendency Measures of dispersion or variability 17 / 40

Percentiles The r th percentile, P r is the value that is greater than or equal to r percent of a sample of n observations or less than or equal to (100-r) percent of the observations Percentile Quartile Formula n+1 th P 25 Q 1 4 observation n+1 th P 50 Q 2 2 observation 3(n+1) th P 75 Q 3 4 observation 18 / 40

Calculating quartiles I From the age data: with n=10 25, 26, 29, 32, 35, 36, 38, 44, 49, 51 Q 2 = median = average of 5 th and 6 th observations 35 + 36 = 2 = 35.5 Remember to order your data! 19 / 40

Calculating quartiles II Q 1 = median of lower half of data = third smallest value = 29 Q 3 = median of upper half of data = third largest value = 44 Note: If n is odd, include the median in the upper and lower half of the data. 20 / 40

Measures of Central Tendency Measure Mean Median Mode Formula n i=1 x i n = x Middle observation Most frequent observation observation From the age example the mean is: 25+26+29+32+35+36+38+44+49+51 10 = 36.5 The mode is more helpful for categorical data, i.e. the most frequent age interval is 30-39 and it has 4 observations. 21 / 40

Measures of spread: Range Range = max-min The difference between the maximum and minimum values From age example: Max = 51, Min = 25 Range = 51-25 = 26 22 / 40

Measures of spread: Variance Variance = Expected value of the squared deviation of the observations from the true mean σ 2 = E[(X X ) 2 ] Sample variance = Average of the squared deviation of the observations from the sample mean n s 2 i=1 = (x i x) 2 n 1 Sample variance from age example = 82.9 23 / 40

Standard deviation Standard deviation = Square root of the variance σ = E[(X X ) 2 ] Sample standard deviation = Square root of the sample variance n i=1 s = (x i x) 2 n 1 From the age data: s = 82.9 = 9.1 Note: The units of the variance are years 2, while the units of the standard deviation are years. Interpretation: The standard deviation gives an idea of how much observations differ from the mean 24 / 40

Box-and-whisker plots I Box-and-whisker plots display quartiles Some terminology: Upper Hinge = Q 3 = Third quartile Lower Hinge = Q 1 = First quartile Interquartile range (IQR) = Q 3 Q 1 Contains the middle 50% of data Upper Fence = Upper Hinge + 1.5 * (IQR) Lower Fence = Lower Hinge - 1.5 * (IQR) Outliers: Data values beyond the fences Whiskers are drawn to the smallest and largest observations within the fences 25 / 40

Box-and-whisker plots II Boxplot of Age Age in Years 25 30 35 40 45 50 IQR = 44-29 = 15 Upper Fence = 44 + 15*1.5 = 66.5 Lower Fence = 29-15*1.5 = 6.5 26 / 40

Pairwise EDA 2 Categorical Variables Frequency table 1 Categorical, 1 Continuous Variable Stratified stem-and-leaf plots Side-by-side box plots 2 Continuous variables Scatterplot 27 / 40

2 Categorical Variables Frequency Table Age Interval Gender Total Female Male 20-29 1 2 3 30-39 2 2 4 40-49 1 1 2 50-59 1 0 1 Total 5 5 10 It looks like the men tend to be younger than women in this example. 28 / 40

1 Categorical and 1 Continuous Variable I Stratified Stem-and-Leaf plots Female Male Age Interval Obs. Age Interval Obs. 20-29 6 20-29 5 9 30-39 5 6 30-39 2 8 40-49 9 40-49 4 50-59 1 50-59 Total 5 5 10 29 / 40

1 Categorical and 1 Continuous Variable II Side-by-Side Box Plots Boxplot of Age by Gender Age in Years 25 30 35 40 45 50 Female Male Allows us to compare the distribution of the continuous variable (age) across values of the categorical variable (gender) 30 / 40

2 Continuous Variables Scatterplot Height in Centimeters 155 165 175 185 Age by Height 25 30 35 40 45 50 Age in Years Scatterplots visually display the relationship between two continuous variables 31 / 40

EDA: What to notice Shape Center Spread 32 / 40

Common Distribution Shapes Symmetrical and bell shaped Positively skewed or skewed to the right Negatively skewed or skewed to the left 33 / 40

Other Distribution Shapes Bimodal Reverse J shaped Uniform 34 / 40

Measures of Center Mode: Peak(s) Median: Equal areas point Mean: Balancing point 35 / 40

Skewness I Positively skewed Longer tail in the high values Mean > Median > Mode Positively skewed or skewed to the right Mode Median Mean 36 / 40

Skewness II Negatively skewed Longer tail in the low values Mode > Median > Mean Negatively skewed or skewed to the left Median Mean Mode 37 / 40

Symmetric Right and left sides are mirror images Left tail looks like right tail Mean = Median = Mode Symmetric 38 / 40

EDA: What to notice Outliers Values that are far from the bulk of the data Outliers can influence the value of some statistical measures Age example Data Mean Original 36.5 With 80-year-old added 40.5 39 / 40

Take Home Message Look at your data FIRST! Happy exploring! 40 / 40