Exploratory Data Analysis. Psychology 3256



Similar documents
STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 1: Review and Exploratory Data Analysis (EDA)

Exercise 1.12 (Pg )

Descriptive Statistics

Exploratory data analysis (Chapter 2) Fall 2011

Summarizing and Displaying Categorical Data

Variables. Exploratory Data Analysis

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Lesson 4 Measures of Central Tendency

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

3: Summary Statistics

Data Exploration Data Visualization

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Exploratory Data Analysis

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Diagrams and Graphs of Statistical Data

Frequency Distributions

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Week 1. Exploratory Data Analysis

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

2. Filling Data Gaps, Data validation & Descriptive Statistics

AMS 7L LAB #2 Spring, Exploratory Data Analysis

Descriptive statistics parameters: Measures of centrality

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

A Picture Really Is Worth a Thousand Words

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

THE BINOMIAL DISTRIBUTION & PROBABILITY


Chapter 1: Exploring Data

3.2 Measures of Spread

MEASURES OF VARIATION

MBA 611 STATISTICS AND QUANTITATIVE METHODS

determining relationships among the explanatory variables, and

Standard Deviation Estimator

Using SPSS, Chapter 2: Descriptive Statistics

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Topic 9 ~ Measures of Spread

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

How To Write A Data Analysis

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Common Tools for Displaying and Communicating Data for Process Improvement

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Descriptive Statistics and Measurement Scales

Measures of Central Tendency and Variability: Summarizing your Data for Others

Geostatistics Exploratory Analysis

Describing Data: Measures of Central Tendency and Dispersion

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Means, standard deviations and. and standard errors

Module 4: Data Exploration

Foundation of Quantitative Data Analysis

Lecture 2. Summarizing the Sample

SKEWNESS. Measure of Dispersion tells us about the variation of the data set. Skewness tells us about the direction of variation of the data set.

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Northumberland Knowledge

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Statistics. Measurement. Scales of Measurement 7/18/2012

List of Examples. Examples 319

Ch. 3.1 # 3, 4, 7, 30, 31, 32

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

2 Describing, Exploring, and

Module 3: Correlation and Covariance

Statistics Chapter 2

DATA INTERPRETATION AND STATISTICS

STAB22 section 1.1. total = 88(200/100) + 85(200/100) + 77(300/100) + 90(200/100) + 80(100/100) = = 837,

Chapter 2: Frequency Distributions and Graphs

Statistics Review PSY379

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

Demographics of Atlanta, Georgia:

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

Introduction; Descriptive & Univariate Statistics

Chapter 4: Average and standard deviation

+ Chapter 1 Exploring Data

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

MTH 140 Statistics Videos

Exploratory Data Analysis

STAT355 - Probability & Statistics

COMPARISON MEASURES OF CENTRAL TENDENCY & VARIABILITY EXERCISE 8/5/2013. MEASURE OF CENTRAL TENDENCY: MODE (Mo) MEASURE OF CENTRAL TENDENCY: MODE (Mo)

CALCULATIONS & STATISTICS

Descriptive Statistics and Exploratory Data Analysis

Shape of Data Distributions

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Chapter 2: Descriptive Statistics

Descriptive Statistics

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Data Analysis Tools. Tools for Summarizing Data

When to use Excel. When NOT to use Excel 9/24/2014

MATH 10: Elementary Statistics and Probability Chapter 7: The Central Limit Theorem

Chapter 2 Data Exploration

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Module 2: Introduction to Quantitative Data Analysis

Transcription:

Exploratory Data Analysis Psychology 3256 1

Introduction If you are going to find out anything about a data set you must first understand the data Basically getting a feel for you numbers Easier to find mistakes Easier to guess what actually happened Easier to find odd values 2

Introduction One of the most important and overlooked part of statistics is Exploratory Data Analysis or EDA Developed by John Tukey Allows you to generate hypotheses as well as get a feel for you data Get an idea of how the experiment went without losing any richness in the data 3

Hey look, numbers! x (the value) f (frequency) 10 1 23 2 25 5 30 2 33 1 35 1 4

Frequency tables make stuff easy 10(1)+23(2)+25(5)+30(2)+33(1)+35(10 = 309 5

Relative Frequency Histogram You can use this to make a relative frequency histogram Lose no richness in the data Easy to reconstruct data set Allows you to spot oddities Frequency Relative Frequency Histogram 5.00 3.75 2.50 1.25 0 9 13 16 19 22 25 28 31 34 Score Frequency 6

Categorical Data With categorical data you do not get a histogram, you get a bar graph You could do a pie chart too, though I hate them (but I love pie) Pretty much the same thing, but the x axis really does not have a scale so to speak So say we have a STAT 2126 class with 38 Psych majors, 15 Soc, 18 CESD majors and five Bio majors 7

Like this 40 30 STAT 2126 STAT 2126 Biology Count 20 CESD 10 Psych 0 Psych Soc CESD Biology Major Soc 8

Quantitative Variables So with these of course we use a histogram We can see central tendency Spread shape 9

Skewness 10

Kurtosis Leptokurtic means peaked Platykurtic means flat 11

More on shape A distribution can be symmetrical or asymmetrical It may also be unimodal or bimodal It could be uniform 12

An example Number of goals scored per year by Mario Lemieux 43 48 54 70 85 45 19 44 69 17 69 50 35 6 28 1 7 A histogram is a good start, but you probably need to group the values 13

Mario could sorta play 5.00 Goals Per Season Frequency 3.75 2.50 1.25 0 10 20 30 40 50 60 70 80 90 Goal Totals Wait a second, what is with that 90? Labels are midpoints, limits are 5-14 85-94 Real limits are 85.5 94.5 14

Careful You have to make sure the scale makes sense Especially the Y axis One of the problems with a histogram with grouped data like this is that you lose some of the richness of the data, which is OK with a big data set, perhaps not here though 15

Stem and Leaf Plot 0 1 6 7 1 7 9 2 8 3 5 4 3 4 5 8 5 0 4 6 9 9 7 0 8 5 This one is an ordered stem and leaf You interpret this like a histogram Easy to sp ot outliers Preserves data Easy to get the middle or 50 th percentile which is 44 in this case 16

The Five Number Summary You can get other stuff from a stem and leaf as well Median First quartile (17.5 in our case) Third quartile (61.5 here) Quartiles are the 25th and 75th percentiles So halfway between the minimum and the median, and the median and the maximum 17

You said there were five numbers.. Yeah so also there is the minimum 1 And the maximum, 85 These two by the way, give you the range Now you take those five numbers and make what is called a box and whisker plot, or a boxplot Gives you an idea of the shape of the data 18

And here you go 19

and it continues We talked about the central tendency of a distribution This is one of the three properties necessary to describe a distribution We can also talk about the shape You know that kurtosis stuff and all of that 20

An Example Consider 1 5 9 20 30 11 12 13 14 15 Both have the same mean (13) They both sum to 65, then divide 65 by 5, you get 13 21

The same, but different 1 5 9 20 30 11 12 13 14 15 So, they both have the same mean, and both are symmetrical How are they different? Well the one on the top is much more spread out 22

Spread Well how could we measure spreadoutedness? Well the range is a start 1-30 vs 11-15 Seems pretty crude We could look at the IQD Still pretty crude 23

We need something better Something that is kind of like a mean really Like the average amount that the data are spread out Well why not do that? 24

Well here s why not (x x) (1 13) + (5 13) + (9 13) + (20 13) + (30 13) = n 5 12 + ( 8) + ( 4) + 7 +17 = 5 = 0 5 25

Hmm They will ALWAYS sum to zero Makes sense when you think about it If the mean is the balancing point, there should be as much mass on one side as the other So how do we get rid of negatives? Absolute value! 26

The Mean Absolute Deviation (x x) (1 13) + (5 13) + (9 13) + (20 13) + (30 13) = n 5 12 + 8 + 4 + 7 +17 = 5 = 48 5 = 9.6 27

Cool! Well sometimes things you think are cool, well they aren t Mullets for example Anyway, for our purposes the MAD is just not that useful It is, in the type of stats we will do, a dead end Too bad, as it has intuitive appeal 28

There has to be another way Well of course there is or we would end now OK, how else can we get rid of those nasty negatives? Square the deviations (you know, -9 2 = 81 for example) 29

( x x) 2 n We are getting closer ( = 1 13 ) 2 + ( 5 13) 2 + ( 9 13) 2 + ( 20 13) 2 + ( 30 13) 2 = ( 12)2 + ( 8) 2 + ( 4) 2 + 7 2 +17 2 5 144 + 64 +16 + 49 + 289 = 5 =112.4 5 30

Hmmmm 112.4, seems like a mighty big number Well it is in squared units not in the original units What is the opposite of squaring something? Square root 10.6 31

There is a little problem here The formula I have shown you so far, has n on the bottom Yeah I know that just makes sense. In fact, it is supposed to be n-1 We want something that will be an unbiased estimator of the same quantity in the population 32

Variance and standard deviation The population parameters, variance and the standard deviation have N on the bottom The sample statistics used to estimate them have n-1 If they had n, they would underestimate the population parameters 33

Sample statistics s 2 = ( x x) 2 n 1 s = ( x x) 2 n 1 34

So in our case s = ( x x) 2 = n 1 = ( 12)2 + ( 8) 2 + ( 4) 2 + 7 2 +17 2 4 = 144 + 64 +16 + 49 + 289 4 = 140.5 =11.85 ( 1 13) 2 + ( 5 13) 2 + ( 9 13) 2 + ( 20 13) 2 + ( 30 13) 2 4 35

For the Population σ 2 = σ = (X µ) 2 N ( X µ ) 2 N 36

How are the variance and sd affected by extreme scores? 1 5 9 20 30 s = 11.85 OK let s throw in a new number, say 729 1 5 9 20 30 729 Our new mean is 132.33 Our new variance is 85555.067 Our new standard deviation is 292.50 Well the mean is affected by extreme scores, so of course so is the sd 37

How can we use this to our advantage? coefficient of variation Katz et al (1990) study mean 69.6 sd 10.6 no study mean 46.6 sd 6.8 one could conclude there is more variation with studying however the cvs are.152 and.146 respectively sd / mean 38

A couple of key points Remember, we want to learn about populations not samples we estimate population parameters with sample statistics we want unbiased estimators of parameters 39 39

Transformations E(x + k) = x + k var(x + k) = s 2 x E(xk) = x k var(xk) = s 2 k 2 x 40 40