Histogram. Graphs, and measures of central tendency and spread. Alternative: density (or relative frequency ) plot /13/2004

Similar documents
STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Descriptive Statistics and Measurement Scales

Descriptive Statistics

Lecture 1: Review and Exploratory Data Analysis (EDA)

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Data Exploration Data Visualization

Summarizing and Displaying Categorical Data

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

MEASURES OF VARIATION

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Lesson 4 Measures of Central Tendency

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Exercise 1.12 (Pg )

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory Data Analysis

Lecture 2. Summarizing the Sample

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Exploratory Data Analysis. Psychology 3256

AP * Statistics Review. Descriptive Statistics

Means, standard deviations and. and standard errors

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Northumberland Knowledge

Statistics. Measurement. Scales of Measurement 7/18/2012

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Variables. Exploratory Data Analysis

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Frequency Distributions

Chapter 3. The Normal Distribution

Interpreting Data in Normal Distributions

Mind on Statistics. Chapter 2

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Descriptive Statistics

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

3: Summary Statistics

Statistics Review PSY379

How To Write A Data Analysis

Diagrams and Graphs of Statistical Data

Module 4: Data Exploration

Descriptive statistics parameters: Measures of centrality

First Midterm Exam (MATH1070 Spring 2012)

DESCRIPTIVE STATISTICS & DATA PRESENTATION*

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Week 1. Exploratory Data Analysis

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

AP Statistics Solutions to Packet 2

Section 1.3 Exercises (Solutions)

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Chapter 2 Data Exploration

Chapter 1: Exploring Data

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Algebra I Vocabulary Cards

1 Descriptive statistics: mode, mean and median

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

Describing Data: Measures of Central Tendency and Dispersion

determining relationships among the explanatory variables, and

Introduction; Descriptive & Univariate Statistics

Box-and-Whisker Plots

8. THE NORMAL DISTRIBUTION

Unit 7: Normal Curves

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Measures of Central Tendency and Variability: Summarizing your Data for Others

Describing, Exploring, and Comparing Data

6. Decide which method of data collection you would use to collect data for the study (observational study, experiment, simulation, or survey):

Mean, Median, and Mode

Statistics Chapter 2

Classify the data as either discrete or continuous. 2) An athlete runs 100 meters in 10.5 seconds. 2) A) Discrete B) Continuous

CHAPTER THREE. Key Concepts

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

What Does the Normal Distribution Sound Like?

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

II. DISTRIBUTIONS distribution normal distribution. standard scores

List of Examples. Examples 319

6.4 Normal Distribution

Topic 9 ~ Measures of Spread

Week 3&4: Z tables and the Sampling Distribution of X

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Bar Graphs and Dot Plots

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Midterm Review Problems

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

TEACHER NOTES MATH NSPIRED

Shape of Data Distributions

Week 4: Standard Error and Confidence Intervals

Mean = (sum of the values / the number of the value) if probabilities are equal

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

Name: Date: Use the following to answer questions 2-3:

Geostatistics Exploratory Analysis

DATA INTERPRETATION AND STATISTICS

Practice#1(chapter1,2) Name

Measurement & Data Analysis. On the importance of math & measurement. Steps Involved in Doing Scientific Research. Measurement

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Transcription:

Graphs, and measures of central tendency and spread 9.07 9/13/004 Histogram If discrete or categorical, bars don t touch. If continuous, can touch, should if there are lots of bins. Sum of bin heights = Alternative: density (or relative frequency ) plot Alt: for lots of bins, continuous variable, draw histogram as a frequency polygon Sum of bin heights = 1 Fraction of total samples in each bin. bins 1

Don t do this for categorical variables Stem-and-Leaf Plots A quick way of examining the distribution of data Like histograms on their sides, but with more information. Data: 7 5 5 10 11 10 10 15 15 14 0 5 0 35 35 40 Stem-and-Leaf Plot: 0755 0557 10100554 10001455 050 ---> 005 355 355 40 40 We can plot two histograms in the same plot, to compare them E.G. distribution of heights for women (solid) vs. men (dashed) Summarizing the distribution You might not want to look at the whole distribution in that last case. You get lots of information about the shape of the distribution, but it s kind of visually noisy. Summarize. Central tendency intuition: capture the impression that the heights for the men tend to be higher than the heights for the women.

Measures of central tendency Mode Where s the peak of the histogram? Can be used for any data, but typically only used for categorical data, where the other measures we ll discuss don t apply. Data can have one or more modes Unimodal vs. bimodal vs. trimodal Here a mode is the highest point locally Average otation: observations x={x1,x,x 3,,x } x + x + x + + x x = 1 3... 1 = x i i = 1 Also known as the mean. But note this is the sample mean. Soon we will talk about the population mean. Don t get confused! Only makes sense for interval or ratio data, but will often see it used for rank-order or rating data, too 3

Mean number students studying Brain and Cognitive Sciences, per university 0 00 180 160 140 10 100 1940 1950 1960 1970 1980 1990 Highly skewed data egative skew. Average not at the mode. ote < 50% of data is below average Why does the average shift to the right? mode average somewhere around here Year 35 30 Positive or egative Skew? Average as center of mass % chance 5 0 15 10 5 The average gives the point where a histogram (if made of blocks) would balance. 0 Earlier ov ov 9 ov 16 ov 3 ov 30 Later Week beginning 4

What an outlier does to the mean Outlier shifts balance to right, average to right. Another measure: the median Median = value with ½ of the data points to the left, ½ to the right. E.G. 1,, 4.7, 6, 8, 9., 10 median=6 E.G. 8, 15., 18, 19., 1.3, 5 median=(18+19.)/ =18.6 The median is robust to outliers # of hours of TV watched per week: 3, 5, 7, 7, 140 Mean = 44.4! Median = 7. Mode When to use which measure? Variables are categorical. You want a quick and easy measure for ordinal/quantitative data. You want to report the most common score Median Variables are measured at the ordinal level. Variables measured at the interval-ratio level have highly skewed distributions or lots of outliers You want to report the central score. The median lies at the exact center of the distribution 5

When to use which measure? Mean Quantitative variables (except for highly skewed distributions). Income distributions are highly skewed be wary of anyone talking about the average tax cut. They should be using the median. You want to report the typical score (except for highly skewed distributions). The mean is the "fulcrum that exactly balances all of the scores. The notion of spread Plots look pretty different even when there s the same central tendencies. Also need to capture a notion of spread. You anticipate additional statistical analysis Measures of spread Range Interquartile range Standard deviation Range Biggest value observed smallest value observed. Pretty sensitive to outliers in the tails of the distribution. 6

Interquartile range Box-and-whiskers plot Divide data into 4 groups, see how far about the extreme groups are. median=q1 median Q3-Q1 = IQR. median=q3 Draw box ends = Q1 & Q3 Draw median through box. Plot as points the outliers = any points > 1.5 IQR from an end of the box. Plot whiskers out to the farthest points that are not outliers. Sub-Compact Compact MiniVan Truck Luxury Sedan 5 Economy Standard Deluxe Ultra 10 15 0 5 30 35 40 Car Prices ($K) But, i = 1 Standard deviation Measures spread about average. Desired: How far are data points from the average, on average? ( x i x) = x i x x = x = 0 Try squaring the difference, before taking the average Sample variance = s = i = 1 ( x i x) = x i x Clearly, this is not going to be useful 7

But, this doesn t quite do what we want Wrong units If we were talking about height in inches, spread is now in inches. If we double the distance from all points to the mean, we d like the spread to double. Variance will go up by a factor of 4. Sample standard deviation s = ( x i x) This has the right units and right behavior. Sometimes you ll see -1 We ll talk about this later. Example Data: 3, 5, 7, 7, 13 Average = 7 Deviations: -4, -, 0, 0, 6 Root mean square deviation: s = sqrt((16+4+0+0+36)/5) = 3.35 Or: s = sqrt(mean(9, 5, 49, 49, 169) 7 = sqrt(60.-49) = 3.35 ) Effect of data transformations on mean and standard deviation Effect of adding a constant to each datum: 3, 5, 5, 7, 9 6, 8, 8, 10, 1 (+3) mean = 5.8 mean = 8.8 (+3) s =.04 s =.04 (same) Effect of multiplying by a constant 18, 4, 1, 6 1.5,, 1, 0.5 ( 1) mean = 15 mean = 1.5 ( 1) s = 6.71 s =.56 ( 1) 8

Effect of transformation on mean and standard deviation Adding a constant shifts the mean by that amount, and leaves the standard deviation unchanged. Multiplying by a constant multiplies both the mean and standard deviation by that constant. The z-score is robust to these changes ( x x z i ) How many standard deviations above the i = s mean is score x i? x = 18, 4, 1, 6 x = 1.5,, 1, 0.5 z =.4, 1.3, -.4,-1.3 z =.4, 1.3, -.4, -1.3 We ll use this transformation to z a lot in this class Example use of z-scores How do you combine scores from different people? Mary, Jeff, and Raul see a movie. Their ratings of the movie, on a scale from 1 to 10: Mary = 7, Jeff = 9, Raul = 5 Average score = 7? It s more meaningful to see how those scores compare to how they typically rate movies. Example use of z-scores Recent ratings from Mary, Jeff, & Raul: Mary: 7, 8, 7, 9, 8, 9 7 is pretty low Jeff: 8, 9, 9, 10, 8, 10 9 is average Raul: 1,,, 4, 5 5 is pretty good z-scores: Mary: m=8, s=.8 z(7) = -1. Jeff: m=9 z(9) = 0 Raul: m=.8, s=1.5 z(5) = 1.5 mean(z) = 0.1 = # of standard deviations above the mean It s probably your average movie, nothing outstanding. 9

Describing Distributions We could use more parameters to describe the distribution. E.G. skew: For a normal distribution only two parameters are necessary to completely specify the distribution. Mean (average) Standard deviation Probability Density ormal Distribution One mode, symmetric about that mode. p ( X = x ) = Mean 1 e πσ ( x µ ) / σ Standard Deviation ote: About your book! ormal distribution MAY of the pictures of normal distributions in your book are not normal!!! They seem to be splines -- someone s attempt to draw a normal distribution using a graphics program. ormal distributions really look like the picture on the previous slide. If x is distributed according to a normal distribution with mean µ and standard deviation σ, we write: x ~ (µ, σ ) or x ~ (µ, σ ) otation: sample (Roman) x, s or m, s population (Greek) µ, σ 10

Some Terminology Density curve to describe a population Area under a density curve = 1 Density because the actual probability of any value is zero There are an infinite number of events -- the probability of any one of them is 0. Some things that might be normally distributed Heights and weights Arrival times to class Exam scores Gas mileage Sizes of anything that grows umber of raisins in a box of Raisin Bran Lots of phenomena have approx. this distribution later we ll talk about one reason why. Mean Describes the center of a normal distribution Because the normal distribution is symmetrical, the median and mean are equal Changing the mean just changes the position of the distribution, not the shape Probability Density Changing the mean Value 11

Standard Deviation Standard Deviation Describes the width (spread) of the distribution Can be read directly from the density curve at the inflection point (point where the curvature changes) Changing the standard deviation changes the width and height of a normal curve, but not the position Probability Density s.d.= s.d.=.5 s.d.=1 Value Using the Standard Deviation +/-1 Standard Deviation The 68-95-99.7 rule for any normal distribution 68% of the population fall within one standard deviation of the mean 95% of the population fall within two standard deviations of the mean 99.7% of the population fall within three standard deviations of the mean Probability Density 68% Standard Deviations Value 1

+/- Standard Deviations +/-3 Standard Deviations Probability Density 95% Probability Density 99.7% 4 Standard Deviations 6 Standard Deviations Value Value The normal curve approximation for data For many types of data, the normal curve is a good approximation to the distribution of the data. Use this approximation to generalize from the sample to the larger population. First: Estimate the population mean, µ, from the sample mean x. Estimate the population std. deviation, σ, from the sample standard deviation, s. Example question you can ask Mean height for this class is 65.7 inches. Std. dev. = 3.1 inches Assume the broader population we re interested in is MIT students in general. What s the probability that a student will be taller than 68.8 inches? 13

Using what we know about normal distributions Well, 68.8 is one std. dev. above the mean height. σ What is the area under this part of the curve? 68% 3% 16%. p(>68.8" ) =0.16 One little detail s is not a good estimate for σ, for small sample size. It is biased, i.e. it is consistently too small. Why: we compute x and s at the same time. x follows the samples around, so the deviation from x tends to be smaller than the population σ. ( ) σ x x Fix: when estimating, s = 1 Makes it a little bigger 14