2: Frequency Distributions

Similar documents
Summarizing and Displaying Categorical Data

3: Summary Statistics

Exploratory data analysis (Chapter 2) Fall 2011

Lecture 1: Review and Exploratory Data Analysis (EDA)

Statistics Chapter 2

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Data exploration with Microsoft Excel: univariate analysis

a. mean b. interquartile range c. range d. median

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Chapter 3. The Normal Distribution

Exploratory Data Analysis

Module 2: Introduction to Quantitative Data Analysis

Diagrams and Graphs of Statistical Data

First Midterm Exam (MATH1070 Spring 2012)

Variables. Exploratory Data Analysis

AP * Statistics Review. Descriptive Statistics

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

STAB22 section 1.1. total = 88(200/100) + 85(200/100) + 77(300/100) + 90(200/100) + 80(100/100) = = 837,

Describing, Exploring, and Comparing Data

AMS 7L LAB #2 Spring, Exploratory Data Analysis

Statistics Revision Sheet Question 6 of Paper 2

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Data Exploration Data Visualization

Descriptive Statistics

Week 1. Exploratory Data Analysis


Intro to Statistics 8 Curriculum

Sta 309 (Statistics And Probability for Engineers)

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Lecture 2. Summarizing the Sample

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Descriptive Statistics

Activity 4 Determining Mean and Median of a Frequency Distribution Table

Practice#1(chapter1,2) Name

Descriptive statistics parameters: Measures of centrality

How To Write A Data Analysis

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

A Picture Really Is Worth a Thousand Words

Using SPSS, Chapter 2: Descriptive Statistics

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

AP Statistics Solutions to Packet 2

Chapter 1: Exploring Data

2. Filling Data Gaps, Data validation & Descriptive Statistics

Exercise 1.12 (Pg )

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Exploratory Data Analysis. Psychology 3256

MATH 4470/5470 EXPLORATORY DATA ANALYSIS ONLINE COURSE SYLLABUS

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Activity 3.7 Statistical Analysis with Excel

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

Exploratory Spatial Data Analysis

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Lesson 4 Measures of Central Tendency

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

TEACHER NOTES MATH NSPIRED

MEASURES OF LOCATION AND SPREAD

Classify the data as either discrete or continuous. 2) An athlete runs 100 meters in 10.5 seconds. 2) A) Discrete B) Continuous

Appendix 2.1 Tabular and Graphical Methods Using Excel

Data! Data! Data! I can t make bricks without clay. Sherlock Holmes, The Copper Beeches, 1892.

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

determining relationships among the explanatory variables, and

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Assignment #03: Time Management with Excel

sample median Sample quartiles sample deciles sample quantiles sample percentiles Exercise 1 five number summary # Create and view a sorted

Geostatistics Exploratory Analysis

Dongfeng Li. Autumn 2010

Big Ideas in Mathematics

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Basics of Statistics

CALCULATIONS & STATISTICS

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

Measures of Central Tendency and Variability: Summarizing your Data for Others

Section 3: Examining Center, Spread, and Shape with Box Plots

MATH 103/GRACEY PRACTICE EXAM/CHAPTERS 2-3. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Skewness and Kurtosis in Function of Selection of Network Traffic Distribution

Data exploration with Microsoft Excel: analysing more than one variable

Common Tools for Displaying and Communicating Data for Process Improvement

Chapter 7 Section 1 Homework Set A

Section 1.1 Exercises (Solutions)

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Bernd Klaus, some input from Wolfgang Huber, EMBL

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Chapter 4 Displaying Quantitative Data

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

International College of Economics and Finance Syllabus Probability Theory and Introductory Statistics

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Northumberland Knowledge

R and Rcmdr : Basic Functions for Managing Data

Topic 9 ~ Measures of Spread

Data Analysis Tools. Tools for Summarizing Data

Transcription:

2: Frequency Distributions Stem-and-Leaf Plots (Stemplots) The stem-and-leaf plot (stemplot) is an excellent way to begin an analysis. Consider this small data set: 218 426 53 116 309 504 281 270 246 523 To construct a stemplot, start by drawing the stem. Stem-values represent either the first or first-two significant digits of each value. As a rule-of-thumb, we want between 3 and 15 stem-values on the stem. You are forming a number line, so stem-values must be evenly spaced and no stem-value can be skipped. Values for our data range from 53 to 523, so a reasonable first approximation for the stem is: 0 1 2 3 4 5 A stem-multiplier is included to allow the reader to decipher the magnitude of values. The stemmultiplier of on this stem shows that a stem value of 2 represents about two hundred (and not two or twenty, etc.). Values between 200 and 299 will be stored next to the stem bin of 2. We plot the subsequent significant digit of each value, so we plot the tens place, truncating remaining significant digits, if any.* For example, 218 is plotted as: 0 1 2 1 3 4 5 The remaining leaves are plotted in rank order: 0 5 1 1 2 1478 3 0 4 2 5 02 Note the value of 53 in the above plot is shown as 0 5 (0 in the hundreds-place and 5 in the tens-place). * These rules differ from the simplified rules taught in California public school and are instead based on the original intention and intent of John W. Tukey in the groundbreaking book Exploratory Data Analysis (1977). Page 2.1 (freq.docx; last update 2/7/16)

Shape, location, and spread Shape I m going to rotate the plot 90 degrees to display the distribution in a more familiar way. 8 7 4 2 5 1 1 0 2 0 0 1 2 3 4 5 (x100) This is now just a histogram. Note that the batch of numbers forms a distribution with a shape, location, and spread. The shape of a distribution is seen as a skyline silhouette : X X X X X X X X X X 0 1 2 3 4 5 (x100) Describe the shape narratively. While it is difficult to make reliable statements about the shape of a distribution when the data set is small, you can still get a general impression of whether (A) a mound is present, (B) data are symmetrical, and (C) if there are any data that separate from the rest of the distribution (i.e., outliers). The current stemplot is mound-shaped and is [almost] symmetrical. There are no clear outliers. Note: (1) When the data set is this small, keep a soft focus when describing shape. If a shift in just a few data points can change your impression of the shape, avoid overstatements. (2) Identification of outliers depends on context and can be subjective. There is no uniform definition of an outlier. See http://www.tufts.edu/~gdallal/out.htm for further remarks (optional for now). Location We will simply summarize the central location of a distribution by its median. The median is the value that is greater than or equal to half of the values in the data set. It s easier to see the median if we stretchout the data in rank-order: 053 116 218 246 270 281 309 426 504 523 ^ Median The median will have a depth of (n + 1) / 2. For the current data, n = 10 and median has a depth of (10 + 1) /2 = 5.5. Count in from either the top or bottom of the ordered array to the depth of the median. When n is even, the median falls between two values, as it is here. Under these circumstances, interpolate the median as the average of the two adjacent values. In this case, the median = average(270, 281) = 275.5. Page 2.2 (freq.docx; last update 2/7/16)

Spread The spread of the distribution is its dispersion around its center: 8 7 4 2 5 1 1 0 2 0 0 1 2 3 4 5 (x100) <---------> Spread The data above spread from around 50 to 520. Do not be too concerned about the loss of the 3 rd significant digit. Most of the information in is in the first two significant digits and, at this point in time, we are just exploring the data. The technical term for spread is variability. Second stemplot example The next illustration shows how to draw a stemplot for data that might not immediately lend itself to plotting. Consider: 1.47 2.06 2.36 3.43 3.74 3.78 3.94 4.42 Since data spread from about 1.4 to 4.4, we use a stem-multiplier of and draw the stem: 1 2 3 4 We truncate extra digits before plotting, so that 1.47 is truncated to 1.4, 2.06 is truncated to 2.0, and so on. The decimal point should not be displayed because the next digit is automatically in the tenths place. 1 4 2 03 3 4779 4 4 Narrative interpretation: This stemplot is [almost] symmetrical, mound-shaped, and has no apparent outliers. The median has a location of (8+1)/2 = 4.5 and is between 3.4 and 3.7 (that s good enough for now). Data spread from around 1.4 to 4.4. Page 2.3 (freq.docx; last update 2/7/16)

Third stemplot example (split stem-values) The following pollution levels were found in water samples {1.4, 1.7, 1.8, 1.9, 2.2, 2.2, 2.3, 2.4, 2.6, 2.6, 2.7, 2.8, 2.9, 3.0, 3.0, 3.0, 3.1, 3.2, 3.3, 3.4, 3.4, 3.5, 3.6, 3.7, 3.8}. Here is our first cut at the stemplot: 1 4789 2 223466789 3 000123445678 The above stemplot is too squashed to display the distribution s shape. Let s try splitting the stemvalues so that values between 1.0 and 1.4 are listed on the first stem-value of 1 and values between 1.5 and 1.9 are listed on the second stem-value of 1: 1 4 1 789 2 2234 2 66789 3 00012344 3 5678 Narrative interpretation: This distribution has a tail toward its lower values. This left tail is called a left or negative skew. The median has a depth of (25+1) / 2 = 13. Count the leaves from either end of the stemplot and you will see that the median is approximately equal to 2.9. Data spread from 1.4 to 3.8. Fourth stemplot example (quintuple split) Sometimes splitting stem-values in two still leaves an unclear picture of shape. Consider these 9 values: 17 35 40 49 55 70 74 81 84 You may be tempted to start the process as follows: 1 7 2 3 5 4 09 5 5 6 7 04 8 14 0 This is too spread out to draw out the shape, so you try a stem-multiplier with split the stem-values. We ll put values between 0 and 49 on the first 0 and values between 50 and 99 on the second 0: 0 1344 0 57788 Page 2.4 (freq.docx; last update 2/7/16)

Now try a quintuple-split. This means you will divide the range 0 to 99 into five class intervals, each 20 units in width. The first interval will contain values between 0 and 19, the second interval will contain values between 20 and 39, and so on. The stem will store the hundreds place and the leaves represent the tens place. The ones-place will be truncated. Thus: 0 1 0 3 0 445 0 77 0 88 This distribution is relatively symmetrical with no outliers. Frequency Tables Frequency table example #1 A more traditional way to explore a distribution is in tabular form. It s easier to show you how to construct a frequency table than to provide formulas. In addition, it is best not to be mechanical in our approach toward statistics. If you ve already created a stemplot, the hard work is behind you. Here s the stemplot for the first illustrative data set in this chapter: 0 5 1 1 2 1478 3 0 4 2 5 02 The stemplot has already sorted data into class intervals. The class intervals are 100 units in width. The first class interval contains values from 0 to 99, the second class interval contains values of 100 to 199, and so on. Count the number of observations that fall into each class interval. This is the frequency. Also determine the relative frequency of each count. The relative frequency is just the proportion. It doesn t matter if you report the relative frequency as a proportion or percentage, as long as its labeled clearly. Class interval Frequency Relative frequency 0 99 1 10% 100 199 1 10% 200 299 4 40% 300 399 1 10% 400 499 1 10% 500-599 2 20% Total 10 100% An additional concept worth noting is called cumulative frequency. The cumulative frequency is the frequency up to and including the current interval. This can be reported as a count or as a proportion of the total (cumulative relative frequency), as shown in the table on the next page. Page 2.5 (freq.docx; last update 2/7/16)

Cumulative Frequency Cumulative relative frequency < = 99 1 10% <= 199 2 20% <= 299 6 60% < = 399 7 70% <= 499 8 80% < = 599 10 100% Frequency table example #2 (large dataset) A frequency table of AGES (years) from a childhood health survey is: AGE Freq Rel.Freq Cum.Freq. ------+----------- 3 2 0.3% 0.3% 4 9 1.4% 1.7% 5 28 4.3% 6.0% 6 37 5.7% 11.6% 7 54 8.3% 19.9% 8 85 13.0% 32.9% 9 94 14.4% 47.2% 10 81 12.4% 59.6% 11 90 13.8% 73.4% 12 57 8.7% 82.1% 13 43 6.6% 88.7% 14 25 3.8% 92.5% 15 19 2.9% 95.4% 16 13 2.0% 97.4% 17 8 1.2% 98.6% 18 6 0.9% 99.5% 19 3 0.5% 100.0% ------+----------- Total 654 100.0% Narrative description and interpretation. You might be asked to make narrative interpretations based on data in a frequency table. The proper interpretation will depend on context, so it must be reasonable. There are no mechanical rules. For the above data, we see that ages range from 3 to 19. The most common in this data set is 9, representing 14.4% of the kids. The most common value (9, in this instance), is called the mode. Frequency table example #3 (Non-uniform class intervals) It is occasionally useful to create a frequency table with non-uniform class intervals. For example, here are the ages grouped into intervals according to school level. AGE RANGE, YRS (SCHOOL AGE) Freq RelFreq CumFreq ---- 3-4 (PRE) 11 1.7% 1.7% 5-11 (ELEM) 469 71.7% 73.4% 12-13 (MIDDLE) 100 15.3% 88.7% 14-19 (HIGH) 74 11.3% 100.0% ----- Total 654 100.0% Page 2.6 (freq.docx; last update 2/7/16)