Descriptive Statistics



Similar documents
Lesson 4 Measures of Central Tendency

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

Exercise 1.12 (Pg )

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

MBA 611 STATISTICS AND QUANTITATIVE METHODS

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Summarizing and Displaying Categorical Data

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory Data Analysis. Psychology 3256

Data Exploration Data Visualization

AP * Statistics Review. Descriptive Statistics

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Exploratory Data Analysis

Descriptive Statistics

Lecture 1: Review and Exploratory Data Analysis (EDA)

6.4 Normal Distribution

4. Continuous Random Variables, the Pareto and Normal Distributions

Describing, Exploring, and Comparing Data

Variables. Exploratory Data Analysis

Frequency Distributions

MEASURES OF VARIATION

Foundation of Quantitative Data Analysis

Chapter 3 RANDOM VARIATE GENERATION

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Means, standard deviations and. and standard errors

Statistics. Measurement. Scales of Measurement 7/18/2012

THE BINOMIAL DISTRIBUTION & PROBABILITY

Dongfeng Li. Autumn 2010

DESCRIPTIVE STATISTICS & DATA PRESENTATION*

CALCULATIONS & STATISTICS

Statistics Revision Sheet Question 6 of Paper 2

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Descriptive Statistics and Measurement Scales

How Does My TI-84 Do That

Geostatistics Exploratory Analysis

Measures of Central Tendency and Variability: Summarizing your Data for Others

Statistics Chapter 2

2. Filling Data Gaps, Data validation & Descriptive Statistics

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

First Midterm Exam (MATH1070 Spring 2012)

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

1 Descriptive statistics: mode, mean and median

Diagrams and Graphs of Statistical Data

3: Summary Statistics

Module 4: Data Exploration

Chapter 2: Frequency Distributions and Graphs

2 Describing, Exploring, and

Sta 309 (Statistics And Probability for Engineers)

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

DATA INTERPRETATION AND STATISTICS

Northumberland Knowledge

Standard Deviation Estimator

Data exploration with Microsoft Excel: univariate analysis

Bar Graphs and Dot Plots

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Section 1.3 Exercises (Solutions)

How To Write A Data Analysis

Week 1. Exploratory Data Analysis

Interpreting Data in Normal Distributions

Introduction to Quantitative Methods

Chapter 3. The Normal Distribution

AP Statistics Solutions to Packet 2

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

The Normal Distribution

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Topic 9 ~ Measures of Spread

Introduction; Descriptive & Univariate Statistics

Exploratory Data Analysis

AP STATISTICS REVIEW (YMS Chapters 1-8)

Without data, all you are is just another person with an opinion.

Describing and presenting data

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Chapter 1: Exploring Data

Algebra I Vocabulary Cards

Mathematical goals. Starting points. Materials required. Time needed

How To Test For Significance On A Data Set

Describing Data: Measures of Central Tendency and Dispersion

3.4 The Normal Distribution

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Chapter 2 Data Exploration

CHAPTER THREE. Key Concepts

Descriptive statistics parameters: Measures of centrality

Mean, Median, Standard Deviation Prof. McGahagan Stat 1040

determining relationships among the explanatory variables, and

Transcription:

Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web site, be sure you can define, know when to use, calculate (with Spss), and interpret the following: I. Indicators of Central Tendency A. Mode B. Median C. Mean II. Indicators of Dispersion A. Range B. Interquartile Range C. Variance D. Standard Deviation III. Graphic Presentation and Summarization A. Sort raw data B. Frequency table C. Reduce raw data to categories D. Cumulative frequencies & percentiles E. Histograms IV. Exploratory Data Analysis A. Box and whisker plot B. Stem and leaf display Page 1 of 12

Displaying the Shape of the Distribution Goal: Determine how closely does the shape of the distribution approximates a Gaussian distribution. Parametic statistical tests the kind we will study next assume the data do indeed approximate a Gaussian distribution. V. Indicators of a Gaussian distribution A. Mean = Median = Mode B. Skewness: = -- Σ ------------ measures the asymmetry of the distribution. A value of n s zero indicates no skewness is present. The larger the value the more skewed the distribution. Negative skew indicates the tail of the distribution is to the left, with most of the scores clustering at the higher end of the scale. Positive skew indicates the scores cluster at the low end of the scale and the tail extends to the right. b 1 1 C. Kurtosis: = -- Σ ------------ 3 indicates the flatness of the distribution. 1 b 2 n 1. Mesokutric: = 3 2. Platykurtic: < 3 3. Leptokurtic > 3 D. Graphs 1. Ogive 2. Normal Probability Plots E. Statistical Tests 1. Chi Square VI. Resistant indicators x i x 3 x i x 4 s A. Central Tendency In certain data sets some observed values lie far way from the clump of the data values. These outliers or extreme scores, may be due to measurement errors, data recording errors, or may represent valid data points. Extreme scores influence unduly the mean and standard deviation. Suppose for example, that the mean annual salary in this class is $59,000. Now, imagine that for some reason Bill Gates decides to join our class. When we include his, say, $10,000,000 annual salary, we are now all millionaires, for the class mean is now $x,xxx,xxx. The mean is no longer descriptive of the average, for the mean is not resistant to extreme scores. Hence, use the median instead. The median is not influenced Page 2 of 12

by the exact value of the largest score (or value) and thus is a more resistant measure of central tendency. B. Dispersion. The range, clearly, is not resistant to the influence of extreme scores. Because each value in a distribution is included in the calculation of the variance and standard deviation, neither is resistant to extreme values. The interquartile range, because it is based on percentiles, is resistant to extreme scores. The lower quartile is the value such that 25 percent of all values fall below that value. The upper quartile is the value at which 25 percent of all values fall above it. The interquartile range is the difference between the upper and lower quartiles. In a large sample that approximates the Gaussian distribution, the interquartile range tends to be 1.34 times the sample standard deviation. C. Shape of the distribution Resistent indicators of skewness and kurtosis also exist, such as the Yule-Kendall x skewness statistic defined as: ϒYK 0.25 ( 2x 0.5 + x 0.75 ) = -------------------------------------------------- x0.75 x 0.25 Other resistant indicators exist based on all the quantities such as L-moments but these are not included in an introductory discussion. Page 3 of 12

Calculation of Mean and Standard Deviation Sample of 10 Scores from P102 Exam Person Score (x) (x-m) (x-m) 2 A 67-23.1 533.61 B 95 4.9 24.01 C 98 7.9 62.41 D 92 1.9 3.61 E 99 8.9 79.21 F 96 5.9 34.81 G 94 3.9 15.21 H 90-0.1 0.01 I 95 4.9 24.01 J 75-15.1 228.01 sum = 901 sum -> 1,005 mean = 90.1 variance -> 100.49 standard deviation -> 166.17 skewness = -1.6557 kurtosis = 1.7794 Note that the mean is the arithmetic average. The column labelled (x-m) shows the amount by which each score deviates from the mean. This column will always sum to zero. The column labelled (x-m) 2 is also known as the sum of the squared deviations about the mean, or just as sum of squares. The variance is the average of the sum of squares Σ( x M) 2 ------------------------- n 1. and the standard deviation is the square root of the variance -------------------------. Σx ------- i n Σ( x M) 2 n 1 To illustrate the impact of an extreme score, the instructor realizes that for student A, the score of 67 was mistakenly entered. In actuality, student A earned a score ot 57. Note the changes in the descriptive statistics when this single change is made. Page 4 of 12

Effect of an Extreme Score Sample of 10 Scores from P102 Exam Person Score (x) (x-m) (x-m) 2 A 57-32.1 1030.41 B 95 5.9 34.81 C 98 8.9 79.21 D 92 2.9 8.41 E 99 9.9 98.01 F 96 6.9 47.61 G 94 4.9 24.01 H 90 0.9 0.81 I 95 5.9 34.81 J 75-14.1 198.81 sum = 891 sum -> 1,557 mean = 89.1 variance -> 155.69 standard deviation -> 312.71 skewness = -2.0341 kurtosis = 3.8474 Note the changes in the descriptive statistics presented below. The mean changes slightly (about one percent), as you would expect due to an extreme score, but the median remains unchanged. This illustrates the meaning of resistant indicator. The standard deviation shows a 24 percent increase, skewness and kurtosis also show large changes, suggesting the shape of the distribution departs even further from the Gaussian. Original Data One Extreme Score Mean 90.10 89.10 Standard Error 3.34 4.16 Median 94.50 94.50 Mode 95.00 95.00 Standard Deviation 10.57 13.15 Sample Variance 111.66 172.99 Kurtosis 1.78 3.85 Skewness -1.66-2.03 Range 32.00 42.00 Page 5 of 12

Here is how skewness is calculated by hand for a different set of data: Skewness 1. List Raw Scores in a column 2. Subtract Mean from each Raw Score. Aka, Deviations from the mean 3. Raise each of these deviations from the mean to the third power and sum. Aka: Sum of third moment deviations 4. Calculate skewness, which is the sum of the deviations from the mean, raise to the third power, divided by number of cases minus 1, times the standard deviation raised to the third power. y (y - M) (y - M) 3 8.04 0.54 0.16 6.95-0.55-0.17 7.58 0.08 0.00 8.81 1.31 2.24 8.33 0.83 0.57 9.96 2.46 14.87 7.24-0.26-0.02 4.26-3.24-34.04 10.84 3.34 37.23 4.82-2.68-19.27 5.68-1.82-6.04 sum = y = 82.51 0.00-4.46 sum = deviations 3 mean = (y)/n = M 7.50 83.65 = (n-1) stdev 3 st dev = var 2.03-0.0533 = skewness Calculating Skewness: 1. First, calculate the mean and standard deviation 2. Subtract the mean from each raw score and cube (i.e., raise to the third power) 3. Sum the cubed deviations. 4. Multiply the number of scores minus 1 times the cubed standard deviation (i.e., raised to the third power). 5. Skewness = step 3 divided by step 4 Page 6 of 12

Keep in mind that if a distribution is positively skewed, the bulk of the values clump around the lower end of the scale with a few trialing off at the high end. Conversely, in a negatively skewed distribution, the bulk of the values clump around or near the high end of the scale with a few values trailing off at the low end. The following table summarizes the descriptive statistics for the P102 sample. Table 1: Summary Statistics for P102 Exam Data Statistic Symbol Value Comment sample size n 10 number of cases/individuals mean x 90.1 non-resistant measure of location standard deviation 166.17 non-resistant measure of dispersion range 32 non-resistant measure of scale x max s x skewness -1.66 non-resistant measure of skewness b 1 x min kurtosis 1.78 non-resistant measure of kurtosis b 2 median 94.5 resistant measure of location x 0, 5 interquartile range 10.25 resistant measure of dispersion x 0.75 x 0.25 Yule-Kendall ϒYK -0.19 resistant measure of skewness Page 7 of 12

4 sd 3 sd 2 sd 1 sd mean 1 sd 2 sd 3 sd 4 sd The equation for the Gaussian curve is y = x µ) 2 ---------------------- ( 1 2σ ------------- 2 e σ 2π. where: y = The height of the curve at a given value of x σ π = The standard deviation of the distribution. = A constant (pi) of approximately 3.1416 x = A specific score within the distribution. e = The base of the Napierian logarithms, approximately 2.71828 µ σ 2 = The mean of the distribution. = The variance of the distribution. Page 8 of 12

Box Plots Box plots are useful in visualizing distributions. Consider the following scattergram of per capita income for each of the 50 states (y axis) with charitable deductions (x axis) listed on 1998 itemized tax returns. 30,000 Per Capita Income 25,000 20,000 15,000 0 2,000 4,000 6,000 Charitable Giving An explanation of the box plot appears on the following page. The line or asterisk within the box is the median of the distribution. Fifty percent of the cases fall with the upper and lower hinges (the box boundaries). The upper hinge occurs at the 75 th percentile, which is the third quartile, which corresponds to a z-score of.68. As discussed earlier, the median occurs at the 50 th percentile, which is the second quartile and corresponds to a z-score of zero. The lower hinge occurs at the 25 th percentile, which is the first quartile and corresponds to a z-score of.68. The whiskers terminate at the largest and smallest values that are not considered to be outliers. The definitions for outlier and extreme scores may vary depending on the software program. A common definition for outlier is any value 1.5 box-lengths above or below the upper and lower hinges, and for extreme scores, any value more than 3 box-lengths above or below the upper or lower hinges respectively. Page 9 of 12

In the charatible giving example one of the states (that shall remain nameless) has a high per capita income (around $27,000) but gives only about $1,000 to charity. Notice that the circle for this pair of data points lies beyond the whisker of the charatible giving box. Page 10 of 12

Stem and Leaf Another useful data display is know as the stem and leaf. This is a simple way of displaying the distribution of data without having to use computer graphics. The characteristic that makes the stem and left unique is that very value in the data set is displayed. The stem and leaf plot groups the values in a data set according to their all but least significant digits. These are written in ascending or descending order to the left side of a vertical bar and are know as the stem. The leaves are formed by writing the least significant digit to the right of the vertical bar, on the same line as the more significant digits with which it belongs. The stem and leaf plot below shows the charitable giving for 100 individuals. We can see that least amout one person gave was $1,082 while the most one person gave was $5,779. Further, we can see that in the $4,000 range, the following exact values were given: $4,018, $4,057, $4,073, $4,095... $4,814. The stem and leaf with vary slightly in appearance depending on the specific software used. Some programs enable you to examine the leaves in detail, by reporting the number of cases, the spread, the value of the lower and upper hinges, etc. 1*** 082 1*** 303 1*** 1*** 785 1*** 870,976,985 2*** 012,040,116 2*** 212,242,256,296,308 2*** 448,482,511,511,530,560 2*** 609,632,686,718,740,785 2*** 806,829,833,871,885,899,951,963 3*** 001,010,015,028,030,088,164,170,171,178 3*** 225,229,237,277,310,358,385,392 3*** 413,414,439,450,450,502,519,594 3*** 615,633,638,654,682,738,761 3*** 813,813,820,834,860,872,897,914,918,955,994 4*** 018,057,073,095,154,192 4*** 238,271,342,377,379,387 4*** 425,426,494,545 4*** 4*** 814 5*** 009 5*** 273,379 5*** 501 5*** 779 Page 11 of 12

Histogram The range of values is divided into a finite set of class intervals known as bins. The number of values in each bin is then counted and divided by the sample size to obtain frequency of occurrence. The frequency is plotted as vertical bars of varying height. Some programs allow the user to set the number of bins that appear. The frequencies can be divided by the bin width to obtain frequency densities that can be compared to probability densities from a theoretical distribution, such as the Gaussian distribution. For example, the Gaussian probability density function is superimposed on the frequency histogram of the charitable giving of 100 individuals. 24 22 20 18 16 Frequency 14 12 10 8 6 4 2 0 0 2,000 4,000 6,000 Charitable Giving Page 12 of 12