Statistics 13 Elementary Statistics

Statistics 13 Elementary Statistics Summer Session I 2012 Lecture Notes 2: Methods for Describing Data 1 Describing Qualitative Data Definition 2.1 classified. A class is one of the categories into which qualitative data can be Definition 2.2 The class frequency is the number of observations in the data set that fall into a particular class. Definition 2.3 The class relative frequency is the class frequency divided by the total number of observations in the data set; that is class relative frequency = class frequency total number of observations Definition 2.4 The class percentage is the class relative frequency multiplied by 100; that is, class percentage = (class relative frequency) 100 Summary of Graphical Descriptive Methods for Qualitative Data Bar Graph: The categories (classes) of the qualitative variable are represented by bars, where the height of each bar is either the class frequency, class relative frequency, or class percentage. Pie Chart: The categories (classes) of the qualitative variable are represented by slices of a pie (circle). The size of each slice is proportional to the class relative frequency. Pareto Diagram: A bar graph with the categories (classes) of the qualitative variable (i.e., the bars) arranged by height in descending order from left to right. 1 Last update: June 25, 2012 1

Control Treatment 12.5% 16.7% 6.7% 17.8% 28.9% 20.8% 12.5% 17.8% 18.8% 18.8% 13.3% 15.6% 25 Under $25,000 25 Under $25,000 $25,000 $50,000 $25,000 $50,000 20 $50,001 $75,000 $75,001 $100,000 Above $100,000 20 $50,001 $75,000 $75,001 $100,000 Above $100,000 Prefer not to answer Prefer not to answer 15 15 13 10 5 8 6 9 9 10 6 10 5 7 6 8 8 3 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Income of the patients: Examples of pie charts (top) and bar graphs (down) Reasons for arriving late at work (from Wikipedia): Example of Pareto Diagram Describing Quantitative Data Summary of Graphical Descriptive Methods for Quantitative Data Dot Plot: The numerical value of each quantitative measurement in the data set is represented by a dot on a horizontal scale. When data values repeat, the dots are placed above one another vertically. Stem-and-Leaf Display: The numerical value of the quantitative variable is partitioned into a stem and a leaf. The possible stems are listed in order in a column. The leaf for each quantitative measurement in the data set is placed in the corresponding stem row. Leaves for observations with the same stem value are listed in increasing order horizontally. Histogram: The possible numerical values of the quantitative variable are partitioned into class intervals, each of which has the same width. These intervals from the scale of the horizontal axis. The frequency or relative frequency of observations in each class interval is determined. A vertical bar is placed over each class interval, with the height of the bar equal to either the class frequency or class relative frequency. 2

Dotplots Example 1 The outbreak of food poisoning on a sportsday, Thailand 1990. Age by sex Distribution of birthdate F M Frequency 0 5 10 15 20 0 10 20 30 40 50 60 70 1930 1935 1940 1945 1950 1955 1960 1965 1970 1975 Stem-and-Leaf Display Example 2 The following data show the ages of the 27 residents of Alcan, Alaska. (Source: U.S. Bureau of the Census) The stem-and-plot leaf for the data: 45 1 52 42 10 40 50 40 7 46 19 35 3 11 31 6 41 12 43 37 8 41 48 42 55 30 58 0 13678 1 0129 2 3 0157 4 0011223568 5 0258 3

Histograms Example 3 Using the age data from above. Histogram of age Histogram of age Frequency 0 2 4 6 8 10 Relative Frequency 0.00 0.01 0.02 0.03 0.04 0 10 20 30 40 50 60 age 0 10 20 30 40 50 60 age The Meaning of Summation Notation n i=1 x i Sum the measurements of the variable that appears to the right of the summation symbol, beginning with the first measurement and ending with the nth measurement. Example 4 A data set contains the observations 5,1,3,2,1. Then we set x 1 = 5, x 2 = 1, x 3 = 3, x 4 = 2, x 5 = 1. Then a. 5 i=1 x i = x 1 + x 2 + x 3 + x 4 + x 5 = 5 + 1 + 3 + 2 + 1 = 12 b. 5 i=1 x 2 i = x2 1 + x2 2 + x2 3 + x2 4 + x2 5 = 52 + 1 2 + 3 2 + 2 2 + 1 2 = 12 c. 5 i=1 (x 1) = (x 1 1) + (x 2 1) + (x 3 1) + (x 4 1) + (x 5 1) = (x 1 + x 2 + x 3 + x 4 + x 5 ) (1 + 1 + 1 + 1 + 1) = 5 i=1 x i 5 = 12 5 = 7 d. 5 i=1 (x 1) 2 = (x 1 1) 2 +(x 2 1) 2 +(x 3 1) 2 +(x 4 1) 2 +(x 5 1) 2 = 4 2 +0 2 +2 2 +1 2 +0 2 = 21 e. ( 5 i=1 x i ) 2 = (x 1 + x 2 + x 3 + x 4 + x 5 ) 2 = (5 + 1 + 3 + 2 + 1) 2 = 12 2 = 144 Definition 2.5 The mean of a set of quantitative data is the sum of the measurements, divided by the number of measurements contained in the data set. Formula for a Sample Mean: x = n i=1 x i n Symbols for the Sample Mean and the Population Mean x =Sample mean µ =Population mean 4

Definition 2.6 The median of a quantitative data set is the middle number when the measurements are arranged in ascending (or descending) order. Calculating a Sample Median M Arrange the n measurements from the smallest to the largest. 1. If n is odd, M is the middle number. 2. If n is even, M is the mean of the middle two numbers. Definition 2.7 A data set is said to be skewed if one tail of the distribution has more extreme observations than the other tail. mean median mean median mean median Relative frequency Relative frequency Relative frequency Rightward skewness Symmetry Leftward skewness Definition 2.8 set. The mode is the measurement that occurs most frequently in the data Definition 2.9 The range of a quantitative data set is equal to the largest measurement minus the smallest measurement. Definition 2.10 The sample variance for a sample of n measurements is equal to the sum of the squared distances from the mean, divided by (n 1). The symbol s 2 is used to represent the sample variance. n i=1 (x i x) 2 Formula for a Sample Variance: s 2 = n 1 n n A shortcut formula: s 2 i=1 = x2 ( i=1 x i) 2 i n n 1 5

Definition 2.11 The sample standard deviation, s, is defined as the positive square root of the sample variance, s 2, or, mathematically, s = s 2 Symbols for Variance and Standard Deviation s 2 = Sample variance s = Sample standard deviation σ 2 = Population variance σ = Population standard deviation Numerical Descriptive Measures Central Tendency Mean Median Mode Variation Range Variance Standard Deviation Two ways to interpret the standard deviation: 1. Chebyshev s Rule and 2. Empirical Rule. 1. Chebyshev s rule applies to any data set, regardless of the shape of the frequency distribution of the data. a. It is possible that very few of the measurements will fall within one standard deviation of the mean. b. At least 3/4 of the measurements will fall within two standard deviations of the mean. c. At least 8/9 of the measurements will fall within three standard deviations of the mean. d. Generally, for any number k greater than 1, at least (1 1/k 2 ) of the measurements will fall within k standard deviations of the mean. 2. Empirical rule is a rule of thumb that applies to data sets with frequency distributions that are mound shaped and symmetric, as follows: Relative frequency Population measurements 6

a. Approximately 68% of the measurements will fall within one standard deviation of the mean. b. Approximately 95% of the measurements will fall within two standard deviations of the mean. c. Approximately 99.7% (essentially all) of the measurements will fall within three standard deviation of the mean. x ± s x ± 2s x ± 3s x ± ks ( x ± σ) ( x ± 2σ) ( x ± 3σ) ( x ± kσ) Chebyshev s rule less than 3 At least 3 At least 8 At least (1 1 ) 4 4 9 k 2 Empirical rule approx 68% approx 95% approx 99.7% Example 5 Use Chebyshev s Theorem to give a lower bound on the percent of data in the interval ( x 2.5s, x + 2.5s). Answer: At least 1 1/2.5 2 = 0.84 = 84% of the measurements will fall within the interval. i.e. The lower bound is 84%. Definition 2.12 For any set of n measurements (arranged in ascending or descending order), the pth percentile is a number such that p% of the measurements fall below that number and (100 p)% fall above it. Definition 2.13 The sample z-score for a measurement x is z = x x s The population z-score for a measurement x is z = x µ σ Interpretation of z-scores for Mound-Shaped Distributions of Data 1. Approximately 68% of the measurements will have a z-score between -1 and 1. 2. Approximately 95% of the measurements will have z-score between -2 and 2. 3. Approximately 97% (almost all) of the measurements will have a z-score between -3 and 3. Definition 2.14 An observation (or measurement) that is unusually large or small relative to the other values in a data set is called an outlier. Outliers typically are attributable to one of the following causes: 1. The measurement is observed, recorded, or entered into the computer incorrectly. 2. The measurement comes from a different population. 7

3. The measurement is correct, but represents a rare (chance) event. Definition 2.15 The lower quartile Q L is the 25th percentile of a data set. The middle quartile M is the median. The upper quartile Q U is the 75th percentile. Definition 2.16 The interquartile range (IQR) is the distance between the lower and upper quartiles. Elements of a Box Plot IQR= Q U Q L 1. A rectangle (the box) is drawn with the ends (the hinges) drawn at the lower and upper quartiles(q L and Q U ). The median of the data is shown in the box, usually by a line. 2. The points at distances 1.5(IQR) from each hinge mark the inner fences of the data set. Lines (the whiskers) are drawn from each hinge to the most extreme measurement inside the inner fence. Thus, Lower inner fence= Q L 1.5(IQR) Upper inner fence= Q U + 1.5(IQR) A second pair of fences, the outer fences, appears at a distance of 3(IQR) from the hinges. One symbol (e.g., * ) is used to represent measurements falling between the inner and outer fences, and another (e.g., 0 ) is used to represent measurements that lie beyond the outer fences. Thus outer fences are not shown unless one or more measurements lie beyond them. We have Lower outer fence= Q L 3(IQR) Upper outer fence= Q U + 3(IQR) Different symbols can be used to represent the median and the extreme data points. Measurements beyond the outer fences are probably outliers. Graphing Bivariate Relationships One way to describe the relationship between two quantitative variables, called a bivariate relationship, is to plot the data in a scattergram (or scatterplot). a. Positive relationship b. Negative relationship c. No relationship 8