Essential Statistics Chapter 3

2 Measures of Center in summarizing descriptions of data, statisticians often talk about measures of center (i.e. what the data looks like in its center) as well as measures of spread (i.e. how the data spreads out) when we talk about measures of center, we will use the arithmetic mean and the arithmetic median, or more simply just mean and median

3 Measures of Center - Mean a list of n (or N) numbers is denoted xx 1, xx 2, xx 3,, xx nn the sum of those numbers is: xx = xx 1 + xx 2 + xx 3 + xx nn the mean for sample and population is: where x-bar is sample mean and mu is population mean note the mean is not necessarily a member of the data set µ x N = i

4 Measures of Center - Median the median is a number or approximation that splits the dataset in two parts procedure for finding the median (symbol x-tilde) 1. sort the data, and determine the number of data elements 2. if n is odd, the median is element number (n + 1) / 2 3. if n is even, the median is mean of the elements numbered (n/2) and (n/2) + 1 (e.g. if n = 12, the median is the average of the 6 th and 7 th elements) note if n is even, the median is not a value in the dataset, but between the two center elements

5 Rounding numbers it is a general good rule to round decimal places to one more decimal place than that of the data in the original data set

6 Comparing mean and median values that lie very far away from the majority of the other data values are called outliers the mean is more affected by outliers than is the median symmetric skew to right skew to left

7 Data Set Mode the mode of a data set is the data value that occurs most often when two values occur the most often (i.e. the same # of times), values are bimodal if > 2 values occur the most often, values are multimodal if no value occurs more than once in a data set, there is no mode

8 Mean of Grouped Data sometimes we don t have access to the actual data, but rather the frequency distribution approximating the mean will use class midpoints, that is the lower class limit from one class plus the lower class limit from the next consecutive class divided by 2

9 Mean of Grouped Data Procedure for approximating the mean of grouped data: 1. compute the midpoint of each class by taking the average of the lower class limit and the lower limit of the next larger class 2. for each class, multiply the class midpoint by the class frequency 3. add the products (Midpoint)x(Frequency) over all classes 4. divide the sum obtained in Step 3 by the sum of the frequencies see example 3.9

Mean of Grouped Data 10

11 Mean of Grouped Data 6850 = 50 = 137

Summary 12

13 Measures of Spread (3.2) measures of spread are measures of how the data spreads out in the dataset the simples measure of spread is the range range = maximum data value minimum data value

14 Measures of Spread - Variance variance is a measure of how far, on average, the data values in the dataset are from the mean as with mean, let x 1, x 2, x 3, x n represent the values in a dataset the formulas for population and sample variance are as follows:

Measures of Spread - Variance 15

Measures of Spread - Variance 16

17 Measures of Spread Std. Deviation the units of variance are squared units, thus if the orignal data was degrees, the variance is in degrees squared to remedy this, we use the standard deviation the standard deviation is simply the square root of the variance, e.g. sample std. dev. population std. dev.

18 Measures of Spread Empirical Rule when a population or sample has a histogram that is approximately bellshaped, then: approximately 68% of the data will be within one standard deviation of the mean approximately 95% of the data will be within two standard deviations of the mean almost all, of the data will be within three standard deviations of the mean

19 Measures of Spread Empirical Rule when a population or sample has a histogram that is approximately bell-shaped, visually: x-bar - s x-bar x-bar + s

20 Measures of Spread CV the coefficient of variation (CV) shows how large the standard deviation is relative to the mean CV values are unit-less, so relative comparisons of different units can be made CV formula is simply std. deviation / mean

21 Measures of Position z-scores (3.3) a z-score of an individual data value indicates how many standard deviations it is away from its mean given x is a value from a population with mean μ and standard deviation σ, the z-score for x is: z = x µ σ see example 3.22

22 Measures of Position z-scores Empirical Rule and Z-Scores When a population has a histogram that is approximately bell-shaped: Approximately 68% of the data will have z-scores between 1 and 1 Approximately 95% of the data will have z-scores between 2 and 2 All, or almost all of the data will have z-scores between 3 and 3

23 Measures of Position given any data set, the median divides the dataset into? equal parts data set values median

24 Measures of Position given any data set, the median divides the dataset into? equal parts data set values median we can also divide a dataset into 4 equal parts, called quartiles

25 Measures of Position given any data set, the median divides the dataset into? equal parts data set values median we can also divide a dataset into 4 equal parts, called quartiles data set values Q 1 Q 2 Q 3

26 Measures of Position we can also divide a dataset into 100 equal parts, called percentiles given a number p between 1 & 99, the pth percentile separates the lowest p% of the data from the highest (100- p)% data set values P 25 P 50 P 75 25 % 75 %

27 Measures of Position Computing a data value corresponding to a given percentile: 1. sort the data in increasing order, and determine n 2. using the following formula, compute the location L = (p/100) n 3. if L is not a whole number, round up (take ceiling) to the next highest whole number, the pth percentile is in the location of the rounded-up number 4. if L is a whole number, the pth percentile is the average of the number in in the location L and location L + 1

28 Measures of Position Example 3.23: compute the 30 th percentile given the following sorted data: location L = (30 / 100) * 42 = 12.6 since not a whole number, take next highest number 13, and the 30 th percentile is @ location 13

29 Measures of Position Example 3.23: compute the 30 th percentile given the following sorted data: location L = (30 / 100) * 42 = 12.6 since not a whole number, take next highest number, 13, and the 30 th percentile is @ location 13

30 Measures of Position Computing the percentile corresponding to a given data value: 1. sort the data in increasing order, and determine n 2. let x be the given data value, compute the percentile p = ((number of data values < x + 0.5) / n ) * 100 3. if p is not a whole number, round (up or down) to the next whole number

31 Measures of Position Example 3.24: what percentile does rainfall of 1.90 correspond? sort data ascending, how number of values are less than 1.9? percentile p = ((17 + 0.5 ) / 42 ) * 100 = 41.6667 since not a whole number, 41.7 rounds to 42, thus the value 1.9 corresponds to the 42 nd percentile

32 Measures of Position Computing a data value corresponding to a given quartile: 1. sort the data in increasing order, and determine n 2. find the percentile corresponding to the desired quartile, e.g. q1 = p25, q2 = p50, etc. 3. using the following formula, compute the location L = (p/100) n 4. if L is not a whole number, round up (take ceiling) to the next highest whole number, the pth percentile is in the location of the rounded-up number 5. if L is a whole number, the pth percentile is the average of the number in in the location L and location L + 1

33 Measures of Position five number summary consists of the following 5 positional values

34 Measures of Position find the five number summary given the following data 41 42 42 44 44 45 45 46 49 49 51 51 53 56 57 59 59 65 67 71 77 100 min = 41, max = 100, median = 51 (n = 22, so average the 11 th & 12 th ) Q1 = P25 = (25 / 100) * 22 = 5.5, next higher whole number = 6, so the value in the 6 th location is 45 Q3 = P75 = (75 / 100) * 22 = 16.5, next higher whole number = 17, so the value in the 17 th location is 59

35 Measures of Position - Outliers an outlier is a data value much larger or smaller than other data values in the dataset outliers can be erroneous, or unusually correct, depending upon the measurement interquartile range (IQR) is a measure of spread used to detect outliers IQR = Q3 Q1 lower and upper outlier boundaries are computed by lower outlier boundary = Q1 (1.5 x IQR) upper outlier boundary = Q3 + (1.5 x IQR)

36 Measures of Position - Outliers Example 3.30: use IQR method to determine which values, if any in table 3.11 are outliers from example 2.27, Q1 =45, Q3 = 59 IQR = Q3 Q1 = 59 45 = 14 lower outlier boundary = 45 (1.5 x 14) = 24 upper outlier boundary = 59 + (1.5 x 14) = 80 so any data values in table 3.1 < 24 or > 80 are outliers (note only a single outlier, i.e. 100)