Page 1 of 16 Chapter 2: Exploring Data with Graphs and Numerical Summaries Graphical Measures- Graphs are used to describe the shape of a data set. Section 1: Types of Variables In general, variable can be classified as being categorical (qualitative) or quantitative. Categorical variable measures attribute, it takes values from a set of categories. Examples are gender, make of automobile, color of eyes. Quantitative variable represents the amount or quantity of something. It could be discrete or continuous. Discrete quantitative variable takes numbers that are countable, such as 0, 1, 2, 3, Example: are number of female students in APSU. Continues quantitative variable takes values from an interval, such as [0, 5]. Examples: heights of students in a class, weather temperature. Example: Questions 2.3 and 2.4 page 29. 1
Page 2 of 16 A proportion of a category (or relative frequency of a category) = frequency of category total number of observation. Percentage of a category = proportion 100%. A frequency table is a list the possible variable values with the frequencies for each value. Example: In a survey concerning public education, 400 school administrators were asked to rate the quality of education in the US. Their responses are summarized in the following table Rating Frequency A 35 B 260 C 93 D 12 Based on this data, answer the following questions: a) What is the proportion for the rating A schools in the US. b) What is the percentage of rating A schools in the US. c) What is the grade for the schools in the US which they got the highest percentage? What is this percentage? What can you say about the quality of the education in US based on this data? d) Construct a relative frequency table for the data. 2
Page 3 of 16 Section 2: Graphs Graphs for Categorical variables: We use pie chart or bar chart to describe the categorical data. Example: For the previous data. a) Construct a pie chart to describe the data b) Construct a bar chart to describe the data. Graphs for quantitative variables: We use dot plots, stem and leaf plots, or histograms. The distribution of a set of data is a graph, table, or mathematical formula that indicates the different kinds of possible observations and how often they occur. Distributions of quantitative data have shape, and the shape of a distribution can be determined by looking at dot plots, stem and leaf plots, or histograms. Example: Construct a dot plot for the following data; 1,2,3,2,2,4,2,3,5,9,5,5,5,1 Discussion (shapes of distributions): Dotplot of data 1 Dotplot of data 2 Dotplot of data 3 1 2 3 4 data 1 5 6 7 2 4 data 2 6 8 2 4 data 3 6 8 3
Page 4 of 16 Stem and leaf To construct a stem-and-leaf display: Partition each observation into a stem and a leaf. Usually the stem consists of all the digits except for the final one, which is the leaf. Order the stems from the smallest to the largest in a column. Ensure that all stems in the data range are included. Record the leaf for each observation in the row corresponding to its stem. The leaves should be ordered. Example: The following data represents the prices of 19 different brands of walking shoes. Construct a stem and leaf plot to display the distribution of the data. 90 70 70 75 70 65 68 60 74 70 95 75 70 68 65 40 65 70. Histogram - Divide the range of the data into intervals of equal width. For discrete variable use the actual values. - Use the frequency table to construct the histogram. 4
Page 5 of 16 Basic Shapes Right skewed Symmetric Left skewed Example: The following data represents number of quarts of milk purchased during a particular week. Construct a histogram to describe the distribution of the data 0 3 5 4 3 2 1 3 1 2 1 1 2 0 1 4 3 2 2 2 2 2 2 3 4 Example: The following data represents the GPAs of 30 Bucknell University freshmen, recorded at the end of the freshman year. Construct a histogram to display the distribution of the data (Choose width=0.3 for each interval). Describe shape of the distribution of the data. 2.0 3.1 1.9 2.5 1.9 2.3 2.6 3.1 2.5 2.1 2.9 3.0 2.7 2.5 2.4 2.7 2.5 2.4 3.0 3.4 2.6 2.8 2.5 2.7 2.9 2.7 2.8 2.2 2.7 2.1 5
Page 6 of 16 Example: Discussion (skewness, mode, unimodal, bimodal, spread, outliers) 6
Page 7 of 16 Section 3: Measures of Center Summation Notation Given numbers x 1, x 2, x 3,., x n, we can express their sum x 1 + x 2 + x 3 + x 4 + + x n as Example: n x i i= 1 Mean, Median, Mode A measure of center is a one-number description of a distribution or data set, and we focus on three. Mean or average: The sum of numbers divided by the total number of the numbers: Mean = n i=1 n x i Median or 50th percentile: a number that separates the lower 50% and upper 50% of the numbers. Mode: the number that occurs most frequently in the set. There can be more than one mode. Example: In the 2002 Winter Olympics, figure skater Michelle Kwan competed in the short program ladies single event. She received the following scores for technical merit: 5.8 5.7 5.9 5.7 5.5 5.7 5.7 5.7 5.6 Find the mean, median, and mode. Throw out the score of 5.5 and again find the mean, median and mode. 7
Page 8 of 16 Remark: Properties: Resistant measures are not sensitive to extreme data. The median is resistant, the mean is not. Example: Compare the mean and median of the salaries $13,000 $32,000 $45,000 with the mean and median of the salaries $13; 000 $32; 000 $250; 000 Population Mean and Sample Mean The population mean µ (pronounced myoo) of a population of size N is the average of all values x 1, x 2,, x N in the population: Population mean N xi i= µ = 1. N The sample mean x (pronounced x-bar) of a sample of size n is the average of all values x 1, x 2, x n in the sample: Sample mean n xi i= x = 1. n
Page 9 of 16 Example: The ages in years of all seven MATH 4270 students are Find the population mean for the students ages. 26 22 24 21 23 32 24 A random sample of size three was taken from the class. The random sample was 21 23 32. Find the sample mean of the three students' ages. Example:
Page 10 of 16 Sec 4: Measures of Spread Measures of spread summarize how far data are spread out. We focus on the following measurements: Standard deviation: used when the mean is the measure of center. It is the most important measure of spread. Range: the largest value minus the smallest value. Interquartile range: used when the median is the measure of center (we will talk about it in sec 5). Other Important Sums (leading up to measuring the spread) Sum of distances from the mean: ( x i x) n i= 1 Sum of squared distances from the mean: ( x i x) n i= 1 Example: History Exam Scores for four students are 91 95 92 76. Fill the following table 2 x 91 95 92 76 x i x 2 ( x i x) Population and Sample Standard Deviation The standard deviation measures the variation in a data set by indicating how far, on average, each number is from the mean. Population standard deviation σ : σ = N i= 1 2 ( x x) i N Sample standard deviation s: s = n i= 1 ( x i x) n 1 2
Page 11 of 16 Example: The ages in years for a sample of three MATH 4270 students are 21 23 32. Find the sample standard deviation and the range. Remarks about Standard Deviation The more variation among data in a sample, the larger the standard deviation. Like the mean, the standard deviation is not resistant because its value is affected by extreme data points. Empirical Rule: For bell-shaped distributions, - about 68.27% of all possible observations lie within one σ from µ. - about 95.45% of all possible observations lie within two σ s from µ. - about 99.73% of all possible observations lie within three σ s from µ.
Page 12 of 16 Section 5: Five-Number Summary, Boxplots, z-scores The 1rst quartile Q 1 is the 25th percentile, and it is the median of the lower half of the data. That is 25% of the data is lower than Q 1. The 2nd quartile Q 2 is the 50 th percentile, and it is the median of the data. That is 50% of the data is lower than Q 2. The 3rd quartile Q 3 is the 75th percentile, and it is the median of the upper half of the data. That is 75% of the data is lower than Q 3. Example: Eleven students report their exam score as: Find the quartiles. 11 12 10 18 19 15 20 15 20 17 8 Example: Sixteen people reportedly watched the following numbers of hours of TV weekly: 8 22 34 16 13 26 19 23 25 31 34 30 31 20 22 41 Find the quartiles. The interquartile range (IQR) is compute as IQR = Q Q (this is our third measure of spread). 3 1 The IQR is not sensitive to extreme values and is therefore a resistant measure of spread. The IQR is used as the measure of spread. Example: Compute the range and IQR for the data in the previous example.
Page 13 of 16 The five- number summary of data consists of the min, Q 1, Q 2, Q 3 and max Example; Compute the five- number summary for the following 100 meter race times (in seconds): 10.69 11.11 11.18 12.44 10.76 10.88 10.64 7.10 8.44 10.47 Outliers Outlier(s): data value(s) that is (are) far from most of the data. Lower limit: Q1 - (1.5)(IQR) (sometime called Inner fence lower limit) Upper limit: Q3 + (1.5)(IQR) (sometime called Inner fence upper limit) Data greater than the upper limit or less than the lower limit are potential outliers. Examples: Human heights of 9'. Miles per gallon rates greater than 95. Example: Use the lower limit and upper limit to identify any potential outliers in the previous example. Boxplots To create Boxplots do the following: - Determine the 5-number summary. - Compute lower & upper limits (Inner fence). - Mark and label the quartiles with vertical lines and box them in. - Indicate all potential outliers with * and label them. - Mark and label the smallest & largest values occurring within upper and lower limits with vertical lines, and connect the lines to the box (these are called adjacent values).
Example: The following data represent ages of a group of people. Make a boxplot for the data and identify any outliers. 25 22 26 23 27 26 28 18 25 24 12 Page 14 of 16 Advantage of Boxplots: - Graphically display the shape of distribution of the data. - Shows the potential outliers in the data. - Graphically display the spread of the data. Example: a) What is the shape of the distribution? b) Approximate each component of the five-number summary, and interpret them.
Page 15 of 16 z-scores (Standardized Data) Data can be standardized so that different data sets can be compared, or to compare values within the same data set. Example: The average height of men is 69 inches with a std. of 2.8 inches. The average height of women is 63.6 inches with a std. of 2.5 inches. Michael Jordan is 78 inches tall. Rebecca Lobo is 76 inches tall. Relatively speaking, who is taller? Jordan's and Lobo's heights should be standardized relative to those of their genders so their heights can be compared. x mean If x is a variable, then z = is the standardized version or z-score of x. Calculate the z- standard deviation scores of Jordan's and Lobo's heights. Facts About z-scores The mean of the z-scores of a population is always 0. The standard deviation of the z-scores of a population is always 1. Most z-scores will fall between -3 and 3. If the z-scores of x is less than -3 or greater than 3, then x is an outlier. z-scores never have units! Example: Body temperatures of healthy human children have mean = 98.60 o F and standard deviation = 0.62 o F. Your child has temperature of 101 o F. What should you do?
Page 16 of 16 Section 6: Misleading Graphs Examples 2.85 and 2.87 Calculator commands: To use your calculator for evaluating the descriptive statistics and more (mean, standard deviation, quartile, ) do the following: STAT Edit Enter report your data in any column (say L1) STAT CALC 1-Var Stats Enter Scroll down to see more output. Example: Use your calculator to compute mean, mode, standard deviation, and the five- number summary for the following 100 meter race times (in seconds): 10.69 11.11 11.18 12.44 10.76 10.88 10.64 7.10 8.44 10.47