2-7 Exploratory Data Analysis (EDA)

Transcription

1 102 C HAPTER 2 Describing, Exploring, and Comparing Data 2-7 Exploratory Data Analysis (EDA) This chapter presents the basic tools for describing, exploring, and comparing data, and the focus of this section is the exploration of data. We begin this section by first defining exploratory data analysis, then we introduce outliers, 5-number summaries, and boxplots. Definition Exploratory data analysis is the process of using statistical tools (such as graphs, measures of center, and measures of variation) to investigate data sets in order to understand their important characteristics. Recall that in Section 2-1 we listed five important characteristics of data, and we began with (1) center, (2) variation, and (3) the nature of the distribution. These characteristics can be investigated by calculating the values of the mean and standard deviation, and by constructing a histogram. It is generally important to further investigate the data set to identify any notable features, especially those that could strongly affect results and conclusions. One such feature is the presence of outliers. Outliers An outlier is a value that is located very far away from almost all of the other values. Relative to the other data, an outlier is an extreme value. When exploring a data set, outliers should be considered because they may reveal important information, and they may strongly affect the value of the mean and standard deviation, as well as seriously distorting a histogram. The following example uses an incorrect entry as an example of an outlier, but not all outliers are errors; some outliers are correct values. EXAMPLE Cotinine Levels of Smokers When using computer software or a calculator, it is often easy to make keying errors. Refer to the cotinine levels of smokers listed in Table 2-1 with the

2 2-7 Exploratory Data Analysis (EDA) 103 Chapter Problem and assume that the first entry of 1 is incorrectly entered as because you were distracted by a meteorite landing on your porch. The incorrect entry of is an outlier because it is located very far away from the other values. How does that outlier affect the mean, standard deviation, and histogram? SOLUTION When the entry of 1 is replaced by the outlier value of 11111, the mean changes from to 450.2, so the effect of the outlier is very substantial. The incorrect entry of causes the standard deviation to change from to , so the effect of the outlier here is also substantial. Figure 2-1 in Section 2-3 depicts the histogram for the correct values of cotinine levels of smokers in Table 2-1, but the STATDISK display presented here shows the histogram that results from using the same data with the value of 1 replaced by the incorrect value of Compare this STATDISK histogram to Figure 2-1 and you can easily see that the presence of the outlier dramatically affects the shape of the distribution. STATDISK The preceding example illustrates these important principles: 1. An outlier can have a dramatic effect on the mean. 2. An outlier can have a dramatic effect on the standard deviation. 3. An outlier can have a dramatic effect on the scale of the histogram so that the true nature of the distribution is totally obscured. An easy procedure for finding outliers is to examine a sorted list of the data. In particular, look at the minimum and maximum sample values and determine whether they are very far away from the other typical values. Some outliers are correct values and some are errors, as in the preceding example. If we are sure that an outlier is an error, we should correct it or delete it. If we include an outlier because we know that it is correct, we might study its effects by constructing graphs and calculating statistics with and without the outliers included. An Outlier Tip Outliers are important to consider because, in many cases, one extreme value can have a dramatic effect on statistics and conclusions derived from them. In some cases an outlier is a mistake that should be corrected or deleted. In other cases, an outlier is a valid data value that should be investigated for any important information. Students of the author collected data consisting of restaurant bills and tips, and no notable outliers were found among their sample data. However, one such outlier is the tip of $16,000 that was left for a restaurant bill of $8, The tip was left by an unidentified London executive to waiter Lenny Lorando at Nello s restaurant in New York City. Lorando said that he had waited on the customer before and He s always generous, but never anything like that before. I have to tell my sister about him.

3 104 C HAPTER 2 Describing, Exploring, and Comparing Data Boxplots Good Advice for Journalists Columnist Max Frankel wrote in the New York Times that most schools of journalism give statistics short shrift and some let students graduate without any numbers training at all. How can such reporters write sensibly about trade and welfare and crime, or air fares, health care and nutrition? The media s sloppy use of numbers about the incidence of accidents or disease frightens people and leaves them vulnerable to journalistic hype, political demagoguery, and commercial fraud. He cites several cases, including an example of a full-page article about New York City s deficit with a promise by the mayor of New York City to close a budget gap of $2.7 billion; the entire article never once mentioned the total size of the budget, so the $2.7 billion figure had no context. In addition to the graphs presented in Section 2-3, a boxplot is another graph that is used often. Boxplots are useful for revealing the center of the data, the spread of the data, the distribution of the data, and the presence of outliers. The construction of a boxplot requires that we first obtain the minimum value, the maximum value, and quartiles, as defined in the 5-number summary. Definitions For a set of data, the 5-number summary consists of the minimum value; the first quartile, Q 1 ; the median (or second quartile, Q 2 ); the third quartile, Q 3 ; and the maximum value. A boxplot (or box-and-whisker diagram) is a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile, Q 1 ; the median; and the third quartile, Q 3. (See Figure 2-16.) Procedure for Constructing a Boxplot 1. Find the 5-number summary consisting of the minimum value, Q 1, the median, Q 3, and the maximum value. 2. Construct a scale with values that include the minimum and maximum data values. 3. Construct a box (rectangle) extending from Q 1 to Q 3, and draw a line in the box at the median value. 4. Draw lines extending outward from the box to the minimum and maximum data values. Boxplots don t show as much detailed information as histograms or stem-and-leaf plots, so they might not be the best choice when dealing with a single data set. They are often great for comparing two or more data sets. When using two or more boxplots for comparing different data sets, it is important to use the same scale so that correct comparisons can be made. EXAMPLE Cotinine Levels of Smokers Refer to the 40 cotinine levels of smokers in Table 2-1 (without the error of used in place of 1, as in the preceding example). a. Find the values constituting the 5-number summary. b. Construct a boxplot. SOLUTION a. The 5-number summary consists of the minimum, Q 1, median, Q 3, and maximum. To find those values, first sort the data (by arranging them in order from lowest to highest). The minimum of 0 and the maximum of 491

4 2-7 Exploratory Data Analysis (EDA) 105 are easy to identify from the sorted list. Now proceed to find the quartiles. Using the flowchart of Figure 2-15, we get Q 1 5 P , which is located by calculating the locator L 5 (25> 100) and finding the value midway between the 10th value and the 11th value in the sorted list. The median is 170, which is the value midway between the 20th and 21st values. We also find that Q by using Figure 2-15 for the 75th percentile. The 5-number summary is therefore 0, 86.5, 170, 251.5, and 491. b. In Figure 2-16 we graph the boxplot for the data. We use the minimum (0) and the maximum (491) to determine a scale of values, then we plot the values from the 5-number summary as shown. Minimum Q 1 Median Q 3 Maximum FIGURE 2-16 Boxplot Continine Level of Smokers In Figure 2-17 we show some generic boxplots along with common distribution shapes. It appears that the cotinine levels of smokers have a skewed distribution. FIGURE 2-17 Boxplots Corresponding to Bell- Shaped, Uniform, and Skewed Distributions Bell shaped Uniform Skewed To illustrate the use of boxplots to compare data sets, see the accompanying Minitab display of cholesterol levels for a sample of males and a sample of females, based on the National Health Examination data included in Data Set 1 of Appendix B. Based on the sample data, it appears that males have cholesterol levels that are generally higher than females, and the cholesterol levels of males appear to vary more than those of females.

5 106 C HAPTER 2 Describing, Exploring, and Comparing Data Best Colleges Each year, U.S. News and World Report publishes an issue with a list of America s Best Colleges and Universities. Sales typically jump 40% for that issue. The list has critics who argue against the criteria and method of collecting data. Common complaints: Too much emphasis is placed on the criteria of a college s wealth, reputation, College Board scores, alumni donations, and the opinions of college presidents; too little emphasis is placed on the satisfaction of students and effective educational practices. The New York Times interviewed Kenneth Auchincloss, who is editor of How to Get Into College (by Kaplan> Newsweek), and he said that We have never been comfortable trying to quantify in numeric terms the various criteria that go into making a college good or less good, and we don t want to devote the resources to doing an elaborate statistical analysis that frankly we don t think is valid. EXAMPLE Does It Rain More on Weekends? Refer to Data Set 11 in Appendix B, which lists rainfall amounts (in inches) in Boston for every day of a recent year. The collection of this data set was inspired by media reports that it rains more on weekends (Saturday and Sunday) than on weekdays. Later in this book we will describe important statistical methods that can be used to formally test that claim, but for now, let s explore the data set to see what can be learned. (Even if we already know how to apply those formal statistical methods, we should first explore the data before proceeding with the formal analysis.) SOLUTION Let s begin with an investigation into the key elements of center, variation, distribution, outliers, and characteristics over time (the same CVDOT list introduced in Section 2-1). Listed below are measures of center (mean), measures of variation (standard deviation), and the 5-number summary for the rainfall amounts for each day of the week. The accompanying STATDISK display shows boxplots for each of the seven days of the week, starting with Monday at the top. Because the histograms for all seven days are pretty much the same, we show only the histogram for the Monday rainfall amounts. Standard Mean Deviation Minimum Q 1 Median Q 3 Maximum Monday Tuesday Wednesday Thursday Friday Saturday Sunday STATDISK STATDISK

6 2-7 Exploratory Data Analysis (EDA) 107 INTERPRETATION Examining and comparing the statistics and graphs, we make the following important observations. Means: The means vary from a low of in. to a high of in. The seven means vary by considerable amounts, and in later chapters of this book we will present methods for determining whether these differences are significant. (Later methods will show that the means do not differ by significant amounts.) If we list the means in order from low to high, we get this sequence of days: Wednesday, Tuesday, Sunday, Thursday, Friday, Monday, Saturday. There does not appear to be a pattern of higher rainfall on weekends (although the highest mean corresponds to Saturday). Also, see the Excel graph of the seven means, with the mean for Monday plotted first. The Excel graph does not support the claim of more rainfall on weekends (although it might be argued that there is more rainfall on Saturdays). Excel Variation: The seven standard deviations vary from in. to in., but those values are not dramatically different. There does not appear to be anything highly unusual about the amounts of variation. The minimums, first quartiles, and medians are all 0.00 for each of the seven days. This is explained by the fact that for each day of the week, there are many days with no rain. The abundance of zeros is also seen in the boxplots and histograms, which show that the data have distributions that are heavy toward the low end (skewed right). Outliers: There are no outliers or unusual values. At the low end, there are many rainfall amounts of zero. At the high end, the sorted list of all 365 rainfall amounts ends with the high values of 0.92, 0.96, 1.28, 1.41, and Distributions: The distributions of the rainfall amounts are skewed to the right. They are not bell-shaped, as we might have expected. If the use of a particular method of statistics requires normally distributed (bell-shaped) populations, that requirement is not satisfied for the rainfall amounts. We now have considerable insight into the nature of the Boston rainfall amounts for different days of the week. Based on our exploration, we can conclude that Boston does not experience more rain on weekends than on the other days of the week (although we might argue that there is more rainfall on Saturdays).

7 108 C HAPTER 2 Describing, Exploring, and Comparing Data Critical Thinking Armed with a list of tools for investigating center, variation, distribution, outliers, and characteristics of data over time, we might be tempted to develop a rote and mindless procedure, but critical thinking is critically important. In addition to using the tools presented in this chapter, we should consider any other relevant factors that might be crucial to the conclusions we form. We might pose questions such as these: Is the sample likely to be representative of the population, or is the sample somehow biased? What is the source of the data, and might the source be someone with an interest that could affect the quality of the data? Suppose, for example, that we want to estimate the mean income of college students. Also suppose that we mail questionnaires to 500 students and receive 20 responses. We could calculate the mean, standard deviation, construct graphs, identify outliers, and so on, but the results will be what statisticians refer to as hogwash. The sample is a voluntary response sample, and it is not likely to be representative of the population of all college students. In addition to the specific statistical tools presented in this chapter, we should also think! Using Technology This section introduced outliers, 5-number summaries, and boxplots. To find outliers, sort the data in order from lowest to highest, then examine the highest and lowest values to determine whether they are far away from the other sample values. STAT- DISK, Minitab, Excel, and the TI-83 Plus calculator can provide values of quartiles, so the 5-number summary is easy to find. STATDISK, Minitab, Excel, and the TI-83 Plus calculator can be used to create boxplots, and we now describe the different procedures. (Caution: Remember that quartile values calculated by Minitab and the TI-83 Plus calculator may differ slightly from those calculated by applying Figure 2-15, so the boxplots may differ slightly as well.) STATDISK Choose the main menu item of Data and use the Sample Editor to enter the data, then click on COPY. Now select Data, then Boxplot and click on PASTE, then Evaluate. Minitab Enter the data in column C1, then select Graph, then Boxplot. Enter C1 in the first cell under the Y column, then click OK. Excel Although Excel is not designed to generate boxplots, they can be generated using the Data Desk XL add-in that is a supplement to this book. First enter the data in column A. Click on DDXL and select Charts and Plots. Under Function Type, select the option of Boxplot. In the dialog box, click on the pencil icon and enter the range of data, such as A1:A40 if you have 40 values listed in column A. Click on OK. The result is a modified boxplot as described in Exercise 13. The values of the 5-number summary are also displayed. TI-83 Plus Enter the sample data in list L1. Now select STAT PLOT by pressing the 2nd key followed by the key labeled Y 5. Press the ENTER key, then select the option of ON, and select the boxplot type that is positioned in the middle of the second row. The Xlist should indicate L1 and the Freq value should be 1. Now press the ZOOM key and select option 9 for ZoomStat. Press the ENTER key and the boxplot should be displayed. You can use the arrow keys to move right or left so that values can be read from the horizontal scale. 2-7 Basic Skills and Concepts 1. Lottery Refer to Data Set 26 and use only the 40 digits in the first column of the Win 4 results from the New York State Lottery (9, 7, 0, and so on). Find the 5-number summary and construct a boxplot. What characteristic of the boxplot suggests that the digits are selected with a random and fair procedure?

8 2-7 Exploratory Data Analysis (EDA) Movie Budgets Refer to Data Set 21 in Appendix B for the budget amounts of the 15 movies that are R-rated. Find the 5-number summary and construct a boxplot. Determine whether the sample values are likely to be representative of movies made this year. 3. Cereal Calories Refer to Data Set 16 in Appendix B for the 16 values consisting of the calories per gram of cereal. Find the 5-number summary and construct a boxplot. Determine whether the sample values are likely to be representative of the cereals consumed by the general population. 4. Nicotine in Cigarettes Refer to Data Set 5 for the 29 amounts of nicotine (in mg per cigarette). Find the 5-number summary and construct a boxplot. Are the sample values likely to be representative of cigarettes smoked by an individual consumer? 5. Red M&Ms Refer to Data Set 19 for the 21 weights (in grams) of the red M&M candies. Find the 5-number summary and construct a boxplot. Are the red sample values likely to be representative of M&M candies of all colors? T 6. Bear Lengths Refer to Data Set 9 for the lengths (in inches) of the 54 bears that were anesthetized and measured. Find the 5-number summary and construct a boxplot. Does the distribution of the lengths appear to be symmetric or does it appear to be skewed? T 7. Alcohol in Children s Movies Refer to Data Set 7 for the 50 times (in seconds) of scenes showing alcohol use in animated children s movies. Find the 5-number summary and construct a boxplot. Based on the boxplot, does the distribution appear to be symmetric or is it skewed? T 8. Body Temperatures Refer to Data Set 4 in Appendix B for the 106 body temperatures for 12 A.M. on day 2. Find the 5-number summary and construct a boxplot, then determine whether the sample values support the common belief that the mean body temperature is 98.6 F. In Exercises 9 12, find 5-number summaries, construct boxplots, and compare the data sets. 9. Academy Awards In Ages of Oscar-Winning Best Actors and Actresses (Mathematics Teacher magazine) by Richard Brown and Gretchen Davis, the authors compare the ages of actors and actresses at the time they won Oscars. The results for winners from both categories are listed in the following table. Use boxplots to compare the two data sets. Actors: Actresses: T 10. Regular> Diet Coke Refer to Data Set 17 in Appendix B and use the weights of regular Coke and the weights of diet Coke. Does there appear to be a significant difference? If so, can you provide an explanation?

9 110 C HAPTER 2 Describing, Exploring, and Comparing Data T T 11. Cotinine Levels Refer to Table 2-1 located in the Chapter Problem. We have already found that the 5-number summary for the cotinine levels of smokers is 0, 86.5, 170, 251.5, and 491. Find the 5-number summaries for the other two groups, then construct the three boxplots using the same scale. Are there any apparent differences? 12. Clancy, Rowling, Tolstoy Refer to Data Set 14 in Appendix B and use the Flesch reading ease scores for the sample pages from Tom Clancy s The Bear and the Dragon, J. K. Rowling s Harry Potter and the Sorcerer s Stone, and Leo Tolstoy s War and Peace. (Higher scores indicate easier reading.) Does there appear to be a difference in ease of reading? Are the results consistent with your expectations? 2-7 Beyond the Basics 13. The boxplots discussed in this section are often called skeletal (or regular) boxplots. Modified boxplots are constructed as follows: a. Find the IQR, which denotes the interquartile range defined by IQR 5 Q 3 2 Q 1. b. Draw the box with the median and quartiles as usual, but when drawing the lines to the right and left of the box, draw the lines only as far as the points corresponding to the largest and smallest values that are within 1.5 IQR of the box. c. Mild outliers, plotted as solid dots, are values below Q 1 or above Q 3 by an amount that is greater than 1.5 IQR but not greater than 3 IQR. That is, mild outliers are values x such that Q IQR x Q IQR or Q IQR x Q IQR d. Extreme outliers, plotted as small hollow circles, are values that are either below Q 1 by more than 3 IQR or above Q 3 by more than 3 IQR. That is, extreme outliers are values x such that x Q IQR or x Q IQR The accompanying figure is an example of a modified boxplot. Refer to the cotinine levels of smokers in Table 2-1 included with the Chapter Problem. We have found that this data set has a 5-number summary of 0, 86.5, 170, 251.5, and 491. Identify the value of IQR, identify the ranges of values used to identify mild and extreme outliers, then identify any actual mild outliers or extreme outliers. Q1 Q2 Q3 Extreme Outliers Mild Outliers 1. 5 IQR IQR 1. 5 IQR Mild Outliers Extreme Outliers 3 IQR 3 IQR

10 Review Refer to the accompanying STATDISK display of three boxplots that represent the measure longevity (in months) of samples of three different car batteries. If you are the manager of a fleet of cars and you must select one of the three brands, which boxplot represents the brand you should choose? Why? STATDISK Review In this chapter we considered methods for describing, exploring, and comparing data sets. When investigating a data set, these characteristics are generally very important: 1. Center: A representative or average value. 2. Variation: A measure of the amount that the values vary. 3. Distribution: The nature or shape of the distribution of the data (such as bell-shaped, uniform, or skewed). 4. Outliers: Sample values that lie very far away from the vast majority of the other sample values. 5. Time: Changing characteristics of the data over time. After completing this chapter you should be able to do the following: Summarize data by constructing a frequency distribution or relative frequency distribution (Section 2-2) Visually display the nature of the distribution by constructing a histogram, dotplot, stem-and-leaf plot, pie chart, or Pareto chart (Section 2-3) Calculate measures of center by finding the mean, median, mode, and midrange (Section 2-4) Calculate measures of variation by finding the standard deviation, variance, and range (Section 2-5) Compare individual values by using z scores, quartiles, or percentiles (Section 2-6) Investigate and explore the spread of data, the center of the data, and the range of values by constructing a boxplot (Section 2-7)