AMS 7L LAB #2 Spring, 2009 Exploratory Data Analysis Name: Lab Section: Instructions: The TAs/lab assistants are available to help you if you have any questions about this lab exercise. If you have any questions please raise your hand and they will get to you as quickly as possible. At the end of class, you will need to turn in this cover sheet to your lab instructor. If you do not turn it in, you will not get credit for this lab. Be sure to write your name and section above. The following symbol at the beginning of a question means that after you answer that question you should raise your hand and have a TA or lab assistant review your answers up to that point. Once they have reviewed your work they will initial in the appropriate space in the table below. The purpose of this check is to be sure you have answered the questions correctly. Be sure to take the rest of the lab handout with you when you leave. It contains your answers and JMP instructions which you may find useful for doing homework assignments. Check-Problem 10 Lab Instructor s Initials 21 32 1
AMS 7L LAB #2 Spring, 2009 Objectives: Exploratory Data Analysis 1. To practice exploratory data analysis techniques 2. To learn to read datafiles into JMP Getting Started: Log onto your machine using your ITS login. Before starting JMP, you need to download two datafiles: butterfly.jmp and cereal.txt. First, open a web browser (such as Firefox) and go to the course webpage: http://www.soe.ucsc.edu/classes/ams007/spring09/ Then click on the link for Datasets. Once on the datasets webpage, download butterfly.jmp by clicking on the butterfly link, choosing Save to Disk, and hitting OK. Note that this file is already in JMP format. Next download cereal.txt by clicking on the cereal link, which will bring up the datafile in your browser (since this is just a text file). Go up to the File menu in the upper left and choose Save Page As and save the file to your Desktop (or anywhere else you d prefer) by clicking on Save. Part I. Butterfly Data Start JMP. Open the butterfly data by choosing Open Data Table from the JMP Starter window, clicking on butterfly.jmp and then choosing Open. JMP will open a data window with the butterfly data. Because this data file is already a JMP file, you will see that the column has already been labeled with the title Wing-Length. This file consists of the measurements of wing lengths of 24 butterflies. Take a look at the data values. Question #1 Wing length (in cm) is a quantitative variable. Is it continuous or discrete? Most of the exploratory data analysis techniques can be accessed from the Basic menu of the JMP Starter window. Click on Basic and then choose Distribution. If Wing-Length is not already highlighted on the left (under Select Columns), then click on it, then click on Y, Columns in the middle, and click on OK. You should get a histogram, but by default it is sideways from the usual form. Click on the red triangle hot spot button just to the left of Wing-Length. The drop-down menu has a lot of options. Click on Histogram Options and then in the menu that appears to the right. You can see that JMP allows you to display the histogram vertically or horizontally. Now let s go through the four primary part of exploratory data analysis: center, variation, shape, and outliers (we don t have an element of time, so we won t worry about changes over time). Question #2 What is the mean of the dataset? 2
Question #3 What is the median of the dataset? Question #4 What is the mode of the dataset? (You may want to go back to the original data table, or you can make a stem and leaf diagram by going to the red hot spot by Wing-Length and choosing Stem and Leaf; to get back to the histogram, uncheck Stem and Leaf from the same menu). Question #5 What is the standard deviation of the dataset? Question #6 Looking at the histogram, is this dataset symmetric or skewed? Question #7 Looking at the boxplot, is this dataset symmetric or skewed? (The boxplot has a few extra things on it that we won t worry about for now. You should look primarily at the quartiles, and secondarily at the whiskers.) Question #8 With a symmetric dataset, the mean is usually very close to the median. Is that the case here? Does this agree with or disagree with your answers to the two previous questions? JMP uses certain rules-of-thumb to decide how wide the bins are in the histogram. Usually JMP will produce a good histogram. But sometimes the picture will change as the size of the bins changes. To see this in action, click on the hand icon in the toolbar near the top (this is called the Grabber tool ). Note that the default plot has a lot of bars in it. Now click and hold on the histogram, and slowly drag the cursor directly downwards (i.e., move the mouse toward you) while still holding the mouse button down. You should see the histogram change as the bins are made wider and wider until there are only three bins left (you can go even further, but those plots really don t make much sense). Still holding the mouse button down, drag the cursor upwards to bring back more bins. Be careful not to drag the cursor to the left or right, as that will change the placement of the particular bin that you have clicked on, rather than the full set of bins. 3
Question #9 Does the choice of histogram bin width have an effect on the histogram? After taking into account everything you ve done so far, do you think the data are symmetric or skewed? Question #10 Finally, we should check for potential outliers. Do there appear to be any outliers? (You may want to go back to the data table.) JMP will automatically flag possible outliers on the boxplot by marking those observations with points separate from the whiskers, so that the whiskers don t go all the way out to those points. In this case, there aren t any points flagged by JMP. OK, we re done with the butterfly data for today. You can close the distribution analysis window (click on the X in the upper right corner of that window) and you can also close the data window. Part II. Cereal Data Next we ll learn how to read in a plain text file. Go back to the JMP Starter window and click on File. Click on the button for Open Data Table. A window will pop up, but you probably won t see the cereal.txt file yet. One of the last items on the bottom left says Files of type:. On the right end of that box (which should currently say Data Files with a bunch of possible file extensions) there is a black triangle. Click on the black triangle and select Text Import Files. Now you should see the cereal.txt file in the large box. Click on that file, then click on the Open button. JMP will read in the file, and it will try to put all the labels in the right place and even guess the types of the variables. Scroll around and see that there are 77 different cereals in this dataset, and 15 different variables measured for each cereal. The first column is the name of the cereal. This is just a label. The second column is a code for the manufacturer (A = American Home Food Products; G = General Mills; K = Kelloggs; N = Nabisco; P = Post; Q = Quaker Oats; R = Ralston Purina). The type is cold or hot, and you ll see that almost all of the cereals in this sample are cold. Question #11 Is manufacturer a nominal or ordinal variable? In the box labeled Columns on the left side of the window, right-click on the little image to the left of mfr. Did JMP correctly guess the type of variable? 4
Question #12 What type of variable is the number of calories per serving (the calories column)? Did JMP correctly guess the type of variable? Let s do some exploratory data analysis. We ll start with the manufacturer. Go back to the JMP Starter window and click on Basic and then choose Distribution. This time, we have a long list of possible columns, so it is important that we specify. Click on mfr then click on Y, Columns and then on OK. JMP gives you a bar chart. JMP also gives you a count by category below. In addition, we can make a Pareto chart. Go back to the JMP Starter window and click on Measure, which brings up a different set of options. Choose Pareto Plot, which brings up a new dialog box. Click on the variable mfr then click on Y, Cause and OK to bring up the Pareto chart. Use information from both the Pareto chart and the histogram and relative frequency table to answer the following. Question #13 Which manufactures have the largest representation in this sample? Question #14 What percent of the cereals in the sample were manufactured by Quaker? There s not that much we can do with a single categorical variable, but we can also make two-way tables of two categorical variables. Back in the JMP Starter window, go back to Basic, look near the bottom of the options and click on Contingency. Click on type and then click on Y, Response Category. Next click on mfr and then click on X, Grouping Category, then hit OK. You will get a funky mosaic plot that we re going to skip, and you can make that plot go away by clicking on the grey and blue diamond to the left of Mosaic Plot. Now you should see the Contingency Table. The table has a lot more information than we need right now, so let s hide some of that information. Click on the hot spot to the left of Contingency Table and notice that four items are checked: Count, Total %, Col %, and Row %. Right now, we really only need the counts, not any of the percentages, so uncheck all three of the percentages so that only the counts remain. Now you should have a much more compact table that just gives the counts by manufacturer and by type (cold vs. hot). The rows show the manufacturers, and the columns show the types. The bottom row is the total by type (summed over manufacturer) and the rightmost column gives the total by manufacturer. This rightmost column should match the previous table we had. In the bottom row, notice that there are only three hot cereals in this sample. Question #15 Which three manufacturers made the hot cereals in this sample? You can now close all the contingency table analysis windows. Back in the JMP starter window, let s look at some of the continuous variables. Still in Basic, choose Distribution. This time, we ll look at three variables. Click on calories and then hold down the Control key on the keyboard (often labeled Ctrl ) and click on fiber, and also click on carbo (then let go of the Control key). Then click on Y, Columns and OK. You should get three histograms. 5
Question #16 Is the distribution of calories symmetric or skewed? Question #17 What are the quartiles of the distribution of calories? Since the median and the third quartile are the same, it may look like a line is missing on the boxplot, but it s just that the lines for the median and the third quartile are on top of each other, so you only see one box without a middle line. Back to the histogram for calories, the default bin width produces a lot of bins (basically each value is in a separate bin). Select the Grabber tool from the toolbar and slowly drag the cursor directly downwards. As the bins get wider, with the first re-drawing of the histogram, you should notice that the data now appear to fall into five separate groups. If you continue to widen the bins, the data will then appear in one continuous span, with some peak in the center. However, the peak will appear to move around as the bin width changes. Question #18 Do some of the different possible bin widths give different impressions about the data? Recall that important considerations are center, variability, and shape. Question #19 Now on to the fiber variable. Is the distribution of fiber (in grams per serving) symmetric or skewed? Question #20 What are the quartiles of the distribution of fiber? Question #21 Are there any potential outliers in the fiber distribution? Which cereals do they correspond to? (Click on either the bar in the histogram or the point in the boxplot and then find the selected row in the data table. You can highlight multiple points by dragging open a box around all of them.) Do these look like data entry problems or do they look like valid measurements? 6
Question #22 Is the distribution of carbohydrates (grams per serving) symmetric or skewed? Question #23 What is the mean value of carbohydrates? Question #24 What is the standard deviation of carbohydrates? Question #25 Are there any potential outliers for carbohydrates? Do they look like valid observations? What s going on here? You can t have a negative amount of carbohydrates. It turns out that whoever entered the data didn t have a value for that cereal, and so they coded the missing value as 1. Clearly we need to do something about this outlier. Go back to the original data table and find that row. Click on the row number (the column to the left of the cereal name). On the left side of the data window, the bottom box is Rows. Click on the red hot spot and choose Exclude/Unexclude. You should now see a red circle with a line through it on the row for the cereal with the missing carbohydrate data. Now go back to the JMP Starter window and choose Distribution again. Click on carbo, then Y, Columns and OK. Notice that that observation has been removed from the analysis (but not completely removed from the dataset, so that we can put it back later if we need to, for example, if we go back to analyzing calories, for which it does have a valid value). Question #26 Now what is the mean? How much was it affected by the outlier? Question #27 Now what is the standard deviation? How much was it affected by the outlier? Question #28 With the revised mean and standard deviation, are there any new observations flagged as potential outliers? 7
We re done with the carbohydrates, as well as the calories and fiber for today, although you may want to leave those windows open until after you ve been checked off below. Keep the Quaker Oatmeal row excluded for now. Let s take a look at the shelf variable. This is coded as 1 for cereals sold on the bottom shelf at the supermarket, 2 for those on the middle shelf, and 3 for those on the top shelf. Question #29 Is shelf continuous, discrete, nominal, or ordinal? Did JMP correctly guess the variable type? (If not, fix it.) If you ve shopped for cereal in a supermarket, you may have noticed that the healthier cereals tend to come in smaller boxes, while the less healthy cereals tend to be manufactured with puffed air and take up more volume (larger boxes). So the shelf with smaller boxes can have more different varieties. Question #30 Make a bar chart of counts by shelf (from the Basic Distributions ). Which shelf does it look like the healthier cereals are on? Go back to the JMP Starter window and from Basic, choose Oneway. In the dialog box, click on sugars and then click on Y, Response. Click on shelf and then on X, Grouping, then on OK. You should get a plot with a bunch of dots. We d like boxplots, so click on the hot spot to the left of Oneway Analysis of sugars By shelf and choose Quantiles. Now you should see a boxplot of sugar content for each shelf. Question #31 Which shelf has the most sugary cereals? Question #32 Sugary cereals are often marketed to kids. Given the height of kids, does this placement make sense? Quit JMP and please remember to Log Off (from the Start menu in the lower left of the screen). 8