AMS 7L LAB #2 Spring, 2009. Exploratory Data Analysis



Similar documents
4 Other useful features on the course web page. 5 Accessing SAS

Chapter 4 Displaying and Describing Categorical Data

Using SPSS, Chapter 2: Descriptive Statistics

Exploratory data analysis (Chapter 2) Fall 2011

IBM SPSS Statistics for Beginners for Windows

GeoGebra Statistics and Probability

SPSS Manual for Introductory Applied Statistics: A Variable Approach

Statgraphics Getting started

Creating Drawings in Pro/ENGINEER

Introduction to MS WINDOWS XP

First Time On-Campus Remote Desktop Connection ipad Edition

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

MetroBoston DataCommon Training

Plotting: Customizing the Graph

Tutorials. If you have any questions, comments, or suggestions about these lessons, don't hesitate to contact us at

Drawing a histogram using Excel

How to make a line graph using Excel 2007

First Time Off-Campus Remote Desktop Connection ipad Edition

Getting Started with Excel Table of Contents

GeoGebra. 10 lessons. Gerrit Stols

Google Docs Basics Website:

What Do You Think? for Instructors

In this example, Mrs. Smith is looking to create graphs that represent the ethnic diversity of the 24 students in her 4 th grade class.

Google Drive Create, Share and Edit Documents Online

Final Exam Practice Problem Answers

How to Use a Data Spreadsheet: Excel

An Introduction to Excel Pivot Tables

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

macquarie.com.au/prime Charts Macquarie Prime and IT-Finance Advanced Quick Manual

Exploratory Data Analysis. Psychology 3256

Cleaning your Windows 7, Windows XP and Macintosh OSX Computers

Create a Poster Using Publisher

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Chapter 14: Links. Types of Links. 1 Chapter 14: Links

Hierarchical Clustering Analysis

EXCEL PIVOT TABLE David Geffen School of Medicine, UCLA Dean s Office Oct 2002

Using Microsoft Word. Working With Objects

GUIDELINES FOR PREPARING POSTERS USING POWERPOINT PRESENTATION SOFTWARE

Outlook Web Access (OWA) Cheat Sheet

Microsoft Word defaults to left justified (aligned) paragraphs. This means that new lines automatically line up with the left margin.

Computer Basics: Tackling the mouse, keyboard, and using Windows

Sage Accountants Business Cloud EasyEditor Quick Start Guide

Beginners Guide to CQG FX

The Dashboard. Change ActivInspire's Look And Feel. ActivInspire Primary. ActivInspire Studio. <- Primary. Studio -> page 1

Making Visio Diagrams Come Alive with Data

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

CREATING EXCEL PIVOT TABLES AND PIVOT CHARTS FOR LIBRARY QUESTIONNAIRE RESULTS

Using Excel to find Perimeter, Area & Volume

Getting started manual

Microsoft PowerPoint 2010 Templates and Slide Masters (Level 3)

Directions for using SPSS

Introduction to Exploratory Data Analysis

Course Exercises for the Content Management System. Grazyna Whalley, Laurence Cornford June 2014 AP-CMS2.0. University of Sheffield

Using SSH Secure File Transfer to Upload Files to Banner

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

BID2WIN Workshop. Advanced Report Writing

Outlook Web Access Tutorial

Summarizing and Displaying Categorical Data

Flash MX Image Animation

Basic Pivot Tables. To begin your pivot table, choose Data, Pivot Table and Pivot Chart Report. 1 of 18

MailChimp Instruction Manual

Using Microsoft Excel to Plot and Analyze Kinetic Data

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Access to Moodle. The first session of this document will show you how to access your Lasell Moodle course, how to login, and how to logout.

Variables. Exploratory Data Analysis

WEB TRADER USER MANUAL

Describing, Exploring, and Comparing Data

Appendix A How to create a data-sharing lab

Getting started in Excel

AIM Dashboard-User Documentation

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Data exploration with Microsoft Excel: univariate analysis

ecollege AU Release Notes - ClassLive ClassLive

1. Go to your programs menu and click on Microsoft Excel.

Intro to Excel spreadsheets

Building Better Dashboards PART 1: BASIC DASHBOARDS

I. Create the base view with the data you want to measure

Microsoft Word Track Changes

History Explorer. View and Export Logged Print Job Information WHITE PAPER

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Virtual Heart User Manual Username Password

Converting Dimensions to Measures & Changing Data Types

Descriptive Statistics

How To Change Your Site On Drupal Cloud On A Pcode On A Microsoft Powerstone On A Macbook Or Ipad (For Free) On A Freebie (For A Free Download) On An Ipad Or Ipa (For

Excel Spreadsheet Activity Redo #1

MicroStrategy Quick Guide: Running the PI Report ITU Data Mart Support Group Go to reporting.gmu.edu and click on Login to Microstrategy

APPLYING BENFORD'S LAW This PDF contains step-by-step instructions on how to apply Benford's law using Microsoft Excel, which is commonly used by

ADOBE ACROBAT 7.0 CREATING FORMS

CREATING A 3D VISUALISATION OF YOUR PLANS IN PLANSXPRESS AND CORTONA VRML CLIENT

Joomla Article Advanced Topics: Table Layouts

Plots, Curve-Fitting, and Data Modeling in Microsoft Excel

Step Sheet: Creating a Data Table and Charts

OUTLOOK WEB APP 2013 ESSENTIAL SKILLS

So you want to create an a Friend action

Excel 2007 A Beginners Guide

FirstClass FAQ's An item is missing from my FirstClass desktop

Business Objects Version 5 : Introduction

Excel 2007 Basic knowledge

Working with the Ektron Content Management System

Transcription:

AMS 7L LAB #2 Spring, 2009 Exploratory Data Analysis Name: Lab Section: Instructions: The TAs/lab assistants are available to help you if you have any questions about this lab exercise. If you have any questions please raise your hand and they will get to you as quickly as possible. At the end of class, you will need to turn in this cover sheet to your lab instructor. If you do not turn it in, you will not get credit for this lab. Be sure to write your name and section above. The following symbol at the beginning of a question means that after you answer that question you should raise your hand and have a TA or lab assistant review your answers up to that point. Once they have reviewed your work they will initial in the appropriate space in the table below. The purpose of this check is to be sure you have answered the questions correctly. Be sure to take the rest of the lab handout with you when you leave. It contains your answers and JMP instructions which you may find useful for doing homework assignments. Check-Problem 10 Lab Instructor s Initials 21 32 1

AMS 7L LAB #2 Spring, 2009 Objectives: Exploratory Data Analysis 1. To practice exploratory data analysis techniques 2. To learn to read datafiles into JMP Getting Started: Log onto your machine using your ITS login. Before starting JMP, you need to download two datafiles: butterfly.jmp and cereal.txt. First, open a web browser (such as Firefox) and go to the course webpage: http://www.soe.ucsc.edu/classes/ams007/spring09/ Then click on the link for Datasets. Once on the datasets webpage, download butterfly.jmp by clicking on the butterfly link, choosing Save to Disk, and hitting OK. Note that this file is already in JMP format. Next download cereal.txt by clicking on the cereal link, which will bring up the datafile in your browser (since this is just a text file). Go up to the File menu in the upper left and choose Save Page As and save the file to your Desktop (or anywhere else you d prefer) by clicking on Save. Part I. Butterfly Data Start JMP. Open the butterfly data by choosing Open Data Table from the JMP Starter window, clicking on butterfly.jmp and then choosing Open. JMP will open a data window with the butterfly data. Because this data file is already a JMP file, you will see that the column has already been labeled with the title Wing-Length. This file consists of the measurements of wing lengths of 24 butterflies. Take a look at the data values. Question #1 Wing length (in cm) is a quantitative variable. Is it continuous or discrete? Most of the exploratory data analysis techniques can be accessed from the Basic menu of the JMP Starter window. Click on Basic and then choose Distribution. If Wing-Length is not already highlighted on the left (under Select Columns), then click on it, then click on Y, Columns in the middle, and click on OK. You should get a histogram, but by default it is sideways from the usual form. Click on the red triangle hot spot button just to the left of Wing-Length. The drop-down menu has a lot of options. Click on Histogram Options and then in the menu that appears to the right. You can see that JMP allows you to display the histogram vertically or horizontally. Now let s go through the four primary part of exploratory data analysis: center, variation, shape, and outliers (we don t have an element of time, so we won t worry about changes over time). Question #2 What is the mean of the dataset? 2

Question #3 What is the median of the dataset? Question #4 What is the mode of the dataset? (You may want to go back to the original data table, or you can make a stem and leaf diagram by going to the red hot spot by Wing-Length and choosing Stem and Leaf; to get back to the histogram, uncheck Stem and Leaf from the same menu). Question #5 What is the standard deviation of the dataset? Question #6 Looking at the histogram, is this dataset symmetric or skewed? Question #7 Looking at the boxplot, is this dataset symmetric or skewed? (The boxplot has a few extra things on it that we won t worry about for now. You should look primarily at the quartiles, and secondarily at the whiskers.) Question #8 With a symmetric dataset, the mean is usually very close to the median. Is that the case here? Does this agree with or disagree with your answers to the two previous questions? JMP uses certain rules-of-thumb to decide how wide the bins are in the histogram. Usually JMP will produce a good histogram. But sometimes the picture will change as the size of the bins changes. To see this in action, click on the hand icon in the toolbar near the top (this is called the Grabber tool ). Note that the default plot has a lot of bars in it. Now click and hold on the histogram, and slowly drag the cursor directly downwards (i.e., move the mouse toward you) while still holding the mouse button down. You should see the histogram change as the bins are made wider and wider until there are only three bins left (you can go even further, but those plots really don t make much sense). Still holding the mouse button down, drag the cursor upwards to bring back more bins. Be careful not to drag the cursor to the left or right, as that will change the placement of the particular bin that you have clicked on, rather than the full set of bins. 3

Question #9 Does the choice of histogram bin width have an effect on the histogram? After taking into account everything you ve done so far, do you think the data are symmetric or skewed? Question #10 Finally, we should check for potential outliers. Do there appear to be any outliers? (You may want to go back to the data table.) JMP will automatically flag possible outliers on the boxplot by marking those observations with points separate from the whiskers, so that the whiskers don t go all the way out to those points. In this case, there aren t any points flagged by JMP. OK, we re done with the butterfly data for today. You can close the distribution analysis window (click on the X in the upper right corner of that window) and you can also close the data window. Part II. Cereal Data Next we ll learn how to read in a plain text file. Go back to the JMP Starter window and click on File. Click on the button for Open Data Table. A window will pop up, but you probably won t see the cereal.txt file yet. One of the last items on the bottom left says Files of type:. On the right end of that box (which should currently say Data Files with a bunch of possible file extensions) there is a black triangle. Click on the black triangle and select Text Import Files. Now you should see the cereal.txt file in the large box. Click on that file, then click on the Open button. JMP will read in the file, and it will try to put all the labels in the right place and even guess the types of the variables. Scroll around and see that there are 77 different cereals in this dataset, and 15 different variables measured for each cereal. The first column is the name of the cereal. This is just a label. The second column is a code for the manufacturer (A = American Home Food Products; G = General Mills; K = Kelloggs; N = Nabisco; P = Post; Q = Quaker Oats; R = Ralston Purina). The type is cold or hot, and you ll see that almost all of the cereals in this sample are cold. Question #11 Is manufacturer a nominal or ordinal variable? In the box labeled Columns on the left side of the window, right-click on the little image to the left of mfr. Did JMP correctly guess the type of variable? 4

Question #12 What type of variable is the number of calories per serving (the calories column)? Did JMP correctly guess the type of variable? Let s do some exploratory data analysis. We ll start with the manufacturer. Go back to the JMP Starter window and click on Basic and then choose Distribution. This time, we have a long list of possible columns, so it is important that we specify. Click on mfr then click on Y, Columns and then on OK. JMP gives you a bar chart. JMP also gives you a count by category below. In addition, we can make a Pareto chart. Go back to the JMP Starter window and click on Measure, which brings up a different set of options. Choose Pareto Plot, which brings up a new dialog box. Click on the variable mfr then click on Y, Cause and OK to bring up the Pareto chart. Use information from both the Pareto chart and the histogram and relative frequency table to answer the following. Question #13 Which manufactures have the largest representation in this sample? Question #14 What percent of the cereals in the sample were manufactured by Quaker? There s not that much we can do with a single categorical variable, but we can also make two-way tables of two categorical variables. Back in the JMP Starter window, go back to Basic, look near the bottom of the options and click on Contingency. Click on type and then click on Y, Response Category. Next click on mfr and then click on X, Grouping Category, then hit OK. You will get a funky mosaic plot that we re going to skip, and you can make that plot go away by clicking on the grey and blue diamond to the left of Mosaic Plot. Now you should see the Contingency Table. The table has a lot more information than we need right now, so let s hide some of that information. Click on the hot spot to the left of Contingency Table and notice that four items are checked: Count, Total %, Col %, and Row %. Right now, we really only need the counts, not any of the percentages, so uncheck all three of the percentages so that only the counts remain. Now you should have a much more compact table that just gives the counts by manufacturer and by type (cold vs. hot). The rows show the manufacturers, and the columns show the types. The bottom row is the total by type (summed over manufacturer) and the rightmost column gives the total by manufacturer. This rightmost column should match the previous table we had. In the bottom row, notice that there are only three hot cereals in this sample. Question #15 Which three manufacturers made the hot cereals in this sample? You can now close all the contingency table analysis windows. Back in the JMP starter window, let s look at some of the continuous variables. Still in Basic, choose Distribution. This time, we ll look at three variables. Click on calories and then hold down the Control key on the keyboard (often labeled Ctrl ) and click on fiber, and also click on carbo (then let go of the Control key). Then click on Y, Columns and OK. You should get three histograms. 5

Question #16 Is the distribution of calories symmetric or skewed? Question #17 What are the quartiles of the distribution of calories? Since the median and the third quartile are the same, it may look like a line is missing on the boxplot, but it s just that the lines for the median and the third quartile are on top of each other, so you only see one box without a middle line. Back to the histogram for calories, the default bin width produces a lot of bins (basically each value is in a separate bin). Select the Grabber tool from the toolbar and slowly drag the cursor directly downwards. As the bins get wider, with the first re-drawing of the histogram, you should notice that the data now appear to fall into five separate groups. If you continue to widen the bins, the data will then appear in one continuous span, with some peak in the center. However, the peak will appear to move around as the bin width changes. Question #18 Do some of the different possible bin widths give different impressions about the data? Recall that important considerations are center, variability, and shape. Question #19 Now on to the fiber variable. Is the distribution of fiber (in grams per serving) symmetric or skewed? Question #20 What are the quartiles of the distribution of fiber? Question #21 Are there any potential outliers in the fiber distribution? Which cereals do they correspond to? (Click on either the bar in the histogram or the point in the boxplot and then find the selected row in the data table. You can highlight multiple points by dragging open a box around all of them.) Do these look like data entry problems or do they look like valid measurements? 6

Question #22 Is the distribution of carbohydrates (grams per serving) symmetric or skewed? Question #23 What is the mean value of carbohydrates? Question #24 What is the standard deviation of carbohydrates? Question #25 Are there any potential outliers for carbohydrates? Do they look like valid observations? What s going on here? You can t have a negative amount of carbohydrates. It turns out that whoever entered the data didn t have a value for that cereal, and so they coded the missing value as 1. Clearly we need to do something about this outlier. Go back to the original data table and find that row. Click on the row number (the column to the left of the cereal name). On the left side of the data window, the bottom box is Rows. Click on the red hot spot and choose Exclude/Unexclude. You should now see a red circle with a line through it on the row for the cereal with the missing carbohydrate data. Now go back to the JMP Starter window and choose Distribution again. Click on carbo, then Y, Columns and OK. Notice that that observation has been removed from the analysis (but not completely removed from the dataset, so that we can put it back later if we need to, for example, if we go back to analyzing calories, for which it does have a valid value). Question #26 Now what is the mean? How much was it affected by the outlier? Question #27 Now what is the standard deviation? How much was it affected by the outlier? Question #28 With the revised mean and standard deviation, are there any new observations flagged as potential outliers? 7

We re done with the carbohydrates, as well as the calories and fiber for today, although you may want to leave those windows open until after you ve been checked off below. Keep the Quaker Oatmeal row excluded for now. Let s take a look at the shelf variable. This is coded as 1 for cereals sold on the bottom shelf at the supermarket, 2 for those on the middle shelf, and 3 for those on the top shelf. Question #29 Is shelf continuous, discrete, nominal, or ordinal? Did JMP correctly guess the variable type? (If not, fix it.) If you ve shopped for cereal in a supermarket, you may have noticed that the healthier cereals tend to come in smaller boxes, while the less healthy cereals tend to be manufactured with puffed air and take up more volume (larger boxes). So the shelf with smaller boxes can have more different varieties. Question #30 Make a bar chart of counts by shelf (from the Basic Distributions ). Which shelf does it look like the healthier cereals are on? Go back to the JMP Starter window and from Basic, choose Oneway. In the dialog box, click on sugars and then click on Y, Response. Click on shelf and then on X, Grouping, then on OK. You should get a plot with a bunch of dots. We d like boxplots, so click on the hot spot to the left of Oneway Analysis of sugars By shelf and choose Quantiles. Now you should see a boxplot of sugar content for each shelf. Question #31 Which shelf has the most sugary cereals? Question #32 Sugary cereals are often marketed to kids. Given the height of kids, does this placement make sense? Quit JMP and please remember to Log Off (from the Start menu in the lower left of the screen). 8