2-7 Exploratory Data Analysis (EDA)
|
|
- Christopher Russell
- 8 years ago
- Views:
Transcription
1 102 C HAPTER 2 Describing, Exploring, and Comparing Data 2-7 Exploratory Data Analysis (EDA) This chapter presents the basic tools for describing, exploring, and comparing data, and the focus of this section is the exploration of data. We begin this section by first defining exploratory data analysis, then we introduce outliers, 5-number summaries, and boxplots. Definition Exploratory data analysis is the process of using statistical tools (such as graphs, measures of center, and measures of variation) to investigate data sets in order to understand their important characteristics. Recall that in Section 2-1 we listed five important characteristics of data, and we began with (1) center, (2) variation, and (3) the nature of the distribution. These characteristics can be investigated by calculating the values of the mean and standard deviation, and by constructing a histogram. It is generally important to further investigate the data set to identify any notable features, especially those that could strongly affect results and conclusions. One such feature is the presence of outliers. Outliers An outlier is a value that is located very far away from almost all of the other values. Relative to the other data, an outlier is an extreme value. When exploring a data set, outliers should be considered because they may reveal important information, and they may strongly affect the value of the mean and standard deviation, as well as seriously distorting a histogram. The following example uses an incorrect entry as an example of an outlier, but not all outliers are errors; some outliers are correct values. EXAMPLE Cotinine Levels of Smokers When using computer software or a calculator, it is often easy to make keying errors. Refer to the cotinine levels of smokers listed in Table 2-1 with the
2 2-7 Exploratory Data Analysis (EDA) 103 Chapter Problem and assume that the first entry of 1 is incorrectly entered as because you were distracted by a meteorite landing on your porch. The incorrect entry of is an outlier because it is located very far away from the other values. How does that outlier affect the mean, standard deviation, and histogram? SOLUTION When the entry of 1 is replaced by the outlier value of 11111, the mean changes from to 450.2, so the effect of the outlier is very substantial. The incorrect entry of causes the standard deviation to change from to , so the effect of the outlier here is also substantial. Figure 2-1 in Section 2-3 depicts the histogram for the correct values of cotinine levels of smokers in Table 2-1, but the STATDISK display presented here shows the histogram that results from using the same data with the value of 1 replaced by the incorrect value of Compare this STATDISK histogram to Figure 2-1 and you can easily see that the presence of the outlier dramatically affects the shape of the distribution. STATDISK The preceding example illustrates these important principles: 1. An outlier can have a dramatic effect on the mean. 2. An outlier can have a dramatic effect on the standard deviation. 3. An outlier can have a dramatic effect on the scale of the histogram so that the true nature of the distribution is totally obscured. An easy procedure for finding outliers is to examine a sorted list of the data. In particular, look at the minimum and maximum sample values and determine whether they are very far away from the other typical values. Some outliers are correct values and some are errors, as in the preceding example. If we are sure that an outlier is an error, we should correct it or delete it. If we include an outlier because we know that it is correct, we might study its effects by constructing graphs and calculating statistics with and without the outliers included. An Outlier Tip Outliers are important to consider because, in many cases, one extreme value can have a dramatic effect on statistics and conclusions derived from them. In some cases an outlier is a mistake that should be corrected or deleted. In other cases, an outlier is a valid data value that should be investigated for any important information. Students of the author collected data consisting of restaurant bills and tips, and no notable outliers were found among their sample data. However, one such outlier is the tip of $16,000 that was left for a restaurant bill of $8, The tip was left by an unidentified London executive to waiter Lenny Lorando at Nello s restaurant in New York City. Lorando said that he had waited on the customer before and He s always generous, but never anything like that before. I have to tell my sister about him.
3 104 C HAPTER 2 Describing, Exploring, and Comparing Data Boxplots Good Advice for Journalists Columnist Max Frankel wrote in the New York Times that most schools of journalism give statistics short shrift and some let students graduate without any numbers training at all. How can such reporters write sensibly about trade and welfare and crime, or air fares, health care and nutrition? The media s sloppy use of numbers about the incidence of accidents or disease frightens people and leaves them vulnerable to journalistic hype, political demagoguery, and commercial fraud. He cites several cases, including an example of a full-page article about New York City s deficit with a promise by the mayor of New York City to close a budget gap of $2.7 billion; the entire article never once mentioned the total size of the budget, so the $2.7 billion figure had no context. In addition to the graphs presented in Section 2-3, a boxplot is another graph that is used often. Boxplots are useful for revealing the center of the data, the spread of the data, the distribution of the data, and the presence of outliers. The construction of a boxplot requires that we first obtain the minimum value, the maximum value, and quartiles, as defined in the 5-number summary. Definitions For a set of data, the 5-number summary consists of the minimum value; the first quartile, Q 1 ; the median (or second quartile, Q 2 ); the third quartile, Q 3 ; and the maximum value. A boxplot (or box-and-whisker diagram) is a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile, Q 1 ; the median; and the third quartile, Q 3. (See Figure 2-16.) Procedure for Constructing a Boxplot 1. Find the 5-number summary consisting of the minimum value, Q 1, the median, Q 3, and the maximum value. 2. Construct a scale with values that include the minimum and maximum data values. 3. Construct a box (rectangle) extending from Q 1 to Q 3, and draw a line in the box at the median value. 4. Draw lines extending outward from the box to the minimum and maximum data values. Boxplots don t show as much detailed information as histograms or stem-and-leaf plots, so they might not be the best choice when dealing with a single data set. They are often great for comparing two or more data sets. When using two or more boxplots for comparing different data sets, it is important to use the same scale so that correct comparisons can be made. EXAMPLE Cotinine Levels of Smokers Refer to the 40 cotinine levels of smokers in Table 2-1 (without the error of used in place of 1, as in the preceding example). a. Find the values constituting the 5-number summary. b. Construct a boxplot. SOLUTION a. The 5-number summary consists of the minimum, Q 1, median, Q 3, and maximum. To find those values, first sort the data (by arranging them in order from lowest to highest). The minimum of 0 and the maximum of 491
4 2-7 Exploratory Data Analysis (EDA) 105 are easy to identify from the sorted list. Now proceed to find the quartiles. Using the flowchart of Figure 2-15, we get Q 1 5 P , which is located by calculating the locator L 5 (25> 100) and finding the value midway between the 10th value and the 11th value in the sorted list. The median is 170, which is the value midway between the 20th and 21st values. We also find that Q by using Figure 2-15 for the 75th percentile. The 5-number summary is therefore 0, 86.5, 170, 251.5, and 491. b. In Figure 2-16 we graph the boxplot for the data. We use the minimum (0) and the maximum (491) to determine a scale of values, then we plot the values from the 5-number summary as shown. Minimum Q 1 Median Q 3 Maximum FIGURE 2-16 Boxplot Continine Level of Smokers In Figure 2-17 we show some generic boxplots along with common distribution shapes. It appears that the cotinine levels of smokers have a skewed distribution. FIGURE 2-17 Boxplots Corresponding to Bell- Shaped, Uniform, and Skewed Distributions Bell shaped Uniform Skewed To illustrate the use of boxplots to compare data sets, see the accompanying Minitab display of cholesterol levels for a sample of males and a sample of females, based on the National Health Examination data included in Data Set 1 of Appendix B. Based on the sample data, it appears that males have cholesterol levels that are generally higher than females, and the cholesterol levels of males appear to vary more than those of females.
5 106 C HAPTER 2 Describing, Exploring, and Comparing Data Best Colleges Each year, U.S. News and World Report publishes an issue with a list of America s Best Colleges and Universities. Sales typically jump 40% for that issue. The list has critics who argue against the criteria and method of collecting data. Common complaints: Too much emphasis is placed on the criteria of a college s wealth, reputation, College Board scores, alumni donations, and the opinions of college presidents; too little emphasis is placed on the satisfaction of students and effective educational practices. The New York Times interviewed Kenneth Auchincloss, who is editor of How to Get Into College (by Kaplan> Newsweek), and he said that We have never been comfortable trying to quantify in numeric terms the various criteria that go into making a college good or less good, and we don t want to devote the resources to doing an elaborate statistical analysis that frankly we don t think is valid. EXAMPLE Does It Rain More on Weekends? Refer to Data Set 11 in Appendix B, which lists rainfall amounts (in inches) in Boston for every day of a recent year. The collection of this data set was inspired by media reports that it rains more on weekends (Saturday and Sunday) than on weekdays. Later in this book we will describe important statistical methods that can be used to formally test that claim, but for now, let s explore the data set to see what can be learned. (Even if we already know how to apply those formal statistical methods, we should first explore the data before proceeding with the formal analysis.) SOLUTION Let s begin with an investigation into the key elements of center, variation, distribution, outliers, and characteristics over time (the same CVDOT list introduced in Section 2-1). Listed below are measures of center (mean), measures of variation (standard deviation), and the 5-number summary for the rainfall amounts for each day of the week. The accompanying STATDISK display shows boxplots for each of the seven days of the week, starting with Monday at the top. Because the histograms for all seven days are pretty much the same, we show only the histogram for the Monday rainfall amounts. Standard Mean Deviation Minimum Q 1 Median Q 3 Maximum Monday Tuesday Wednesday Thursday Friday Saturday Sunday STATDISK STATDISK
6 2-7 Exploratory Data Analysis (EDA) 107 INTERPRETATION Examining and comparing the statistics and graphs, we make the following important observations. Means: The means vary from a low of in. to a high of in. The seven means vary by considerable amounts, and in later chapters of this book we will present methods for determining whether these differences are significant. (Later methods will show that the means do not differ by significant amounts.) If we list the means in order from low to high, we get this sequence of days: Wednesday, Tuesday, Sunday, Thursday, Friday, Monday, Saturday. There does not appear to be a pattern of higher rainfall on weekends (although the highest mean corresponds to Saturday). Also, see the Excel graph of the seven means, with the mean for Monday plotted first. The Excel graph does not support the claim of more rainfall on weekends (although it might be argued that there is more rainfall on Saturdays). Excel Variation: The seven standard deviations vary from in. to in., but those values are not dramatically different. There does not appear to be anything highly unusual about the amounts of variation. The minimums, first quartiles, and medians are all 0.00 for each of the seven days. This is explained by the fact that for each day of the week, there are many days with no rain. The abundance of zeros is also seen in the boxplots and histograms, which show that the data have distributions that are heavy toward the low end (skewed right). Outliers: There are no outliers or unusual values. At the low end, there are many rainfall amounts of zero. At the high end, the sorted list of all 365 rainfall amounts ends with the high values of 0.92, 0.96, 1.28, 1.41, and Distributions: The distributions of the rainfall amounts are skewed to the right. They are not bell-shaped, as we might have expected. If the use of a particular method of statistics requires normally distributed (bell-shaped) populations, that requirement is not satisfied for the rainfall amounts. We now have considerable insight into the nature of the Boston rainfall amounts for different days of the week. Based on our exploration, we can conclude that Boston does not experience more rain on weekends than on the other days of the week (although we might argue that there is more rainfall on Saturdays).
7 108 C HAPTER 2 Describing, Exploring, and Comparing Data Critical Thinking Armed with a list of tools for investigating center, variation, distribution, outliers, and characteristics of data over time, we might be tempted to develop a rote and mindless procedure, but critical thinking is critically important. In addition to using the tools presented in this chapter, we should consider any other relevant factors that might be crucial to the conclusions we form. We might pose questions such as these: Is the sample likely to be representative of the population, or is the sample somehow biased? What is the source of the data, and might the source be someone with an interest that could affect the quality of the data? Suppose, for example, that we want to estimate the mean income of college students. Also suppose that we mail questionnaires to 500 students and receive 20 responses. We could calculate the mean, standard deviation, construct graphs, identify outliers, and so on, but the results will be what statisticians refer to as hogwash. The sample is a voluntary response sample, and it is not likely to be representative of the population of all college students. In addition to the specific statistical tools presented in this chapter, we should also think! Using Technology This section introduced outliers, 5-number summaries, and boxplots. To find outliers, sort the data in order from lowest to highest, then examine the highest and lowest values to determine whether they are far away from the other sample values. STAT- DISK, Minitab, Excel, and the TI-83 Plus calculator can provide values of quartiles, so the 5-number summary is easy to find. STATDISK, Minitab, Excel, and the TI-83 Plus calculator can be used to create boxplots, and we now describe the different procedures. (Caution: Remember that quartile values calculated by Minitab and the TI-83 Plus calculator may differ slightly from those calculated by applying Figure 2-15, so the boxplots may differ slightly as well.) STATDISK Choose the main menu item of Data and use the Sample Editor to enter the data, then click on COPY. Now select Data, then Boxplot and click on PASTE, then Evaluate. Minitab Enter the data in column C1, then select Graph, then Boxplot. Enter C1 in the first cell under the Y column, then click OK. Excel Although Excel is not designed to generate boxplots, they can be generated using the Data Desk XL add-in that is a supplement to this book. First enter the data in column A. Click on DDXL and select Charts and Plots. Under Function Type, select the option of Boxplot. In the dialog box, click on the pencil icon and enter the range of data, such as A1:A40 if you have 40 values listed in column A. Click on OK. The result is a modified boxplot as described in Exercise 13. The values of the 5-number summary are also displayed. TI-83 Plus Enter the sample data in list L1. Now select STAT PLOT by pressing the 2nd key followed by the key labeled Y 5. Press the ENTER key, then select the option of ON, and select the boxplot type that is positioned in the middle of the second row. The Xlist should indicate L1 and the Freq value should be 1. Now press the ZOOM key and select option 9 for ZoomStat. Press the ENTER key and the boxplot should be displayed. You can use the arrow keys to move right or left so that values can be read from the horizontal scale. 2-7 Basic Skills and Concepts 1. Lottery Refer to Data Set 26 and use only the 40 digits in the first column of the Win 4 results from the New York State Lottery (9, 7, 0, and so on). Find the 5-number summary and construct a boxplot. What characteristic of the boxplot suggests that the digits are selected with a random and fair procedure?
8 2-7 Exploratory Data Analysis (EDA) Movie Budgets Refer to Data Set 21 in Appendix B for the budget amounts of the 15 movies that are R-rated. Find the 5-number summary and construct a boxplot. Determine whether the sample values are likely to be representative of movies made this year. 3. Cereal Calories Refer to Data Set 16 in Appendix B for the 16 values consisting of the calories per gram of cereal. Find the 5-number summary and construct a boxplot. Determine whether the sample values are likely to be representative of the cereals consumed by the general population. 4. Nicotine in Cigarettes Refer to Data Set 5 for the 29 amounts of nicotine (in mg per cigarette). Find the 5-number summary and construct a boxplot. Are the sample values likely to be representative of cigarettes smoked by an individual consumer? 5. Red M&Ms Refer to Data Set 19 for the 21 weights (in grams) of the red M&M candies. Find the 5-number summary and construct a boxplot. Are the red sample values likely to be representative of M&M candies of all colors? T 6. Bear Lengths Refer to Data Set 9 for the lengths (in inches) of the 54 bears that were anesthetized and measured. Find the 5-number summary and construct a boxplot. Does the distribution of the lengths appear to be symmetric or does it appear to be skewed? T 7. Alcohol in Children s Movies Refer to Data Set 7 for the 50 times (in seconds) of scenes showing alcohol use in animated children s movies. Find the 5-number summary and construct a boxplot. Based on the boxplot, does the distribution appear to be symmetric or is it skewed? T 8. Body Temperatures Refer to Data Set 4 in Appendix B for the 106 body temperatures for 12 A.M. on day 2. Find the 5-number summary and construct a boxplot, then determine whether the sample values support the common belief that the mean body temperature is 98.6 F. In Exercises 9 12, find 5-number summaries, construct boxplots, and compare the data sets. 9. Academy Awards In Ages of Oscar-Winning Best Actors and Actresses (Mathematics Teacher magazine) by Richard Brown and Gretchen Davis, the authors compare the ages of actors and actresses at the time they won Oscars. The results for winners from both categories are listed in the following table. Use boxplots to compare the two data sets. Actors: Actresses: T 10. Regular> Diet Coke Refer to Data Set 17 in Appendix B and use the weights of regular Coke and the weights of diet Coke. Does there appear to be a significant difference? If so, can you provide an explanation?
9 110 C HAPTER 2 Describing, Exploring, and Comparing Data T T 11. Cotinine Levels Refer to Table 2-1 located in the Chapter Problem. We have already found that the 5-number summary for the cotinine levels of smokers is 0, 86.5, 170, 251.5, and 491. Find the 5-number summaries for the other two groups, then construct the three boxplots using the same scale. Are there any apparent differences? 12. Clancy, Rowling, Tolstoy Refer to Data Set 14 in Appendix B and use the Flesch reading ease scores for the sample pages from Tom Clancy s The Bear and the Dragon, J. K. Rowling s Harry Potter and the Sorcerer s Stone, and Leo Tolstoy s War and Peace. (Higher scores indicate easier reading.) Does there appear to be a difference in ease of reading? Are the results consistent with your expectations? 2-7 Beyond the Basics 13. The boxplots discussed in this section are often called skeletal (or regular) boxplots. Modified boxplots are constructed as follows: a. Find the IQR, which denotes the interquartile range defined by IQR 5 Q 3 2 Q 1. b. Draw the box with the median and quartiles as usual, but when drawing the lines to the right and left of the box, draw the lines only as far as the points corresponding to the largest and smallest values that are within 1.5 IQR of the box. c. Mild outliers, plotted as solid dots, are values below Q 1 or above Q 3 by an amount that is greater than 1.5 IQR but not greater than 3 IQR. That is, mild outliers are values x such that Q IQR x Q IQR or Q IQR x Q IQR d. Extreme outliers, plotted as small hollow circles, are values that are either below Q 1 by more than 3 IQR or above Q 3 by more than 3 IQR. That is, extreme outliers are values x such that x Q IQR or x Q IQR The accompanying figure is an example of a modified boxplot. Refer to the cotinine levels of smokers in Table 2-1 included with the Chapter Problem. We have found that this data set has a 5-number summary of 0, 86.5, 170, 251.5, and 491. Identify the value of IQR, identify the ranges of values used to identify mild and extreme outliers, then identify any actual mild outliers or extreme outliers. Q1 Q2 Q3 Extreme Outliers Mild Outliers 1. 5 IQR IQR 1. 5 IQR Mild Outliers Extreme Outliers 3 IQR 3 IQR
10 Review Refer to the accompanying STATDISK display of three boxplots that represent the measure longevity (in months) of samples of three different car batteries. If you are the manager of a fleet of cars and you must select one of the three brands, which boxplot represents the brand you should choose? Why? STATDISK Review In this chapter we considered methods for describing, exploring, and comparing data sets. When investigating a data set, these characteristics are generally very important: 1. Center: A representative or average value. 2. Variation: A measure of the amount that the values vary. 3. Distribution: The nature or shape of the distribution of the data (such as bell-shaped, uniform, or skewed). 4. Outliers: Sample values that lie very far away from the vast majority of the other sample values. 5. Time: Changing characteristics of the data over time. After completing this chapter you should be able to do the following: Summarize data by constructing a frequency distribution or relative frequency distribution (Section 2-2) Visually display the nature of the distribution by constructing a histogram, dotplot, stem-and-leaf plot, pie chart, or Pareto chart (Section 2-3) Calculate measures of center by finding the mean, median, mode, and midrange (Section 2-4) Calculate measures of variation by finding the standard deviation, variance, and range (Section 2-5) Compare individual values by using z scores, quartiles, or percentiles (Section 2-6) Investigate and explore the spread of data, the center of the data, and the range of values by constructing a boxplot (Section 2-7)
Exploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
More informationDescribing, Exploring, and Comparing Data
24 Chapter 2. Describing, Exploring, and Comparing Data Chapter 2. Describing, Exploring, and Comparing Data There are many tools used in Statistics to visualize, summarize, and describe data. This chapter
More information2 Describing, Exploring, and
2 Describing, Exploring, and Comparing Data This chapter introduces the graphical plotting and summary statistics capabilities of the TI- 83 Plus. First row keys like \ R (67$73/276 are used to obtain
More informationLecture 1: Review and Exploratory Data Analysis (EDA)
Lecture 1: Review and Exploratory Data Analysis (EDA) Sandy Eckel seckel@jhsph.edu Department of Biostatistics, The Johns Hopkins University, Baltimore USA 21 April 2008 1 / 40 Course Information I Course
More informationVariables. Exploratory Data Analysis
Exploratory Data Analysis Exploratory Data Analysis involves both graphical displays of data and numerical summaries of data. A common situation is for a data set to be represented as a matrix. There is
More informationA Correlation of. to the. South Carolina Data Analysis and Probability Standards
A Correlation of to the South Carolina Data Analysis and Probability Standards INTRODUCTION This document demonstrates how Stats in Your World 2012 meets the indicators of the South Carolina Academic Standards
More informationUsing SPSS, Chapter 2: Descriptive Statistics
1 Using SPSS, Chapter 2: Descriptive Statistics Chapters 2.1 & 2.2 Descriptive Statistics 2 Mean, Standard Deviation, Variance, Range, Minimum, Maximum 2 Mean, Median, Mode, Standard Deviation, Variance,
More informationSTATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI
STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members
More informationSECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS
SECTION 2-1: OVERVIEW Chapter 2 Describing, Exploring and Comparing Data 19 In this chapter, we will use the capabilities of Excel to help us look more carefully at sets of data. We can do this by re-organizing
More informationExercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
More informationAP * Statistics Review. Descriptive Statistics
AP * Statistics Review Descriptive Statistics Teacher Packet Advanced Placement and AP are registered trademark of the College Entrance Examination Board. The College Board was not involved in the production
More informationFoundation of Quantitative Data Analysis
Foundation of Quantitative Data Analysis Part 1: Data manipulation and descriptive statistics with SPSS/Excel HSRS #10 - October 17, 2013 Reference : A. Aczel, Complete Business Statistics. Chapters 1
More informationLecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
More informationDiagrams and Graphs of Statistical Data
Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in
More information6 3 The Standard Normal Distribution
290 Chapter 6 The Normal Distribution Figure 6 5 Areas Under a Normal Distribution Curve 34.13% 34.13% 2.28% 13.59% 13.59% 2.28% 3 2 1 + 1 + 2 + 3 About 68% About 95% About 99.7% 6 3 The Distribution Since
More informationCenter: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)
Center: Finding the Median When we think of a typical value, we usually look for the center of the distribution. For a unimodal, symmetric distribution, it s easy to find the center it s just the center
More informationClassify the data as either discrete or continuous. 2) An athlete runs 100 meters in 10.5 seconds. 2) A) Discrete B) Continuous
Chapter 2 Overview Name MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Classify as categorical or qualitative data. 1) A survey of autos parked in
More information3: Summary Statistics
3: Summary Statistics Notation Let s start by introducing some notation. Consider the following small data set: 4 5 30 50 8 7 4 5 The symbol n represents the sample size (n = 0). The capital letter X denotes
More informationChapter 1: Exploring Data
Chapter 1: Exploring Data Chapter 1 Review 1. As part of survey of college students a researcher is interested in the variable class standing. She records a 1 if the student is a freshman, a 2 if the student
More information1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers
1.3 Measuring Center & Spread, The Five Number Summary & Boxplots Describing Quantitative Data with Numbers 1.3 I can n Calculate and interpret measures of center (mean, median) in context. n Calculate
More informationInterpreting Data in Normal Distributions
Interpreting Data in Normal Distributions This curve is kind of a big deal. It shows the distribution of a set of test scores, the results of rolling a die a million times, the heights of people on Earth,
More informationSPSS Manual for Introductory Applied Statistics: A Variable Approach
SPSS Manual for Introductory Applied Statistics: A Variable Approach John Gabrosek Department of Statistics Grand Valley State University Allendale, MI USA August 2013 2 Copyright 2013 John Gabrosek. All
More informationExploratory Data Analysis. Psychology 3256
Exploratory Data Analysis Psychology 3256 1 Introduction If you are going to find out anything about a data set you must first understand the data Basically getting a feel for you numbers Easier to find
More informationStudents summarize a data set using box plots, the median, and the interquartile range. Students use box plots to compare two data distributions.
Student Outcomes Students summarize a data set using box plots, the median, and the interquartile range. Students use box plots to compare two data distributions. Lesson Notes The activities in this lesson
More informationExploratory Data Analysis
Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction
More informationAMS 7L LAB #2 Spring, 2009. Exploratory Data Analysis
AMS 7L LAB #2 Spring, 2009 Exploratory Data Analysis Name: Lab Section: Instructions: The TAs/lab assistants are available to help you if you have any questions about this lab exercise. If you have any
More informationDescriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics
Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),
More informationc. Construct a boxplot for the data. Write a one sentence interpretation of your graph.
MBA/MIB 5315 Sample Test Problems Page 1 of 1 1. An English survey of 3000 medical records showed that smokers are more inclined to get depressed than non-smokers. Does this imply that smoking causes depression?
More informationDESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS
DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 seema@iasri.res.in 1. Descriptive Statistics Statistics
More informationMATH 103/GRACEY PRACTICE EXAM/CHAPTERS 2-3. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.
MATH 3/GRACEY PRACTICE EXAM/CHAPTERS 2-3 Name MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Provide an appropriate response. 1) The frequency distribution
More informationThe right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median
CONDENSED LESSON 2.1 Box Plots In this lesson you will create and interpret box plots for sets of data use the interquartile range (IQR) to identify potential outliers and graph them on a modified box
More informationStatistics and Probability
Statistics and Probability TABLE OF CONTENTS 1 Posing Questions and Gathering Data. 2 2 Representing Data. 7 3 Interpreting and Evaluating Data 13 4 Exploring Probability..17 5 Games of Chance 20 6 Ideas
More informationAppendix 2.1 Tabular and Graphical Methods Using Excel
Appendix 2.1 Tabular and Graphical Methods Using Excel 1 Appendix 2.1 Tabular and Graphical Methods Using Excel The instructions in this section begin by describing the entry of data into an Excel spreadsheet.
More informationMind on Statistics. Chapter 2
Mind on Statistics Chapter 2 Sections 2.1 2.3 1. Tallies and cross-tabulations are used to summarize which of these variable types? A. Quantitative B. Mathematical C. Continuous D. Categorical 2. The table
More information4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"
Data Analysis Plan The appropriate methods of data analysis are determined by your data types and variables of interest, the actual distribution of the variables, and the number of cases. Different analyses
More informationData Exploration Data Visualization
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
More informationData Analysis, Statistics, and Probability
Chapter 6 Data Analysis, Statistics, and Probability Content Strand Description Questions in this content strand assessed students skills in collecting, organizing, reading, representing, and interpreting
More informationModule 4: Data Exploration
Module 4: Data Exploration Now that you have your data downloaded from the Streams Project database, the detective work can begin! Before computing any advanced statistics, we will first use descriptive
More informationBNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
More informationSTAT355 - Probability & Statistics
STAT355 - Probability & Statistics Instructor: Kofi Placid Adragni Fall 2011 Chap 1 - Overview and Descriptive Statistics 1.1 Populations, Samples, and Processes 1.2 Pictorial and Tabular Methods in Descriptive
More informationSummarizing and Displaying Categorical Data
Summarizing and Displaying Categorical Data Categorical data can be summarized in a frequency distribution which counts the number of cases, or frequency, that fall into each category, or a relative frequency
More informationIntroduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data
A Few Sources for Data Examples Used Introduction to Environmental Statistics Professor Jessica Utts University of California, Irvine jutts@uci.edu 1. Statistical Methods in Water Resources by D.R. Helsel
More informationChapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs
Types of Variables Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Quantitative (numerical)variables: take numerical values for which arithmetic operations make sense (addition/averaging)
More informationCommon Tools for Displaying and Communicating Data for Process Improvement
Common Tools for Displaying and Communicating Data for Process Improvement Packet includes: Tool Use Page # Box and Whisker Plot Check Sheet Control Chart Histogram Pareto Diagram Run Chart Scatter Plot
More informationThe Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)
Describing Data: Categorical and Quantitative Variables Population The Big Picture Sampling Statistical Inference Sample Exploratory Data Analysis Descriptive Statistics In order to make sense of data,
More information4 Other useful features on the course web page. 5 Accessing SAS
1 Using SAS outside of ITCs Statistical Methods and Computing, 22S:30/105 Instructor: Cowles Lab 1 Jan 31, 2014 You can access SAS from off campus by using the ITC Virtual Desktop Go to https://virtualdesktopuiowaedu
More informationIntroduction to Statistics for Psychology. Quantitative Methods for Human Sciences
Introduction to Statistics for Psychology and Quantitative Methods for Human Sciences Jonathan Marchini Course Information There is website devoted to the course at http://www.stats.ox.ac.uk/ marchini/phs.html
More informationChapter 3. The Normal Distribution
Chapter 3. The Normal Distribution Topics covered in this chapter: Z-scores Normal Probabilities Normal Percentiles Z-scores Example 3.6: The standard normal table The Problem: What proportion of observations
More informationHow To Check For Differences In The One Way Anova
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
More informationEngineering Problem Solving and Excel. EGN 1006 Introduction to Engineering
Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques
More informationconsider the number of math classes taken by math 150 students. how can we represent the results in one number?
ch 3: numerically summarizing data - center, spread, shape 3.1 measure of central tendency or, give me one number that represents all the data consider the number of math classes taken by math 150 students.
More informationMBA 611 STATISTICS AND QUANTITATIVE METHODS
MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain
More informationGrade 8 Classroom Assessments Based on State Standards (CABS)
Grade 8 Classroom Assessments Based on State Standards (CABS) A. Mathematical Processes and E. Statistics and Probability (From the WKCE-CRT Mathematics Assessment Framework, Beginning of Grade 10) A.
More informationIntroduction to Exploratory Data Analysis
Introduction to Exploratory Data Analysis A SpaceStat Software Tutorial Copyright 2013, BioMedware, Inc. (www.biomedware.com). All rights reserved. SpaceStat and BioMedware are trademarks of BioMedware,
More informationCell Phone Impairment?
Cell Phone Impairment? Overview of Lesson This lesson is based upon data collected by researchers at the University of Utah (Strayer and Johnston, 2001). The researchers asked student volunteers (subjects)
More informationHISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS
Mathematics Revision Guides Histograms, Cumulative Frequency and Box Plots Page 1 of 25 M.K. HOME TUITION Mathematics Revision Guides Level: GCSE Higher Tier HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS
More informationHow To Write A Data Analysis
Mathematics Probability and Statistics Curriculum Guide Revised 2010 This page is intentionally left blank. Introduction The Mathematics Curriculum Guide serves as a guide for teachers when planning instruction
More informationUsing Excel for descriptive statistics
FACT SHEET Using Excel for descriptive statistics Introduction Biologists no longer routinely plot graphs by hand or rely on calculators to carry out difficult and tedious statistical calculations. These
More informationSta 309 (Statistics And Probability for Engineers)
Instructor: Prof. Mike Nasab Sta 309 (Statistics And Probability for Engineers) Chapter 2 Organizing and Summarizing Data Raw Data: When data are collected in original form, they are called raw data. The
More informationThursday, November 13: 6.1 Discrete Random Variables
Thursday, November 13: 6.1 Discrete Random Variables Read 347 350 What is a random variable? Give some examples. What is a probability distribution? What is a discrete random variable? Give some examples.
More informationGeoGebra Statistics and Probability
GeoGebra Statistics and Probability Project Maths Development Team 2013 www.projectmaths.ie Page 1 of 24 Index Activity Topic Page 1 Introduction GeoGebra Statistics 3 2 To calculate the Sum, Mean, Count,
More informationStatistics Chapter 2
Statistics Chapter 2 Frequency Tables A frequency table organizes quantitative data. partitions data into classes (intervals). shows how many data values are in each class. Test Score Number of Students
More information2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.
Math 1530-017 Exam 1 February 19, 2009 Name Student Number E There are five possible responses to each of the following multiple choice questions. There is only on BEST answer. Be sure to read all possible
More informationNew Zealand Crash Statistics Mathematics and Statistics 91582 (3.10) version 1: Use statistical methods to make a formal inference Credits: 4
New Zealand Crash Statistics Mathematics and Statistics 91582 (3.10) version 1: Use statistical methods to make a formal inference Credits: 4 Teacher guidelines Context/setting This activity requires students
More informationDESCRIPTIVE STATISTICS - CHAPTERS 1 & 2 1
DESCRIPTIVE STATISTICS - CHAPTERS 1 & 2 1 OVERVIEW STATISTICS PANIK...THE THEORY AND METHODS OF COLLECTING, ORGANIZING, PRESENTING, ANALYZING, AND INTERPRETING DATA SETS SO AS TO DETERMINE THEIR ESSENTIAL
More informationGestation Period as a function of Lifespan
This document will show a number of tricks that can be done in Minitab to make attractive graphs. We work first with the file X:\SOR\24\M\ANIMALS.MTP. This first picture was obtained through Graph Plot.
More informationGETTING YOUR DATA INTO SPSS
GETTING YOUR DATA INTO SPSS UNIVERSITY OF GUELPH LUCIA COSTANZO lcostanz@uoguelph.ca REVISED SEPTEMBER 2011 CONTENTS Getting your Data into SPSS... 0 SPSS availability... 3 Data for SPSS Sessions... 4
More informationFinal Exam Practice Problem Answers
Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal
More informationSPSS Workbook 1 Data Entry : Questionnaire Data
TEESSIDE UNIVERSITY SCHOOL OF HEALTH & SOCIAL CARE SPSS Workbook 1 Data Entry : Questionnaire Data Prepared by: Sylvia Storey s.storey@tees.ac.uk SPSS data entry 1 This workbook is designed to introduce
More informationIBM SPSS Statistics for Beginners for Windows
ISS, NEWCASTLE UNIVERSITY IBM SPSS Statistics for Beginners for Windows A Training Manual for Beginners Dr. S. T. Kometa A Training Manual for Beginners Contents 1 Aims and Objectives... 3 1.1 Learning
More informationGetting started in Excel
Getting started in Excel Disclaimer: This guide is not complete. It is rather a chronicle of my attempts to start using Excel for data analysis. As I use a Mac with OS X, these directions may need to be
More informationMEASURES OF VARIATION
NORMAL DISTRIBTIONS MEASURES OF VARIATION In statistics, it is important to measure the spread of data. A simple way to measure spread is to find the range. But statisticians want to know if the data are
More informationCh. 3.1 # 3, 4, 7, 30, 31, 32
Math Elementary Statistics: A Brief Version, 5/e Bluman Ch. 3. # 3, 4,, 30, 3, 3 Find (a) the mean, (b) the median, (c) the mode, and (d) the midrange. 3) High Temperatures The reported high temperatures
More information2. Filling Data Gaps, Data validation & Descriptive Statistics
2. Filling Data Gaps, Data validation & Descriptive Statistics Dr. Prasad Modak Background Data collected from field may suffer from these problems Data may contain gaps ( = no readings during this period)
More informationFirst Midterm Exam (MATH1070 Spring 2012)
First Midterm Exam (MATH1070 Spring 2012) Instructions: This is a one hour exam. You can use a notecard. Calculators are allowed, but other electronics are prohibited. 1. [40pts] Multiple Choice Problems
More informationPractice#1(chapter1,2) Name
Practice#1(chapter1,2) Name Solve the problem. 1) The average age of the students in a statistics class is 22 years. Does this statement describe descriptive or inferential statistics? A) inferential statistics
More informationBox-and-Whisker Plots
Learning Standards HSS-ID.A. HSS-ID.A.3 3 9 23 62 3 COMMON CORE.2 Numbers of First Cousins 0 3 9 3 45 24 8 0 3 3 6 8 32 8 0 5 4 Box-and-Whisker Plots Essential Question How can you use a box-and-whisker
More informationKey Concept. Density Curve
MAT 155 Statistical Analysis Dr. Claude Moore Cape Fear Community College Chapter 6 Normal Probability Distributions 6 1 Review and Preview 6 2 The Standard Normal Distribution 6 3 Applications of Normal
More informationUNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test March 2014
UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test March 2014 STAB22H3 Statistics I Duration: 1 hour and 45 minutes Last Name: First Name: Student number: Aids
More informationPaper 232-2012. Getting to the Good Part of Data Analysis: Data Access, Manipulation, and Customization Using JMP
Paper 232-2012 Getting to the Good Part of Data Analysis: Data Access, Manipulation, and Customization Using JMP Audrey Ventura, SAS Institute Inc., Cary, NC ABSTRACT Effective data analysis requires easy
More informationDescriptive Statistics and Exploratory Data Analysis
Descriptive Statistics and Exploratory Data Analysis Dean s s Faculty and Resident Development Series UT College of Medicine Chattanooga Probasco Auditorium at Erlanger January 14, 2008 Marc Loizeaux,
More informationCONTENTS. Chapter 1...1. Chapter 2...9. Chapter 3... 29. Chapter 4... 45. Chapter 5... 59. Chapter 6... 73. Chapter 7... 101. Chapter 8...
CONTENTS Chapter 1...1 Chapter...9 Chapter 3... 9 Chapter 4... 45 Chapter 5... 59 Chapter 6... 73 Chapter 7... 101 Chapter 8... 117 Chapter 9... 139 Chapter 10... 159 Chapter 11... 199 Chapter 1... 11
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationAP Statistics Solutions to Packet 2
AP Statistics Solutions to Packet 2 The Normal Distributions Density Curves and the Normal Distribution Standard Normal Calculations HW #9 1, 2, 4, 6-8 2.1 DENSITY CURVES (a) Sketch a density curve that
More informationMULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.
Exam Name MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. 1) The government of a town needs to determine if the city's residents will support the
More informationSolutions to Homework 3 Statistics 302 Professor Larget
s to Homework 3 Statistics 302 Professor Larget Textbook Exercises 3.20 Customized Home Pages A random sample of n = 1675 Internet users in the US in January 2010 found that 469 of them have customized
More informationData exploration with Microsoft Excel: univariate analysis
Data exploration with Microsoft Excel: univariate analysis Contents 1 Introduction... 1 2 Exploring a variable s frequency distribution... 2 3 Calculating measures of central tendency... 16 4 Calculating
More informationBox-and-Whisker Plots
Mathematics Box-and-Whisker Plots About this Lesson This is a foundational lesson for box-and-whisker plots (boxplots), a graphical tool used throughout statistics for displaying data. During the lesson,
More information+ Chapter 1 Exploring Data
Chapter 1 Exploring Data Introduction: Data Analysis: Making Sense of Data 1.1 Analyzing Categorical Data 1.2 Displaying Quantitative Data with Graphs 1.3 Describing Quantitative Data with Numbers Introduction
More information3. There are three senior citizens in a room, ages 68, 70, and 72. If a seventy-year-old person enters the room, the
TMTA Statistics Exam 2011 1. Last month, the mean and standard deviation of the paychecks of 10 employees of a small company were $1250 and $150, respectively. This month, each one of the 10 employees
More informationBar Graphs and Dot Plots
CONDENSED L E S S O N 1.1 Bar Graphs and Dot Plots In this lesson you will interpret and create a variety of graphs find some summary values for a data set draw conclusions about a data set based on graphs
More informationBASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s
More informationADD-INS: ENHANCING EXCEL
CHAPTER 9 ADD-INS: ENHANCING EXCEL This chapter discusses the following topics: WHAT CAN AN ADD-IN DO? WHY USE AN ADD-IN (AND NOT JUST EXCEL MACROS/PROGRAMS)? ADD INS INSTALLED WITH EXCEL OTHER ADD-INS
More informationWalk the Line Written by: Maryann Huey Drake University Maryann.Huey@drake.edu
Walk the Line Written by: Maryann Huey Drake University Maryann.Huey@drake.edu Overview of Lesson In this activity, students will conduct an investigation to collect data to determine how far students
More informationWhen to use Excel. When NOT to use Excel 9/24/2014
Analyzing Quantitative Assessment Data with Excel October 2, 2014 Jeremy Penn, Ph.D. Director When to use Excel You want to quickly summarize or analyze your assessment data You want to create basic visual
More informationThe Normal Distribution
Chapter 6 The Normal Distribution 6.1 The Normal Distribution 1 6.1.1 Student Learning Objectives By the end of this chapter, the student should be able to: Recognize the normal probability distribution
More informationEXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!
STP 231 EXAM #1 (Example) Instructor: Ela Jackiewicz Honor Statement: I have neither given nor received information regarding this exam, and I will not do so until all exams have been graded and returned.
More informationDemographics of Atlanta, Georgia:
Demographics of Atlanta, Georgia: A Visual Analysis of the 2000 and 2010 Census Data 36-315 Final Project Rachel Cohen, Kathryn McKeough, Minnar Xie & David Zimmerman Ethnicities of Atlanta Figure 1: From
More informationIBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the
More information