Chapter 2: Exploring Data with Graphs and Numerical Summaries. Graphical Measures- Graphs are used to describe the shape of a data set.

Similar documents
Exploratory data analysis (Chapter 2) Fall 2011

Variables. Exploratory Data Analysis

Exercise 1.12 (Pg )

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Mind on Statistics. Chapter 2

Classify the data as either discrete or continuous. 2) An athlete runs 100 meters in 10.5 seconds. 2) A) Discrete B) Continuous

Summarizing and Displaying Categorical Data

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Practice#1(chapter1,2) Name

AP * Statistics Review. Descriptive Statistics

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Diagrams and Graphs of Statistical Data

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Lecture 1: Review and Exploratory Data Analysis (EDA)

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

MEASURES OF VARIATION

3: Summary Statistics

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Exploratory Data Analysis

Descriptive Statistics

Interpreting Data in Normal Distributions

Chapter 2: Frequency Distributions and Graphs

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Chapter 1: Exploring Data

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Sta 309 (Statistics And Probability for Engineers)

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

MATH 103/GRACEY PRACTICE EXAM/CHAPTERS 2-3. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Describing, Exploring, and Comparing Data

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

DESCRIPTIVE STATISTICS - CHAPTERS 1 & 2 1

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

+ Chapter 1 Exploring Data

2 Describing, Exploring, and

First Midterm Exam (MATH1070 Spring 2012)

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Statistics Chapter 2

Box-and-Whisker Plots

Chapter 3. The Normal Distribution

AP Statistics Solutions to Packet 2

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DesCartes (Combined) Subject: Mathematics Goal: Statistics and Probability

a. mean b. interquartile range c. range d. median

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Foundation of Quantitative Data Analysis

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

Ch. 3.1 # 3, 4, 7, 30, 31, 32

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

How To Write A Data Analysis

Using SPSS, Chapter 2: Descriptive Statistics

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

STAT355 - Probability & Statistics

The Normal Distribution

Bar Graphs and Dot Plots

Statistics Revision Sheet Question 6 of Paper 2

Basics of Statistics

THE BINOMIAL DISTRIBUTION & PROBABILITY

Data Exploration Data Visualization

Exploratory Data Analysis. Psychology 3256

Box-and-Whisker Plots

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Northumberland Knowledge

Topic 9 ~ Measures of Spread

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Week 1. Exploratory Data Analysis

Descriptive Statistics and Measurement Scales

Means, standard deviations and. and standard errors

6 3 The Standard Normal Distribution

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

6. Decide which method of data collection you would use to collect data for the study (observational study, experiment, simulation, or survey):

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test March 2014

3 Describing Distributions

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Final Exam Practice Problem Answers

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Dongfeng Li. Autumn 2010

Determine whether the data are qualitative or quantitative. 8) the colors of automobiles on a used car lot Answer: qualitative

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Section 1.3 Exercises (Solutions)

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Midterm Review Problems

Shape of Data Distributions

Descriptive statistics parameters: Measures of centrality

Describing and presenting data

Lesson 4 Measures of Central Tendency

DesCartes (Combined) Subject: Mathematics Goal: Data Analysis, Statistics, and Probability

MTH 140 Statistics Videos

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Section 1.1 Exercises (Solutions)

Transcription:

Page 1 of 16 Chapter 2: Exploring Data with Graphs and Numerical Summaries Graphical Measures- Graphs are used to describe the shape of a data set. Section 1: Types of Variables In general, variable can be classified as being categorical (qualitative) or quantitative. Categorical variable measures attribute, it takes values from a set of categories. Examples are gender, make of automobile, color of eyes. Quantitative variable represents the amount or quantity of something. It could be discrete or continuous. Discrete quantitative variable takes numbers that are countable, such as 0, 1, 2, 3, Example: are number of female students in APSU. Continues quantitative variable takes values from an interval, such as [0, 5]. Examples: heights of students in a class, weather temperature. Example: Questions 2.3 and 2.4 page 29. 1

Page 2 of 16 A proportion of a category (or relative frequency of a category) = frequency of category total number of observation. Percentage of a category = proportion 100%. A frequency table is a list the possible variable values with the frequencies for each value. Example: In a survey concerning public education, 400 school administrators were asked to rate the quality of education in the US. Their responses are summarized in the following table Rating Frequency A 35 B 260 C 93 D 12 Based on this data, answer the following questions: a) What is the proportion for the rating A schools in the US. b) What is the percentage of rating A schools in the US. c) What is the grade for the schools in the US which they got the highest percentage? What is this percentage? What can you say about the quality of the education in US based on this data? d) Construct a relative frequency table for the data. 2

Page 3 of 16 Section 2: Graphs Graphs for Categorical variables: We use pie chart or bar chart to describe the categorical data. Example: For the previous data. a) Construct a pie chart to describe the data b) Construct a bar chart to describe the data. Graphs for quantitative variables: We use dot plots, stem and leaf plots, or histograms. The distribution of a set of data is a graph, table, or mathematical formula that indicates the different kinds of possible observations and how often they occur. Distributions of quantitative data have shape, and the shape of a distribution can be determined by looking at dot plots, stem and leaf plots, or histograms. Example: Construct a dot plot for the following data; 1,2,3,2,2,4,2,3,5,9,5,5,5,1 Discussion (shapes of distributions): Dotplot of data 1 Dotplot of data 2 Dotplot of data 3 1 2 3 4 data 1 5 6 7 2 4 data 2 6 8 2 4 data 3 6 8 3

Page 4 of 16 Stem and leaf To construct a stem-and-leaf display: Partition each observation into a stem and a leaf. Usually the stem consists of all the digits except for the final one, which is the leaf. Order the stems from the smallest to the largest in a column. Ensure that all stems in the data range are included. Record the leaf for each observation in the row corresponding to its stem. The leaves should be ordered. Example: The following data represents the prices of 19 different brands of walking shoes. Construct a stem and leaf plot to display the distribution of the data. 90 70 70 75 70 65 68 60 74 70 95 75 70 68 65 40 65 70. Histogram - Divide the range of the data into intervals of equal width. For discrete variable use the actual values. - Use the frequency table to construct the histogram. 4

Page 5 of 16 Basic Shapes Right skewed Symmetric Left skewed Example: The following data represents number of quarts of milk purchased during a particular week. Construct a histogram to describe the distribution of the data 0 3 5 4 3 2 1 3 1 2 1 1 2 0 1 4 3 2 2 2 2 2 2 3 4 Example: The following data represents the GPAs of 30 Bucknell University freshmen, recorded at the end of the freshman year. Construct a histogram to display the distribution of the data (Choose width=0.3 for each interval). Describe shape of the distribution of the data. 2.0 3.1 1.9 2.5 1.9 2.3 2.6 3.1 2.5 2.1 2.9 3.0 2.7 2.5 2.4 2.7 2.5 2.4 3.0 3.4 2.6 2.8 2.5 2.7 2.9 2.7 2.8 2.2 2.7 2.1 5

Page 6 of 16 Example: Discussion (skewness, mode, unimodal, bimodal, spread, outliers) 6

Page 7 of 16 Section 3: Measures of Center Summation Notation Given numbers x 1, x 2, x 3,., x n, we can express their sum x 1 + x 2 + x 3 + x 4 + + x n as Example: n x i i= 1 Mean, Median, Mode A measure of center is a one-number description of a distribution or data set, and we focus on three. Mean or average: The sum of numbers divided by the total number of the numbers: Mean = n i=1 n x i Median or 50th percentile: a number that separates the lower 50% and upper 50% of the numbers. Mode: the number that occurs most frequently in the set. There can be more than one mode. Example: In the 2002 Winter Olympics, figure skater Michelle Kwan competed in the short program ladies single event. She received the following scores for technical merit: 5.8 5.7 5.9 5.7 5.5 5.7 5.7 5.7 5.6 Find the mean, median, and mode. Throw out the score of 5.5 and again find the mean, median and mode. 7

Page 8 of 16 Remark: Properties: Resistant measures are not sensitive to extreme data. The median is resistant, the mean is not. Example: Compare the mean and median of the salaries $13,000 $32,000 $45,000 with the mean and median of the salaries $13; 000 $32; 000 $250; 000 Population Mean and Sample Mean The population mean µ (pronounced myoo) of a population of size N is the average of all values x 1, x 2,, x N in the population: Population mean N xi i= µ = 1. N The sample mean x (pronounced x-bar) of a sample of size n is the average of all values x 1, x 2, x n in the sample: Sample mean n xi i= x = 1. n

Page 9 of 16 Example: The ages in years of all seven MATH 4270 students are Find the population mean for the students ages. 26 22 24 21 23 32 24 A random sample of size three was taken from the class. The random sample was 21 23 32. Find the sample mean of the three students' ages. Example:

Page 10 of 16 Sec 4: Measures of Spread Measures of spread summarize how far data are spread out. We focus on the following measurements: Standard deviation: used when the mean is the measure of center. It is the most important measure of spread. Range: the largest value minus the smallest value. Interquartile range: used when the median is the measure of center (we will talk about it in sec 5). Other Important Sums (leading up to measuring the spread) Sum of distances from the mean: ( x i x) n i= 1 Sum of squared distances from the mean: ( x i x) n i= 1 Example: History Exam Scores for four students are 91 95 92 76. Fill the following table 2 x 91 95 92 76 x i x 2 ( x i x) Population and Sample Standard Deviation The standard deviation measures the variation in a data set by indicating how far, on average, each number is from the mean. Population standard deviation σ : σ = N i= 1 2 ( x x) i N Sample standard deviation s: s = n i= 1 ( x i x) n 1 2

Page 11 of 16 Example: The ages in years for a sample of three MATH 4270 students are 21 23 32. Find the sample standard deviation and the range. Remarks about Standard Deviation The more variation among data in a sample, the larger the standard deviation. Like the mean, the standard deviation is not resistant because its value is affected by extreme data points. Empirical Rule: For bell-shaped distributions, - about 68.27% of all possible observations lie within one σ from µ. - about 95.45% of all possible observations lie within two σ s from µ. - about 99.73% of all possible observations lie within three σ s from µ.

Page 12 of 16 Section 5: Five-Number Summary, Boxplots, z-scores The 1rst quartile Q 1 is the 25th percentile, and it is the median of the lower half of the data. That is 25% of the data is lower than Q 1. The 2nd quartile Q 2 is the 50 th percentile, and it is the median of the data. That is 50% of the data is lower than Q 2. The 3rd quartile Q 3 is the 75th percentile, and it is the median of the upper half of the data. That is 75% of the data is lower than Q 3. Example: Eleven students report their exam score as: Find the quartiles. 11 12 10 18 19 15 20 15 20 17 8 Example: Sixteen people reportedly watched the following numbers of hours of TV weekly: 8 22 34 16 13 26 19 23 25 31 34 30 31 20 22 41 Find the quartiles. The interquartile range (IQR) is compute as IQR = Q Q (this is our third measure of spread). 3 1 The IQR is not sensitive to extreme values and is therefore a resistant measure of spread. The IQR is used as the measure of spread. Example: Compute the range and IQR for the data in the previous example.

Page 13 of 16 The five- number summary of data consists of the min, Q 1, Q 2, Q 3 and max Example; Compute the five- number summary for the following 100 meter race times (in seconds): 10.69 11.11 11.18 12.44 10.76 10.88 10.64 7.10 8.44 10.47 Outliers Outlier(s): data value(s) that is (are) far from most of the data. Lower limit: Q1 - (1.5)(IQR) (sometime called Inner fence lower limit) Upper limit: Q3 + (1.5)(IQR) (sometime called Inner fence upper limit) Data greater than the upper limit or less than the lower limit are potential outliers. Examples: Human heights of 9'. Miles per gallon rates greater than 95. Example: Use the lower limit and upper limit to identify any potential outliers in the previous example. Boxplots To create Boxplots do the following: - Determine the 5-number summary. - Compute lower & upper limits (Inner fence). - Mark and label the quartiles with vertical lines and box them in. - Indicate all potential outliers with * and label them. - Mark and label the smallest & largest values occurring within upper and lower limits with vertical lines, and connect the lines to the box (these are called adjacent values).

Example: The following data represent ages of a group of people. Make a boxplot for the data and identify any outliers. 25 22 26 23 27 26 28 18 25 24 12 Page 14 of 16 Advantage of Boxplots: - Graphically display the shape of distribution of the data. - Shows the potential outliers in the data. - Graphically display the spread of the data. Example: a) What is the shape of the distribution? b) Approximate each component of the five-number summary, and interpret them.

Page 15 of 16 z-scores (Standardized Data) Data can be standardized so that different data sets can be compared, or to compare values within the same data set. Example: The average height of men is 69 inches with a std. of 2.8 inches. The average height of women is 63.6 inches with a std. of 2.5 inches. Michael Jordan is 78 inches tall. Rebecca Lobo is 76 inches tall. Relatively speaking, who is taller? Jordan's and Lobo's heights should be standardized relative to those of their genders so their heights can be compared. x mean If x is a variable, then z = is the standardized version or z-score of x. Calculate the z- standard deviation scores of Jordan's and Lobo's heights. Facts About z-scores The mean of the z-scores of a population is always 0. The standard deviation of the z-scores of a population is always 1. Most z-scores will fall between -3 and 3. If the z-scores of x is less than -3 or greater than 3, then x is an outlier. z-scores never have units! Example: Body temperatures of healthy human children have mean = 98.60 o F and standard deviation = 0.62 o F. Your child has temperature of 101 o F. What should you do?

Page 16 of 16 Section 6: Misleading Graphs Examples 2.85 and 2.87 Calculator commands: To use your calculator for evaluating the descriptive statistics and more (mean, standard deviation, quartile, ) do the following: STAT Edit Enter report your data in any column (say L1) STAT CALC 1-Var Stats Enter Scroll down to see more output. Example: Use your calculator to compute mean, mode, standard deviation, and the five- number summary for the following 100 meter race times (in seconds): 10.69 11.11 11.18 12.44 10.76 10.88 10.64 7.10 8.44 10.47