Essential Statistics Chapter 3

Similar documents
STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

3: Summary Statistics

Exploratory data analysis (Chapter 2) Fall 2011

3.2 Measures of Spread

2. Filling Data Gaps, Data validation & Descriptive Statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Data Exploration Data Visualization

Calculation example mean, median, midrange, mode, variance, and standard deviation for raw and grouped data

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

MEASURES OF VARIATION

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Means, standard deviations and. and standard errors

Lecture 1: Review and Exploratory Data Analysis (EDA)

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Topic 9 ~ Measures of Spread

Descriptive Statistics

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

CALCULATIONS & STATISTICS

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lesson 4 Measures of Central Tendency

Geostatistics Exploratory Analysis

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Mean = (sum of the values / the number of the value) if probabilities are equal

Module 4: Data Exploration

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

Exploratory Data Analysis

Exercise 1.12 (Pg )

1 Descriptive statistics: mode, mean and median

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Descriptive statistics parameters: Measures of centrality

Variables. Exploratory Data Analysis

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

How To Write A Data Analysis

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Exploratory Data Analysis. Psychology 3256

Northumberland Knowledge

Classify the data as either discrete or continuous. 2) An athlete runs 100 meters in 10.5 seconds. 2) A) Discrete B) Continuous

First Midterm Exam (MATH1070 Spring 2012)

Descriptive Statistics

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1.

Ch. 3.1 # 3, 4, 7, 30, 31, 32

a. mean b. interquartile range c. range d. median

Introduction; Descriptive & Univariate Statistics

Dongfeng Li. Autumn 2010

Measures of Central Tendency and Variability: Summarizing your Data for Others

Week 11 Lecture 2: Analyze your data: Descriptive Statistics, Correct by Taking Log

Week 1. Exploratory Data Analysis

2 Describing, Exploring, and

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Summarizing and Displaying Categorical Data

MBA 611 STATISTICS AND QUANTITATIVE METHODS

AP * Statistics Review. Descriptive Statistics

Module 3: Correlation and Covariance

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

COMPARISON MEASURES OF CENTRAL TENDENCY & VARIABILITY EXERCISE 8/5/2013. MEASURE OF CENTRAL TENDENCY: MODE (Mo) MEASURE OF CENTRAL TENDENCY: MODE (Mo)

Lecture 2. Summarizing the Sample

Diagrams and Graphs of Statistical Data

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

DESCRIPTIVE STATISTICS - CHAPTERS 1 & 2 1

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

THE BINOMIAL DISTRIBUTION & PROBABILITY

Basics of Statistics

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)


seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

Chapter 3. The Normal Distribution

Chapter 2 Statistical Foundations: Descriptive Statistics

Interpreting Data in Normal Distributions

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Final Exam Practice Problem Answers

Common Tools for Displaying and Communicating Data for Process Improvement

Mind on Statistics. Chapter 2

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Algebra I Vocabulary Cards

STAT355 - Probability & Statistics

MEASURES OF CENTER AND SPREAD MEASURES OF CENTER 11/20/2014. What is a measure of center? a value at the center or middle of a data set

Foundation of Quantitative Data Analysis

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Descriptive Statistics

Chapter 4. Probability Distributions

6.4 Normal Distribution

Chapter 2 Data Exploration

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Using SPSS, Chapter 2: Descriptive Statistics

Data exploration with Microsoft Excel: univariate analysis

Standard Deviation Estimator

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Shape of Data Distributions

determining relationships among the explanatory variables, and

Transcription:

1 Essential Statistics Chapter 3 By Navidi and Monk Copyright 2016 Mark A. Thomas. All rights reserved.

2 Measures of Center in summarizing descriptions of data, statisticians often talk about measures of center (i.e. what the data looks like in its center) as well as measures of spread (i.e. how the data spreads out) when we talk about measures of center, we will use the arithmetic mean and the arithmetic median, or more simply just mean and median

3 Measures of Center - Mean a list of n (or N) numbers is denoted xx 1, xx 2, xx 3,, xx nn the sum of those numbers is: xx = xx 1 + xx 2 + xx 3 + xx nn the mean for sample and population is: where x-bar is sample mean and mu is population mean note the mean is not necessarily a member of the data set µ x N = i

4 Measures of Center - Median the median is a number or approximation that splits the dataset in two parts procedure for finding the median (symbol x-tilde) 1. sort the data, and determine the number of data elements 2. if n is odd, the median is element number (n + 1) / 2 3. if n is even, the median is mean of the elements numbered (n/2) and (n/2) + 1 (e.g. if n = 12, the median is the average of the 6 th and 7 th elements) note if n is even, the median is not a value in the dataset, but between the two center elements

5 Rounding numbers it is a general good rule to round decimal places to one more decimal place than that of the data in the original data set

6 Comparing mean and median values that lie very far away from the majority of the other data values are called outliers the mean is more affected by outliers than is the median symmetric skew to right skew to left

7 Data Set Mode the mode of a data set is the data value that occurs most often when two values occur the most often (i.e. the same # of times), values are bimodal if > 2 values occur the most often, values are multimodal if no value occurs more than once in a data set, there is no mode

8 Mean of Grouped Data sometimes we don t have access to the actual data, but rather the frequency distribution approximating the mean will use class midpoints, that is the lower class limit from one class plus the lower class limit from the next consecutive class divided by 2

9 Mean of Grouped Data Procedure for approximating the mean of grouped data: 1. compute the midpoint of each class by taking the average of the lower class limit and the lower limit of the next larger class 2. for each class, multiply the class midpoint by the class frequency 3. add the products (Midpoint)x(Frequency) over all classes 4. divide the sum obtained in Step 3 by the sum of the frequencies see example 3.9

Mean of Grouped Data 10

11 Mean of Grouped Data 6850 = 50 = 137

Summary 12

13 Measures of Spread (3.2) measures of spread are measures of how the data spreads out in the dataset the simples measure of spread is the range range = maximum data value minimum data value

14 Measures of Spread - Variance variance is a measure of how far, on average, the data values in the dataset are from the mean as with mean, let x 1, x 2, x 3, x n represent the values in a dataset the formulas for population and sample variance are as follows:

Measures of Spread - Variance 15

Measures of Spread - Variance 16

17 Measures of Spread Std. Deviation the units of variance are squared units, thus if the orignal data was degrees, the variance is in degrees squared to remedy this, we use the standard deviation the standard deviation is simply the square root of the variance, e.g. sample std. dev. population std. dev.

18 Measures of Spread Empirical Rule when a population or sample has a histogram that is approximately bellshaped, then: approximately 68% of the data will be within one standard deviation of the mean approximately 95% of the data will be within two standard deviations of the mean almost all, of the data will be within three standard deviations of the mean

19 Measures of Spread Empirical Rule when a population or sample has a histogram that is approximately bell-shaped, visually: x-bar - s x-bar x-bar + s

20 Measures of Spread CV the coefficient of variation (CV) shows how large the standard deviation is relative to the mean CV values are unit-less, so relative comparisons of different units can be made CV formula is simply std. deviation / mean

21 Measures of Position z-scores (3.3) a z-score of an individual data value indicates how many standard deviations it is away from its mean given x is a value from a population with mean μ and standard deviation σ, the z-score for x is: z = x µ σ see example 3.22

22 Measures of Position z-scores Empirical Rule and Z-Scores When a population has a histogram that is approximately bell-shaped: Approximately 68% of the data will have z-scores between 1 and 1 Approximately 95% of the data will have z-scores between 2 and 2 All, or almost all of the data will have z-scores between 3 and 3

23 Measures of Position given any data set, the median divides the dataset into? equal parts data set values median

24 Measures of Position given any data set, the median divides the dataset into? equal parts data set values median we can also divide a dataset into 4 equal parts, called quartiles

25 Measures of Position given any data set, the median divides the dataset into? equal parts data set values median we can also divide a dataset into 4 equal parts, called quartiles data set values Q 1 Q 2 Q 3

26 Measures of Position we can also divide a dataset into 100 equal parts, called percentiles given a number p between 1 & 99, the pth percentile separates the lowest p% of the data from the highest (100- p)% data set values P 25 P 50 P 75 25 % 75 %

27 Measures of Position Computing a data value corresponding to a given percentile: 1. sort the data in increasing order, and determine n 2. using the following formula, compute the location L = (p/100) n 3. if L is not a whole number, round up (take ceiling) to the next highest whole number, the pth percentile is in the location of the rounded-up number 4. if L is a whole number, the pth percentile is the average of the number in in the location L and location L + 1

28 Measures of Position Example 3.23: compute the 30 th percentile given the following sorted data: location L = (30 / 100) * 42 = 12.6 since not a whole number, take next highest number 13, and the 30 th percentile is @ location 13

29 Measures of Position Example 3.23: compute the 30 th percentile given the following sorted data: location L = (30 / 100) * 42 = 12.6 since not a whole number, take next highest number, 13, and the 30 th percentile is @ location 13

30 Measures of Position Computing the percentile corresponding to a given data value: 1. sort the data in increasing order, and determine n 2. let x be the given data value, compute the percentile p = ((number of data values < x + 0.5) / n ) * 100 3. if p is not a whole number, round (up or down) to the next whole number

31 Measures of Position Example 3.24: what percentile does rainfall of 1.90 correspond? sort data ascending, how number of values are less than 1.9? percentile p = ((17 + 0.5 ) / 42 ) * 100 = 41.6667 since not a whole number, 41.7 rounds to 42, thus the value 1.9 corresponds to the 42 nd percentile

32 Measures of Position Computing a data value corresponding to a given quartile: 1. sort the data in increasing order, and determine n 2. find the percentile corresponding to the desired quartile, e.g. q1 = p25, q2 = p50, etc. 3. using the following formula, compute the location L = (p/100) n 4. if L is not a whole number, round up (take ceiling) to the next highest whole number, the pth percentile is in the location of the rounded-up number 5. if L is a whole number, the pth percentile is the average of the number in in the location L and location L + 1

33 Measures of Position five number summary consists of the following 5 positional values

34 Measures of Position find the five number summary given the following data 41 42 42 44 44 45 45 46 49 49 51 51 53 56 57 59 59 65 67 71 77 100 min = 41, max = 100, median = 51 (n = 22, so average the 11 th & 12 th ) Q1 = P25 = (25 / 100) * 22 = 5.5, next higher whole number = 6, so the value in the 6 th location is 45 Q3 = P75 = (75 / 100) * 22 = 16.5, next higher whole number = 17, so the value in the 17 th location is 59

35 Measures of Position - Outliers an outlier is a data value much larger or smaller than other data values in the dataset outliers can be erroneous, or unusually correct, depending upon the measurement interquartile range (IQR) is a measure of spread used to detect outliers IQR = Q3 Q1 lower and upper outlier boundaries are computed by lower outlier boundary = Q1 (1.5 x IQR) upper outlier boundary = Q3 + (1.5 x IQR)

36 Measures of Position - Outliers Example 3.30: use IQR method to determine which values, if any in table 3.11 are outliers from example 2.27, Q1 =45, Q3 = 59 IQR = Q3 Q1 = 59 45 = 14 lower outlier boundary = 45 (1.5 x 14) = 24 upper outlier boundary = 59 + (1.5 x 14) = 80 so any data values in table 3.1 < 24 or > 80 are outliers (note only a single outlier, i.e. 100)