Chapter 7 What to do when you have the data

Similar documents
Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Descriptive Statistics

Lesson 4 Measures of Central Tendency

MEASURES OF VARIATION

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

6.4 Normal Distribution

Descriptive Statistics and Measurement Scales

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

AP Statistics Solutions to Packet 2

Exploratory Data Analysis

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Interpreting Data in Normal Distributions

MBA 611 STATISTICS AND QUANTITATIVE METHODS

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Unit 7: Normal Curves

Diagrams and Graphs of Statistical Data

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

Frequency Distributions

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1.

Summarizing and Displaying Categorical Data

Data Exploration Data Visualization

Descriptive statistics parameters: Measures of centrality

AP * Statistics Review. Descriptive Statistics

Classify the data as either discrete or continuous. 2) An athlete runs 100 meters in 10.5 seconds. 2) A) Discrete B) Continuous

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Exercise 1.12 (Pg )

6 3 The Standard Normal Distribution

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Statistics Revision Sheet Question 6 of Paper 2

The Normal Distribution

Mind on Statistics. Chapter 2

Describing, Exploring, and Comparing Data

What Does the Normal Distribution Sound Like?

Variables. Exploratory Data Analysis

8. THE NORMAL DISTRIBUTION

Lecture 1: Review and Exploratory Data Analysis (EDA)

Lecture 2. Summarizing the Sample

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

4. Continuous Random Variables, the Pareto and Normal Distributions

CHAPTER THREE. Key Concepts

Exploratory Data Analysis. Psychology 3256

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Statistics. Measurement. Scales of Measurement 7/18/2012

Descriptive Statistics

Exploratory data analysis (Chapter 2) Fall 2011

TEACHER NOTES MATH NSPIRED

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

Midterm Review Problems

How To Write A Data Analysis

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Chapter 2 Statistical Foundations: Descriptive Statistics

THE BINOMIAL DISTRIBUTION & PROBABILITY

Sampling and Descriptive Statistics

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Practice#1(chapter1,2) Name

Chapter 4. Probability and Probability Distributions

Chapter 2: Frequency Distributions and Graphs

CALCULATIONS & STATISTICS

Common Tools for Displaying and Communicating Data for Process Improvement

Lecture 5 : The Poisson Distribution

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Topic 9 ~ Measures of Spread

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Probability. Distribution. Outline

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Statistics Chapter 2

3: Summary Statistics

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

MATH 103/GRACEY PRACTICE EXAM/CHAPTERS 2-3. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

3.2 Measures of Spread

Normal distribution. ) 2 /2σ. 2π σ

c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.

Sta 309 (Statistics And Probability for Engineers)

Chapter 1: Exploring Data

Week 1. Exploratory Data Analysis

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Session 7 Bivariate Data and Analysis

Name: Date: Use the following to answer questions 2-3:

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

Section 1.3 Exercises (Solutions)

Chapter 3. The Normal Distribution

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Measurement & Data Analysis. On the importance of math & measurement. Steps Involved in Doing Scientific Research. Measurement

Chapter 3 RANDOM VARIATE GENERATION

determining relationships among the explanatory variables, and

Northumberland Knowledge

Means, standard deviations and. and standard errors

Lecture Notes Module 1

DESCRIPTIVE STATISTICS - CHAPTERS 1 & 2 1

Introduction to Quantitative Methods

Transcription:

Chapter 7 What to do when you have the data We saw in the previous chapters how to collect data. We will spend the rest of this course looking at how to analyse the data that we have collected.

Stem and Leaf Diagrams Stem and Leaf Diagrams are graphical ways to display a group of integers in a dataset. Steps for Constructing a Stem and Leaf Diagram 1. Select one or more of the leading digits to be the Stem values, the remaining digits become the Leaves. 2. List Possible Stem values in a column 3. Record the Leaf for every observation beside the corresponding Stem value. 4. Indicate on the display what units are used for the Stems and Leaves.

Example The following are a selection of exam marks 71 52 52 75 64 60 48 56 67 29 11 53 25 46 58 46 49 62 66 40 19 54 57 54 60 19 59 43 51 40 21 45 46 62 73 59 36 45 55 46 45 32 55 46 51 46 65 49 61 40

A Stem And Leaf Diagram will look like this: 1 199 2 159 3 26 4 0 0 0 3 5 5 5 6 6 6 6 6 6 8 9 9 5 1 1 2 2 3 4 4 5 5 6 7 8 9 9 6 0 0 1 2 2 4 5 6 7 7 135 STEM UNIT = TENS LEAF UNIT = ONES

Histogram for Discrete Numerical Data 1. Draw a horizontal X-axis and on it mark the possible values taken by the observations 2. Draw a vertical Y-axis marked with either relative frequencies or frequencies 3. Above each possible value on the X-axis draw a rectangle centred on the value with width 1 and height equal to the relative frequency or frequency of that value.

Value Frequency 30 100 40 150 50 200 60 100 250 200 150 100 50 0 30 40 50 60 Frequency

The Shape of Histograms The general shape of a histogram is important. The number of peaks in the histogram determines whether a distribution is classed as Unimodal, Bimodal or Multimodal. In addition to this classification we can further classify UniModal distributions as to whether they are symmetric or not. A unimodal distribution is defined to be Symmetric if there is a vertical line of symmetry through the middle of the distribution such that the distribution to the left of this line is the mirror image of the distribution to the right of this line.

The right part of a unimodal distribution is called the Upper Tail of the distribution while the left part is called the Lower Tail: A Unimodal distribution which is not symmetric is called skewed, there are two types of skewness. Positive Skew: If the upper tail of the distribution stretches out more than the lower tail then the distribution is said to be positively skewed. Negative Skew: If the Lower tail of the distribution stretches out more than the upper tail then the distribution is said to be negatively skewed.

Symmetric Distributions

POSITIVELY SKEWED DISTRIBUTION

NEGATIVELY SKEWED DISTRIBUTION

Definitions Mean: The Mean of a quantitative dataset is the sum of the observations in the dataset divided by the number of observations in the dataset. Median: The Median (m) of a quantitative dataset is the middle number when the observations are arranged in ascending order. Mode: The Mode of a datset is the observation that occurs most frequently in the dataset.

How to calculate these Dataset: X 1 X 2 X 3 X 4 X 5..... X n Mean = (X 1 + X 2 + X 3 +..+ X n )/n Median: Arrange the n observations in order from smallest to largest, then: if n is odd, the median (m) is the middle number, if n is even, the median is the mean of the middle two numbers Mode: If given a dataset, the mode is easily chosen as the value which appears most often.

Example A: Dataset: 5, 3, 8, 5, 6 Mean = 5.4 Mode = 5 Median: 3, 5, 5, 6, 8 so m = 5 Note: 5.4 is not one of the original values in the dataset Example B: 11, 140, 98, 23, 45, 14, 56, 78, 93, 200, 123, 165 n = 12 Mean = 1046/12 = 87.16666666 Median: 11, 14, 23, 45, 56, 78, 93, 98, 123, 140, 165, 200 m = (78 + 93)/2 = 85.5 Example C: generate a dataset containing 9 numbers using the Day, Month and Year of your birth and that of the people sitting to your left and right. ie: DD/MM/YY

Mean vs Median vs Mode - which measures the centre best? Choosing which of these three measures to use in practice can sometimes seem like a difficult task. However if we understand a little about the relative merits of each we should at least be able to make an informed decision. If the distribution is symmetric then Mean = Median If the distribution is Positively Skewed (to the right) then Median < Mean If the distribution is Negatively Skewed (to the left) then Mean < Median

So the difference between the mean and median can be used to measure the skewness of a dataset. Note: The presence of outliers affects the mean but not the median. This can be seen from the diagrams and from the following example Example: Ten recent statistics graduates who are now working as statisticians are surveyed for their annual salary. The survey produced the following dataset: E60,000 E20,000 E19,000 E22,000 E21,500 E21,000 E18,000 16,000 E17,500 E20,000 Mode = E20,000 Median = E20,000 Mean = E23,500

Notice that the distribution is positively skewed, the presence of the one high earner has affected the Mean causing it to be E1,500 higher than the highest of all the salaries excluding E60,000. For this dataset the Mean is therefore not a good measure of the centre of the dataset. Notice also that the median would be unaffected if the E60,000 was changed to a value like E23,000 which is more in line with the rest of the data.

Examples Would you expect the datasets described below to be symmetric, skewed to the right or skewed to the left. A. The salaries of people employed by UCD B. The grades on an easy exam C. The grades on a difficult exam D. The amount of time spent by students in a difficult 3 hour exam. E. The amount of time students in this class studied last week. F. The age of cars on a used car lot

Example:The median age of the population in Ireland is now 32 years old. The median age of the Irish population in 1986 was 27. Interpret these values and explain the trend, what implications does this data have for Irish society. What are the consequences for the entertainment industry in Ireland?

Numerical Measures of Variability When we want to describe a dataset providing a measure of the centre of that dataset is only part of the story. Consider the following two distributions:

Both of these distributions are symmetric and meana = meanb, modea=modeb and mediana=medianb. However these two distributions are obviously different, the data in A is quite spread out compared to the data in B. This spread is technically called variability and we will now examine how best to measure it.

Definitions Range: The Range of a quantatitive dataset is equal to the largest value minus the smallest value. Sample Variance: The Sample Variance is equal to the sum of the squared distances from the mean divided by n-1. Standard Deviation: The Sample Standard Deviation, s, is defined as the positive square root of the Sample Variance, s 2.

Sample Variance s x x n i i n 2 2 1 1 = = ( ) s x x n n i i i n i n 2 2 1 2 1 1 = = =

Which is best? The meaning of the Range is easily seen from its definition. It is a very crude measure of the variability contained in a dataset as it is only interested in the largest and smallest values and does not measure the variability of the rest of the dataset. Example: These two datasets have the same range but do they have the same variability? Dataset1: 1, 5, 5, 5, 9 Dataset2: 1, 2, 5, 8, 9 NO, Dataset2 is obviously more spread out than Dataset1 which has three values clustered at 5.

Example Once upon a time there were two lecturers A & B, each delivered the same course to two different classes. When exam time came both classes had the same average marks of 70%. The marks for Lecturer A s class however had a standard deviation of 25% whereas the Standard Deviation for Lecturer B s class was 5%. Who s class would you rather be in?

Chapter 8 Normal Curves and Relative Standing We have just seen how datasets can be described by histograms. For large datasets of continuous variables the histograms have so many possible values that it would be impracticable to draw all of the really narrow rectangles necessary. Instead we represent these datasets by curves (distributions). The curve can be thought of as joining the centre points of tops of all the rectangles in the histogram.

These distributions which are like generalised relative frequency histograms can take many different shapes, some symmetrical some skewed. There is one shape however that crops up all through the natural world and that is THE NORMAL DISTRIBUTION aka The Gaussian Distribution or The Bell Curve

The Normal Curve

The Normal Distribution is Symmetric. There are many different Normal curves, some are fat some are thin. Some are centred at 0 some at 1 some at 5 etc. Each normal curve can be uniquely identified by two parameters. The Mean and the Standard Deviation Once you know the mean and the S.Deviation for a Normal curve then it is possible to draw the curve. Normal curves are centred at the Mean. And the Standard Deviation describes how spread out they are.

The Normal Curve Standard Deviation MEAN

The area under a Normal curve to the left of the mean is.5. This indicates that the probability that something which is normally distributed is less than its mean is.5. The area under the curve to the left of any point A on the X axis represents the probability that a Normal variable is less than A.

X ~ Normal Probability( X<A) is the area under the curve to the left of A MEAN A

There are an infinite number of different Normal curves, one for each possible combination of values of the mean and the standard deviation. However there is a relationship between all Normal curves. All Normal variables X can be transformed into a Standard Normal Variable Z. Z is Normal with Mean 0 and Standard Deviation 1. Z = X μ σ

We can use tables to look areas under the Standard Normal Curve. Example: Find the Probability that a Normal variable with Mean 3 and Standard Deviation 2 is less than 4. Pr( X < 4) = Pr( X 3< 4 3) = Pr X 3 4 3 < = 2 2 Pr( Z < 05. ) = 0. 6915

Section Interpreting the Standard Deviation -the Empirical Rule We have seen that the Variance and hence the Standard Deviation of a dataset provides us with a relative measure of the variability contained in a dataset. So that if we are given two datasets the one with the larger Standard Deviation will be the dataset which exhibits the greater variability. Is it possible for the Standard Deviation to give more than a relative measure of variability? Can we actually say how spread out the data is?

The answer is yes, we will see later how to give detailed answers for particular distributions. In the meantime there are two rules which will provide us with a good deal of information about some general datasets. The Empirical rule provides us with some definite statements about the proportion of observations in a specified interval. It only works for Symmetric Bell-Shaped (moundshaped) distributions. Also this rule is an approximation and more or less data than is indicated by the rule may lie in each interval.

The Empirical Rule For a Symmetric Bell-Shaped distribution - Normal or close to Normal. Approximately 68% of the observations are within 1 Standard Deviation of the Mean Approximately 95% of the observations are within 2 Standard Deviation of the Mean Approximately 99.7% of the observations are within 3 Standard Deviation of the Mean

Example In Iraq, there are daily shootings. The US army use M16 rifles The British Army use SA80 rifles.

Example Forensic Scientists have conducted tests on many different combinations of weapons.they found that the dataset of observations produced by the SA80 Rifles showed a Mean velocity of 936 feet/second and a Standard Deviation of 10 feet/second.

The measurements were taken at a distance of 15 feet from the gun. Iraqi police examined the body of a civilian shot by a soldier. They concluded that he was shot at a distance of 15 feet and that the velocity of the bullet at impact was 1,000 feet/second. Controversy reigned between the British and the US as to who was responsible. Can you use the Empirical Rule to see if the British were responsible?

The distribution of this bullet velocity data should be approximately bell-shaped. This implies that the empirical rule should give a good estimation of the percentages of the data within each interval. k- # of Standard Deviations Interval Empirical approximate Percentage 2 916, 956 95% 3 906, 966 99.7% 4 896, 976 ~100% 5 886, 986 ~100% 6 876, 996 ~100% 7 866, 1006 ~100%

This table quite clearly demonstrates that since the bullet velocity in the shooting was 1000 ft/sec and since this lies more than 6 Standard Deviations away from the mean the probability is extremely high that the British were not responsible for this shooting. This is especially evident from looking at the column showing percentages from the empirical rule. Practically 100% of bullet velocities should be between 896 and 976 ft/sec.

Numerical Measures of Relative Standing While it is useful to know how to measure the centre of a dataset and the variability of a dataset, many times we want to be able to compare one observation with the rest of the observations in the dataset. Is one observation larger than many others? For Example suppose you get 35% on the exam for this course you will probably feel quite bad about your performance but what if 90% of the class actually did worse than you? Then you might feel a bit better about your 35%.

So in some cases knowing how one observation compares with others can be more useful than just knowing the value of that observation. We will now look at some different ways of measuring Relative Standing.

Definitions Percentile: For any dataset the p th percentile is the observation which is greater in value than P% of all the numbers. Consequently this observation will be smaller than (100-P)% of the data. Z-Score: The Z-Score of an observation is the distance between that observation and the mean expressed in units of standard deviations. So: Z = X μ σ

The numerical value of the Z-score reflects the relative standing of the observation. A large positive Z-score implies that the observation is larger than most of the other observations. A large negative Z-score indicates that the observation is smaller than almost all the other observations. A Z score of zero or close to 0 means that the observation is located close to the mean of the dataset.

Example A sample of 120 statistics students was chosen and their exam results summarised, the mean and standard deviation were shown to be: mean = 53% st.dev. = 7% Britney and Christina are two students in this class and Britney s exam result was 47% what was her Z-score? If Christina s Z-Score is 2, what was her percentage on the exam?