How To Calculate The Normal Distribution Of A Distribution



Similar documents
6.4 Normal Distribution

Descriptive Statistics

Lesson 4 Measures of Central Tendency

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Means, standard deviations and. and standard errors

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Exercise 1.12 (Pg )

3: Summary Statistics

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Northumberland Knowledge

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

MEASURES OF VARIATION

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

8. THE NORMAL DISTRIBUTION

How To Write A Data Analysis

Exploratory Data Analysis

3.2 Measures of Spread

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Geostatistics Exploratory Analysis

Algebra I Vocabulary Cards

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

THE BINOMIAL DISTRIBUTION & PROBABILITY

Descriptive Statistics

3.4 The Normal Distribution

Continuous Random Variables

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Lesson 7 Z-Scores and Probability

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Exploratory data analysis (Chapter 2) Fall 2011

CALCULATIONS & STATISTICS

7. Normal Distributions

DATA INTERPRETATION AND STATISTICS

Variables. Exploratory Data Analysis

Normal distribution. ) 2 /2σ. 2π σ

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

4. Continuous Random Variables, the Pareto and Normal Distributions

Dongfeng Li. Autumn 2010

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Descriptive Statistics and Measurement Scales

Mathematical Conventions Large Print (18 point) Edition

Measures of Central Tendency and Variability: Summarizing your Data for Others

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Module 4: Data Exploration

AP Statistics Solutions to Packet 2

Chapter 5: Normal Probability Distributions - Solutions

Data Exploration Data Visualization

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

Mathematical Conventions. for the Quantitative Reasoning Measure of the GRE revised General Test

Lecture 1: Review and Exploratory Data Analysis (EDA)

LINEAR EQUATIONS IN TWO VARIABLES

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

9.1 Measures of Center and Spread

ST 371 (IV): Discrete Random Variables

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Standard Deviation Estimator

Simple Regression Theory II 2010 Samuel L. Baker

Introduction to Quantitative Methods

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Week 1. Exploratory Data Analysis

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Week 4: Standard Error and Confidence Intervals

The Normal distribution

Notes on Continuous Random Variables

STAT355 - Probability & Statistics

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Diagrams and Graphs of Statistical Data

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Week 3&4: Z tables and the Sampling Distribution of X

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Statistics. Measurement. Scales of Measurement 7/18/2012

Department of Civil Engineering-I.I.T. Delhi CEL 899: Environmental Risk Assessment Statistics and Probability Example Part 1

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

MATH 10: Elementary Statistics and Probability Chapter 7: The Central Limit Theorem

MODERN APPLICATIONS OF PYTHAGORAS S THEOREM

Ch. 3.1 # 3, 4, 7, 30, 31, 32

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

Temperature Scales. The metric system that we are now using includes a unit that is specific for the representation of measured temperatures.

Statistics Revision Sheet Question 6 of Paper 2

THE BASICS OF STATISTICAL PROCESS CONTROL & PROCESS BEHAVIOUR CHARTING

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Measurement with Ratios

Descriptive statistics parameters: Measures of centrality

Pre-Algebra Academic Content Standards Grade Eight Ohio. Number, Number Sense and Operations Standard. Number and Number Systems

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Chapter 2 Statistical Foundations: Descriptive Statistics

Chapter 3 RANDOM VARIATE GENERATION

AP STATISTICS REVIEW (YMS Chapters 1-8)

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

The Normal Distribution

16. THE NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION

Transcription:

2.0 Lesson Plan Answer Questions 1 Summary Statistics Histograms The Normal Distribution Using the Standard Normal Table

2. Summary Statistics Given a collection of data, one needs to find representations of the data that facilitate understanding and insight. Three standard tools for this are: 2 Measures of Central Tendency (mean, median, mode) Measures of Dispersion (standard deviation, range, IQR) Visualizations (histograms, other graphics) We shall cover this quickly, since most of this is review from elementary school.

2.1 Measures of Central Tendency The mean is just the average of the data. Suppose one has a sample of n observations with values X 1,...,X n. Then the mean is just X = 1 n n i=1 X i 3 = 1 n (X 1 + + X n ). The mode is just the value that occurs most frequently in the sample. There can be many modes. The median is the middle-largest value among the n observations, if n is odd. If there are an even number of observations, then we average the two middle-largest values.

Example: Suppose one observes the following data: 1, 0, 2, -2, 1, -2, 5, -1 The mean is X = 1 (1 + 0 + 2 2 + 1 2 + 5 1) = 0.5. 8 The modes are 1 and -2. 4 The median is the average of 0 and 1, or 0.5. The mean can be pulled in misleading directions if there are outliers. A single large or small datum will have a large influence on the mean, but not on the median. An outlier is an incorrect or unrepresentative observation that is very different from the others in the sample.

2.2 Measures of Dispersion To measure how spread out a sample is, we mostly use the standard deviation (or sd). This is: 5 sd = = 1 n 1 [(X 1 X) 2 + + (X n X) 2 ] (1) ( n ) 1 Xi 2 n n 1 n 1 X 2. (2) i=1 The sd is the square root of the average squared deviation of each observation from the mean. The square of the sd is the variance. Formula (1) is better for understanding, but (2) is better for calculation.

Note Bene: When one observes the whole population rather than just a sample, then the formula for the population sd divides by n instead of n 1. 6 The majority of observations are usually within 1 sd of the mean. But it can happen that none of the data are less than 1 sd from the mean. Tchebyshev proved that for all datasets, the proportion of data that lie within a standard deviations of the mean must be at least 1 1 a 2. In terms of probability, if you pick an observation X at random, P[ X mean < a sd ] 1 1 a 2. Thus At least 75% of the observations must always be less than 2 standard deviations from the population mean. At least 89% of the observations must always be less than 3 standard deviations of the population mean.

The range is the largest observation minus the smallest. As a measure of dispersion, it is strongly influenced by outliers. The interquartile range is 75th percentile of the data minus the 25th percentile (the median is the 50th percentile). 7 The 25th percentile is the number u such that at least 25% of the sample is less than or equal to u and at least 75% of the sample is greater than or equal to u. (The u need not be a sample value; and if there are many numbers u that satisfy this definition, we take the middle value.) The 75th percentile is the number v such that at least 75% of the sample is less than or equal to v and at least 25% of the sample is greater than or equal to v. (If this is not unique, take the middle value.) The interquartile range is not strongly influenced by outliers.

Example: Suppose you have the following sample: 1, 0, 2, -2, 1, 5, -2, -1. It helps to order the data first: -2, -2, -1, 0, 1, 1, 2, 5 The range is 5 - (-2) = 7. There are eight values, so the 25th percentile is any number between -2 and -1; we take -1.5. Similarly, the 75th percentile is 1.5. The IQR is the difference of these, or 1.5 - (-1.5) = 3. 8 The standard deviation is sd = but it is faster to calculate 1 7 [(1 0.5)2 + + (( 1) 0.5) 2 ] =? sd = 1 7 [12 + + ( 1) 2 ] (8/7)(0.5) 2 = 2.32993.

2.3 Properties of X and the sd Suppose we have n observations X 1,...,X n and we use these to create a new sample Y 1,...,Y n where Y i = ax i + b. This often arises when converting units of measurement, such has changing Fahrenheit data into the Centigrade scale: C = 5/9 * F - 17.778. 9 Then Ȳ = a X + b sd Y = a sd X. Can you guess the conversion formulae for the median, mode, range, and interquartile range of the Y values?

2.4 The Histogram The histogram shows where sample values are located and where they concentrate. 10 The x-axis gives the sample value, and the y-axis is the percent per x-value. (This is different from a bar chart.) In a histogram, the areas under a block represent percentages. By convention, the left endpoint of a histogram bar is included in the interval, but not the right.

This incomplete histogram shows the number of parties attended in one week by Duke freshmen. What is the height of the missing bar? What percentage go to 1 or fewer parties? 11

As the sample size gets large and as the bin-width gets small at the appropriate relative rates, then the histogram becomes smooth. This limiting smooth curve is called a probability density function. 12

2.5 The Normal Distribution Some limiting histograms are famous and have names. The most famous distribution is the normal distribution (a/k/a the Gaussian distribution or the bell-shaped curve). 13 The distribution was named after Carl Friedrich Gauss, the greatest mathematician in history. He proved the fundamental theorem of algebra four ways, inventing a new branch of mathematics each time. He worked in number theory, co-invented the telegraph, and discovered non-euclidean geometry, but did not publish, fearing controversy.

People believe the normal distribution describes IQ, height, rainfall, measurement error, and many other features. This is only approximately true. But it is a good approximation for features that are the sum of many separate increments. 14 A normal distribution with mean µ and standard deviation σ has the equation: f(x) = 1 [ exp 1 ] 2πσ 2σ2(x µ)2 for < µ < and σ 0. The µ is the mean of the entire population, whereas X is used to denote the mean of a sample from the population. Similarly, σ is the standard deviation of the entire population, whereas sd is used to denote the standard deviation of a sample. The standard normal has µ = 0 and σ = 1.

The mean of a normal distribution shows where it is centered. The standard deviation of a normal distribution shows how spread out the normal is. 15

2.6 The Standard Normal Distribution Practice reading the standard normal table in the handout. A copy is also posted on the FAQ site. You may bring your table to class and use whatever notes you have on the back of the table during quizzes. 16 What proportion of a standard normal population has values between -1.5 and 1.5? From the table, the proportion is 1-2 * 0.067 = 0.866 (i.e., the total area under the curve is one, and we subtract the upper tail area and the symmetric lower tail area). Now go the other way. About 80% of the population lies between what two values that are centered at 0? The answer is about -1.28 and 1.28 (the table shows that 10% of the area is above 1.28, and symmetrically, 10% is below -1.28).

Some problems to think about: What is the value of z such that 25% of a standard normal population is larger than that value? (Ans: about 0.67) What is the value for which about 90% of the population is smaller? (Ans: about 1.28) 17 What proportion of the population has a value larger than -1? (Ans: about 0.841) What proportion of the population has a value less than -0.3? (Ans: about 0.382) In this class, use the nearest value in the table. If you interpolate, or use a number from your calculator, it will confuse the graders.

How can you decide if data are a random sample from a normal distribution? Inspect the histogram. Make a normal probability plot. 18 To make a normal probability plot, order the observations from smallest to largest; denote the ordered observations by X (1), X (2),...,X (n). For observation X (i), find the z-value such that (i 0.5)/n 100% of the area under the standard normal curve is to the left. Call this z-value Y i. Then plot (X (i), Y i ) for all i = 1,...n. If this looks pretty much like a straight line, then the data are approximately normal. (There is a Stata command for this.)