Notes on Distributions, Measures of Central Tendency, and Dispersion

Similar documents
Mean = (sum of the values / the number of the value) if probabilities are equal

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Review of Random Variables

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Measures of Central Tendency and Variability: Summarizing your Data for Others

SKEWNESS. Measure of Dispersion tells us about the variation of the data set. Skewness tells us about the direction of variation of the data set.

Means, standard deviations and. and standard errors

MEASURES OF VARIATION

2. Filling Data Gaps, Data validation & Descriptive Statistics

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Descriptive Statistics

Lecture 1: Review and Exploratory Data Analysis (EDA)

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Logistic Regression (a type of Generalized Linear Model)

Using Excel for inferential statistics

Geostatistics Exploratory Analysis

Descriptive Statistics

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

COMPARISON MEASURES OF CENTRAL TENDENCY & VARIABILITY EXERCISE 8/5/2013. MEASURE OF CENTRAL TENDENCY: MODE (Mo) MEASURE OF CENTRAL TENDENCY: MODE (Mo)

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Introduction to Quantitative Methods

Lesson 4 Measures of Central Tendency

MBA 611 STATISTICS AND QUANTITATIVE METHODS

AP STATISTICS REVIEW (YMS Chapters 1-8)

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Statistics Review PSY379

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Standard Deviation Estimator

Exploratory Data Analysis

Module 4: Data Exploration

The correlation coefficient

The Normal distribution

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

1.5 Oneway Analysis of Variance

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Week 4: Standard Error and Confidence Intervals

LOGNORMAL MODEL FOR STOCK PRICES

Diagrams and Graphs of Statistical Data

Chapter 3 RANDOM VARIATE GENERATION

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

CALCULATIONS & STATISTICS

Department of Civil Engineering-I.I.T. Delhi CEL 899: Environmental Risk Assessment Statistics and Probability Example Part 1

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

Simple Regression Theory II 2010 Samuel L. Baker

Quantitative Methods for Finance

6.4 Normal Distribution

Bootstrap Example and Sample Code

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Statistical Confidence Calculations

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

3.2 Measures of Spread

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Summarizing and Displaying Categorical Data

THE FIRST SET OF EXAMPLES USE SUMMARY DATA... EXAMPLE 7.2, PAGE 227 DESCRIBES A PROBLEM AND A HYPOTHESIS TEST IS PERFORMED IN EXAMPLE 7.

Study Guide for the Final Exam

THE BINOMIAL DISTRIBUTION & PROBABILITY

How To Test For Significance On A Data Set

Probability Distributions

Pre-course Materials

DATA INTERPRETATION AND STATISTICS

Using R for Linear Regression

Part 2: Analysis of Relationship Between Two Variables

Fairfield Public Schools

Statistics. Measurement. Scales of Measurement 7/18/2012

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

Capital Market Theory: An Overview. Return Measures

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Factors affecting online sales

Frequency Distributions

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS


Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA

Dongfeng Li. Autumn 2010

The Standard Normal distribution

Exercise 1.12 (Pg )

Descriptive Statistics

Pr(X = x) = f(x) = λe λx

Module 5: Measuring (step 3) Inequality Measures

E3: PROBABILITY AND STATISTICS lecture notes

II. DISTRIBUTIONS distribution normal distribution. standard scores

Exploratory Data Analysis. Psychology 3256

Introduction; Descriptive & Univariate Statistics

4. Continuous Random Variables, the Pareto and Normal Distributions

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Characteristics of Binomial Distributions

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

8. THE NORMAL DISTRIBUTION

3.1. Solving linear equations. Introduction. Prerequisites. Learning Outcomes. Learning Style

Transcription:

Notes on Distributions, Measures of Central Tendency, and Dispersion Anthropological Sciences 192/292 Data Analysis in the Anthropological Sciences James Holland Jones & Ian G. Robertson February 1, 2006 1

The Sample Mean Say you have a sample of n observations, from some variable X which we index with the letter i: x i, where i = 1, 2, 3,..., n The sample mean is given by: x = 1 n i=n i=1 x i > r <- rnorm(20, mean=69, sd=12) > r [1] 69.07067 84.08761 60.37330 70.58928 49.10580 71.16912 80.04877 53.18702 [9] 81.15977 50.41038 45.53022 76.09200 66.07399 63.40058 74.85083 76.26923 [17] 58.42128 68.58139 77.53043 58.71911 > sum(r)/20 Anthropological Sciences 192/292: 2

[1] 66.73354 > mean(r) [1] 66.73354 > The sample mean is the most common measure of central tendency Anthropological Sciences 192/292: 3

Some Properties # make a pdf of the figure > pdf(file="sample500.pdf") > hist(r1, main="") > abline(v=mean(r1), lwd=3, col="red") > dev.off() # don t forget to turn of the pdf device so you ll see all your # subsequent plots Frequency 0 50 100 150 20 40 60 80 100 120 r1 Anthropological Sciences 192/292: 4

Notice that in this sample of 500 normal deviates, the mean is approximately the same as the most common value This is a property of the normal distribution It arises because the normal distribution of symmetric You can assess this symmetry by comparing the mean and the median (remember: the median is the point where 50% of the observations are above and 50% are below) > mean(r1) [1] 68.83362 > median(r1) [1] 68.8683 Nearly identical! Anthropological Sciences 192/292: 5

Means for Skewed Distributions > r2 <- rlnorm(500,meanlog=log(69), sdlog=log(12)) > max(r2) [1] 174837.2 > mean(r2) [1] 1145.734 > median(r2) [1] 68.27205 Yikes! Note that I used a lognormal distribution rlnorm() to generate these variates The lognormal distribution (like the gamma or exponential) is a skewed distribution: it has a long tail Anthropological Sciences 192/292: 6

An Exponential Example > r3 <- rexp(500,rate=2) [1] 0.5219174 > median(r3) [1] 0.3496812 > pdf(file="sample500exp.pdf") > hist(r3, main="") > abline(v=mean(r3), lwd=2, col="red") > abline(v=median(r3), lwd=2, lty=2, col="red") > legend(3,250,c("mean", "median"), lty=1:2, lwd=2, col="red") > dev.off() null device 1 Anthropological Sciences 192/292: 7

Frequency 0 50 100 150 200 250 300 mean median 0 1 2 3 4 r3 Anthropological Sciences 192/292: 8

More Properties of the Mean Let Y be a linear function of X: y i = x i + c c is a constant The mean of Y is then ȳ = x + c Now say that we scale X so that y i = cx i Anthropological Sciences 192/292: 9

The mean of Y is then ȳ = c x Combine them! y i = c 1 x i + c 2 ȳ = c 1 x + c 2 Say you have a mean temperature of 11.75 C. What is the mean in F? The conversion formula C F: y i = 9 5 x i + 32 Anthropological Sciences 192/292: 10

The transformed mean ȳ = 9 5 (11.75) + 32 = 53.15 F Anthropological Sciences 192/292: 11

Sample Variance and Standard Deviation The most important measure of spread of a distribution is the variance The sample variance is given by s 2 = 1 n 1 n (x i x) 2 i=1 The sample standard deviation is simply the square root of this s = 1 n 1 n (x i x) 2 i=1 A more useful formula for calculating sample variance is given by Anthropological Sciences 192/292: 12

s 2 = 1 n 1 n x 2 i n x 2 i=1 With this formula, you don t need to compute the difference between each observation and then square it a process in which it is easy to make errors Anthropological Sciences 192/292: 13

Properties of Sample Variances Add a constant to X y i = x i + c The variance remains unchanged! s 2 y = s 2 x Now scale X by some constant multiplier y i = cx i Anthropological Sciences 192/292: 14

This time the variance changes s 2 y = c 2 s 2 x Anthropological Sciences 192/292: 15

Coefficient of Variation A handy way to characterize the relative variability of a distribution or sample is with the coefficient of variation CV = 100 s x The CV remains the same regardless of units since if the units are changed by a factor c, both the sample mean and sample standard deviation will change by this factor and they will cancel out Anthropological Sciences 192/292: 16

The Normal Distribution This is the ubiquitous distribution in statistics The normal distribution is very useful for modeling all sorts of natural phenomena, particularly lots of biometric things (e.g., body size, height). It also forms the basis of most statistical tests that are used The normal distribution is completely characterized by two parameters: µ and σ These are the mean and standard deviation respectively The normal distribution has probability density function f X (x) = 1 2πσ 2 e (x µ)2 2σ 2 Anthropological Sciences 192/292: 17

Standard Normal Distribution When µ = 0 and σ = 1, we refer to the distribution as the standard normal The standard normal distribution has probability density function f X (x) = 1 2π e ( 1/2)x2 Some Properties: Approximately 68% of the area under a standard normal density lies between -1 and 1 Approximately 95% of the area under a standard normal density lies between -2 and 2 97.5% of the area under the cumulative distribution function (pnorm()) of the standard normal distribution lies below the value 1.96 As you practice statistics, you will see this seemingly bizarre number 1.96 come up over and over again. This is where it comes from Anthropological Sciences 192/292: 18

Any time you see a formula (e.g., for a confidence interval or a hypothesis test) that involves multiplying something by 1.96, you are using a normal approximation to something Anthropological Sciences 192/292: 19

The Standard Normal Density Function φ(z) φ(z) = 1 2π e 2 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 z Anthropological Sciences 192/292: 20