Notes on Distributions, Measures of Central Tendency, and Dispersion Anthropological Sciences 192/292 Data Analysis in the Anthropological Sciences James Holland Jones & Ian G. Robertson February 1, 2006 1
The Sample Mean Say you have a sample of n observations, from some variable X which we index with the letter i: x i, where i = 1, 2, 3,..., n The sample mean is given by: x = 1 n i=n i=1 x i > r <- rnorm(20, mean=69, sd=12) > r [1] 69.07067 84.08761 60.37330 70.58928 49.10580 71.16912 80.04877 53.18702 [9] 81.15977 50.41038 45.53022 76.09200 66.07399 63.40058 74.85083 76.26923 [17] 58.42128 68.58139 77.53043 58.71911 > sum(r)/20 Anthropological Sciences 192/292: 2
[1] 66.73354 > mean(r) [1] 66.73354 > The sample mean is the most common measure of central tendency Anthropological Sciences 192/292: 3
Some Properties # make a pdf of the figure > pdf(file="sample500.pdf") > hist(r1, main="") > abline(v=mean(r1), lwd=3, col="red") > dev.off() # don t forget to turn of the pdf device so you ll see all your # subsequent plots Frequency 0 50 100 150 20 40 60 80 100 120 r1 Anthropological Sciences 192/292: 4
Notice that in this sample of 500 normal deviates, the mean is approximately the same as the most common value This is a property of the normal distribution It arises because the normal distribution of symmetric You can assess this symmetry by comparing the mean and the median (remember: the median is the point where 50% of the observations are above and 50% are below) > mean(r1) [1] 68.83362 > median(r1) [1] 68.8683 Nearly identical! Anthropological Sciences 192/292: 5
Means for Skewed Distributions > r2 <- rlnorm(500,meanlog=log(69), sdlog=log(12)) > max(r2) [1] 174837.2 > mean(r2) [1] 1145.734 > median(r2) [1] 68.27205 Yikes! Note that I used a lognormal distribution rlnorm() to generate these variates The lognormal distribution (like the gamma or exponential) is a skewed distribution: it has a long tail Anthropological Sciences 192/292: 6
An Exponential Example > r3 <- rexp(500,rate=2) [1] 0.5219174 > median(r3) [1] 0.3496812 > pdf(file="sample500exp.pdf") > hist(r3, main="") > abline(v=mean(r3), lwd=2, col="red") > abline(v=median(r3), lwd=2, lty=2, col="red") > legend(3,250,c("mean", "median"), lty=1:2, lwd=2, col="red") > dev.off() null device 1 Anthropological Sciences 192/292: 7
Frequency 0 50 100 150 200 250 300 mean median 0 1 2 3 4 r3 Anthropological Sciences 192/292: 8
More Properties of the Mean Let Y be a linear function of X: y i = x i + c c is a constant The mean of Y is then ȳ = x + c Now say that we scale X so that y i = cx i Anthropological Sciences 192/292: 9
The mean of Y is then ȳ = c x Combine them! y i = c 1 x i + c 2 ȳ = c 1 x + c 2 Say you have a mean temperature of 11.75 C. What is the mean in F? The conversion formula C F: y i = 9 5 x i + 32 Anthropological Sciences 192/292: 10
The transformed mean ȳ = 9 5 (11.75) + 32 = 53.15 F Anthropological Sciences 192/292: 11
Sample Variance and Standard Deviation The most important measure of spread of a distribution is the variance The sample variance is given by s 2 = 1 n 1 n (x i x) 2 i=1 The sample standard deviation is simply the square root of this s = 1 n 1 n (x i x) 2 i=1 A more useful formula for calculating sample variance is given by Anthropological Sciences 192/292: 12
s 2 = 1 n 1 n x 2 i n x 2 i=1 With this formula, you don t need to compute the difference between each observation and then square it a process in which it is easy to make errors Anthropological Sciences 192/292: 13
Properties of Sample Variances Add a constant to X y i = x i + c The variance remains unchanged! s 2 y = s 2 x Now scale X by some constant multiplier y i = cx i Anthropological Sciences 192/292: 14
This time the variance changes s 2 y = c 2 s 2 x Anthropological Sciences 192/292: 15
Coefficient of Variation A handy way to characterize the relative variability of a distribution or sample is with the coefficient of variation CV = 100 s x The CV remains the same regardless of units since if the units are changed by a factor c, both the sample mean and sample standard deviation will change by this factor and they will cancel out Anthropological Sciences 192/292: 16
The Normal Distribution This is the ubiquitous distribution in statistics The normal distribution is very useful for modeling all sorts of natural phenomena, particularly lots of biometric things (e.g., body size, height). It also forms the basis of most statistical tests that are used The normal distribution is completely characterized by two parameters: µ and σ These are the mean and standard deviation respectively The normal distribution has probability density function f X (x) = 1 2πσ 2 e (x µ)2 2σ 2 Anthropological Sciences 192/292: 17
Standard Normal Distribution When µ = 0 and σ = 1, we refer to the distribution as the standard normal The standard normal distribution has probability density function f X (x) = 1 2π e ( 1/2)x2 Some Properties: Approximately 68% of the area under a standard normal density lies between -1 and 1 Approximately 95% of the area under a standard normal density lies between -2 and 2 97.5% of the area under the cumulative distribution function (pnorm()) of the standard normal distribution lies below the value 1.96 As you practice statistics, you will see this seemingly bizarre number 1.96 come up over and over again. This is where it comes from Anthropological Sciences 192/292: 18
Any time you see a formula (e.g., for a confidence interval or a hypothesis test) that involves multiplying something by 1.96, you are using a normal approximation to something Anthropological Sciences 192/292: 19
The Standard Normal Density Function φ(z) φ(z) = 1 2π e 2 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 z Anthropological Sciences 192/292: 20