GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 1 III. Sampling 0 Some R commands for functions we ve covered so far 0.1 rbinom(m,n,p) returns m integers drawn from the binomial distribution with n trials and probability of success p. Each one of the integers returned would be k in our terminology. 0.2 dbinom(k,n,p) returns the probability of exactly k successes in n trials each with probability p. 0.3 pbinom(j,n,p) returns the cumulative probability of j or fewer successes in n trials each with probability p. 0.4 rpois(m,a) returns m integers drawn from the poisson distribution with parameter a. 0.5 dpois(k,a) returns the probability of exactly k events in poisson distribution with parameter a. 0.6 ppois(j,a) returns the cumulative probability of j or fewer events resulting from poisson distribution with parameter a. 0.7 rmultinom(m,n,p) returns m vectors of integers drawn from multinomial with n trials and vector of probabilities p. 0.8 dmultinom(k,n,p) returns the probability of sampling exactly k (where k is a vector of integers) in n trials with vector of probabilities p.
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 2 0.9 rexp(n,a) returns n numbers drawn from exponential distribution with parameter (rate) a. 0.10 dexp(x,a) returns density of exponential distribution with parameter a at X = x. 0.11 pexp(x,a) returns cumulative probability of exponential distribution with parameter a at X = x. 0.12 rnorm(n) returns n numbers drawn from standard normal distribution (with zero mean and unit variance). 0.13 dnorm(x) returns the normal density at X = x. 0.14 pnorm(x) returns the normal distribution function (cumulative probability) at X = x. 0.15 runif(), dunif(), punif(): These are like rnorm(), dnorm(), pnorm(), but for uniform distribution on (0,1). 0.16 choose(n,k) returns ( n k), i.e. n!/[k!(n k)!]. 0.17 factorial(j) returns j!. 0.18 lfactorial(j) returns ln(j!). 0.19 gamma(x) returns Γ(x).
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 3 0.20 mean(x), median(x), var(x), sd(x) return mean, median, variance, and standard deviation of the vector or array x. 0.21 cov(x,y) returns the covariance between vectors x and y. 1 Overview of Sampling, Error, Bias 2 Error Estimates With Assumed Sampling Distribution 2.1 Standard Error: Standard deviation of distribution of sample statistics that would result from infinite number of trials of drawing sample from underlying probability distribution and calculating the sample statistic. 2.2 In practice we generally do not estimate error by repeated sampling from the underlying distribution (expensive and time-consuming), although there are exceptions. 2.3 Approximations based on sample distribution (from Sokal and Rohlf):
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 4
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 5 2.4 Limitations: 2.4.1 Many approximation formulae make assumptions about shape of distribution and sample size. 2.4.2 We may be interested in novel statistic or one whose sampling distribution is not well characterized. 3 Bootstrap Error Estimates 3.1 Estimate standard error by resampling from the single sample we have. 3.2 This approach uses sampling with replacement from observed sample to simulate sampling without replacement from the underlying distribution. 3.3 Procedure 3.3.1 Start with observed sample of size n and observed sample statistic, call it Z. 3.3.2 Randomly pick a sample of size n, with replacement, from the observed sample. 3.3.3 Calculate the sample statistic of interest on this random sample; call is Z boot. 3.3.4 Repeat many times (generally hundreds to thousands). 3.3.5 Calculate standard deviation of the Z boot. This is an estimate of the standard error of the observed sample statistic Z: (SD(Z boot ) SE(Z). 3.4 Simple (but not necessarily most useful) example: trimmed mean Define p-% trimmed mean as mean of sample with p% lowest and p% highest observations discarded. (Idea is to try to reduce effect of outlines.) Suppose data consist of 10 (ordered) observations: 1,2,3,4,8,10,12,15,20,30. Let the trimmed mean be denoted Z. Then Z = (3 + 4 + 8 + 10 + 12 + 15)/6 = 8.67.
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 6 R code to estimate SE(Z) #define function trim.mean<-function(x,ntrim){ ii<-order(x) xtmp<-x[ii] return(mean(xtmp[(ntrim+1):(n-ntrim)])) } data<-c(1,2,3,4,8,10,12,15,20,30) #specify data n<-length(data) ntrim<-2 #specify number to trim from each side Zobs<-trim.mean(data,ntrim) #get observed value nrep<-10000 #specify number of bootstrap replicates Zboot<-rep(NA,nrep) #assign memory for (i in 1:nrep) #get bootstrap replicates Zboot[i]<-trim.mean(sample(data,n,replace=TRUE),ntrim) SE<-sd(Zboot) #calculate bootstrap std. error hist(zboot,breaks=50) #plot histogram of results This yields Z obs = 8.67 and SE(Z) 3.1. Histogram of Zboot 600 500 400 Frequency 300 200 100 0 5 10 15 20 25 Zboot
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 7 3.5 Useful R function: sample(x,n,replace=true[or FALSE]) returns a random sample of size n from the vector x with or without replacement. 3.6 To sample from array X so that the variables (columns) stay together: nr<-dim(x)[1] #get number of rows i<-sample(1:nr,n,replace=true[or FALSE]) #returns vector of integers sampled on [1,n] XSAMP<-X[i,] 4 Parametric bootstrap 4.1 Take observed sample and estimate relevant parameter from it. 4.2 Resample from parametric distribution with parameter equal to sample estimate (rather than resampling from observed distribution). 4.3 This approach can also be applied to more complicated situations: for example, simulating a process with parameters estimated from data. 4.3.1 We ll do lots of this later...
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 8 5 Examples of Finite-sample Bias (sample-size bias) 5.1 Sample variance 5.1.1 (x x) 2 /n is biased. This is systematically too low, which makes sense since it is based on squared deviations from sample mean. 5.1.2 (x x) 2 /(n 1) is unbiased. 5.2 Number of taxa 5.2.1 Rarefaction method (from Raup 1975) Abundance of species i is N i ; N = N i. Consider a particular species, i. ( N N i ) n is the number of ways of drawing the non-i individuals in a sample of n. ( N n) is the number of ways of drawing all individuals. Therefore, the ratio of these two is the probability of not drawing any individuals of species i. Therefore 1 minus this ratio is the probability of drawing at least one individual of species i. So the expected number of species is just the sum of this probability, calculated for each species in turn. 5.2.2 Caveats Rarefaction for interpolation rather than extrapolation Collecting curves vs. rarefaction curves Apparent leveling off of curves does not imply that nearly everything has been found (only that you re unlikely to find it with modest effort). Curves affected by factors other than sample size (sampling method, taxonomic treatment, size of geographic area etc.). Crossing of rarefaction curves can make interpretation difficult.
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 9
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 10 5.2.3 Examples of application of taxonomic rarefaction (Raup 1975; Raup and Schopf 1978) This example suggests that the increase in observed family diversity in post-paleozoic echinoids cannot be accounted for by an increase in the number of species sampled.
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 11 This example suggests that much of the variation in the number of observed echinoid orders is consistent with differences in number of sampled species.
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 12 5.2.4 Interpretation of taxonomic rarefaction curves not entirely straightforward. Sampling standardization to be treated in more detail later
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 13 5.3 Range 5.3.1 Example: Range of samples from normal distribution
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 14
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 15
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 16
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 17 5.3.2 Example: Test for nonrandomness of sampling with respect to morphology
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 18 5.3.3 Correction in general case via rarefaction (random subsampling at controlled sample-size) Caveat: Range at standardized sample size may not convey any information that isn t conveyed by sample variance.
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 19 6 Extreme value statistics 6.1 Introduction to problem 6.1.1 Previous look at standard errors considered sampling distribution of quantities such as mean. 6.1.2 We may also be interested in distribution of extremes: For example, how is the largest of n observations distributed, or the second smallest, etc.? 6.2 Probability of number of observations exceeding some value, if distribution known 6.2.1 P r(x > x) = 1 F (x), where F (x) is the cumulative distribution. 6.2.2 If there are N observations, then the probability that exactly k of them exceed some value x is given by a simple binomial: ( ) N [1 F (x)] k F (x) N k k 6.2.3 Example: normal with N = 10, x = 0.67, and k = 3: F (0.67) = 0.75, so the probability = ( 10 3 ) 0.25 3 0.75 7 = 0.25. 6.2.4 Future observations Suppse we have n 1 past observations ranked from m = 1 (largest) to m = n 1 (smallest), and we take n 2 future observations. What is the probability that exactly k of n 2 observations will exceed the m th value from the first set of n 1 observations? Simply find F (x) corresponding to the m th value and plug into previous binomial equation. Clearly this works only if we know the distribution.
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 20 6.3 Probability of number of observations exceeding some value, even if distribution is not known 6.3.1 General expressions:
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 21 6.3.2 Intuitive explanation for insensitivity to distribution: A given number of points should cover a given proportion of the cumulative distribution, regardless of the shape of the distribution (provided that it is continuous). 6.3.3 Example (table 2.2.1 from Gumbel): Note symmetry in table. Probability of x exceedances above largest is the same as probability of x exceedances below lowest, etc.
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 22 6.3.4 Application to crinoid evolution (Foote 1994)
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 23
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 24
GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 25 6.4 Relationship to theory of records 6.4.1 Let there be n 1 past trials and n 2 future trials. What is the probability that the record set (m = 1) by first set of trials will stand by the second set (i.e. x = 0)? This is w(0). Now, suppose we let n 1 = n 2, then we have: ( n1 ) ( m m n2 ) x w(x) = (n 1 + n 2 ) ( n 1 +n 2 1), x+m 1 which, for n 1 = n 2, m = 1, and x = 0, gives which is equal to 1 2. w(0) = ( n1 1 )( n1 0 ) (2n 1 ) ( 2n 1 1 0 6.4.2 What is the expected number of exceedances above the past record? E(x) = mn 2 n 1 + 1 = n 1 n 1 + 1 1 for large n 1 ), 6.4.3 Thus, for athletic contests, if all trials reflect the same underlying pool of talent, equipment, etc., the waiting time between successive record should progressively double. 6.4.4 Likewise for discoveries of largest dinosaur, oldest primate etc. Deviations suggest change in rules or nonrandom searching.