GEOS 33000/EVOL January 2006 modified January 12, 2006 Page 1. 0 Some R commands for functions we ve covered so far

Similar documents
Permutation Tests for Comparing Two Populations

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

4. Continuous Random Variables, the Pareto and Normal Distributions

Chapter 3 RANDOM VARIATE GENERATION

Normal distribution. ) 2 /2σ. 2π σ

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

6.4 Normal Distribution

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Exploratory Data Analysis

WHERE DOES THE 10% CONDITION COME FROM?

Association Between Variables

6 Scalar, Stochastic, Discrete Dynamic Systems

Bootstrap Example and Sample Code

CALCULATIONS & STATISTICS

From the help desk: Bootstrapped standard errors

Correlation key concepts:

An Introduction to Basic Statistics and Probability

Week 4: Standard Error and Confidence Intervals

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Confidence Intervals for the Difference Between Two Means

Rarefaction Method DRAFT 1/5/2016 Our data base combines taxonomic counts from 23 agencies. The number of organisms identified and counted per sample

Probability Distributions

Chapter 4 Lecture Notes

Package SHELF. February 5, 2016

The normal approximation to the binomial

Statistical tests for SPSS

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

Important Probability Distributions OPRE 6301

Chapter 5. Random variables

Master s Theory Exam Spring 2006

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Time Series and Forecasting

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Notes on Continuous Random Variables

Lecture 5 : The Poisson Distribution

Chapter 4. Probability and Probability Distributions

Tenth Problem Assignment

CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont

Practice problems for Homework 11 - Point Estimation

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Estimation and Confidence Intervals

Pr(X = x) = f(x) = λe λx

Simple Regression Theory II 2010 Samuel L. Baker

STAT 35A HW2 Solutions

Standard Deviation Estimator

TEACHER NOTES MATH NSPIRED

Two-sample inference: Continuous data

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Advanced Topics in Statistical Process Control

SIMULATION STUDIES IN STATISTICS WHAT IS A SIMULATION STUDY, AND WHY DO ONE? What is a (Monte Carlo) simulation study, and why do one?

Logistic Regression (1/24/13)

CHAPTER 7 INTRODUCTION TO SAMPLING DISTRIBUTIONS

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

L13: cross-validation

Simple Random Sampling

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Nonparametric statistics and model selection

Probability Distributions

Math 370, Spring 2008 Prof. A.J. Hildebrand. Practice Test 2 Solutions

SKEWNESS. Measure of Dispersion tells us about the variation of the data set. Skewness tells us about the direction of variation of the data set.

You flip a fair coin four times, what is the probability that you obtain three heads.

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

MATH 140 Lab 4: Probability and the Standard Normal Distribution

Math 461 Fall 2006 Test 2 Solutions

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Session 7 Bivariate Data and Analysis

Fairfield Public Schools

Solutions to Math 51 First Exam January 29, 2015

Joint Exam 1/P Sample Exam 1

Sampling Strategies for Error Rate Estimation and Quality Control

PLANNING PROBLEMS OF A GAMBLING-HOUSE WITH APPLICATION TO INSURANCE BUSINESS. Stockholm

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Binomial Sampling and the Binomial Distribution

SOLUTIONS: 4.1 Probability Distributions and 4.2 Binomial Distributions

Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16

AP STATISTICS 2010 SCORING GUIDELINES

Probability density function : An arbitrary continuous random variable X is similarly described by its probability density function f x = f X

Week 3&4: Z tables and the Sampling Distribution of X

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Normality Testing in Excel

Review of Random Variables

Multiple Linear Regression in Data Mining

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

1.5 Oneway Analysis of Variance

Characteristics of Binomial Distributions

2WB05 Simulation Lecture 8: Generating random variables

Descriptive Statistics

Measurement with Ratios

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

Chapter 3: DISCRETE RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS. Part 3: Discrete Uniform Distribution Binomial Distribution

individualdifferences

Tutorial 5: Hypothesis Testing

Transcription:

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 1 III. Sampling 0 Some R commands for functions we ve covered so far 0.1 rbinom(m,n,p) returns m integers drawn from the binomial distribution with n trials and probability of success p. Each one of the integers returned would be k in our terminology. 0.2 dbinom(k,n,p) returns the probability of exactly k successes in n trials each with probability p. 0.3 pbinom(j,n,p) returns the cumulative probability of j or fewer successes in n trials each with probability p. 0.4 rpois(m,a) returns m integers drawn from the poisson distribution with parameter a. 0.5 dpois(k,a) returns the probability of exactly k events in poisson distribution with parameter a. 0.6 ppois(j,a) returns the cumulative probability of j or fewer events resulting from poisson distribution with parameter a. 0.7 rmultinom(m,n,p) returns m vectors of integers drawn from multinomial with n trials and vector of probabilities p. 0.8 dmultinom(k,n,p) returns the probability of sampling exactly k (where k is a vector of integers) in n trials with vector of probabilities p.

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 2 0.9 rexp(n,a) returns n numbers drawn from exponential distribution with parameter (rate) a. 0.10 dexp(x,a) returns density of exponential distribution with parameter a at X = x. 0.11 pexp(x,a) returns cumulative probability of exponential distribution with parameter a at X = x. 0.12 rnorm(n) returns n numbers drawn from standard normal distribution (with zero mean and unit variance). 0.13 dnorm(x) returns the normal density at X = x. 0.14 pnorm(x) returns the normal distribution function (cumulative probability) at X = x. 0.15 runif(), dunif(), punif(): These are like rnorm(), dnorm(), pnorm(), but for uniform distribution on (0,1). 0.16 choose(n,k) returns ( n k), i.e. n!/[k!(n k)!]. 0.17 factorial(j) returns j!. 0.18 lfactorial(j) returns ln(j!). 0.19 gamma(x) returns Γ(x).

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 3 0.20 mean(x), median(x), var(x), sd(x) return mean, median, variance, and standard deviation of the vector or array x. 0.21 cov(x,y) returns the covariance between vectors x and y. 1 Overview of Sampling, Error, Bias 2 Error Estimates With Assumed Sampling Distribution 2.1 Standard Error: Standard deviation of distribution of sample statistics that would result from infinite number of trials of drawing sample from underlying probability distribution and calculating the sample statistic. 2.2 In practice we generally do not estimate error by repeated sampling from the underlying distribution (expensive and time-consuming), although there are exceptions. 2.3 Approximations based on sample distribution (from Sokal and Rohlf):

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 4

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 5 2.4 Limitations: 2.4.1 Many approximation formulae make assumptions about shape of distribution and sample size. 2.4.2 We may be interested in novel statistic or one whose sampling distribution is not well characterized. 3 Bootstrap Error Estimates 3.1 Estimate standard error by resampling from the single sample we have. 3.2 This approach uses sampling with replacement from observed sample to simulate sampling without replacement from the underlying distribution. 3.3 Procedure 3.3.1 Start with observed sample of size n and observed sample statistic, call it Z. 3.3.2 Randomly pick a sample of size n, with replacement, from the observed sample. 3.3.3 Calculate the sample statistic of interest on this random sample; call is Z boot. 3.3.4 Repeat many times (generally hundreds to thousands). 3.3.5 Calculate standard deviation of the Z boot. This is an estimate of the standard error of the observed sample statistic Z: (SD(Z boot ) SE(Z). 3.4 Simple (but not necessarily most useful) example: trimmed mean Define p-% trimmed mean as mean of sample with p% lowest and p% highest observations discarded. (Idea is to try to reduce effect of outlines.) Suppose data consist of 10 (ordered) observations: 1,2,3,4,8,10,12,15,20,30. Let the trimmed mean be denoted Z. Then Z = (3 + 4 + 8 + 10 + 12 + 15)/6 = 8.67.

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 6 R code to estimate SE(Z) #define function trim.mean<-function(x,ntrim){ ii<-order(x) xtmp<-x[ii] return(mean(xtmp[(ntrim+1):(n-ntrim)])) } data<-c(1,2,3,4,8,10,12,15,20,30) #specify data n<-length(data) ntrim<-2 #specify number to trim from each side Zobs<-trim.mean(data,ntrim) #get observed value nrep<-10000 #specify number of bootstrap replicates Zboot<-rep(NA,nrep) #assign memory for (i in 1:nrep) #get bootstrap replicates Zboot[i]<-trim.mean(sample(data,n,replace=TRUE),ntrim) SE<-sd(Zboot) #calculate bootstrap std. error hist(zboot,breaks=50) #plot histogram of results This yields Z obs = 8.67 and SE(Z) 3.1. Histogram of Zboot 600 500 400 Frequency 300 200 100 0 5 10 15 20 25 Zboot

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 7 3.5 Useful R function: sample(x,n,replace=true[or FALSE]) returns a random sample of size n from the vector x with or without replacement. 3.6 To sample from array X so that the variables (columns) stay together: nr<-dim(x)[1] #get number of rows i<-sample(1:nr,n,replace=true[or FALSE]) #returns vector of integers sampled on [1,n] XSAMP<-X[i,] 4 Parametric bootstrap 4.1 Take observed sample and estimate relevant parameter from it. 4.2 Resample from parametric distribution with parameter equal to sample estimate (rather than resampling from observed distribution). 4.3 This approach can also be applied to more complicated situations: for example, simulating a process with parameters estimated from data. 4.3.1 We ll do lots of this later...

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 8 5 Examples of Finite-sample Bias (sample-size bias) 5.1 Sample variance 5.1.1 (x x) 2 /n is biased. This is systematically too low, which makes sense since it is based on squared deviations from sample mean. 5.1.2 (x x) 2 /(n 1) is unbiased. 5.2 Number of taxa 5.2.1 Rarefaction method (from Raup 1975) Abundance of species i is N i ; N = N i. Consider a particular species, i. ( N N i ) n is the number of ways of drawing the non-i individuals in a sample of n. ( N n) is the number of ways of drawing all individuals. Therefore, the ratio of these two is the probability of not drawing any individuals of species i. Therefore 1 minus this ratio is the probability of drawing at least one individual of species i. So the expected number of species is just the sum of this probability, calculated for each species in turn. 5.2.2 Caveats Rarefaction for interpolation rather than extrapolation Collecting curves vs. rarefaction curves Apparent leveling off of curves does not imply that nearly everything has been found (only that you re unlikely to find it with modest effort). Curves affected by factors other than sample size (sampling method, taxonomic treatment, size of geographic area etc.). Crossing of rarefaction curves can make interpretation difficult.

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 9

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 10 5.2.3 Examples of application of taxonomic rarefaction (Raup 1975; Raup and Schopf 1978) This example suggests that the increase in observed family diversity in post-paleozoic echinoids cannot be accounted for by an increase in the number of species sampled.

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 11 This example suggests that much of the variation in the number of observed echinoid orders is consistent with differences in number of sampled species.

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 12 5.2.4 Interpretation of taxonomic rarefaction curves not entirely straightforward. Sampling standardization to be treated in more detail later

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 13 5.3 Range 5.3.1 Example: Range of samples from normal distribution

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 14

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 15

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 16

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 17 5.3.2 Example: Test for nonrandomness of sampling with respect to morphology

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 18 5.3.3 Correction in general case via rarefaction (random subsampling at controlled sample-size) Caveat: Range at standardized sample size may not convey any information that isn t conveyed by sample variance.

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 19 6 Extreme value statistics 6.1 Introduction to problem 6.1.1 Previous look at standard errors considered sampling distribution of quantities such as mean. 6.1.2 We may also be interested in distribution of extremes: For example, how is the largest of n observations distributed, or the second smallest, etc.? 6.2 Probability of number of observations exceeding some value, if distribution known 6.2.1 P r(x > x) = 1 F (x), where F (x) is the cumulative distribution. 6.2.2 If there are N observations, then the probability that exactly k of them exceed some value x is given by a simple binomial: ( ) N [1 F (x)] k F (x) N k k 6.2.3 Example: normal with N = 10, x = 0.67, and k = 3: F (0.67) = 0.75, so the probability = ( 10 3 ) 0.25 3 0.75 7 = 0.25. 6.2.4 Future observations Suppse we have n 1 past observations ranked from m = 1 (largest) to m = n 1 (smallest), and we take n 2 future observations. What is the probability that exactly k of n 2 observations will exceed the m th value from the first set of n 1 observations? Simply find F (x) corresponding to the m th value and plug into previous binomial equation. Clearly this works only if we know the distribution.

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 20 6.3 Probability of number of observations exceeding some value, even if distribution is not known 6.3.1 General expressions:

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 21 6.3.2 Intuitive explanation for insensitivity to distribution: A given number of points should cover a given proportion of the cumulative distribution, regardless of the shape of the distribution (provided that it is continuous). 6.3.3 Example (table 2.2.1 from Gumbel): Note symmetry in table. Probability of x exceedances above largest is the same as probability of x exceedances below lowest, etc.

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 22 6.3.4 Application to crinoid evolution (Foote 1994)

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 23

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 24

GEOS 33000/EVOL 33000 10 January 2006 modified January 12, 2006 Page 25 6.4 Relationship to theory of records 6.4.1 Let there be n 1 past trials and n 2 future trials. What is the probability that the record set (m = 1) by first set of trials will stand by the second set (i.e. x = 0)? This is w(0). Now, suppose we let n 1 = n 2, then we have: ( n1 ) ( m m n2 ) x w(x) = (n 1 + n 2 ) ( n 1 +n 2 1), x+m 1 which, for n 1 = n 2, m = 1, and x = 0, gives which is equal to 1 2. w(0) = ( n1 1 )( n1 0 ) (2n 1 ) ( 2n 1 1 0 6.4.2 What is the expected number of exceedances above the past record? E(x) = mn 2 n 1 + 1 = n 1 n 1 + 1 1 for large n 1 ), 6.4.3 Thus, for athletic contests, if all trials reflect the same underlying pool of talent, equipment, etc., the waiting time between successive record should progressively double. 6.4.4 Likewise for discoveries of largest dinosaur, oldest primate etc. Deviations suggest change in rules or nonrandom searching.