the number of organisms in the squares of a haemocytometer? the number of goals scored by a football team in a match?

Similar documents
Characteristics of Binomial Distributions

Exploratory Data Analysis

Means, standard deviations and. and standard errors

8. THE NORMAL DISTRIBUTION

Pr(X = x) = f(x) = λe λx

Simple linear regression

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Week 3&4: Z tables and the Sampling Distribution of X

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

The correlation coefficient

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

An Introduction to Basic Statistics and Probability

Lecture 5 : The Poisson Distribution

4. Continuous Random Variables, the Pareto and Normal Distributions

Descriptive Statistics

Normal distribution. ) 2 /2σ. 2π σ

6.4 Normal Distribution

Unit 7: Normal Curves

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

CALCULATIONS & STATISTICS

Probability Distributions

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Week 4: Standard Error and Confidence Intervals

Exercise 1.12 (Pg )

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

The normal approximation to the binomial

You flip a fair coin four times, what is the probability that you obtain three heads.

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

Probability Distributions

The Normal Distribution

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

Simple Regression Theory II 2010 Samuel L. Baker

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

6 3 The Standard Normal Distribution

The normal approximation to the binomial

Chapter 3 RANDOM VARIATE GENERATION

z-scores AND THE NORMAL CURVE MODEL

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Probability. Distribution. Outline

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

16. THE NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION

WHERE DOES THE 10% CONDITION COME FROM?

Binomial Sampling and the Binomial Distribution

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Confidence Intervals for One Standard Deviation Using Standard Deviation

Continuous Random Variables

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

The Normal Distribution

Chapter 4. Probability and Probability Distributions

Important Probability Distributions OPRE 6301

Notes on Continuous Random Variables

Interpreting Data in Normal Distributions

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

CURVE FITTING LEAST SQUARES APPROXIMATION

AP STATISTICS REVIEW (YMS Chapters 1-8)

STATISTICS 8: CHAPTERS 7 TO 10, SAMPLE MULTIPLE CHOICE QUESTIONS

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Lecture Notes Module 1

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

1.5 Oneway Analysis of Variance

The Standard Normal distribution

AP STATISTICS 2010 SCORING GUIDELINES

Basic Probability and Statistics Review. Six Sigma Black Belt Primer

Penalized regression: Introduction

Econometrics Simple Linear Regression

Lecture 7: Continuous Random Variables

AP Statistics Solutions to Packet 2

Normal Distribution as an Approximation to the Binomial Distribution

ST 371 (IV): Discrete Random Variables

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

Introduction to the Practice of Statistics Fifth Edition Moore, McCabe

The Binomial Distribution

Math 461 Fall 2006 Test 2 Solutions

Review. March 21, S7.1 2_3 Estimating a Population Proportion. Chapter 7 Estimates and Sample Sizes. Test 2 (Chapters 4, 5, & 6) Results

table to see that the probability is (b) What is the probability that x is between 16 and 60? The z-scores for 16 and 60 are: = 1.

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Descriptive Statistics

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Unit 4 The Bernoulli and Binomial Distributions

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

Math 151. Rumbos Spring Solutions to Assignment #22

Father s height (inches)

Lecture 2: Discrete Distributions, Normal Distributions. Chapter 1

Without data, all you are is just another person with an opinion.

Stats on the TI 83 and TI 84 Calculator

2. Simple Linear Regression

Point and Interval Estimates

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Lesson 17: Margin of Error When Estimating a Population Proportion

Questions and Answers

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

Lesson 4 Measures of Central Tendency

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1.

Chapter 5: Normal Probability Distributions - Solutions

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Math Quizzes Winter 2009

Transcription:

Poisson Random Variables (Rees: 6.8 6.14) Examples: What is the distribution of: the number of organisms in the squares of a haemocytometer? the number of hits on a web site in one hour? the number of goals scored by a football team in a match? the number of cracks in a rail track? the number of sultanas in a slice of fruit cake? The Poisson distribution often provides a good model for the number of events occurring in time or space. Space can be linear, area or volume. In order to decide whether to use the Binomial or the Poisson distribution, consider whether there is a sample size involved (i.e. an upper limit on the number). If so, use the Binomial; if not, use the Poisson. Consider, for example, the number of calls arriving at a telephone exchange. We assume that: events which occur in time intervals that do not overlap are independent; the underlying rate λ ( lambda ) at which calls arrive is constant. Under these conditions, if Y is the random variable denoting the number of calls actually made in one hour then Y has a Poisson distribution with parameter λ. We write The probability distribution of Y is Y P(λ). Pr(Y = r) = e λ λr r! where e is the well known mathematical constant (e = 2.7183...). Most calculators have an e x button; this is called the exponential function. The formula for Poisson distribution probabilities simplifies when r = 0 ( no calls ), because λ 0 = 1 and 0! = 1. So Pr(Y = 0) = e λ Example: The mean number of bacteria in the single cell of a haemocytometer is 2. Let N be the number of bacteria actually observed in a cell. Find: (i) Pr(N = 0) (ii) Pr(N = 3) (iii) Pr(N > 2)

We assume that N has a Poisson distribution with rate parameter 2. Then: Pr(N = n) = e λ λn n! = e 2 2n n! (i) Pr(N = 0) = e 2 20 0! = e 2 (simplifies for N = 0) = 0.1353 (ii) Pr(N = 3) = e 2 23 3! = e 2 8 6 (iii) We can calculate Pr(N > 2) as Check that this gives = 0.1353 8 6 = 0.1804 Pr(N > 2) = 1 Pr(N 2) = 1 (p(0) + p(1) + p(2)) ) = 1 (e 2 + e 2 21 1! + e 2 22 2! 1 (0.1353 + 0.2707 + 0.2707) = 1 0.6767 = 0.3233 We can also use the NCST tables to calculate Poisson probabilities. Table 2 of Lindley and Scott (pp 24 32) gives Pr(Y r) for λ from 0 up to 20. Using p25 gives Pr(N 2) = 0.6767 immediately. Rees gives a short version of Table 2 in Table C.2 (4th ed.). Note that Rees uses m instead of λ, while Lindley & Scott use µ. Example: Telephone calls arrive in an office at a rate of 5 calls per hour. Find the probability that there are: (i) exactly 2 calls in one hour; (ii) exactly 1 call in 15 minutes (iii) 10 or fewer calls in 2 hours.

We use Table 2; (you should check the first two answers by using the formula!) (i) X P(5) Pr(X = 2) = Pr(X 2) Pr(X 1) = 0.1247 0.0404 = 0.0843 (ii) X P(5/4) Pr(X = 1) = Pr(X 1) Pr(X = 0) = 0.6446 0.2865 = 0.3581 (iii) X P(10) Pr(X 10) = 0.5830 Note how the value of λ adjusts to take account of the time interval; for example, if the number of calls in one hour is P(5) then the number of calls in two hours is P(10).

Some properties of Poisson distribution The Poisson distribution is always skewed but the distribution becomes more nearly symmetrical as the rate parameter λ increases. The expected value of a Poisson distribution is the rate parameter λ The variance of a Poisson distribution is also the rate parameter λ Thus the standard deviation is λ Note that in relative terms, the spread gets less as the rate increases. Let X be the number of accidents per quarter at an accident black spot. Suppose that X P(4). Then the probability distribution of X looks like: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Note that any value between 1 and 8 is likely to occur and more extreme values are possible! The theoretical mean and standard deviation of a Poisson random variable are (Rees 6.12): Mean = λ S.D. = λ. For X P(4), Mean = λ = 4 and S.D. = λ = 2 RECAP: Binomial and Poisson These two distributions are both used as models for counts. The Binomial is used when there is a clear upper limit n on the number of events recorded; there is no theoretical upper limit for the Poisson. Binomial: ( There ) are n independent trials each with probability of success p. n Pr(X = r) = p r (1 p) n r r The mean is np and the s.d. is np(1 p). Poisson: Events occur independently at rate λ. Pr(Y = r) = e λ λr r! The mean is λ and the s.d. is λ.

Normal Random Variables (Rees: 7.2 7.5) Example: Consider relative frequency histograms of heights of classes of students. The sizes of the classes are (a) 50 (b) 250 (c) 1000 (d) 5000. As the class size increases, we can reduce the width of a bar in the histogram. We can imagine that with an infinite class size the histogram could be made smooth. The resulting smooth curve is often very similar to a particular form the Normal distribution. Here are some other examples of variables that could have Normal distributions. the weight of a bag of cement; the time taken to walk to the pub; the volume of beer in your glass. We use the Normal distribution as a model for many naturally occurring variables. For example, we might use it to calculate the proportion of weights or lengths that fall between two limits. Notation: A Normal distribution is determined by its mean, µ, and its standard deviation, σ. If X has a Normal distribution with mean µ and standard deviation σ we write X N(µ, σ 2 ). Beware!! The second parameter is the Variance and not the Standard Deviation. Thus, N(5, 9) refers to a Normal distribution with mean 5 and standard deviation 3. First, a special Normal distribution. Definition: We say that Z has the standard Normal distribution if Z N(0, 1). The mean of Z is 0 and its standard deviation is 1. Here is what the standard Normal distribution looks like: 3 2 1 0 1 2 3 Examples of other Normal distributions N(5, 1) Mean 5 and standard deviation 1

0 5 10 15 N(10, 1) Mean 10 and standard deviation 1 0 5 10 15 N(8, 4) Mean 8 and standard deviation 2 0 5 10 15 Probability Density Functions These plots of Normal distributions are examples of probability density functions; the name can be abbreviated to density. These are similar to the probability function for a discrete random variable, but there are some important differences. Instead of lumps of probability at certain values, the probability of getting exactly any particular value is zero! Instead we have to consider the probability of being between two values. The density functions are scaled so that the area under the curve between the values equals the probability. This implies that the total area under the curve equals one. For some continuous random variables, there is an explicit formula for the probabilities. For others we have to use statistical tables or a computer package.

There is no formula to allow the direct calculation of probabilities for a standard Normal distribution. We have to use a computer or tables such as Table 4 of Lindley & Scott. NCST tables give P(Z z) for values of z 0. The symbol Φ(z) is usually used for this probability. This is called the Cumulative Probability Function of the standard Normal distribution. It is the area under the Probability Density Function of the standard Normal distribution. Example: If Z N(0, 1), use NCST Table 4 to find: (i) Pr(Z 1) Tabulated value: Pr(Z 1) = 0.8413 (ii) Pr(Z > 2) Tabulated value: Pr(Z 2) = 0.9772 So: Pr(Z > 2) = 1 0.9772 = 0.0228 (iii) Pr(Z < 1) By symmetry: Pr(Z < 1) = Pr(Z > +1) = 1 Pr(Z 1) = 1 0.8413 = 0.1587 Note how we use symmetry to help us here and how a quick picture keeps us on the right track. When we have a general Normal N(µ, σ 2 ), we can still use Table 4, provided that everything is put on a standard scale. This works because of the following properties of the Normal distribution. If X Normal with mean µ and s.d. σ and if a and b are constants, then (X - a ) Normal with mean (µ a) and s.d. σ X b Normal with mean µ b and s.d. σ b So: Z = X µ Normal with mean 0 and s.d. 1 σ Definition: If x is a value from a distribution with a mean of µ and a standard deviation of σ then the standard score (or z score) of x is z = x µ σ If x comes from a Normal distribution, z comes from a Standard Normal Distribution N(0, 1). In the previous example, the questions referred to values from the standard Normal distribution. How do we find probabilities for N(µ, σ 2 )? (Rees 7.3 gives some examples). In each case, the original question is converted to a question about the standard Normal distribution. The tables are then used as before. We may also need to use linear interpolation if the required value is between tabulated values, Example: The random variable X N(5, 9). Find:

(i) Pr(X < 8) (ii) Pr(X < 3) (iii) Pr(2 X 11) First, recall that the second parameter (9) is the variance, so the standard deviation is 9 = 3. ( (i) Pr(X < 8) = Pr Z < 8 5 ) 3 = Pr(Z < 1) = 0.8413 (ii) ( Pr(X < 3) = Pr Z < 3 5 ) 3 = Pr(Z < 0.6667) = 1 Pr(Z < 0.6667) = 1 (0.7454 + 23 ) (0.7486 0.7454) = 1 0.7475 = 0.2525 Note the use of linear interpolation here to get a more accurate answer. (iii) ( 2 5 Pr(2 X 11) = Pr Z 11 5 ) 3 3 = Pr( 1 Z 2) = Pr(Z 2) (1 Pr(Z 1)) = 0.9772 (1 0.8413) = 0.8185 In each case, the numerical values in the original question are converted to z-scores before the NCST tables are used. Example: IQ tests are constructed so that the mean is 100 and the standard deviation is 15. What percentage of the population will get a score of more than 120? ( ) 120 100 Pr(IQ > 120) = Pr Z > 15 = Pr(Z > 1.3333) = 1 Pr(Z < 1.3333) = 1 0.9088 = 9.12% Sometimes we need to reverse the above process. NCST Table 5 allows us to do this.

Example: What is the IQ score such that only 1% of population do better? From NCST Table 5, we have: Pr(Z > 2.3263) = 1% The standardisation is reversed by multiplying by the standard deviation and then adding in the mean. So the required IQ score is: IQ = 2.3263 15 + 100 = 134.9 Suppose the Normal distribution is used to model a continuous variable and that we wish to find the probability of getting a particular value. Example: In a certain population, male heights are Normally distributed with mean 170 cm and standard deviation 5 cm. If heights are recorded to the nearest cm, then the probability that an individual is 170 cm should be taken as being the probability of being in the interval (169.5, 170.5). The corresponding z-score interval is (-0.1, 0.1) From tables P(Z < 0.1) = 0.5398 So P( 0.1 < Z < 0.1) = 2(0.5398 0.5) = 0.0796 The idea that was used in this example is called a Continuity Correction. Example: Suppose that student female heights have a Normal distribution with mean µ = 163 cm and with standard deviation σ = 6 cm. Let X be the height of a randomly chosen student. Find the probabilities that: (i) X is greater than 170 cm. (ii) X is more than 164 cm and less than 171 cm (iii) X is 149 cm or less. (i) Pr(X > 170) ( ) 170.5 163 = Pr Z > 6 = Pr(Z > 1.25) = 1 Pr(Z < 1.25) = 1 0.8944 = 0.1056 Note the use of the continuity correction, which assumes that heights are taken to nearest cm. Heights of 171 or more are included and 170 or less are excluded, so the division is taken to be at 170.5 cm.

(ii) Pr(164 < X < 171) ( 164.5 163 = Pr < Z < 6 = Pr(0.25 < Z < 1.25) = 0.8944 0.5987 = 0.2957 ) 170.5 163 6 (iii) Pr(X 149) ( ) 149.5 163 = Pr Z < 6 = Pr(Z < 2.25) = 1 Pr(Z < 2.25) = 1 0.9878 = 0.0122 Example: What is the height such that 10% of female students are taller? For standard Normal distribution, NCST Table 5 tells us that 10% of the population are bigger than 1.2816. So the required height is: 1.2816 6 + 163 = 170.69 cm. Models using Discrete and Continuous Distributions The previous sections have introduced Binomial, Poisson and Normal distributions. These can all be derived using probability theory from sets of assumptions; we did this for Binomial distribution. These are all useful as models for real data and allow us to make predictions. If the process that generated the data matches the assumptions of a distribution, then we can be confident in the predictions. The predictions will also be good if the assumptions are close to reality. Example: Consider a binary variable. Random sampling with replacement from any population implies that data will follow a Binomial distribution. Random sampling without replacement from a large population implies that Binomial distribution will be a good approximation. A distribution that is chosen empirically can also make useful predictions. For example, proportion of student heights within a range.

Regression We are often interested in the relationship between two or more variables. This can arise from surveys in which several variables are measured on each unit or from experiments in which some variables are modified and other variables observed. Note: Data sets arising from these two types of situation cannot be distinguished in general, but the interpretation is different. Example: A study for an environmental impact assessment measured the flow rate against the depth at a site on a stream. Depth (m) 0.30 0.35 0.40 0.45 Flow Rate (m/s) 2.3 2.4 2.0 3.5 Depth (m) 0.50 0.55 0.60 0.65 Flow Rate (m/s) 5.7 6.1 4.2 4.7 Depth (m) 0.70 0.75 0.80 Flow Rate (m/s) 6.9 7.4 5.7 Note that the choice of which depths to use was made in advance; that is why they are spaced at fixed intervals. Flow Rate 7.5+ x - x - - x - x x 5.0+ - x - x - x - 2.5+ x x - x - - - 0.0+-------+---------+---------+---------+---------+---- 0.00 0.15 0.30 0.45 0.60 0.75 Depth The plot suggests a straight line relationship. It is often useful to be able to predict the values of one variable from another variable. To do this, we need to formulate a model. The model should allow for the variability that is present. A possible model is: Flow = α + β Depth with variability about the line being independent samples from a Normal distribution with unknown variance σ 2. More generally: y i = α + βx i + ǫ i

where ǫ i N(0,σ 2 ) and independent. ǫ is the Greek letter epsilon. This is called a linear regression model. Notes: The model includes two parts: the functional part and the part which models the variability about the function. Many of the laws of physics and chemistry started out as empirical observations of this sort. The observed variability was often just measurement error. In other sciences, such as biology, there is often variability inherent in the material which is much greater than any errors.

Fitting the Model There is a general method for fitting models of this type. It is called the Method of Least Squares. This method is optimal for predicting the y variable from the x variable(s) if the model for the variability is as specified above. To fit the model, we minimise squared deviations in the y direction. i.e. find ˆα, ˆβ to minimise: (y i α βx i ) 2 Notes: At school, this model is often given as y = mx + c and the line is fitted by eye. Predicting x from y gives different answers. i If the x values were chosen, different methods are needed if we wish to predict x from y. This is needed in assay systems. We calculate: S xx = x 2 i ( x i ) 2 n S xy = x i y i ( x i ) ( y i ) n S yy = yi 2 ( y i ) 2 n Note: S yy = (n 1) Variance (y) = (y i ȳ) 2 Then: ˆβ = S xy S xx ˆα = 1 ( yi n ˆβ ) x i = ȳ ˆβ x Thus the line goes through the point ( x, ȳ). The fitted line y = ˆα + ˆβx is said to be the regression of Y on X. Note: The recommended Casio calculators provide short cut ways of carrying out the calculations.

Regression Summary (so far) Data were collected on a variable (y) at given values of another variable (x). A scatter plot of the two variables suggested a straight line relationship. Variability about the straight line appeared to be roughly constant. Least squares was used to estimate α and β, the model parameters. In the example, there was only one y value for each x value, but the method is also applicable when there are many y values (flow rate at different points with same depth). If the model assumptions are correct, the fitted line y = ˆα + ˆβx gives the best estimate of y for any given value of x. This value is called the fitted value at x. The model can also make predictions of y for values of x where no measurements were taken. Example: Flow rate data. n = 11 xi = 6.05 x = 0.55 yi = 50.9 ȳ = 4.627 x 2 i = 3.6025 S xx = 0.275 xi y i = 30.625 S xy = 2.63 y 2 i = 271.59 S yy = 36.0618 ˆβ = S xy = 2.63 S xx 0.275 = 9.5636 ˆα = 0.6327 The fitted model can be used to predict the flow rate for any specified depth: i.e. ŷ = ˆα + ˆβx. However, the value of ˆα suggests that it could be dangerous to extrapolate from this model! Digression: The word regression literally means stepping back. The term originated when the results from units that were measured on two occasions were compared. The best (or worst) on the first occasion was rarely the best on the second occasion. Comparing the second set of results showed that, on average, units in the first set had regressed towards the mean. Although most uses of regression are not like this, the name has stuck.