λ λ(1+δ) e λ 2πe λ(1+δ) = eλδ (1 + δ) λ(1+δ) 1/2 2πλ p(x) = e (x λ)2 /(2λ) 2πλ



Similar documents
Properties of MLE: consistency, asymptotic normality. Fisher information.

Hypothesis testing. Null and alternative hypotheses

Overview of some probability distributions.

I. Chi-squared Distributions

1. C. The formula for the confidence interval for a population mean is: x t, which was

5: Introduction to Estimation

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem


University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Chapter 7 Methods of Finding Estimators

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

Normal Distribution.

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

Inference on Proportion. Chapter 8 Tests of Statistical Hypotheses. Sampling Distribution of Sample Proportion. Confidence Interval

Lesson 17 Pearson s Correlation Coefficient

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

Confidence Intervals for One Mean

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Determining the sample size

Math C067 Sampling Distributions

Convexity, Inequalities, and Norms

A probabilistic proof of a binomial identity

Section 11.3: The Integral Test

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

A Mathematical Perspective on Gambling

Sequences and Series

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

Basic Elements of Arithmetic Sequences and Series

MEI Structured Mathematics. Module Summary Sheets. Statistics 2 (Version B: reference to new book)

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE The absolute value of the complex number z a bi is

PSYCHOLOGICAL STATISTICS

Maximum Likelihood Estimators.

One-sample test of proportions

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Statistical inference: example 1. Inferential Statistics

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Confidence intervals and hypothesis tests

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Output Analysis (2, Chapters 10 &11 Law)

Theorems About Power Series

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

Sampling Distribution And Central Limit Theorem

Chapter 14 Nonparametric Statistics

CS103X: Discrete Structures Homework 4 Solutions

Soving Recurrence Relations

Quadrat Sampling in Population Ecology

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Practice Problems for Test 3

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Measures of Spread and Boxplots Discrete Math, Section 9.4

3 Basic Definitions of Probability Theory

CHAPTER 3 DIGITAL CODING OF SIGNALS

THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction

Lecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009)

AP Calculus AB 2006 Scoring Guidelines Form B

Infinite Sequences and Series

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Parametric (theoretical) probability distributions. (Wilks, Ch. 4) Discrete distributions: (e.g., yes/no; above normal, normal, below normal)

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1)

INFINITE SERIES KEITH CONRAD

Definition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean

1 Computing the Standard Deviation of Sample Means

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Hypergeometric Distributions

1 Correlation and Regression Analysis

Our aim is to show that under reasonable assumptions a given 2π-periodic function f can be represented as convergent series

The Stable Marriage Problem

THE TWO-VARIABLE LINEAR REGRESSION MODEL

LECTURE 13: Cross-validation

Asymptotic Growth of Functions

Incremental calculation of weighted mean and variance

Exam 3. Instructor: Cynthia Rudin TA: Dimitrios Bisias. November 22, 2011

THE HEIGHT OF q-binary SEARCH TREES

Chapter 5: Inner Product Spaces

Chapter 7: Confidence Interval and Sample Size

WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER?

Descriptive Statistics

arxiv: v1 [stat.me] 10 Jun 2015

FOUNDATIONS OF MATHEMATICS AND PRE-CALCULUS GRADE 10

Math 113 HW #11 Solutions

A Recursive Formula for Moments of a Binomial Distribution

Factoring x n 1: cyclotomic and Aurifeuillian polynomials Paul Garrett <garrett@math.umn.edu>

Lesson 15 ANOVA (analysis of variance)

Lecture 4: Cheeger s Inequality

Mathematical goals. Starting points. Materials required. Time needed

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

4.3. The Integral and Comparison Tests

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

Cooley-Tukey. Tukey FFT Algorithms. FFT Algorithms. Cooley

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

STA 2023 Practice Questions Exam 2 Chapter 7- sec 9.2. Case parameter estimator standard error Estimate of standard error

AP Calculus BC 2003 Scoring Guidelines Form B

Transcription:

2.1.5 Gaussia distributio as a limit of the Poisso distributio A limitig form of the Poisso distributio (ad may others see the Cetral Limit Theorem below) is the Gaussia distributio. I derivig the Poisso distributio we took the limit of the total umber of evets N ; we ow take the limit that the mea value is very large. Let s write the Poisso distributio as P = λ e λ. (45)! Now let x = = λ(1 + δ) where λ 1 ad δ 1. Sice = λ, this meas that we will also be cocered with large values of, i which case the discrete P goes over to a cotiuous pdf i the variable x. Usig Stirlig s formula for!: x! 2πx e x x x as x (46) we fid 7 p(x) = λ λ(1+δ) e λ 2πe λ(1+δ) [λ(1 + δ)] λ(1+δ)+1/2 = eλδ (1 + δ) λ(1+δ) 1/2 2πλ (see footote) = e λδ2 /2 2πλ (48) Substitutig back for x, with δ =(x λ)/λ, yields p(x) = e (x λ)2 /(2λ) 2πλ (49) This is a Gaussia, or Normal 8, distributio with mea ad variace of λ. The Gaussia distributio is the most importat distributio i probability, due to its role i the Cetral Limit Theorem, which loosely says that the sum of a large umber of idepedet quatities teds to have a Gaussia form, idepedet of the pdf of the idividual measuremets. The above specific derivatio is somewhat cumbersome, ad it will actually be more elegat to use the Cetral Limit theorem to derive the Gaussia approximatio to the Poisso distributio. 2.1.6 More o the Gaussia The Gaussia distributio is so importat that we collect some properties here. It is ormally writte as 1 /2σ2 p(x) = (2π) 1/2 e (x µ)2, (50) σ 7 Maths Notes: The limit of a fuctio like (1 + δ) λ(1+δ)+1/2 with λ 1 ad δ 1 ca be foud by takig the atural log, the expadig i δ to secod order ad usig λ 1: l[(1 + δ) λ(1+δ)+1/2 ] = [λ(1 + δ)+1/2] l(1 + δ) = (λ +1/2+δ)(δ δ 2 /2+O(δ 3 )) λδ + λδ 2 /2+O(δ 3 ) (47) 8 The ame Normal was give to this distributio by the statisticia K. Pearso, who almost immediately regretted itroducig the ame. It is also sometimes called the Bell-curve. 19

so that µ is the mea ad σ the stadard deviatio. The first statemet is obvious: cosider x µ, which must vaish by symmetry sice it ivolves itegratio of a odd fuctio. To prove the secod statemet, write (x µ) 2 1 = (2π) 1/2 σ σ3 y 2 e y2 /2 dy (51) ad do the itegral by parts. Provig that the distributio is correctly ormalized is harder, but there is a clever trick, which is to exted to a two-dimesioal Gaussia for two idepedet (zero-mea) variables x ad y: p(x, y) = 1 +y2 )/2σ2 e (x2. (52) 2πσ2 The itegral over both variables ca ow be rewritte usig polar coordiates: p(x, y) dx dy = ad the fial expressio clearly itegrates to so the distributio is ideed correctly ormalized. p(x, y) 2π r dr = 1 2πσ 2 2π r e r2 /2σ 2 dr (53) P (r > R) = exp ( R 2 /2σ 2), (54) Ufortuately, the sigle Gaussia distributio has o aalytic expressio for its itegral, eve though this is ofte of great iterest. As see i the ext sectio, we ofte wat to kow the probability of a Gaussia variable lyig above some value, so we eed to kow the itegral of the Gaussia; there are two commo otatios for this (assumig zero mea): where the error fuctio is defied as P (x < Xσ) = Φ(X); (55) P (x < Xσ) = 1 2 erf(y) 2 π y A useful approximatio for the itegral i the limit of high x is [ 1 + erf(x/ 2) ], (56) 0 e t2 dt. (57) P (x > Xσ) e X2 /2 (2π) 1/2 X (58) (which is derived by a Taylor series: e (x+ɛ)2 /2 e x2 /2 e xɛ ad the liear expoetial i ɛ ca be itegrated). 2.2 Tails ad measures of rareess So far, we have asked questios about the probability of obtaiig a particular experimetal outcome, but this is ot always a sesible questio. Where there are may possible outcomes, the chace of 20

Figure 3: The Gaussia distributio, illustratig the area uder various parts of the curve, divided i uits of σ. Thus the chace of beig withi 1σ of the mea is 68%; 95% of results are withi 2σ of the mea; 99.7%of results are withi 3σ of the mea. ay give oe happeig will be very small. This is most obvious with a cotiuous pdf, p(x): the probability of a result i the rage x = a to x = a + δx is p(a)δx, so the probability of gettig x = a exactly is precisely zero. Thus it oly makes sese to ask about the probability of x lyig i a give rage. The most commo reaso for calculatig the probability of a give outcome (apart from bettig), is so we ca test hypotheses. The lectures will give a i-depth discussio of this issue later, as it ca be quite subtle. Nevertheless, it is immediately clear that we wat to set possible outcomes to a experimet i some order of rareess, ad we will be suspicious of a hypothesis whe a experimet geerates a outcome that ought to be very rare if the hypothesis were true. Iformally, we eed to defie a typical value for x ad some meas of decidig if x is far from the typical value. For a Gaussia, we would aturally the mea, µ as the typical value ad (x µ)/σ as the distace. But how do we make this geeral? There are two other commo measures of locatio: The Mode The value of x where p(x) has a maximum. The Media The value of x such that P (> x) = 0.5. For the Gaussia, both these measures are equal to the mea. I geeral, the media is the safest choice, sice it is easy to devise pathological examples where the mode or eve the mea is ot well defied. Followig o from the media, we ca defie upper ad lower quartiles, which together eclose half the probability i.e. the values x 1 ad x 2, where P (< x 1 )=0.25 ad P (> x 2 )=0.75. This suggests a measure of rareess of evets, which is to sigle out evets that lie i the tails of the distributio, where either P (< x) 1 or P (> x) 1. This ca be doe i a 1-tailed or a 2-tailed maer, depedig o whether we choose to cout oly high excursios, or to be impressed by deviatios i either directio. The area uder the tail of a pdf is called a p value, to emphasise that we have to be careful with meaig. If we get, say, p =0.05 this meas that there is a probability of 0.05 to get a value as extreme as this oe, or worse, o a give hypothesis. So we eed to have a hypothesis i mid 21

to start with; this is called the ull hypothesis it would typically be somethig o-committal, such as there is o sigal. If we get a p value that is small, this is some evidece agaist the ull hypothesis, i which case we ca claimed to have detected a sigal. Small values of p are described as givig sigificat evidece agaist the ull hypothesis: for p = 0.05 we would say this result is sigificat at the 5% level. For completeess, we should metio the term cofidece level, which is complemetary to the sigificace level. If P (< x 1 )=0.05 ad P (> x 2 )=0.05, the we would say that, o the ull hypothesis, x 1 < x < x 2 at 90% cofidece, or x<x 2 at 95% cofidece. As show later i the course, the p value is ot the same as the probability that the ull hypothesis is correct, although may people thik this is the case. Nevertheless, whe p is small, you are o good grouds i disbelievig the ull hypothesis. Some of the p values correspodig to particular places i the Gaussia are listed i table 1. The weakest evidece would be a 2σ result, which happes by chace about 5% of the time. This is hardly defiitive, although it may provoke further work. But a 3σ result is much less probable. If you decide to reject the ull hypothesis wheever x µ>3σ, you will be wrog oly oe time i 3000. Nevertheless, discovery i some areas of sciece ca isist o stroger evidece, perhaps at the 5σ level (1-sided p =2.9 10 7 ). This is partly because σ may ot itself be kow precisely, but also because may differet experimets ca be performed: if we search for the Higgs Boso i 100 differet idepedet mass rages, we are boud to fid a result that is sigificat at about the 1% level. Table 1: Tails of the Gaussia x/σ 1-tail p 2-tail p 1.0 0.159 0.318 2.0 0.0228 0.0456 3.0 0.0013 0.0026 Obviously, this process is ot perfect. If we make a observatio ad get x µ σ, this actually favours a arrower distributio tha the stadard oe, but broader distributios are easier to detect, because the probability of a extreme deviatio falls expoetially. 2.3 The likelihood Although we have argued that the probability of a cotiuous variable havig exactly some give value is zero, the relative probability of havig ay two values, p(x = a)/p(x = b) is well defied. This ca be exteded to the idea of relative probabilities for larger datasets, where drawigs are made from the pdf. We ca approach this usig the multiomial distributio, where we exted the biomial to somethig with more tha two possible outcomes: e.g. we toss a six-sided dice seve times, ad ask what is the probability of gettig three oes, two twoes, a five ad a six? Number the possible results of each trial, ad say each occurs 1 times, 2 times etc., with probabilities p 1, p 2 etc., out of N trials. Imagie first the case where the trials give a strig of 1 s, followed by 2 s etc. 22

The probability of this happeig is p 1 1 p 2 2 p 3 3. If we do t care which trials give these particular umbers, the we eed to multiply by the umber of ways i which such a outcome could arise. Imagie choosig the 1 first, which ca happe C N 1 ways; the 2 ca be chose C N 1 2 ways. Multiplyig all these factors, we get the simple result p = N! 1! 2! 3! p 1 1 p 2 2 p 3 3 (59) Now cosider the approach to the cotiuum limit, where all the p s are very small, so that the s are either 0 or 1. Usig bis i x of width δx, p i = p(x)δx, so p = N!(δx) N N i=1 p(x i ) N!(δx) N L, (60) where L is the likelihood of the data. Clearly, whe we compute the relative probabilities of two differet datasets, this is the same as the likelihood ratio. The likelihood ca be used ot oly to compute the relative probabilities of two differet outcomes for the same p(x), but also the relative probabilities of the same outcome for two differet pdfs. It is therefore a tool that is itimately ivolved i comparig hypotheses, as discussed more fully later i the course. 2.4 Example problems We will ow go through a umber of examples where simple pdfs are applied to real astroomical problems. 2.4.1 Example: Poisso photo statistics Typically, a star produces a large umber, N 1, of photos durig a period of observatio. We oly itercept a tiy fractio, p 1, of the photos which are emitted i all directios by the star, ad if we collect those photos for a few miutes or hours we will collect oly a tiy fractio of those emitted throughout the life of the star. So if the star emits N photos i total ad we collect a fractio, p, of those, the t λ = Np (the mea umber detected) N (the mea total umber emitted) p 0 (probability of detectio is very low) (61) So if we make may idetical observatios of the star ad plot out the frequecy distributio of the umbers of photos collected each time, we expect to see a Poisso distributio (strictly, this is ot completely true, as it igores photo buchig: whe the radiatio occupatio umber is high, as i a laser, photos ted to arrive i bursts). 23

Coversely, if we make oe observatio ad detect photos, we ca use the Poisso distributio to derive the probability of gettig this result for all the possible values of λ. The simplest case is whe there is o source of backgroud photos (as i e.g. gamma-ray astroomy). I that case, seeig eve a sigle photo is eough to tell us that there is a source there, ad the oly questio is how bright it is. Here, the problem is to estimate the mea arrival rate of photos i a give time iterval, λ, give the observed i oe iterval. Provided is reasoably large, we ca safely take the Gaussia approach ad argue that will be scattered aroud λ with variace λ, where λ is close to. Thus the source flux will be estimated from the observed umber of photos, ad the fractioal error o the flux will be σ f f = σ l f =1/. (62) Whe is small, we have to be more careful as discussed later i the sectio o Bayesia statistics. 2.4.2 Example: sky-limited detectio Typically the couts from the directio of a idividual source will be much less tha from the surroudig sky, ad so our attempted flux measuremet always icludes sky photos as well as the desired photos from the object. For example, we may detect 5500 photos from a aperture cetred o the source, ad 5000 from the same sized aperture cetred o a piece of blak sky. Have we detected a source? Let the couts from the aperture o the source be N T, ad from the same area of backgroud sky N B. N T icludes some backgroud, so our estimate of the source couts N S is (hat meas estimate of ): ˆN S = N T N B. (63) The questio we wat to address is how ucertai is this? The couts are idepedet ad radom ad so each follow a Poisso distributio, so the variace o N S is σ 2 S = σ 2 T + σ 2 B = N T + N B. (64) Thus i tur σ ˆ S 2 = N T + N B. If the source is much faiter tha the sky N S N T, the N T N B ad the variace is approximately 2N B. Thus the sigificace of the detectio, the sigal-to-oise ratio, is Sigal/Noise = N T N B N T N B. (65) NT + N B 2NB So simply measurig the backgroud ad the total couts is sufficiet to determie if a detectio is made. I the above example, Sigal/Noise 500/ 10000 = 5 (strictly, slightly less), what we would call a 5σ detectio. Normally 3σ (p 0.001) gives good evidece for a detectio but oly if the positio is kow i advace. Whe we make a survey, every pixel i a image is a cadidate for a source, although most of them will be blak i reality. Thus the umber of trials is very high; to avoid beig swamped by false positives, surveys will ormally set a threshold aroud 5σ (for related reasos, this is the traditioal threshold used by particle physicists whe searchig for e.g. the Higgs Boso). 24

2.4.3 Example: The distributio of superlumial velocities i quasars. Some radio sources appear to be expadig faster tha the speed of light. This is thought to occur if a radio-emittig compoet i the quasar jet travels almost directly towards the observer at a speed close to that of light. The effect was predicted by the Astroomer Royal Lord Marti Rees i 1966 (whe he was a udergraduate), ad first observed i 1971. Figure 4: Superlumial motio from a quasar ucleus. Suppose the agle to the lie of sight is θ as show above, ad that a compoet is ejected alog the jet from the ucleus. at t = 0 After some time t the ejectio compoet has travelled a distace d = tv cos θ alog the lie of sight. But the iitial ejectio is see to happe later tha t = 0, owig to the light travel time from the ucleus, so the observed duratio is t = t d /c = t(1 (v/c) cos θ). I that time the compoet appears to have moved a distace d = tv si θ across the lie of sight, ad hece the apparet trasverse velocity of the compoet is v = d t = v si θ 1 (v/c) cos θ. (66) Note that although a v/c term appears i this expressio, the effect is ot a relativistic effect. It is just due to light delay ad the viewig geometry. Writig β = v/c, γ = (1 β 2 ) 1/2 (67) we fid that the apparet trasverse speed β has a maximum value whe β θ β(β cos θ) = =0, (68) (1 β cos θ) 2 whe cos θ = β. Sice si θ =1/γ we fid a maximum value of β = γβ, where γ ca be much greater tha uity. Give a radomly orieted sample of radio sources, what is the expected distributio of β if β is fixed? First, ote that θ is the agle to the lie of sight, ad sice the orietatio is radom i three dimesios (i.e. uiform distributio over the area da = si θ dθdφ), p(θ) = si θ 0 θ π/2. (69) 25

Hece, p(β ) = dθ p(θ) dβ = p(θ) dβ 1 dθ = si θ(1 β cos θ)2 β cos θ β 2 (70) where si θ ad cos θ are give by the equatio for β. We have chose the limits 0 θ π/2 because i the stadard model blobs are ejected from the ucleus alog a jet i two opposite directios, so we should always see oe blob which is travellig towards us. The limits i β are 0 β γβ. The expressio for p(β ) i terms of β aloe is rather messy, but simplifies for β 1: β = si θ (1 cos θ) (71) p(β ) = si θ(1 cos θ). (72) Squarig both sides of si θ = β (1 cos θ), usig si 2 θ = (1 cos θ)(1 + cos θ) ad rearragig gives us (1 cos θ) = 2/(1 + β 2 ). Substitutig this ad si θ = β (1 cos θ) ito equatio (72) fially gives us p(β 4β )= β 1. (73) (1 + β 2 ) 2 The cumulative probability for β is P (> β )= 2 (1 + β 2 ) β 1. (74) so the probability of observig a large apparet velocity, say β > 5, is P (β > 5) 1/13. I fact, a much larger fractio of powerful radio quasars show superlumial motios, ad it ow seems likely that the quasars jets caot be radomly orieted: There must be effects operatig which ted to favour the selectio of quasars jets poitig towards us, most probably due to a opaque disc shroudig the ucleus. Aother physical effect is that jets poitig towards us at speeds close to c have their fluxes boosted by relativistic beamig, which meas they would be favoured i a survey which was selected o the basis of their radio flux. This ca be avoided by choosig the sample o the basis of some flux which is ot beamed, urelated to the jet. 26

3 Additio of radom variables to Cetral Limit Theorem 3.1 Probability distributio of summed idepedet radom variables Let us cosider the distributio of the sum of two or more radom variables: this will lead us o to the Cetral Limit Theorem which is of critical importace i probability theory ad hece astrophysics. Let us defie a ew radom variable z = x + y. What is the probability desity, p z (z) of z? The probability of observig a value, z, which is greater tha some value z 1 is P (z z 1 ) = = = z 1 d zp z ( z) (75) dy dx z 1 y z 1 x dx p x,y (x, y) (76) dy p x,y (x, y) (77) where the itegral limits o the secod lie ca be see from defiig the regio i the x y plae (see Figure 5). Figure 5: The regio of itegratio of equatio (76). Now, the pdf for z is just the derivative of this itegral probability: p z (z) = dp (> z)/dz = Or we could do the 2D itegral i the opposite order, which would give p z (z) = dx p x,y (x, z x). (78) dy p x,y (z y, y) (79) 27

If the distributios of x ad y are idepedet, the we arrive at a particularly importat result: p z (z) = dx p x (x)p y (z x) or dy p x (z y)p y (y) i.e. p z (z) = (p x p y )(z) (80) If we add together two idepedet radom variables, the resultig distributio is a covolutio of the two distributio fuctios. The most powerful way of hadlig covolutios is to use Fourier trasforms (FTs), sice the FT of a covolved fuctio p(z) is simply the product of the FTs of the separate fuctios p(x) ad p(y) beig covolved, i.e. for the sum of two idepedet radom variables we have: F(p z )=F(p x ) F(p y ) (81) 3.2 Characteristic fuctios I probability theory the Fourier Trasform of a probability distributio fuctio is kow as the characteristic fuctio: φ(k) = dx p(x)e ikx (82) with reciprocal relatio p(x) = 1 dk φ(k)e ikx (83) 2π (ote that other Fourier covetios may put the factor 2π i a differet place). A discrete probability distributio ca be thought of as a cotiuous oe with delta-fuctio spikes of weight p i at locatios x i, so here the characteristic fuctio is φ(k) = i p i e ikx i (84) (ote that φ(k) is a cotiuous fuctio of k). Hece i all cases the characteristic fuctio is simply the expectatio value of e ikx : φ(k) = e ikx. (85) Part of the power of characteristic fuctios is the ease with which oe ca geerate all of the momets of the distributio by differetiatio: This ca be see if oe expads φ(k) i a power series: m =( i) d dk φ(k) k=0 (86) φ(k) = e ikx (ikx) (ikx) = = = 1 + ik x 1 =0! =0! 2 k2 x 2 +. (87) As a example of a characteristic fuctio let us cosider the Poisso distributio: λ e λ φ(k) = e ik = e λ e λeik, (88) =0! 28

so that the characteristic fuctio for the Poisso distributio is φ(k) =e λ(eik 1). (89) The first momets of the Poisso distributio are: m 0 = φ(k) = e λ(eik 1) k=0 =1 (90) k=0 m 1 = ( i) 1 d1 dk 1 φ(k) k=0 =( i)e λ(eik 1) λe ik (i) k=0 = λ (91) m 2 = ( i) 2 d2 dk 2 φ(k) k=0 =( 1) d1 dk 1 eλ(eik 1) λe ik (i) k=0 = = λ(λ + 1) (92) which is i total agreemet with results foud for the mea ad the variace of the Poisso distributio (see Eq. 41 ad Eq. 43). Returig to the covolutio equatio (80), p z (z) = dy p x (z y)p y (y), (93) we shall idetify the characteristic fuctio of p z (z), p x (x), p y (y) as φ z (k), φ x (k) ad φ y (k) respectively. The characteristic fuctio of p z (z) is the φ z (k) = = = (let x = z y) = dz p z (z)e ikz dz dz dy p x (z y)p y (y)e ikz dy [p x (z y)e ik(z y) ][p y (y)e iky ] dx p x (x)e ikx dy p y (y)e iky. (94) which is a explicit proof of the covolutio theorem for the product of Fourier trasforms: φ z (k) =φ x (k) φ y (k). (95) The power of this approach is that the distributio of the sum of a large umber of radom variables ca be easily derived. This result allows us to tur ow to the Cetral Limit Theorem. 29

3.3 The Cetral Limit Theorem The most importat, ad geeral, result from probability theory is the Cetral Limit Theorem. It applies to a wide rage of pheomea ad explais why the Gaussia distributio appears so ofte i Nature. I its most geeral form, the Cetral Limit Theorem states that the sum of radom values draw from a probability distributio fuctio of fiite variace, σ 2, teds to be Gaussia distributed about the expectatio value for the sum, with variace σ 2. There are two importat cosequeces: 1. The mea of a large umber of values teds to be ormally distributed regardless of the probability distributio from which the values were draw. Hece the samplig distributio is kow eve whe the uderlyig probability distributio is ot. It is for this reaso that the Gaussia distributio occupies such a cetral place i statistics. It is particularly importat i applicatios where uderlyig distributios are ot kow, such as astrophysics. 2. Fuctios such as the Biomial ad Poisso distributios arise from multiple drawigs of values from some uderlyig probability distributio, ad they all ted to look like the Gaussia distributio i the limit of large umbers of drawigs. We saw this earlier whe we derived the Gaussia distributio from the Poisso distributio. The first of these cosequeces meas that uder certai coditios we ca assume a ukow distributio is Gaussia, if it is geerated from a large umber of evets with fiite variace. For example, the height of the surface of the sea has a Gaussia distributio, as it is perturbed by the sum of radom wids. But it should be bore i mid that the umber of ifluecig factors will be fiite, so the Gaussia form will ot apply exactly. It is ofte the case that a pdf will be approximately Gaussia i its core, but with icreasig departures from the Gaussia as we move ito the tails of rare evets. Thus the probability of a 5σ excursio might be much greater tha a aive Gaussia estimate; it is sometimes alleged that eglect of this simple poit by the bakig sector played a big part i the fiacial crash of 2008. A simple example of this pheomeo is the distributio of huma heights: a Gaussia model for this must fail, sice a height has to be positive, whereas a Gaussia exteds to. 3.3.1 Derivatio of the cetral limit theorem Let X = 1 (x 1 + x 2 + + x )= j=1 x j (96) be the sum of radom variables x j, each draw from the same arbitrary uderlyig distributio fuctio, p x. I geeral the uderlyig distributios ca all be differet for each x j, but for simplicity we shall cosider oly oe here. The distributio of X s geerated by this summatio, p X (X), will be a covolutio of the uderlyig distributios. 30

From the properties of characteristic fuctios we kow that a covolutio of distributio fuctios is a multiplicatio of characteristic fuctios. If the characteristic fuctio of p x (x) is φ x (k) = dx p x (x)e ikx = 1 + i x k 1 2 x 2 k 2 + O(k 3 ), (97) where i the last term we have expaded out e ikx. Sice the sum is over x j /, rather tha x j we scale all the momets x p (x/ ) p. From equatio (97),we see this is the same as scalig k k/. Hece the characteristic fuctio of X is Φ X (k) = φ xj / (k) = φ xj (k/ ) = [φ x (k/ )] (98) j=1 j=1 If we assume that m 1 = x = 0, so that m 2 = x 2 = σ 2 x (this does t affect our results) the Φ X (k) = [ 1+i m 1k σ2 xk 2 2 ] [ k3 + O( ) = 1 σ2 xk 2 3/2 2 ] k3 + O( ) e σ2 3/2 x k2 /2 (99) as. Note the higher terms cotribute as 3/2 i the expasio of Φ X (k) ad so vaish i the limit of large where we have previously see how to treat the limit of the expressio (1 + a/). It is however importat to ote that we have made a critical assumptio: all the momets of the distributio, however high order, must be fiite. If they are ot, the the higher-order terms i k will ot be egligible, however much we reduce them as a fuctio of. It is easy to ivet distributios for which this will be a problem: a power-law tail to the pdf may allow a fiite variace, but sufficietly high momets will diverge, ad our proof will fail. I fact, the Cetral Limit theorem still holds eve i such cases it is oly ecessary that the variace be fiite but we are uable to give a simple proof of this here. The above proof gives the characteristic fuctio for X. We kow that the F.T. of a Gaussia is aother Gaussia, but let us show that explicitly: p X (X) = 1 dk Φ X (k)e ikx 2π /(2σ 2 = e X2 x ) dk e ( σ2 xk 2 +X 2 /σx 2ikX)/2 2 2π = e X2 /(2σ 2 x) 2π = e X2 /(2σ 2 x) 2πσx. dk e (kσx+ix/σx)2 /2 = e X2 /(2σ 2 x) 2π 1 2πσx dk e (k +ix) 2 /(2σ 2 x) (100) Thus the sum of radom variables, sampled from the same uderlyig distributio, will ted towards a Gaussia distributio, idepedetly of the iitial distributio. The variace of X is evidetly the same as for x: σ 2 x. The variace of the mea of the x j is the clearly smaller by a factor, sice the mea is X/. 31

3.3.2 Measuremet theory As a corollary, by comparig equatio (96) with the expressio for estimatig the mea from a sample of idepedet variables, x = 1 x i, (101) we see that the estimated mea from a sample has a Gaussia distributio with mea m 1 = x ad stadard error o the mea σ x = σ x (102) as. This has two importat cosequeces. i=1 1. This meas that if we estimate the mea from a sample, we will always ted towards the true mea, 2. The ucertaity i our estimate of the mea will decrease as the sample gets bigger. This is a remarkable result: for sufficietly large umbers of drawigs from a ukow distributio fuctio with mea x ad stadard deviatio σ/, we are assured by the Cetral Limit Theorem that we will get the measuremet we wat to higher ad higher accuracy, ad that the estimated mea of the sampled umbers will have a Gaussia distributio almost regardless of the form of the ukow distributio. The oly coditio uder which this will ot occur is if the ukow distributio does ot have a fiite variace. Hece we see that all our assumptios about measuremet rely o the Cetral Limit Theorem. 3.3.3 How the Cetral Limit Theorem works We have see from the above derivatio that the Cetral Limit Theorem arises because i makig may measuremets ad averagig them together, we are covolvig a probability distributio with itself may times. We have show that this has the remarkable mathematical property that i the limit of large umbers of such covolutios, the result always teds to look Gaussia. I this sese, the Gaussia, or ormal, distributio is the smoothest distributio which ca be produced by atural processes. We ca show this by cosiderig a o-gaussia distributio, ie a top-hat, or square distributio (see Figure 6). If we covolve this with itself, we get a triagle distributio. Covolvig agai we get a slightly smoother distributio. If we keep goig we will ed up with a Gaussia distributio. This is the Cetral Limit Theorem ad is the reaso for its ubiquitous presece i ature. 3.4 Samplig distributios Above we showed how the Cetral Limit Theorem lies at the root of our expectatio that more measuremets will lead to better results. Our estimate of the mea of variables is ubiased (ie 32

Figure 6: Repeated covolutio of a distributio will evetually yield a Gaussia if the variace of the covolved distributio is fiite. gives the right aswer) ad the ucertaity o the estimated mea decreases as σ x /, ad the distributio of the estimated, or sampled, mea has a Gaussia distributio. The distributio of the mea determied i this way is kow as the samplig distributio of the mea. How fast the Cetral Limit Theorem works (i.e. how small ca be before the distributio is o loger Gaussia) depeds o the uderlyig distributio. At oe extreme we ca cosider the case of whe the uderlyig variables are all Gaussia distributed. The the samplig distributio of the mea will always be a Gaussia, eve if 1. But, beware! For some distributios the Cetral Limit Theorem does ot hold. For example the meas of values draw from a Cauchy (or Loretz) distributio, p(x) = 1 π(1 + x 2 ) (103) ever approach ormality. This is because this distributio has ifiite variace (try ad calculate it ad see). I fact they are distributed like the Cauchy distributio. Is this a rare, but pathological example? Ufortuately ot. For example the Cauchy distributio appears i spectral lie fittig, where it is called the Voigt distributio. Aother example is if we take the ratio of two Gaussia variables. The resultig distributio has a Cauchy distributio. Hece, we should beware, that although the Cetral Limit Theorem ad Gaussia distributio cosiderably simplify probability ad statistics, exceptios do occur, ad oe should always be wary of them. 3.5 Error propagatio If z is some fuctio of radom variables x ad y, ad we kow the variace of x ad y, what is the variace of z? Let z = f(x, y). We ca propagate errors by expadig f(x, y) to first order aroud some arbitrary 33

values, x 0 ad y 0 ; f(x, y) =f(x 0,y 0 )+(x x 0 ) f +(y y 0 ) f +0((x x 0 ) 2, (x x 0 )(y y 0 ), (y y 0 ) 2 ) x x=x0,y=y 0 y x=x0,y=y 0 (104) Let us assume x 0 = y 0 = 0 ad x = y = 0 for simplicity (the aswer will be geeral). The mea of z is ad the variace is (assumig x ad y are idepedet) σ 2 z = (z z ) 2 = = z = f(x 0,y 0 ) (105) dxdy (f f ) 2 p(x)p(y) where we have used the otatio f x f. x=x0,y=y 0 dxdy (x 2 f 2 x + y 2 f 2 y +2xyf x f y )p(x)p(y) (106) x Averagig over the radom variables we fid for idepedet variables with zero mea: σ 2 z = ( ) 2 f σx 2 + x ( ) 2 f σ 2 y y. (107) This formula will allow us to propagate errors for arbitrary fuctios. Note agai that this is valid for ay distributio fuctio, but depeds o (1) the uderlyig variables beig idepedet, (2) the fuctio beig differetiable ad (3) the variatio from the mea beig small eough that the expasio is valid. 3.5.1 The sample variace The average of the sample ˆx = 1 N N x i (108) i=1 is a estimate of the mea of the uderlyig distributio. Give we may ot directly kow the variace of the summed variables, σx, 2 is there a similar estimate of the variace of ˆx? This is particularly importat i situatios where we eed to assess the sigificace of a result i terms of how far away it is from the expected value, but where we oly have a fiite sample size from which to measure the variace of the distributio. We would expect a good estimate of the populatio variace would be somethig like S 2 = 1 (x i x) 2, (109) i=1 34

where x = 1 x i (110) i=1 is the sample mea of values. Let us fid the expected value of this sum. First we re-arrage the summatio S 2 = 1 i x) i=1(x 2 = 1 x 2 i 2 x 2 i x k + 1 x i k 2 i x k = 1 ( ) x 2 1 2 i k i x i (111) i=1 which is the same result we foud i Sectio 1.4 the variace is just the mea of the square mius the square of the mea. If all the x i are draw idepedetly the f(x i ) = f(x i ) (112) i i where f(x) is some arbitrary fuctio of x. If i = j the ad whe i ad j are differet The expectatio value of our estimator is the x i x j = x 2 i = j, (113) x i x j = x 2 i j. (114) ( ) S 2 1 = x 2 1 2 i x i i=1 = 1 x 2 1 x 2 1 x 2 2 i j i 2 = x 2 ( 1) x 2 x 2 2 ( 2 = 1 1 ) x 2 ( 1) x 2 2 ( 1) = ( x 2 x 2 ( 1) )= σ 2 x. (115) The variace is defied as σ 2 x = x 2 x 2, so S 2 will uderestimate the variace by the factor ( 1)/. This is because a extra variace term, σ 2 x/, has appeared due to the extra variace i our estimate i the mea. Sice the square of the mea is subtracted from the mea of the square, this extra variace is subtracted off from our estimate of the variace, causig the uderestimatio. To correct for this we should chage our estimate to S 2 = 1 1 (x i x) 2 (116) i=1 which is a ubiased estimate of σx, 2 idepedet of the uderlyig distributio. It is ubiased because its expectatio value is always σx 2 for ay whe the mea is estimated from the sample. Note that if the mea is kow, ad ot estimated from the sample, this extra variace does ot appear, i which case equatio (109) is a ubiased estimate of the sample variace. 35

3.5.2 Example: Measurig quasar variatio We wat to look for variable quasars. We have two CCD images of oe field take some time apart ad we wat to pick out the quasars which have varied sigificatly more tha the measuremet error which is ukow. I this case S 2 = 1 1 ( m i m) 2 (117) i=1 is the ubiased estimate of the variace of m. We wat to keep m 0 (i.e. we wat to measure m from the data) to allow for possible calibratio errors. If we were cofidet that the calibratio is correct we ca set m = 0, ad we could retur to the defiitio σ 2 = 1 ( m) 2. (118) i Suppose we fid that oe of the m, say m i, is very large, ca we assess the sigificace of this result? Oe way to estimate its sigificace is from t = m i m. (119) S If the mea is kow this is distributed as a stadardised Gaussia (ie t has uit variace) if the measuremet errors are Gaussia. But if we ca oly estimate the mea from the data, t is distributed as Studet-t. The Studet-t distributio looks qualitatively similar to a Gaussia distributio, but it has larger tails, due to the variatios i the measured mea ad variace. I other words, Studet-t is the pdf that arises whe estimatig the mea of a Gaussia distributed populatio whe the sample size is small. 36