Incremental calculation of weighted mean and variance



Similar documents
I. Chi-squared Distributions

Definition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

THE ABRACADABRA PROBLEM

Properties of MLE: consistency, asymptotic normality. Fisher information.

Lesson 15 ANOVA (analysis of variance)

Confidence Intervals for One Mean

Repeating Decimals are decimal numbers that have number(s) after the decimal point that repeat in a pattern.

A probabilistic proof of a binomial identity

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

Normal Distribution.

5 Boolean Decision Trees (February 11)

Department of Computer Science, University of Otago

Soving Recurrence Relations

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

Factoring x n 1: cyclotomic and Aurifeuillian polynomials Paul Garrett <garrett@math.umn.edu>

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Chapter 7: Confidence Interval and Sample Size

Sequences and Series

hp calculators HP 12C Statistics - average and standard deviation Average and standard deviation concepts HP12C average and standard deviation

5: Introduction to Estimation

The Stable Marriage Problem

One-sample test of proportions


The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

Output Analysis (2, Chapters 10 &11 Law)

A Recursive Formula for Moments of a Binomial Distribution

Solving Logarithms and Exponential Equations

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

CHAPTER 3 DIGITAL CODING OF SIGNALS

Maximum Likelihood Estimators.

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

Asymptotic Growth of Functions

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

Swaps: Constant maturity swaps (CMS) and constant maturity. Treasury (CMT) swaps

NATIONAL SENIOR CERTIFICATE GRADE 12

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

Section 11.3: The Integral Test

Overview of some probability distributions.

Chapter 7 Methods of Finding Estimators

Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.

Lecture 2: Karger s Min Cut Algorithm

Basic Elements of Arithmetic Sequences and Series

Mann-Whitney U 2 Sample Test (a.k.a. Wilcoxon Rank Sum Test)

Time Value of Money. First some technical stuff. HP10B II users

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE The absolute value of the complex number z a bi is

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction

Hypothesis testing. Null and alternative hypotheses

1 Computing the Standard Deviation of Sample Means

Infinite Sequences and Series

Now here is the important step

Laws of Exponents Learning Strategies

INFINITE SERIES KEITH CONRAD

Estimating Probability Distributions by Observing Betting Practices

This document contains a collection of formulas and constants useful for SPC chart construction. It assumes you are already familiar with SPC.

Baan Service Master Data Management

CHAPTER 3 THE TIME VALUE OF MONEY

Cooley-Tukey. Tukey FFT Algorithms. FFT Algorithms. Cooley

Simple Annuities Present Value.

A GUIDE TO LEVEL 3 VALUE ADDED IN 2013 SCHOOL AND COLLEGE PERFORMANCE TABLES

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

GCE Further Mathematics (6360) Further Pure Unit 2 (MFP2) Textbook. Version: 1.4

A Test of Normality. 1 n S 2 3. n 1. Now introduce two new statistics. The sample skewness is defined as:

Lesson 17 Pearson s Correlation Coefficient

FM4 CREDIT AND BORROWING

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

Systems Design Project: Indoor Location of Wireless Devices

THIN SEQUENCES AND THE GRAM MATRIX PAMELA GORKIN, JOHN E. MCCARTHY, SANDRA POTT, AND BRETT D. WICK

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Elementary Theory of Russian Roulette

Research Article Sign Data Derivative Recovery

Present Value Factor To bring one dollar in the future back to present, one uses the Present Value Factor (PVF): Concept 9: Present Value

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Biology 171L Environment and Ecology Lab Lab 2: Descriptive Statistics, Presenting Data and Graphing Relationships

Notes on exponential generating functions and structures.

Sampling Distribution And Central Limit Theorem

Measures of Spread and Boxplots Discrete Math, Section 9.4


Determining the sample size

Integer Factorization Algorithms

Modified Line Search Method for Global Optimization

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length

CS103X: Discrete Structures Homework 4 Solutions

1. MATHEMATICAL INDUCTION

Convexity, Inequalities, and Norms

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Class Meeting # 16: The Fourier Transform on R n

PSYCHOLOGICAL STATISTICS

5.4 Amortization. Question 1: How do you find the present value of an annuity? Question 2: How is a loan amortized?

I. Why is there a time value to money (TVM)?

Transcription:

Icremetal calculatio of weighted mea ad variace Toy Fich faf@cam.ac.uk dot@dotat.at Uiversity of Cambridge Computig Service February 009 Abstract I these otes I eplai how to derive formulae for umerically stable calculatio of the mea ad stadard deviatio, which are also suitable for icremetal o-lie calculatio. I the geeralize these formulae to weighted meas ad stadard deviatios. I upick the difficulties that arise whe geeralizig further to ormalized weights. Fially I show that the epoetially weighted movig average is a special case of the icremetal ormalized weighted mea formula, ad derive a formula for the epoetially weighted movig stadard deviatio. Simple mea Straightforward traslatio of equatio ito code ca suffer from loss of precisio because of the differece i magitude betwee a sample ad the sum of all samples. Equatio 4 calculates the mea i a way that is more umerically stable because it avoids accumulatig large sums. µ i ) + i ) ) This formula also provides us with some useful idetities. Simple variace + )µ ) 3) µ + µ ) 4) µ µ µ ) 5) µ µ µ ) µ + µ )µ µ ) 6) The defiitio of the stadard deviatio i equatio 7 below requires us to already kow the mea, which implies two passes over the data. This is t feasible for olie algorithms that eed to produce icremetal results after each sample becomes available. Equatio solves this problem sice it allows us to calculate the stadard deviatio from two ruig sums. σ i µ) 7) i i µ + µ ) 8)

3 Icremetal variace i µ i + µ 9) i µµ + µ 0) i µ ) ) ) i i ) Kuth otes [] that equatio is proe to loss of precisio because it takes the differece betwee two large sums of similar size, ad suggests equatio 4 as a alterative that avoids this problem. However he does ot say how it is derived. I the followig, equatio 0 is derived from the previous step usig equatio 5. Let S σ 3) i µ ) 4) i µ 5) S S i µ i + )µ 6) µ + )µ 7) µ + µ µ ) 8) µ + µ µ )µ + µ ) 9) µ + µ )µ + µ ) 0) µ + µ µ µ + µ µ ) µ µ + µ µ ) µ ) µ ) 3) S S + µ ) µ ) 4) σ S / 5) Mathworld [] has a alterative derivatio of a similar formula, which i our otatio is as follows. S i µ ) 6) i µ ) µ µ )) 7) i µ ) + µ µ ) i µ )µ µ ) 8) Simplify the first summatio: i µ ) µ ) + i µ ) 9) µ ) + S 30) S + µ µ ) 3)

Simplify the secod summatio: Simplify the third summatio: µ µ ) µ µ ) 3) i µ )µ µ ) µ µ ) i µ ) 33) ) µ µ ) µ + i µ ) ) µ µ ) µ )µ + i µ µ ) µ )µ + )µ ) 36) µ µ ) µ ) 37) µ µ ) 38) Back to the complete formula: S S + µ µ ) + µ µ ) µ µ ) 39) S + µ µ ) µ µ ) 40) S + )µ µ ) 4) We ca use equatios 6 ad 5 to show this is equivalet to equatio 4. S S + )µ µ ) 4) S + µ µ ) µ ) 43) S + µ ) µ ) 44) 4 Weighted mea The weighted mea is defied as follows. 34) 35) µ w i i w i 45) It is equivalet to the simple mea whe all the weights are equal, sice µ w i w w i i 46) w If the samples are all differet, the weights ca be thought of as sample frequecies, or they ca be used to calculate probabilities where p i w i / w i. The followig derivatio of the icremetal formula equatio 53) follows the same patter as the derivatio of equatio 4. For brevity we also defie as the sum of the weights. µ w i 47) w i i 48) ) w + w i i 3 49)

Useful idetities derived from this formula are: 5 Weighted variace w + W µ ) 50) w + w )µ ) 5) µ + w w µ ) 5) µ + w µ ) 53) µ µ ) w µ ) 54) µ µ ) w µ 55) µ w µ µ ) µ + µ w w µ µ ) 56) w µ ) 57) Similarly, we derive a umerically stable formula for calculatig the weighted variace equatio 68) usig the same patter as the derivatio of the uweighed wersio equatio 4). σ w i i µ) w i i µ 58) Let S σ 59) w i i µ 60) S S 6 Variable weights w i i µ w i i + W µ 6) w µ + W µ 6) w µ + w )µ 63) w µ ) + W µ µ ) 64) w µ ) + W µ µ )µ + µ ) 65) w µ + µ )µ + µ ) ) 66) w µ ) µ ) 67) S S + w µ ) µ ) 68) σ S / 69) I the previous three sectios, I have assumed that weights are costat oce assiged. However a commo requiremet is to ormalize the weights, such that W w i 70) If we are repeatedly addig ew data to our workig set, the we ca t have both costat weights ad ormalized weights. To allow us to keep weights ormalized, we eed to allow the weight of each 4

sample to chage as the set of samples chages. To idicate this we will give weights two idices, the first idetifyig the set of samples usig the sample cout as we have bee doig for µ etc.) ad the secod beig the ide of the sample i that set. We will ot make ay assumptios about the sum of the weights, that is we will ot require them to be ormalized. For eample, w,i, µ w,i i 7) Havig doe this we eed to re-eamie some of the logical steps i the previous sectios to esure they are still valid. I equatios 49 5, we used the fact that i the fied-weight settig, w i i W µ w )µ 7) I the ew settig, this equality is fairly obviously o loger true. For eample, if we are keepig weights ormalized the W.) Fortuately there is a differet middle step which justifies equatio 7 whe weights vary, so the results of sectio 4 remai valid. w,i i w, )µ 73) w,i i w,i i w,i w,i w i,i w,j w,i w w,i i,i w,i i w,i w,i 74) 75) w,i w i 76),i w,j w,i where j 77) This says that for the weighted mea formulae to remai valid the ew ad old weights should be cosistet. Equatio 75 says that we get the same result whe we calculate the mea of the previous workig set whether we use the old weights or the ew weights. Equatio 77 says that whe we ormalize the weights across the previous set up to ) we get the same set of weights whether we start from the old weights or the ew oes. This requiremet is t eough by itself to make the weighted variace formulae work, so we will eamie it agai below. 7 The epectatio fuctio At this poit it is worth defiig some better otatio to reduce the umber of summatios we eed to write. The epectatio fuctio is a geeralized versio of the mea, whose argumet is some arbitrary fuctio of each sample. E f)) w,i f i ) 78) E k) k 79) E af)) ae f)) 80) E f) + g)) E f)) + E g)) 8) µ E ) 8) σ E µ ) ) 83) E + µ µ ) 84) E ) + µ µ E ) 85) 5

E ) µ 86) E ) E ) 87) The icremetal formula is derived i the usual way. Equatio 9 is particularly useful. E f)) w,i f i ) 88) 8 Variable-weight variace w, f ) + w,i f i ) 89) w, f ) + w, ) w,if i ) w,i 90) w, f ) + w, ),if i ) w,i 9) w, f ) + w, )E f)) 9) E f)) E f)) + w, f ) E f))) 93) I equatios 6 63 we made the followig assumptios which are ot true whe weights ca vary. w,i i µ w,i i + W µ w, µ + W µ w, µ + w, )µ If we try to re-do the short derivatio of the icremetal stadard deviatio formula startig from S S the we soo get stuck. Fortuately the loger derivatio shows how to made it work. S σ 94) E µ ) ) 95) Simplify the first term: E [ µ ] [µ µ ]) ) 96) E [ µ ] + [µ µ ] [ µ ][µ µ ] ) 97) E [ µ ] ) + E [µ µ ] ) E [ µ ][µ µ ]) 98) E [ µ ] ) w, [ µ ] + w, )E [ µ ] ) 99) Simplify the secod term: Simplify the third term: w, [ µ ] + w, ) S W 00) w, S + [µ µ ] 0) W w, E [µ µ ] ) [µ µ ] 0) E [ µ ][µ µ ]) [µ µ ] E [ µ ] 03) 6

Back to the complete formula: [µ µ ] w, [ µ ] + w, )E [ µ ]) 04) [µ µ ] w, [ µ ] + w, )[E ) E µ )]) 05) [µ µ ] w, [ µ ] + w, )[µ µ ]) 06) [µ µ ]w, [ µ ] 07) [µ µ ] 08) S w, S + [µ µ ] + [µ µ ] [µ µ ] 09) W w, w, S + [µ µ ] w, [µ µ ] 0) W w, w, w, S + w, ) [µ µ ] ) W w, w, W S + w, )[µ µ ] µ ) ) S w, W S + w, µ ) µ ) 3) This is the same as equatio 68, ecept for the multiplier W w, W which captures the chage i weights betwee the old ad ew sets. w,,i W w w,j,i w,j where j 4) Now that we kow the rescalig trick which makes it work, we ca write dow the short versio. S w, S W E ) µ ) W w, ) E ) µ ) 5) E ) µ ) W E ) + w, + w, )µ 6) w, µ + w, )µ 7) w, µ ) + W µ µ ) 8) w, µ ) + W µ µ )µ + µ ) 9) w, µ + µ )µ + µ ) ) 0) w, µ ) µ ) ) 9 Epoetially-weighted mea ad variace Startig from equatio 53, let s set w, / to a costat 0 < α < ad let a α. This produces the stadard formula for the epoetially weighted movig average. µ µ + α µ ) ) α)µ + α 3) aµ + a) 4) I the followig it s more coveiet to use a lower boud of 0 istead of, i.e. 0 i. We are goig to show that the weights are reormalized each time a datum is added. First, we epad out the iductive defiitio of the mea. µ aµ + a) 5) a µ + a a) + a) 6) 7

a 3 µ 3 + a a) + a a) + a) 7) µ a 0 + a i a) i 8) This allows us to write dow the weights directly. Note that w, is idepedet of. w,0 a 9) w,i a i a), for i 30) w, a α 3) Sice w, α w, / we ca see that., that is, the weights are always ormalized. We ca get the same result by summig the geometric series. a i a j a a w,i j0 3) a i a) a 33) w,0 + These weights satisfy the cosistecy requiremet because w,j aw,j w,j w,i aw,i w,i w,i a + a ) 34) We ca use the epectatio fuctio to write dow the aïve formula for the variace. 35) E f)) E f)) + w, f ) E f))) 36) E f)) + αf ) E f))) 37) E ) E ) + α E )) 38) σ E ) µ 39) So usig the formula from the previous sectio we ca write the icremetal versio: S w, W S + w, µ ) µ ) 40) S α S + α µ ) µ ) 4) σ S S as + a) µ ) µ ) 4) This latter form is slightly more coveiet for code: diff : - mea icr : alpha * diff mea : mea + icr variace : - alpha) * variace + diff * icr) Refereces α)s + α µ ) ) 43) [] Doald E. Kuth. Semiumerical Algorithms, volume of The Art of Computer Programmig, chapter 4.., page 3. Addiso-Wesley, Bosto, third editio, 998. [] Eric W. Weisstei. Sample variace computatio. From Mathworld, a Wolfram web resource, http://mathworld.wolfram.com/samplevariacecomputatio.html. 8