The goal: to measure (determne) an unknown quantty x (the value of a RV X) Realsaton: n results: y 1, y 2,..., y j,..., y n, (the measured values of Y 1, Y 2,..., Y j,..., Y n ) every result s encumbered wth an error ε j : y j = x + ε j ; j = 1, 2,..., n the fundamental assumpton: the errors are normally dstrbuted wth the expected value equal to zero and wth some standard devaton σ ε j N(0, σ); E(ε j ) = 0; E(ε 2 j) = σ 2 f so, the probablty of havng a result n the range (y j, y j + dy j ) equals to: dp j dp (Y j (y j, y j + dy j )) = 1 exp [ (y j x) 2 ] 2πσ 2σ 2 dy j
Realsaton: dp j dp (Y j (y j, y j + dy j )) = 1 2πσ exp The lkelhood functon L s then: n 1 L = exp [ (y j x) 2 ] 2πσ 2σ 2 and ts logarthm l s: l = 1 2σ 2 n (y j x) 2 + a constant the demand l = maxmum [ (y j x) 2 2σ 2 n (y j x) 2 = mnmum n ε 2 j = mnmum ] dy j The sum of the squares of the errors should be as small as possble f the determnaton of ˆx s to be the most plausble.
Realsaton, cntd. The ML estmator s the arthmetc mean: ˆx = ȳ = 1 n n y j, σ 2 (ˆx) = σ2 n or, f the errors connected wth ndvdual measurements are dfferent: n ˆx = w jy j n w j [ ] w j = 1 n σj 2, σ 2 (ˆx) = 1 σ 2 j Now, f ˆx was the best estmator of X (ML estmator) of RV X then the quanttes ˆε j = y j ˆx are the best estmators of the quanttes ε j!! 1
We may construct a statstc: M = n ( ) 2 ˆεj = σ j n (y j ˆx) 2 = σ 2 j n (y j ˆx) 2 w j If the ε j have a normal dstrbuton the RV M should have a χ 2 dstrbuton wth n 1 degrees of freedom. Ths hypothess may be verfed (tested). A postve result of the test supports the data treatment. A negatve one calls for an extra analyss of the determnaton procedure.
An example (from: S. Brandt, Data analyss measurng the mass of the K 0 meson: (all data n MeV/c 2 ) m 1 = 498.1; σ 1 = 0.5, m 2 = 497.44; σ 2 = 0.33, m 3 = 498.9; σ 3 = 0.4, m 4 = 497.44; σ 4 = 0.4, ˆx = 4 y j 1 σ 2 j 4 1 = 497.9 4 V AR(ˆx) = 1 σ 2 j 1 = 0.20 σ 2 j 4 M = j ˆx) (y 2 1 σj 2 = 7.2 χ 2 0.95;3 = 7.82 there s no reason for dscredtng the above scheme of establshng the value of x.
LINEAR REGRESSION LINEAR REGRESSION s a powerfull tool for studyng fundamental relatonshps between two (or more) RVs Y and X. The method s based on the method of least squares. Let s dscuss the smplest case possble: we have a set of bvarate data,.e. a set of (x, y ) values and we presume a lnear relatonshp between the RV Y (dependent varable, response varable) and the explanatory (or regressor or predctor) varable X. Thus we should be able to wrte: ŷ = B 0 + B 1 x Note: Y are RVs, and y are ther measured values; ŷ are the ftted values,.e. the values resultng from the above relatonshp. We assume ths relatonshp be true and we are nterested n the numercal coeffcents n the proposed dependence. We shall fnd them va an adequate treatement of the measurement data.
LINEAR REGRESSION, cntd. As for x these are the values of a random varable too, but of a rather dfferent nature. For the sake of smplcty we should thnk about X (or the values x ) as of RV that take on values practcally free of any errors (uncertantes). We shall return to ths (unrealstc) assumpton later on. 1 The errors ε are to be consdered as dfferences between the measured (y ) and the ftted quanttes: ε = y ŷ y (B 0 + B 1 x ) As n the former case, we shall try to mnmse the sum of the error squares (SSE): Q = ε 2 ; t s not hard to show that ths sum may be decomposed nto 3 summands: ( Q = S yy (1 r 2 ) + B 1 Sxx r ) 2 S yy + n (ȳ B0 + B 1 x) 2. 1 One can magne a stuaton when the values of the predctor varable x had been carefully prepared pror to measurement,.e. any errors connected wth them are neglgble. On the other hand, the y values must be measured on-lne and ther errors should not be dsregarded.
LINEAR REGRESSION, cntd. ( Q = S yy (1 r 2 ) + B 1 Sxx r ) 2 S yy + n (ȳ B0 + B 1 x) 2. The symbols used are: = (x x) 2 = x 2 n x 2 S yy = (y ȳ) 2 = y 2 nȳ 2 S xy = (x x)(y ȳ) = x y n xȳ The x and ȳ are the usual arthmetc means; fnally r = S xy Sxx Syy s the sample estmator of the correlaton coeffcent. In order to mnmse Q we are free to adjust properly the values of B 0 and B 1. It s obvous that Q wll be the smallest f the followng equatons are satsfed:
LINEAR REGRESSION, cntd. ( Q = S yy (1 r 2 ) + B 1 Sxx r ) 2 S yy + n (ȳ B0 + B 1 x) 2. B 1 Sxx r S yy = 0 ȳ B 0 + B 1 x = 0. We shall denote the solutons for the values of B 0 (ntercept) and B 1 (slope) coeffcents whch mnmse the sum of squares n a specal way: ˆβ 1 = r Syy = S xy ˆβ0 = ȳ ˆβ 1 x Wth the relaton: y = ˆβ 0 + ˆβ 1 x the SSE has mnmum: Q = S yy (1 r 2 ). (N.B. ths may be used to show that r must be 1.) For r > 0 the slope of the straght lne s postve, and for r < 0 negatve.
LINEAR REGRESSION, cntd. The r quantty (the sample correlaton coeffcent) gves us a measure of the adequacy of the assumed model (lnear dependence). It can be easly shown that the total sum of squares, SST = (y ȳ) 2 can be decomposed nto a sum of the regresson sum of squares, SSR and the ntroduced already error sum of squares, SSE: (y ȳ) 2 = (ŷ ȳ) 2 + (y ŷ ) 2 or SST = SSR + SSE SST s a quantty that consttutes a measure of total varablty of the true observatons; SSR s a measure of the varablty of the ftted values, and SSE s a measure of false ( erroneous ) varablty. We have: 1 = SSR SST + SSE SST but: SSR = SST SSE = SST SST (1 r 2 ) = r 2 SST. Thus the above unty s a sum of two terms: the frst of them s the square of the sample correlaton coeffcent, r 2 and t s sometmes called the coeffcent of determnaton. The closer s r 2 to 1 the better s the (lnear) model.
LINEAR REGRESSION, cntd. Up to now nothng has been sad about the random nature of the ftted coeffcents, B 0, B 1. We tactly assume them to be some real numbers coeffcents n an equaton. But n practce we calculate them from formulae that contan values of some RVs. Concluson: B 0, B 1 should be also perceved as RVs, n that sense that ther determnaton wll be accomplshed also wth some margns of errors. The lnear relatonshp should be wrtten n the form: Y = B 0 + B 1 X + ε, = 1,..., n or perhaps 2 Y = β 0 + β 1 X + ε, = 1,..., n where ε are errors,.e. all possble factors other than the X varable that can produce changes n Y. These errors are normally dstrbuted wth E(ε ) = 0 and V AR(ε ) equal σ 2. From the above relaton we have: E(Y ) = β 0 + β 1 x and V AR(Y ) = σ 2 (remember: any errors on x are to be neglected). The smple lnear regresson model has three unknown parameters: β 0, β 1 and σ 2. 2 Ths change of notaton reflects the change of our attude to the ftted coeffcents; we should thnk about them as about RVs.
LINEAR REGRESSION, cntd. The method of least squares allows us to fnd the numercal values of the beta coeffcents theses are the ML estmators and they should be perceved as the expected values: As for the varances we have: E(β 1 ) = ˆβ 1 = r Syy E(β 0 ) = ˆβ 0 = ȳ ˆβ 1 x V AR(β 1 ) = σ2 = S xy V AR(β 0 ) = σ 2 ( 1 n + x2 )
verfcaton (some manpulatons? = should be carefully justfed; for β 0 the verfcaton can be done n a smlar manner) The thrd parameter of the smple lnear regresson model s σ 2. It may be shown that the statstc E(β 1 ) = E = { Sxy? = 0 + β 1 } { = E (x x)(β 0 + β 1 x ) } (x x)(y Ȳ ) = β 0? = E { (x x) + β 1 (x x)(x x) = β 1 = β 1 V AR(β 1 ) = { } { SxY V AR = V AR (x x) 2? = S 2 xx (x x)y } (x x)x } (x x)(y Ȳ ) V AR(Y ) = Sxx 2 σ 2 = σ2
verfcaton, cntd. s 2 = SSE n 2 = (y ŷ ) 2 n 2 s an unbased estmator of σ 2. (The n 2 n the denomnator reflects the fact that the data are used for determnng two coeffcents). The RV (n 2)s 2 /σ 2 has a ch-square dstrbuton wth n 2 degrees of freedom. Replacng the values of σ 2 n the formulae for the varances of the beta coeffcents by the sample estmator s we conclude that these coeffcents can be regarded as two standardsed random varables: β 1 ˆβ 1 s β 0 and ˆβ 0. 1 s n + x2
verfcaton, cntd. Theˆ-values are the ML estmators and the denomnators are the estmated standard errors of our coeffcents. Both standardsed varables have a Student s dstrbuton wth the n 2 degrees of freedom. Ther confdence ntervals can de determned n the usual way.