19 Another Look at Differentiability in Quadratic Mean

Transcription

1 19 Aother Look at Differetiability i Quadratic Mea David Pollard 1 ABSTRACT This ote revisits the delightfully subtle itercoectios betwee three ideas: differetiability, i a L 2 sese, of the square-root of a probability desity; local asymptotic ormality; ad cotiguity A mystery The traditioal regularity coditios for maximum likelihood theory ivolve existece of two or three derivatives of the desity fuctios, together with domiatio assumptios to justify differetiatio uder itegral sigs. Le Cam (1970) oted that such coditios are uecessarily striget. He commeted: Eve if oe is ot iterested i the maximum ecoomy of assumptios oe caot escape practical statistical problems i which apparetly slight violatios of the assumptios occur. For istace the derivatives fail to exist at oe poit x which may deped o θ, or the distributios may ot be mutually absolutely cotiuous or a variety of other difficulties may occur. The existig literature is rather uclear about what may happe i these circumstaces. Note also that sice the coditios are imposed upo probability desities they may be satisfied for oe choice of such desities but ot for certai other choices. Probably Le Cam had i mid examples such as the double expoetial desity, 1 / 2 exp( x θ ), for which differetiability fails at the poit θ = x. He showed that the traditioal coditios ca be replaced by a simpler assumptio of differetiability i quadratic mea (DQM): differetiability i orm of the square root of the desity as a elemet of a L 2 space. Much asymptotic theory ca be made to work uder DQM. I particular, as Le Cam showed, it implies a quadratic approximatio property for the log-likelihoods kow as local asymptotic ormality (LAN). Le Cam s idea is simple but subtle. Whe I first ecoutered the LAN property I wrogly dismissed it as othig more tha a Taylor expasio to quadratic terms of the log-likelihood. Le Cam s DQM result showed otherwise: 1 Yale Uiversity

2 306 David Pollard oe appears to get the beefit of the quadratic expasio without payig the twice-differetiability price usually demaded by such a Taylor expasio. How ca that happe? My iitial puzzlemet was ot completely allayed by a study of several careful accouts of LAN, such as those of Le Cam (1970; 1986, Sectio 17.3), Ibragimov & Has miskii (1981, page 114), Millar (1983, page 105), Le Cam & Yag (1990, page 101), or Strasser (1985, Chapter 12). Noe of the proofs left me with the feelig that I really uderstood why secod derivatives are ot eeded. (No criticism of those authors iteded, of course.) Evetually it dawed o me that I had overlooked a vital igrediet i the proofs: the square root of a desity is ot just a elemet of a L 2 space: it is a elemet with orm 1. By rearragig some of the stadard argumets I hope to covice the getle reader of this ote that the fixed orm is the real reaso for why a assumptio of oe-times differetiability (i quadratic mea) ca covey the beefits usually associated with two-times differetiability. I claim that the Lemma i the ext Sectio is the key to uderstadig the role of DQM A lemma The cocept of differetiability makes sese for maps ito a arbitrary ormed space (L, ). For the purposes of my expositio, it suffices to cosider the case where the orm is geerated by a ier product,,. I fact, L will be L 2 (λ), the space of fuctios square-itegrable with respect to some measure λ, but that simplificatio will play o role for the momet. Amapξ from R k ito L is said to be differetiable at a poit θ 0 with derivative, ifξ(θ) = ξ(θ 0 ) + (θ θ 0 ) + r(θ) ear θ 0, where r(θ) = o( θ θ 0 ) as θ teds to θ 0. The derivative is liear; it may be idetified with a k-vector of elemets from L. For a differetiable map, the Cauchy-Schwarz iequality implies that ξ(θ 0 ), r(θ) =o( θ θ 0 ). It would usually be a bluder to assume aively that the boud must therefore be of order O( θ θ 0 2 ); typically, higher-order differetiability assumptios are eeded to derive approximatios with smaller errors. However, if ξ(θ) is costat that is, if the fuctio is costraied to take values lyig o the surface of a sphere the the aive assumptio turs out to be o bluder. Ideed, i that case, ξ(θ 0 ), r(θ) ca be writte as a quadratic i θ θ 0 plus a error of order o( θ θ 0 2 ). The sequetial form of the assertio is more coveiet for my purposes. (1) Lemma Let {δ } be a sequece of costats tedig to zero. Let ξ 0, ξ 1,...be elemets of orm oe for which ξ = ξ 0 +δ W +r, with W aæxed elemet of L ad r =o(δ ). The ξ 0, W =0 ad ξ 0, r = 1 2 δ2 W 2 + o(δ 2).

3 19. Differetiability i Quadratic Mea 307 Proof. Because both ξ ad ξ 0 have uit legth, 0 = ξ 2 ξ 0 2 = 2δ ξ 0, W order O(δ ) + 2 ξ 0, r order o(δ ) + δ 2 W 2 order O(δ 2) + 2δ W, r + r 2 order o(δ 2). O the right-had side I have idicated the order at which the various cotributios ted to zero. (The Cauchy-Schwarz iequality delivers the o(δ ) ad o(δ 2 ) terms.) The exact zero o the left-had side leaves the leadig 2δ ξ 0, W uhappily exposed as the oly O(δ ) term. It must be of smaller order, which ca happe oly if ξ 0, W =0, leavig 0 = 2 ξ 0, r +δ 2 W 2 + o(δ 2 ), as asserted. Without the fixed legth property, the ier product ξ 0, r, which iherits o(δ ) behaviour from r, might ot decrease at the O(δ 2) rate A theorem Let {P θ : θ } be a family of probability measures o a space (X, A), idexed by a subset of R k. Suppose P θ has desity f (x,θ) with respect to a sigma-fiite measure λ. Uder the classical regularity coditios twice cotiuous differetiability of log f (x,θ) with respect to θ, with a domiated secod derivative the likelihood ratio f (x i,θ) f (x i,θ 0 ) ejoys the LAN property. Write L (t) for the likelihood ratio evaluated at θ equal to θ 0 + t/. The property asserts that, if the {x i } are sampled idepedetly from P θ0, the (2) L (t) = exp ( t S 1 2 t Ɣt + o p (1) ) for each t, where Ɣ is a fixed matrix (depedig o θ 0 )ads has a cetered asymptotic ormal distributio with variace matrix Ɣ. Formally, the LAN approximatio results from the usual poitwise Taylor expasio of the log desity g(x,θ) = log f (x,θ), followig a style of argumet familiar to most graduate studets. For example, i oe dimesio, log L (θ 0 + t/ ) = ( g(xi,θ 0 + t/ ) g(x i,θ 0 ) ) = t g (x i,θ 0 ) + t 2 g (x i,θ 0 ) +..., 2

4 308 David Pollard which suggests that S be the stadardized score fuctio, 1 g (x i,θ 0 ) N ( 0, var θ0 g (x,θ 0 ) ), ad Ɣ should be the iformatio fuctio, P θ0 g (x,θ 0 ) = var θ0 g (x,θ 0 ). The dual represetatio for Ɣ allows oe to elimiate all metio of secod derivatives from the statemet of the LAN approximatio, which hits that two derivatives might ot really be eeded, as Le Cam (1970) showed. I geeral, the family of desities is said to be differetiable i quadratic mea at θ 0 if the square root ξ(x,θ)= f (x,θ) is differetiable i the L 2 (λ) sese: for some k-vector (x) of fuctios i L 2 (λ), (3) ξ(x,θ)= ξ(x,θ 0 ) + (θ θ 0 ) (x) + r(x,θ), where λ r(x,θ) 2 = o( θ θ 0 2 ) as θ θ 0. Let us abbreviate ξ(x,θ 0 ) to ξ 0 (x) ad (x)/ξ 0 (x) to D(x). From (3) oe almost gets the LAN property. (4) Theorem Assume the DQM property (3). For each Æxed t the likelihood ratio has the approximatio, uder {P,θ0 }, where L (t) = exp ( t S 1 2 t Ɣt + o p (1) ), S = 2 D(x i ) N(0, I 0 ) ad Ɣ = 1 2 I I, with I 0 = 4λ( {ξ 0 > 0}) ad I = 4λ( ). Notice the slight differece betwee Ɣ ad the limitig variace matrix for S. At least formally, 2D(x) equals the derivative of log f (x,θ): igorig problems related to divisio by zero ad distictios betwee poitwise ad L 2 (λ) differetiability, we have 2 2D(x) = f (x,θ0 ) = f (x,θ0 ) θ θ log f (x,θ 0). Also, Ɣ agai correspods to the iformatio matrix, expressed i its variace form, except for the itrusio of the idicator fuctio {ξ 0 > 0}. The extra idicator is ecessary if we wish to be careful about 0/0. Its presece is related to the property called cotiguity aother of Le Cam s great ideas as is explaied i Sectio 5.

5 19. Differetiability i Quadratic Mea 309 At first sight the derivatio of Theorem 4 from assumptio (3) agai appears to be a simple matter of a Taylor expasio to quadratic terms of the log likelihood ratio. Writig R (x) = r(x,θ 0 + t/ )/ξ 0 (x), wehave log L (t) = 2log ξ(x i,θ 0 + t/ ) ξ(x i,θ 0 ) = 2log (1 + t ) D(x i ) + R (x i ). From the Taylor expasio of log( ) about 1, the sum of logarithms ca be writte as a formal series, 2 ( ) t D(x i ) + R (x i ) ( t 2 D(x i ) + R (x i )) +... (5) = 2t D(x i ) + 2 R (x i ) 1 ( t D(x i ) ) The first sum o the right-had side gives the t S i Theorem 4. The law of large umbers gives covergece of the third term to t P θ0 DD t. Mere oe-times differetiability might ot seem eough to dispose of the secod sum. Each summad has stadard deviatio of order o(1/ ), by DQM. A sum of such terms could crudely be bouded via a triagle iequality, leavig a quatity of order o( ), which clearly would ot suffice. I fact the sum of the R (x i ) does ot go away i the limit; as a cosequece of Lemma 1, it cotributes a fixed quadratic i t. That cotributio is the surprise behid DQM A proof Let me write P to deote calculatios uder the assumptio that the observatios x 1,...,x are sampled idepedetly from P θ0. The ratio f (x i,θ 0 + t/ )/f (x i,θ 0 ) is ot well defied whe f (x i,θ 0 ) = 0, but uder P the problem ca be eglected because P { f (x i,θ 0 ) = 0 for at least oe i} =0. For other probability measures that are ot absolutely cotiuous with respect to P, oe should be more careful. It pays to be quite explicit about behaviour whe f (x i,θ 0 ) = 0 for some i, by icludig a explicit idicator fuctio {ξ 0 > 0} as a factor i ay expressios with a ξ 0 i the deomiator. Defie D i to be the radom vector (x i ){ξ 0 (x i )>0}/ξ 0 (x i ), ad, for a fixed t, defie R i, = r(ξ i,θ 0 + t/ ){ξ 0 (x i )>0}/ξ 0 (x i ). The ξ(x i,θ 0 + t/ ) {ξ 0 ( i )>0} =1 + t D i + R i,. ξ 0 (x i )

6 310 David Pollard (6) (8) The radom vector D i has expected value λ(ξ 0 ), which, by Lemma 1, is zero, eve without the traditioal regularity assumptios that justify differetiatio uder a itegral sig. It has variace 1 4 I 0. It follows by a cetral limit theorem that S = 2 D i N(0, I 0 ). Also, by a (weak) law of large umbers, 1 D i D i P (D 1 D 1 ) = 1 4 I 0 i probability. To establish rigorously the ear-lan assertio of Theorem 4, it is merely a matter of boudig the error terms i (5) ad the justifyig the treatmet of the sum of the R (x i ). Three facts are eeded. (7) Lemma Uder {P }, assumig DQM, (a) max D i =o p ( ), (b) max R i, =o p (1), (c) 2R i, 1 4 t It i probability. Let me first explai how Theorem 4 follows from Lemma 7. Together the two facts (a) ad (b) esure that with high probability log L (t) does ot ivolve ifiite values. For (t D i / ) + R i, > 1 we may the a appeal to the Taylor expasio log(1 + y) = y 1 2 y β(y), where β(y) = o(y 2 ) as y teds to zero, to deduce that log L (t) equals 2 t D i + 2 R i, ( t ) D 2 i + R i, + ( t ) D i β + R i,, which expads to t S + 2 R i, 1 (t D i ) 2 2 t D i R i, R i, 2 + o p(1) ( Di 2 ) + Ri, 2. Each of the last three sums is of order o p (1) because D i 2 / = O p (1) ad P R2 i, = λ( ξ0 2 r(x 1,θ 0 + t/ ){ξ 0 > 0}/ξ0 2 ) λ r(,θ 0 + t/ ) 2 = o(1). By virtue of (6) ad (c), the expasio simplifies to t S 1 4 t It 1 4 t I 0 t + o p (1), as asserted by Theorem 4.

7 19. Differetiability i Quadratic Mea 311 Proof of Lemma 7. Assertio (a) follows from the idetical distributios: P {max D i >ɛ } P { D i >ɛ } = P { 1 >ɛ } ɛ 2 λ 2 1 { 1 >ξ 0 ɛ } 0 by Domiated Covergece. Assertio (b) follows from (8): P {max R i, >ɛ} ɛ 2 P Ri, 2 0. Oly Assertio (c) ivolves ay subtlety. The variace of the sum is bouded by 4 P R (x i ) 2, which teds to zero. The sum of the remaiders must lie withi o p (1) of its expected value, which equals 2P θ0 R 1, = 2λ ( ξ 0 r(,θ 0 + t/ ) ), a ier product betwee two fuctios i L 2 (λ). Notice that the ξ 0 factor makes the idicator {ξ 0 > 0} redudat. It is here that the uit legth property becomes importat. Specializig Lemma 1 to the case δ = 1/, with ξ (x) = ξ(x,θ 0 + t/ ) ad W = t, we get the approximatio to the sum of expected values of the R i,, from which Assertio (c) follows. A slight geeralizatio of the LAN assertio is possible. It is ot ecessary that we cosider oly parameters of the form θ 0 + t/ for a fixed t. By arguig almost as above alog coverget subsequeces of {t } we could prove a aalog of Theorem 4 if t were replaced by a bouded sequece {t } such that θ 0 + t /. The extesio is sigificat because (Le Cam 1986, page 584) the slightly stroger result forces a form of differetiability i quadratic mea Cotiguity ad disappearace of mass For otatioal simplicity, cosider oly the oe-dimesioal case with the typical value t = 1. Let ξ 2 be the margial desity, ad Q be the joit distributio, for x 1,...,x sampled with parameter value θ 0 + 1/.As before, ξ0 2 ad P correspod to θ 0. The measure Q is absolutely cotiuous with respect to P if ad oly if it puts zero mass i the set A ={ξ 0 (x i ) = 0 for at least oe i }. Writig α for λξ 2{ξ 0 = 0}, wehave Q A = 1 ( 1 Q {ξ 0 (x i ) = 0} ) = 1 (1 α ).

8 312 David Pollard By direct calculatio, α = λ ( r + / ) 2 {ξ0 = 0} =λ 2 {ξ 0 = 0}/ + o(1/). The quatity τ = λ 2 {ξ 0 = 0} has the followig sigificace. Uder Q, the umber of observatios ladig i A has approximately a Poisso(τ) distributio; ad Q A 1 e τ. I some asymptotic sese, the measure Q becomes more early absolutely cotiuous with respect to P if ad oly if τ = 0. The precise sese is called cotiguity: the sequece of measures {Q } is said to be cotiguous with respect to {P } if Q B 0 for each sequece of sets {B } such that P B 0. Because P A = 0 for every, the coditio τ = 0 is clearly ecessary for cotiguity. It is also sufficiet. Cotiguity follows from the assertio that L, the limit i distributio uder {P } of the likelihood ratios {L (1)}, have expected value oe. ( Le Cam s first lemma see the theorem o page 20 of Le Cam ad Yag, 1990.) The argumet is simple: If PL = 1 the, to each ɛ>0 there exists a fiite costat C such that PL{L < C} > 1 ɛ. From the covergece i distributio, P L {L < C} > 1 ɛ evetually. If P B 0the Q B P B L {L < C}+Q {L C} CP B + 1 P L {L < C} < 2ɛ evetually. For the special case of the limitig exp(n(µ, σ 2 )) distributio, where µ = 1 4 I I ad σ 2 = I 0, the requiremet becomes 1 = P exp ( N(µ, σ 2 ) ) = exp ( µ σ 2). That is, cotiguity obtais whe I 0 = I (or equivaletly, λ( 2 {ξ 0 = 0}) = 0), i which case, the limitig variace of S equals Ɣ. This coclusio plays the same role as the traditioal dual represetatio for the iformatio fuctio. As Le Cam & Yag (1990, page 23) commeted, The equality... is the classical oe. Oe fids it for istace i the stadard treatmet of maximum likelihood estimatio uder Cramér s coditios. There it is derived from coditios of differetiability uder the itegral sig. The fortuitous equality is othig more tha cotiguity i disguise. From the literature oe sometimes gets the impressio that λ 2 {ξ 0 = 0} is always zero. It is ot. (9) Example Let λ be Lebesgue measure o the real lie. Defie f 0 (x) = x{0 x 1}+(2 x){1 < x 2}. For 0 θ 1 defie desities f (x,θ)= (1 θ 2 ) f 0 (x) + θ 2 f 0 (x 2). Notice that (10) λ f (x,θ) f (x, 0) θ f (x, 1) 2 = ( 1 θ 2 1) 2 = O(θ 4 ).

9 19. Differetiability i Quadratic Mea 313 The family of desities is differetiable i quadratic mea at θ = 0 with derivative (x) = f (x, 1). For this family, λ 2 {ξ 0 = 0} =1. The ear-lan assertio of Theorem 4 degeerates: I 0 = 0adI = 4, givig L (t) exp ( t 2) i probability, uder {P,θ0 }. Ideed, as Aad va der Vaart has poited out to me, the limitig experimet (i Le Cam s sese) for the models {P,t/ : 0 t } is ot the Gaussia traslatio model correspodig to the LAN coditio. Istead, the limit experimet is {Q t : t 0}, with Q t equal to the Poisso(t 2 ) distributio. That is, for each fiite set T ad each h, uder {P,h/ } the radom vectors ( dp,t/ ) : t T dp,h/ coverge i distributio to ( ) dqt : t T, dq h as a radom vector uder the Q h distributio. The couterexample would ot work if θ were allowed to take o egative values; oe would eed (x) = f (x, 1) to get the aalog of (10) for egative θ. The failure of cotiguity is directly related to the fact that θ = 0 lies o boudary of the parameter iterval. I geeral, λ {ξ 0 = 0} must be zero at all iterior poits of the parameter space where DQM holds. O the set {ξ 0 = 0} we have 0 ξ(x,θ 0 +t/ ) = t + r, where r 0. Alog a subsequece, r 0, leavig the coclusio that t 0 almost everywhere o the set {ξ 0 = 0}. At a iterior poit, t ca rage over all directios, which forces = 0 almost everywhere o {ξ = 0}; at a iterior poit, {ξ = 0} =0 almost everywhere. More geerally, oe eeds oly to be able to approach θ 0 from eough differet directios to force = 0o{ξ 0 = 0} as i the cocept of a cotiget i Le Cam & Yag (1990, Sectio 6.2). The assumptio that θ 0 lies i the iterior of the parameter space is ot always easy to spot i the literature. Some authors, such as Le Cam & Yag (1990, page 101), prefer to dispese with the domiatig measure λ, by recastig differetiability i quadratic mea as a property of the desities dp θ /dp θ0, whose square roots correspod to the ratios ξ(x,θ){ξ 0 > 0}/ξ 0 (x). With that approach, the behaviour of o the set {ξ 0 = 0} must be specified explicitly. The cotiguity requiremet that P θ puts, at worst, mass of order o( θ θ 0 2 ) i the set {ξ 0 = 0} is the made part of the defiitio of differetiability i quadratic mea Refereces Ibragimov, I. A. & Has miskii, R. Z. (1981), Statistical Estimatio: Asymptotic Theory, Spriger-Verlag, New York.

10 314 David Pollard Le Cam, L. (1970), O the assumptios used to prove asymptotic ormality of maximum likelihood estimators, Aals of Mathematical Statistics 41, Le Cam, L. (1986), Asymptotic Methods i Statistical Decisio Theory, Spriger-Verlag, New York. Le Cam, L. & Yag, G. L. (1990), Asymptotics i Statistics: Some Basic Cocepts, Spriger-Verlag. Millar, P. W. (1983), The miimax priciple i asymptotic statistical theory, Spriger Lecture Notes i Mathematics pp Strasser, H. (1985), Mathematical Theory of Statistics: Statistical Experimets ad Asymptotic Decisio Theory, De Gruyter, Berli.