Improving predictive inference under covariate shift by weighting the loglikelihood function


 Kelley Lawrence
 1 years ago
 Views:
Transcription
1 Journal of Statistical Planning and Inference 90 (2000) elsevier.com/locate/jspi Improving predictive inference under covariate shift by eighting the loglikelihood function Hidetoshi Shimodaira The Institute of Statistical Mathematics, MinamiAzabu, Minatoku, Tokyo , Japan Received 17 December 1998; received in revised form 21 January; accepted 25 February 2000 Abstract A class of predictive densities is derived by eighting the observed samples in maximizing the loglikelihood function. This approach is eective in cases such as sample surveys or design of experiments, here the observed covariate follos a dierent distribution than that in the hole population. Under misspecication of the parametric model, the optimal choice of the eight function is asymptotically shon to be the ratio of the density function of the covariate in the population to that in the observations. This is the pseudomaximum likelihood estimation of sample surveys. The optimality is dened by the expected Kullback Leibler loss, and the optimal eight is obtained by considering the importance sampling identity. Under correct specication of the model, hoever, the ordinary maximum likelihood estimate (i.e. the uniform eight) is shon to be optimal asymptotically. For moderate sample size, the situation is in beteen the to extreme cases, and the eight function is selected by minimizing a variant of the information criterion derived as an estimate of the expected loss. The method is also applied to a eighted version of the Bayesian predictive density. Numerical examples as ell as MonteCarlo simulations are shon for polynomial regression. A connection ith the robust parametric estimation is discussed. c 2000 Elsevier Science B.V. All rights reserved. MSC: 62B10; 62D05 Keyords: Akaike information criterion; Design of experiments; Importance sampling; Kullback Leibler divergence; Misspecication; Sample surveys; Weighted least squares 1. Introduction Let x be the explanatory variable or the covariate, and y be the response variable. In predictive inference ith the regression analysis, e are interested in estimating the conditional density q(y x) of y given x, using a parametric model. Let p(y x; ) be the model of the conditional density hich is parameterized by =( 1 ;:::; m ) R m. Correspondence address: Department of Statistics, Sequoia Hall, 390 Serra Mall, Stanford University, Stanford, CA , USA. address: (H. Shimodaira) /00/$  see front matter c 2000 Elsevier Science B.V. All rights reserved. PII: S (00)
2 228 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) Having observed i.i.d. samples of size n, denoted by (x (n) ;y (n) )=((x t ;y t ): t =1;:::;n), e obtain a predictive density p(y x; ˆ) by giving an estimate ˆ = ˆ(x (n) ;y (n) ). In this paper, e discuss improvement of the maximum likelihood estimate (MLE) under both (i) covariate shift in distribution and (ii) misspecication of the model as explained belo. Let q 1 (x) be the density of x for evaluation of the predictive performance, hile q 0 (x) be the density of x in the observed data. We consider the Kullback Leibler loss function loss i ():= q i (x) q(y x)log p(y x; )dy dx for i=0; 1, and then employ loss 1 ( ˆ) for evaluation of ˆ, rather than the usual loss 0 ( ˆ). The situation q 0 (x) q 1 (x) ill be called covariate shift in distribution, hich is one of the premises of this paper. This situation is not so odd as it might look at rst. In fact, it is seen in various elds as follos. In sample surveys, q 0 (x) is determined by the sampling scheme, hile q 1 (x) is determined by the population. In regression analysis, covariate shift often happens because of the limitation of resources, or the design of experiments. In articial neural netorks literature, active learning is the typical situation here e control q 0 (x) for better prediction. We could say that the distribution of x in future observations is dierent from that of the past observations; x is not necessarily distributed as q 1 (x) in future, but e can give imaginary q 1 (x) to specify the region of x here the prediction accuracy should be controlled. Note that q 0 (x) and=or q 1 (x) are often estimated from data, but e assume they are knon or estimated reasonably in advance. The second premise of this paper is misspecication of the model. Let ˆ0 be the MLE of, and 0 be the asymptotic limit of ˆ 0 as n. Under certain regularity conditions, MLE is consistent and p(y x; 0 )=q(y x) provided that the model is correctly specied. In practice, hoever, p(y x; 0 ) deviates more or less from q(y x). Under both the covariate shift and the misspecication, MLE does not necessarily provide a good inference. We ill sho that MLE is improved by giving a eight function (x) of the covariate in the loglikelihood function L ( x (n) ;y (n) ):= n l (x t ;y t ); (1.1) t=1 here l (x; y )= (x)log p(y x; ). Then the maximum eighted loglikelihood estimate (MWLE), denoted by ˆ, is obtained by maximizing (1.1) over. It ill be seen that the eight function (x)=q 1 (x)=q 0 (x) is the optimal choice for suciently large n in terms of the expected loss ith respect to q 1 (x). We denote MWLE ith this eight function by ˆ 1. A comparison beteen ˆ 0 and ˆ 1 is made in the numerical example of polynomial regression of Section 2, and the asymptotic optimality of ˆ 1 is shon in Section 3. Note that MWLE turns out to be doneighting the observed samples hich are not important in tting the model ith respect to the population. An interpretation of MWLE as one of the robust estimation techniques is given in Section 9.
3 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) This type of estimation is not ne in statistics. Actually, ˆ1 is regarded as a generalization of the pseudomaximum likelihood estimation in sample surveys (Skinner et al., 1989, p. 80; Pfeermann et al., 1998); the log likelihood is eighted inversely proportional to q 0 (x), the probability of selecting unit x, hile q 1 (x) is equal probability for all possible values of x. The same idea is also seen in Rao (1991), here eighted maximum likelihood estimation is considered for unequally spaced timeseries data. The local likelihoods or the eighted likelihoods formally similar to (1.1) are found in the literature for semiparametric inference. Hoever, ˆ is estimated using a eight function concentrated locally around each x or (x; y) in the semiparametric approach; thus ˆ in p(y x; ˆ ) ill depend on (x; y) as ell as the data (x (n) ;y (n) ). On the other hand, e restrict our attention to a rather conventional parametric modeling approach here, and ˆ depends only on the data. In spite of the asymptotic optimality of (x)=q 1 (x)=q 0 (x) mentioned above, another choice of the eight function can improve the expected loss for moderate sample size by compromising the bias and the variance of ˆ. We develop a practical method for this improvement in Sections 4 7. The asymptotic expansion of the expected loss is given in Section 4, and a variant of the information criterion is derived as an estimate of the expected loss in Section 5. This ne criterion is used to nd a good (x) as ell as a good form of p(y x; ). The numerical example is revisited in Section 6, and a simulation study is given in Section 7. In Section 8, e sho the Bayesian predictive density is also improved by considering the eight function. Finally, concluding remarks are given in Section 9. All the proofs are deferred to the appendix. 2. Illustrative example in regression Here e consider the normal regression to predict the response y R using a polynomial function of x R. Let the model p(y x; ) be the polynomial regression y = x + + d x d + ; N(0; 2 ); (2.1) here =( 0 ;:::; d ;) and N(a; b) denotes the normal distribution ith mean a and variance b. In the numerical example belo, e assume the true q(y x) is also given by (2.1) ith d =3: y = x + x 3 + ; N(0; 0:3 2 ): (2.2) The density q 0 (x) of the covariate x is x N( 0 ; 2 0); (2.3) here 0 =0:5, 2 0 =0:52. This corresponds to the sampling scheme of x or the design of experiments. A dataset (x (n) ;y (n) ) of size n = 100 is generated from (2.2) and (2.3), and plotted by circles in Fig. 1a. MLE ˆ 0 is obtained by the ordinary least squares (OLS) for the normal regression; e consider a model of the form (2.1) ith d =1, and the regression line tted by OLS is dran in solid line in Fig. 1a.
4 230 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) Fig. 1. Fitting of polynomial regression ith degree d = 1. (a) Samples (x t;y t) of size n = 100 are generated from q(y x)q 0 (x) and plotted as circles, here the underlying true curve is indicated by the thin dotted line. The solid line is obtained by OLS, and the dotted line is WLS ith eight q 1 (x)=q 0 (x). (b) Samples of n = 100 are generated from q(y x)q 1 (x), and the regression line is obtained by OLS. On the other hand, MWLE ˆ is obtained by eighted least squares (WLS) ith eights (x t ) for the normal regression. We again consider the model ith d =1, and the regression line tted by WLS ith (x)=q 1 (x)=q 0 (x) is dran in dotted line in Fig. 1a. Here, the density q 1 (x) for imaginary future observations or that for the hole population in sample surveys is specied in advance by x N( 1 ; 2 1); (2.4) here 1 =0:0; 2 1 =0:32. The ratio of q 1 (x) toq 0 (x) is q 1 (x) q 0 (x) = exp( (x 1) 2 =2 2 1 )= ) 1 (x )2 exp( (x 0 ) 2 =2 2 0 )= exp ( ; (2.5) here 2 =( ) 1 =0:38 2, and = 2 ( )= 0:28. The obtained lines in Fig. 1a are very dierent for OLS and WLS. The question is: hich is better than the other? It is knon that OLS is the best linear unbiased estimate and makes small mean squared error of prediction in terms of q(y x)q 0 (x) hich generated the data. On the other hand, WLS ith eight (2.5) makes small prediction error in terms of q(y x)q 1 (x) hich ill generate future observations, and thus WLS is better than OLS here. To conrm this, a dataset of size n = 100 is generated from q(y x)q 1 (x) specied by (2.2) and (2.4). The regression line of d =1 tted by OLS is shon in Fig. 1b, hich is considered to have small prediction error for the future data. The regression line of WLS tted to the past data in Fig. 1a is quite similar to the line of OLS tted to the future data in Fig. 1b. In practice, only the past data is available. The WLS gave almost the equivalent result to the future OLS by using only the past data.
5 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) The underlying true curve is the polynomial ith d = 3, and thus the regression line of d = 1 cannot be tted to it nicely over all the region of x. Hoever, the true curve is almost linear in the region of 1 ± 2 1, and the nice t of the WLS in this region is obtained by throing aay the observed samples hich are outside of this region. The eective sample size may be dened in terms of the entropy by n e =exp( n t=1 p t log p t ), here p t =(x t )= n t =1 (x t ). In the WLS above, n e=49:3, hich is about the half of the original sample size n = 100, and then increases the variance of the WLS. This is discussed later in detail. 3. Asymptotic properties of MWLE Let E i ( ) denote the expectation ith respect to q(y x)q i (x) for i =0; 1. Considering L () as the summation of i.i.d. random variables l (x t ;y t ), it follos from the la of large numbers that L ()=n E 0 (l (x; y )) as n gros to innity. Then e have ˆ in probability as n, here is the minimizer of E 0 (l (x; y )) over. Hereafter, e restrict our attention to proper (x) such that E 0 (l (x; y )) exists for all and that the Hessian of E 0 (l (x; y )) is nonsingular at, hich is uniquely determined and interior to. If the above result is applied to (x) =q 1 (x)=q 0 (x), e nd that ˆ1 converges in probability to the minimizer of loss 1 () over, hich e denote 1. Here the key idea is the importance sampling identity: q1 (x) E 0 log p(y x; ) = q 0 (x) q(y x)q 0 (x) q 1(x) log p(y x; )dx dy q 0 (x) = E 1 (log p(y x; )); (3.1) hich implies E 0 (l (x; y )) loss 1 () and = 1 hen (x)=q 1(x)=q 0 (x). Except for the equivalent eight (x) q 1 (x)=q 0 (x), e have 1 under misspecication in general. From the denition of 1, therefore, loss 1() loss 1 (1 ). This immediately implies the asymptotic optimality of the eight (x)=q 1 (x)=q 0 (x), because ˆ and ˆ 1 1 and thus loss 1( ˆ ) loss 1 ( ˆ 1 ) for suciently large n. ˆ 1 has consistency in a sense that it converges to the optimal parameter value. Hoever, ˆ0 is more ecient than ˆ 1 in terms of the asymptotic variance. This ill be important for moderate sample size, here n is large enough for the asymptotic expansions to be alloed, but not enough for the optimality of ˆ 1 to hold. The folloing lemma, hich is used in the subsequent sections, gives the asymptotic distribution of ˆ. The derivation is parallel to that of MLE under misspecication given in White (1982); e replace log p(y x; )q 0 (x) of MLE ith (x)log p(y x; )q 0 (x) of MWLE. Lemma 1. Assume the regularity conditions similar to those of White (1982); i.e.; the model is suciently smooth and the support of p(y x; ) is the same as that of q(y x) for all. Also assume is an interior point of. Then; n 1=2 ( ˆ )
6 232 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) is asymptotically normally distributed as N(0;H 1 m m matrices dened by G = E (x; (x; y hich are assumed to be nonsingular. G H 1 ); here G and H are ; H = E 2 l (x; y ) ; (3.2) 4. Expected loss In the previous section, optimal choice of (x) as discussed in terms of the asymptotic bias 1. For moderate sample size, hoever, the variance of ˆ due to the sampling error should be considered. In order to take account of both the bias and the variance, e employ the expected loss E (n) 0 (loss 1( ˆ )) to determine the optimal eight; E (n) 0 ( ) denotes the expectation ith respect to (x(n) ;y (n) ) hich follos n t=1 q(y t x t )q 0 (x t ). Lemma 2. The expected loss is asymptotically expanded as E (n) 0 (loss 1( ˆ )) = loss 1 ()+ 1 K [1] b + 1 [2] tr(k n 2 here the elements of K [1] and K [2] are dened by (K [k] q 1 k log p(y x; ) ) i1 i k = E 0 q 0 i k H 1 G H 1 ) +o(n 1 ); and b is the asymptotic limit of ne (n) 0 ( ˆ ); hich is of order O(1). (4.1) The expression for b is given in the folloing lemma. We use the summation convention A i B i = m i=1 A ib i in the formula. Lemma 3. The elements of b = lim n ne (n) 0 ( ˆ ) are given by b i1 :=H i1i2 H j1j2 (H [2 1] ) i2j 1 j 2 1 [3] 2 (H ) i2j 1k 1 H k1k2 (H [1 1] ) k2 j 2 ; (4.2) here H ij denotes the (i; j) element of H 1 ; and (H k l (x; y ) ) i1 i k = E i ; k l (x; y ) ) i1 i k j 1 j l = E i k (H [k l l (x; y j : l Note that the matrices dened in (3:2) are ritten as G = H [1 1] and H = H [2].
7 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) For suciently large n, loss 1 ( ) is the dominant term on the righthand side of (4.1), and the optimal eight is (x)=q 1 (x)=q 0 (x) as seen in Section 3. If n is not large enough compared ith the extent of the misspecication, the O(n 1 ) terms related to the rst and second moments of ˆ cannot be ignored in (4.1), and the optimal eight changes. In an extreme case here the model is correctly specied, e only have to look at the O(n 1 ) terms as shon in the folloing lemma. Lemma 4. Assume there exists such that q(y x)=p(y x; ). Then; = and q(y x) =p(y x; ) for all proper (x). The expected loss E (n) 0 (loss 1( ˆ )) is minimized hen (x) 1 for suciently large n. 5. Information criterion The performance of MWLE for a specied (x) is given by (4.1). Hoever, e cannot calculate the value of the expected loss from it in practice, because q(y x) is unknon. We provide a variant of the information criterion as an estimate of (4.1). Theorem 1. Let the information criterion for MWLE be IC := 2L 1 ( ˆ )+2tr(J H 1 ); (5.1) here L 1 ()= n q 1 (x) t=1 q 0 (x) log p(y t x t ;); q 1 log p(y x; ) J = E (x; y ) q : The matrices J and H may be replaced by their consistent estimates Ĵ = 1 n q 1 (x t log p(y t x t (x t ;y t ) n q 0 (x ; ˆ t=1 ˆ Ĥ = 1 2 l (x t ;y t ) n t=1 : ˆ Then; IC =2n is an estimate of the expected loss unbiased up to O(n 1 ) term: E (n) 0 (IC =2n)=E (n) 0 (loss 1( ˆ ))+o(n 1 ): (5.2) Fortunately, expression (5.1) turns out to be rather simpler than that of (4.1), and e do not have to orry about the calculation of the third derivatives appeared in (4.2). The explicit form of (5.1) for the normal regression is given in (6.1) and (6.2). Given the model p(y x; ) and the data (x (n) ;y (n) ), e choose a eight function (x) hich attains the minimum of IC over a certain class of eights. This is selection of the eight rather than model selection. Searching the optimal eight over all the
8 234 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) possible forms of (x) is equivalent to ndimensional optimization problem ith respect to ((x t ): t =1;:::;n). But e do not take this line here, because of the computational cost as ell as a conceptual diculty hich ill be mentioned in Section 9. Rather than the global search, e shall pick a better one from the to extreme cases of (x) 1 and (x)=q 1 (x)=q 0 (x), or consider a class of eights by connecting the to extremes continuously: ( ) q1 (x) (x)= ; [0; 1]; (5.3) q 0 (x) here = 0 corresponds to ˆ 0 and = 1 corresponds to ˆ 1. In the next section, e numerically nd ˆ hich minimizes IC by searching over [0; 1]. Note that (5.3) is proportional to N(; 2 =) in the case of (2.5), and 1=2 is the indo scale parameter. When e have several candidate forms of p(y x; ), the model and the eight are selected simultaneously by minimizing IC. A similar idea of the simultaneous selection is found in Shibata (1989), here an information criterion RIC is derived for the penalized likelihood. A crucial distinction, hoever, is that the eight for is selected in RIC, hereas the eight for x is selected in IC. Another distinction is that the eight is additive to the log likelihood in RIC, hile it is multiplicative in IC. Akaike (1974) gave an information criterion AIC = 2L 0 ( ˆ 0 )+2dim; (5.4) here L 0 () is the loglikelihood function. AIC is intended for MLE, and it is obtained as a special case of IC. When q 1 (x)=q 0 (x) and (x) 1, IC reduces to TIC = 2L 0 ( ˆ 0 )+2tr(G 0 H 1 0 ); here G 0 = G and H 0 = H hen (x) 1. TIC is derived by Takeuchi (1976) as a precise version of AIC, and it is equivalent to the criterion of Linhart and Zucchini (1986). If p(y x; 0 ) is suciently close to q(y x), tr(g 0H 1 0 ) dim and TIC reduces to AIC. 6. Numerical example revisited For the normal linear regression, such as the polynomial regression given in (2.1), components of ˆ are obtained by WLS ith eights (x t ). component of ˆ is then given by ˆ 2 = n t=1 (x t)ˆ 2 t =ĉ, here ĉ = n t=1 (x t) and ˆ t is the residual. Letting ĥ t, t =1;:::;n be the diagonal elements of the hat matrix used in the WLS, the information criterion (5.1) is calculated from L 1 ( ˆ )= 1 2 n t=1 q 1 (x t ) q 0 (x t ) ˆ 2 t ˆ 2 + log(2 ˆ2 ) ; (6.1)
9 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) Fig. 2. (a) Curve of IC versus [0; 1] for the model of Section 2 ith d = 2. The eight function (5.3) connecting from (x) 1 (i.e. =0) to (x)=q 1 (x)=q 0 (x) (i.e. =1) as used. Also shon are 2L 1 ( ˆ ) in dotted lines, and 2 tr(j H 1 ) in broken lines. (b) The regression curves for d = 2. The WLS curve ith the optimal ˆ as ell as those for OLS ( = 0) and WLS ( = 1) are dran. Table 1 IC values ith eight (5.3) for =0, =1, and = ˆ. Also shon is ˆ value. Calculated for the polynomial regression example of Section 2 ith d =0;:::;4 d =0 d =1 d =2 d =3 d =4 = = = ˆ ˆ tr(ĵ Ĥ 1 )= n q 1 (x t ) t=1 q 0 (x t ) ( ) 2 ˆ 2 t ˆ 2 ĥt + (x t) ˆ 2 t 2ĉ ˆ 2 1 : (6.2) We apply the above formulas to the data generated from (2.2) and (2.3) in Section 2. Fig. 2a shos the plot of the information criterion and its to components for d =2. By increasing from 0 to 1, the rst term of (5.1) decreases hile the second term increases in general. We numerically nd ˆ so that the to terms balance. For d =2, IC takes the minimum at ˆ =0:56. The regression curves obtained by this method are shon in Fig. 2b. Table 1 shos IC values for d =0;:::;4. For each d, IC is minimized at = ˆ. Then, IC of ˆ is minimized at the model d = 3. By minimizing IC, and d are simultaneously selected. For d = 3, it turns out that ˆ =0:01 0. In fact, the model of d = 3 is correctly specied in this dataset, and it follos from Lemma 4 that ˆ0 is optimal for d 3. Even in such a situation, the appropriate ˆ is selected by minimizing IC.
10 236 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) Table 2 Asymptotic convergence of the second term of (4.1). 2n times of loss 1 ( ˆ ) loss 1 ( ) is calculated for the MonteCarlo replicates, and its average is tabulated. For every simulation (d =0; 1; 2 and =0; 1), the values are shoing a good convergence as n n d=0 d =1 d =2 =0 =1 =0 =1 =0 = Table 3 Asymptotic convergence of (5.2). 2n loss 1 ( ˆ )+2L 1 ( ˆ ) is calculated for the MonteCarlo replicates, and its average is tabulated in the left columns. For n 300, this agrees very ell ith the average of 2 tr( Jˆ Hˆ 1 ) tabulated in the right columns. 2 tr(j H 1 ) is shon in the n = ro n d=0 d =1 d =2 =0 =1 =0 =1 =0 = In practical data analysis, it ould be rare to have correctly specied models at hand. Therefore, e exclude d 3 from the above example, and restrict the candidates to d 3. Then, d = 2 is selected, and d = 1 has almost the same IC value, hile d = 0 has signicantly larger IC value. This agrees ith the asymptotic result veried by extensive MonteCarlo simulations in Shimodaira (1997) that (4.1) is minimized hen = 1 and d = 1 over d 0; 1; 2, for suciently large n. 7. Simulation study First e sho simulation results in Tables 1 3 hich conrm the theory of Sections 4 and 5. A large number N of replicates of the dataset of size n are generated from (2.2) and (2.3). We used (2.4) as q 1 (x). Four simulations of n = 50, 100, 300, and 1000 are done ith N =10 5 for n = and N =10 6 for n = For each replicate of the dataset, ˆ is calculated for =0; 1 and d =0; 1; 2. Then, loss 1 ( ˆ ), L 1 ( ˆ ), and tr(ĵ Ĥ 1 ) are calculated, and their averages over the N replicates are obtained for each simulation. The tables sho nice agreement beteen the simulations and the theory. Next, e sho the results of another set of simulations in Table 4 hich conrms the practical performance of the eight selection procedure. We used (2.3) and (2.4) as before, but (2.2) is replaced by y = x + 3 x 3 + for generating replicates of the
11 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) Table 4 The expected loss for the selected eight and the selected model. 2n times of loss 1 ( ˆ )+E 1 (log q(y x)) is calculated for the replicates, and its average is tabulated in the columns of =0; 1 for d =0; 1; 2. The average of the loss of the selected eight is shon in the columns of ˆ. The right most columns sho the average of the loss of the selected model 3 d =0 d =1 d =2 ˆd 0 1 ˆ 0 1 ˆ 0 1 ˆ 0 1 ˆ dataset of size n=100. Five simulations of 3 =1:0, 0.5, 0.2, 0.1, and 0.0 are done ith N =10 4. In the table, E 1 ( log q(y x)) is subtracted from the loss to make comparisons easier. The expected loss of MLE ( = 0) is for 3 =1:0, d = 1, hile that of MWLE ( = 1) is 12.3, shoing a great improvement as e have observed in the numerical example. The expected loss of MWLE ith the selected eight ( ˆ) is 11.9, hich is not signicantly dierent from that of = 1. This is a consequence of the large sample size n = 100, here the rst term of (4.1) or (5.1) is dominant. The same observation holds for 3 =1:0 0:0, d = 0, and 3 =1:0; 0:5, d =1; 2. On the other hand, =0 is signicantly better than =1 for 3 =0:0, d=1; 2, here the model is correctly specied. In this case the second term of (4.1) is dominant and = 0 is the optimal choice as mentioned in Lemma 4. The selected eight ˆ performs close to = 0, but ith slightly larger expected loss. This dierence is the price e pay for the eight selection using (5.1) hich is an estimate of (4.1), not the true value of (4.1). For all the cases of 3 =1:0 0.0, d =0; 1; 2, the eight selection procedure gives the expected loss close to the optimal choice. Fixing to either of 0 or 1 can lead to very poor performance. The same observation holds for the model selection as shon in the columns of ˆd. 8. Bayesian inference We have been orking on the predictive density p(y x; ˆ ); (8.1) hich is based on MWLE ˆ. This type of predictive density is occasionally called as an estimative density in the literature. Another possibility is the Bayesian predictive density. Here e consider a eighted version of it, and examine its performance in prediction.
12 238 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) Let p() be the prior density of. Given the data (x (n) ;y (n) ), e shall dene the eighted posterior density by p ( x (n) ;y (n) ) p()exp L ( x (n) ;y (n) ): (8.2) Then the predictive density ill be p (y x; x (n) ;y (n) )= p(y x; )p ( x (n) ;y (n) )d: (8.3) In the case of (x) 1, (8.2) reduces to the ordinary posterior density, and (8.3) reduces to the ordinary Bayesian predictive density. The Kullback Leibler loss of (8.3) ith respect to q(y x)q 1 (x) is q 1 (x) q(y x)log p (y x; x (n) ;y (n) )dy dx and thus the expected loss is given by E (n) 0 (E 1( log p (y x; x (n) ;y (n) ))): (8.4) Lemma 5. For suciently large n; (8:4) is asymptotically expanded as E (n) 0 (loss 1( ˆ )) + 1 K [1] a 1 [1 1] tr((k K [2] )H 1 ) +o(n 1 ); (8.5) n 2 here q 1 (x) = E 0 q 0 (x) K [1 log p(y log p(y x; and a = plim n â is the probability limit of â = n ( ˆ )p ( x (n) ;y (n) )d: Furthermore; (8:4) is estimated by an information criterion 2 n q 1 (x t ) t=1 q 0 (x t ) log p (y t x t ;x (n) ;y (n) )+2tr(J H 1 ): (8.6) In fact; the expectation of (8:6); if divided by 2n; is equal to (8:5) up to O(n 1 ) terms. When q 1 (x) =q 0 (x) and (x) 1, (8.6) reduces to the information criterion for the Bayesian predictive density given in Konishi and Kitagaa (1996). Selection of (x) as ell as selection of p() and p(y x; ) becomes possible by minimizing (8.6). Comparing the values of (5.1) and (8.6), e can also choose hich to use from (8.1) and (8.3). The decrease of the expected loss of (8.3) from that of (8.1) is of order O(n 1 )as shon in (8.5), hich can be positive or negative depending on q(y x). For brevity sake, e assume q 1 (x)=q 0 (x) and (x) 1 belo. Then the decrease in the expected loss is =2n +o(n 1 ), here = (tr(g 0 H 1 0 ) dim ): G 0 H 0 0 and 0 under
13 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) Fig. 3. The curvature of the model in relation to the location of the true density q. On the parametric model denoted by p( ), e have = 0, and increases as q deviates from p( ). The region of 0isinthe inside direction of the model, and 0 is in the outside direction of the model. ˆq denotes the empirical distribution, and the projection of ˆq to the model is the estimative density ˆp = p(y x; ˆ 0 ). ˆp B denotes the Bayesian predictive density. misspecication in general, hich is utilized in the information matrix tests (White, 1982) for detecting the misspecication. An enlightening interpretation of may be possible by the folloing geometric argument using the terminology of Efron (1978) and Amari (1985). Fig. 3 shos the space of all conditional densities. is determined by the extent of the misspecication multiplied by the embedding mixture curvature of the model (S. Amari, personal communication). Bayesian predictive density ˆp B is a mixture of p(y x; ) around ˆ 0, and thus it is located in the inside of the model because of the curvature; ˆp B deviates from ˆp of order O(n 1 ) as shon in Davison (1986). Therefore, ˆp B has larger expected loss than ˆp if q(y x) is located in the outside of the model (i.e. 0), because ˆp B is located in the opposite side of q. This does not contradict the classical result that the expected loss of ˆp B is asymptotically smaller than that of ˆp for some prior. In Bayesian literature, the case of correct specication (i.e. = 0) is discussed and the dierence of the expected loss is of order O(n 2 ) as seen in Komaki (1996). Note that the quadratic form of is relevant hen the curvature vanishes, and 0 ifall the eigenvalues are nonnegative; see Shimodaira (1997) for details. 9. Concluding remarks Although the ratio q 1 (x)=q 0 (x) has been assumed to be knon, it is often estimated from data in practice. Assuming q 1 (x) is knon, e tried three possibilities in the numerical example of Section 2: (i) q 0 (x) is specied correctly ithout unknon parameters. (ii) Assuming the normality of q 0 (x), the unknon 0 and 0 are estimated. (iii) Nonparametric kernel density estimation is applied to q 0 (x). Then, it turns out that MWLE is robust against the estimation of q 1 (x)=q 0 (x) and the results are almost identical in the three cases. This may be because the form of q 0 (x) is quite simple and the sample size n = 100 is rather large.
14 240 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) A parametric approach to take account of estimation of q 1 (x)=q 0 (x) is considered as follos. Let the observed data z t ;t=1;:::;n, follo p 0 (z ), hile future observations ill follo p 1 (z ). Then a possible estimating equation ill be n t=1 (z t log p 1(z t =0; (9.1) here (z ) =(p 1 (z )=p 0 (z )). The solution of (9.1) reduces to the MWLE discussed in this paper by letting z =(x; y); p 0 (z ) =p(y x; )q 0 (x), and p 1 (z ) = p(y x; )q 1 (x). The estimating equation (9.1) is often seen in the literature of the robust parametric estimation; e.g., Green (1984), Hampel et al. (1986), Lindsay (1994), Basu and Lindsay (1994), Field and Smith (1994) and Windham (1995). In this context, the samples hich are not concordant to the model p 1 (z ) ill be regarded as outliers and doneighted to reduce the impact on the parameter estimation. The specication of the eight function is thus the focal point of the argument. Although the covariate shift is a mechanism dierent from the outliers, an interpretation of MWLE in terms of the robust estimation can be given as follos. For simplicity, let z be a discrete random variable and ˆp 0 (z) be the observed relative frequency hich estimates the contaminated distribution p 0 (z) consistently. The eight (p 1 (z )= ˆp 0 (z)) in (9.1) then leads to the minimum disparity estimator obtained by minimizing the poer divergence ( ) 2nI 2 ˆp0 (z) = ˆp ( 1) 0 (z) 1 z p 1 (z ) (Cressie and Read, 1984; Lindsay, 1994). The uniform eight = 0 is sensitive to outliers, and a positive value = 0:5, say, improves the robustness. In the regression analysis, the deviation of p 0 (z) from p 1 (z) is decomposed into to parts: the misspecication of p(y x; ) and the covariate shift. MWLE is obtained by applying the poer eighting scheme to the second part here = 1 is asymptotically optimal. It may be interesting to consider a robust version of MWLE by applying the eight, say (p 1 (y x; )= ˆp 0 (y x)) ith =0:5, to the rst part as ell. A numerical example of simultaneous selection of the eight and the model by the information criterion is shon in Section 6. The information criterion takes account of the selection bias caused by estimation of the parameter, but it does not take account the bias caused by the selection of the eight and the model. It is important to evaluate the expected loss of the predictive density obtained after these selection. The simulation study of Section 7 indicates that the method presented in this paper is eective, and the nal expected loss of the selected eight and=or model is reasonably small. Although e have employed a specic type of oneparameter connection in (5.3), other types of connection may ork similarly. Hoever, increasing the number of connection parameters ill increase the nal expected loss because of the sampling error of (5.1) as an estimator of (4.1). We observed the slight increase of the expected loss of ˆ in Table 4, and this increase can be much larger for multiparameter connections.
15 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) We derived a variant of AIC for MWLE under covariate shift. On the other hand, Shimodaira (1994) and Cavanaugh and Shumay (1998) discussed variants of AIC for MLE in the presence of incomplete data. Information criteria have to be tailored for dierent styles of sampling scheme, and the unied approach for them is left as a future ork. A softare for calculating IC and ˆ of normal regression ill be available in S language at shimo/. Acknoledgements I ould like to thank John Copas, Tony Hayter, Motoaki Kaanabe, Shinto Eguchi, and the revieers for helpful comments and suggestions. Appendix A. Proof of Lemma 1. Since is interior to, sois ˆ for suciently large n. Then, ˆ is obtained as a solution of the estimating equation (x t ;y t ) =0: (A.1) ˆ The Taylor expansion of (A.1) leads to n 1 2 l (x t ;y t ) t=1 n 1=2 ( ˆ )= n 1=2 (x t ;y t ) +O p (n 1=2 ): (A.2) It follos from the central limit theorem that the righthand side is asymptotically distributed as N(0;G ), hile the lefthand side converges to H n 1=2 ( ˆ ). Thus e obtained the desired result. Proof of Lemma 2. The Taylor expansion of loss 1 ( ˆ ) around is loss 1 ()+ 1 (K [1] ) i n 1=2 i + 1 [2] (K ) ij i n 2 j +O p (n 3=2 ); (A.3) here = n 1=2 ( ˆ ), and the summation convention A i B i = m i=1 A ib i is used. Considering Lemma 1, the expectation of (A.3) gives (4.1). By taking expectation of (A.2), e observe b = O(1). Proof of Lemma 3. Considering the O p (n 1=2 ) term in (A.2) explicitly, the Taylor expansion of (A.1) gives Ĥ = n 1=2 (x t ;y t + n 1=2 e ; t=1
16 242 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) here Noting Ĥ = 1 n Ĥ 1 E 0 H 1 n t=1 = H 2 l (x t ;y t 2 l (x; y ) hich immediately implies (4.2). ; (e ) i = 1 2 (H [3] ) ijk j k +O p (n 1=2 ): H 1 (Ĥ H )H 1 +O p (n 1 ); n 1=2 E (n) 0 ( ) is ritten as H (x; y ) + H 1 E (n) (e )+O(n 1=2 ); Proof of Lemma 4. For any x, the conditional Kullback Leibler loss loss( x)= q(y x)log p(y x; )dy is minimized at if q(y x) =p(y x; ). Then = for any (x), because E 0 (l (x; y )) = q(x)(x)loss( x)dx. Thus loss 1 () in (4.1) is equal for any (x). Considering K [1] = 0, the second term in (4.1) is ritten as n 1 tr(k [2] Q() 1 Q( 2 )Q() 1 ); (A.4) here K [2] Q(a)=E 0 a(x) = Q(q 1 =q 0 ) and Q(a) is dened for any a(x) log p(y x; log p(y It it easy to verify that Q() 1 Q( 2 )Q() 1 Q(1) 1 is nonnegative denite for any (x), and so (A.4) is minimized hen (x) 1. Proof of Theorem 1. The Taylor expansion of log p(y x; ˆ ) around gives L 1 ( ˆ )=L 1 ()+ 1 () n 1=2 + 2 L 1 () j i j +O p (n 1=2 ) and thus E (n) 0 ( L 1( ˆ )=n) is expanded as loss 1 () 1 1 () n E(n) 0 n 1=2 + 1 [2] tr(k H 1 G H 1 )+O(n 3=2 ): 2n (A.5) Considering n 1 = K [1] +O p (n 1=2 ), the second term of (A.5) becomes ( 1 n K [1] b + 1 n E(n) 0 n 1=2 1 1 () i (K [1] ) i H ij ( n j ) +O p (n 1=2 ) : = 1 n K [1] b 1 n H ij (J ) ij +O(n 3=2 ): Combining this ith (A.5) and (4.1) completes the proof.
17 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) Proof of Lemma 5. Assuming certain regularity conditions similar to those of Johnson (1970), e have the asymptotic limit of (8.2) is normal ith mean ˆ and covariance matrix Ĥ 1 =n, since log p ( x (n) ;y (n) ) is expanded as 1 2 n1=2 ( ˆ ) Ĥ n 1=2 ( ˆ )+O p (n 1=2 ); here the terms independent of are omitted. Then, (8.3) is asymptotically expanded as p(y x; ˆ )+ 1 x; ˆ â + 1 2n tr 2 p(y x; ) ˆ ) Ĥ 1 +o p (n 1 ): (A.6) Note that Dunsmore (1976) gave the uneighted version of (A.6) hen the model specication is correct, but the term of â as missing as indicated by Komaki (1996). Applying the identity 2 p p log log p to the third term of (A.6), e obtain log p (y x; x (n) ;y (n) ) =log p(y x; ˆ )+ 1 log p(y log p(y x; log p(y x; ) ( a + 1 2n tr ) H log p(y x; +o p (n 1 ): (A.7) Thus (8.5) immediately follos from (A.7). The last statement of the lemma is veried by combining (A.7) ith Theorem 1. References Akaike, H., A ne look at the statistical model identication. IEEE Trans. Automat. Control 19, Amari, S., DierentialGeometrical Methods in Statistics. Lecture Notes in Statistics, Vol. 28. Springer, Berlin. Basu, A., Lindsay, B.G., Minimum disparity estimation for continuous models: eciency, distributions and robustness. Ann. Inst. Statist. Math. 46, Cavanaugh, J.E., Shumay, R.H., An Akaike information criterion for model selection in the presence of incomplete data. J. Statist. Plann. Infererence 67, Cressie, N., Read, T.R.C., Multinomial goodnessoft tests. J. Roy. Statist. Soc. Ser. B 46, Davison, A.C., Approximate predictive likelihood. Biometrika 73, Dunsmore, I.R., Asymptotic prediction analysis. Biometrika 63, Efron, B., The geometry of exponential families. Ann. Statist. 6, Field, C., Smith, B., Robust estimation a eighted maximum likelihood approach. Internat. Statist. Rev. 62, Green, P.J., Iteratively reeighted least squares for maximum likelihood estimation, and some robust and resistant alternatives (ith discussion). J. Roy. Statist. Soc. Ser. B 46, Hampel, F.R., Ronchetti, E.M., Rousseeu, P.J., Stahel, W.A., Robust Statistics: The Approach Based on Inuence Functions. Wiley, Ne York.
18 244 H. Shimodaira / Journal of Statistical Planning and Inference 90 (2000) Johnson, R.A., Asymptotic expansions associated ith posterior distributions. Ann. Math. Statist. 41, Komaki, F., On asymptotic properties of predictive distributions. Biometrika 83, Konishi, S., Kitagaa, G., Generalised information criteria in model selection. Biometrika 83, Lindsay, B.G., Eciency versus robustness: the case for minimum Hellinger distance and related methods. Ann. Statist. 22, Linhart, H., Zucchini, W., Model Selection. Wiley, Ne York. Pfeermann, D., Skinner, C.J., Holmes, D.J., Goldstein, H., Rasbash, J., Weighting for unequal selection probabilities in multilevel models. J. Roy. Statist. Soc. Ser. B 60, Shibata, R., Statistical aspects of model selection. In: Willems, J.C. (Ed.), From Data to Model. Springer, Berlin, pp Shimodaira, H., A ne criterion for selecting models from partially observed data. In: Cheeseman, P., Oldford, R.W. (Eds.), Selecting Models from Data: AI and Statistics IV. Springer, Berlin, pp (Chapter 3). Shimodaira, H., Predictive inference under misspecication and its model selection. Research Memorandum 642, The Institute of Statistical Mathematics, Tokyo, Japan. Skinner, C.J., Holt, D., Smith, T.M.F., Analysis of Complex Surveys. Wiley, Ne York. Takeuchi, K., Distribution of information statistics and criteria for adequacy of models. Math. Sci. (153), (in Japanese). White, H., Maximum likelihood estimation of misspecied models. Econometrica 50, Windham, M.P., Robustifying model tting. J. Roy. Statist. Soc. Ser. B 57,
How to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationSome statistical heresies
The Statistician (1999) 48, Part 1, pp. 1±40 Some statistical heresies J. K. Lindsey Limburgs Universitair Centrum, Diepenbeek, Belgium [Read before The Royal Statistical Society on Wednesday, July 15th,
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationBootstrap Methods and Their Application
Bootstrap Methods and Their Application cac Davison and DV Hinkley Contents Preface i 1 Introduction 1 2 The Basic Bootstraps 11 21 Introduction 11 22 Parametric Simulation 15 23 Nonparametric Simulation
More informationRECENTLY, there has been a great deal of interest in
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 1, JANUARY 1999 187 An Affine Scaling Methodology for Best Basis Selection Bhaskar D. Rao, Senior Member, IEEE, Kenneth KreutzDelgado, Senior Member,
More informationBootstrap con"dence intervals: when, which, what? A practical guide for medical statisticians
STATISTICS IN MEDICINE Statist. Med. 2000; 19:1141}1164 Bootstrap con"dence intervals: when, which, what? A practical guide for medical statisticians James Carpenter* and John Bithell Medical Statistics
More informationarxiv:1309.4656v1 [math.st] 18 Sep 2013
The Annals of Statistics 2013, Vol. 41, No. 1, 1716 1741 DOI: 10.1214/13AOS1123 c Institute of Mathematical Statistics, 2013 UNIFORMLY MOST POWERFUL BAYESIAN TESTS arxiv:1309.4656v1 [math.st] 18 Sep 2013
More informationOn the Optimality of the Simple Bayesian. given the class, but the question of whether other sucient conditions for its optimality exist has
,, 1{30 () c Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. On the Optimality of the Simple Bayesian Classier under ZeroOne Loss PEDRO DOMINGOS pedrod@ics.uci.edu MICHAEL PAZZANI
More informationFoundations of Data Science 1
Foundations of Data Science John Hopcroft Ravindran Kannan Version /4/204 These notes are a first draft of a book being written by Hopcroft and Kannan and in many places are incomplete. However, the notes
More informationSubspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity
Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at UrbanaChampaign
More informationFrom value at risk to stress testing: The extreme value approach
Journal of Banking & Finance 24 (2000) 1097±1130 www.elsevier.com/locate/econbase From value at risk to stress testing: The extreme value approach Francßois M. Longin * Department of Finance, Groupe ESSEC,
More informationDecoding by Linear Programming
Decoding by Linear Programming Emmanuel Candes and Terence Tao Applied and Computational Mathematics, Caltech, Pasadena, CA 91125 Department of Mathematics, University of California, Los Angeles, CA 90095
More informationAn Introduction to Variable and Feature Selection
Journal of Machine Learning Research 3 (23) 11571182 Submitted 11/2; Published 3/3 An Introduction to Variable and Feature Selection Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 9478151, USA
More informationEVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION. Carl Edward Rasmussen
EVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION Carl Edward Rasmussen A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy, Graduate
More informationMisunderstandings between experimentalists and observationalists about causal inference
J. R. Statist. Soc. A (2008) 171, Part 2, pp. 481 502 Misunderstandings between experimentalists and observationalists about causal inference Kosuke Imai, Princeton University, USA Gary King Harvard University,
More informationSemiparametric estimates and tests of baseindependent equivalence scales
Journal of Econometrics 88 (1999) 1 40 Semiparametric estimates and tests of baseindependent equivalence scales Krishna Pendakur* Department of Economics, Simon Fraser University, Burnaby, BC, Canada
More informationJournal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1. (1977), pp. 138.
Maximum Likelihood from Incomplete Data via the EM Algorithm A. P. Dempster; N. M. Laird; D. B. Rubin Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1. (1977), pp. 138.
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289. Compressed Sensing. David L. Donoho, Member, IEEE
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289 Compressed Sensing David L. Donoho, Member, IEEE Abstract Suppose is an unknown vector in (a digital image or signal); we plan to
More informationPanel Unit Root Tests in the Presence of CrossSectional Dependency and Heterogeneity 1
Panel Unit Root Tests in the Presence of CrossSectional Dependency and Heterogeneity 1 Yoosoon Chang and Wonho Song Department of Economics Rice University Abstract An IV approach, using as instruments
More informationIf You re So Smart, Why Aren t You Rich? Belief Selection in Complete and Incomplete Markets
If You re So Smart, Why Aren t You Rich? Belief Selection in Complete and Incomplete Markets Lawrence Blume and David Easley Department of Economics Cornell University July 2002 Today: June 24, 2004 The
More informationChapter 3 OneStep Methods 3. Introduction The explicit and implicit Euler methods for solving the scalar IVP y = f(t y) y() = y : (3..) have an O(h) global error which istoolow to make themofmuch practical
More informationStatistical Inference in TwoStage Online Controlled Experiments with Treatment Selection and Validation
Statistical Inference in TwoStage Online Controlled Experiments with Treatment Selection and Validation Alex Deng Microsoft One Microsoft Way Redmond, WA 98052 alexdeng@microsoft.com Tianxi Li Department
More informationThe Capital Asset Pricing Model: Some Empirical Tests
The Capital Asset Pricing Model: Some Empirical Tests Fischer Black* Deceased Michael C. Jensen Harvard Business School MJensen@hbs.edu and Myron Scholes Stanford University  Graduate School of Business
More informationSteering User Behavior with Badges
Steering User Behavior with Badges Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell University Cornell University Stanford University ashton@cs.stanford.edu {dph,
More informationWHICH SCORING RULE MAXIMIZES CONDORCET EFFICIENCY? 1. Introduction
WHICH SCORING RULE MAXIMIZES CONDORCET EFFICIENCY? DAVIDE P. CERVONE, WILLIAM V. GEHRLEIN, AND WILLIAM S. ZWICKER Abstract. Consider an election in which each of the n voters casts a vote consisting of
More informationWe show that social scientists often do not take full advantage of
Making the Most of Statistical Analyses: Improving Interpretation and Presentation Gary King Michael Tomz Jason Wittenberg Harvard University Harvard University Harvard University Social scientists rarely
More informationOptimization by Direct Search: New Perspectives on Some Classical and Modern Methods
SIAM REVIEW Vol. 45,No. 3,pp. 385 482 c 2003 Society for Industrial and Applied Mathematics Optimization by Direct Search: New Perspectives on Some Classical and Modern Methods Tamara G. Kolda Robert Michael
More informationHow to Deal with The LanguageasFixedEffect Fallacy : Common Misconceptions and Alternative Solutions
Journal of Memory and Language 41, 416 46 (1999) Article ID jmla.1999.650, available online at http://www.idealibrary.com on How to Deal with The LanguageasFixedEffect Fallacy : Common Misconceptions
More informationStatistical Comparisons of Classifiers over Multiple Data Sets
Journal of Machine Learning Research 7 (2006) 1 30 Submitted 8/04; Revised 4/05; Published 1/06 Statistical Comparisons of Classifiers over Multiple Data Sets Janez Demšar Faculty of Computer and Information
More informationON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION. A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT
ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT A numerical study of the distribution of spacings between zeros
More information