Kernel density estimation with adaptive varying window size

Transcription

1 Pattern Recognition Letters 23 (2002) Kernel density estimation with adaptive varying window size Vladimir Katkovnik a, Ilya Shmulevich b, * a Kwangju Institute of Science and Technology, Kwangju, South Korea b University of Texas M.D. Anderson Cancer Center, Houston, TX 77030, USA Received 2 November 2000; received in revised form 2 August 200 Abstract A new method of kernel density estimation with a varying adaptive window size is proposed. It is based on the socalled intersection of confidence intervals (ICI) rule. Several examples of the proposed method are given for different types of densities and the quality of the adaptive density estimate is assessed by means of numerical simulations. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: Non-parametric; Kernel; Density estimation; Parzen; ICI rule. Introduction In pattern recognition, optimal algorithms often require the knowledge of underlying densities of signals and/or noise. As these densities are usually unknown, unrealistic assumptions are frequently made, thus compromising the performance of the algorithms in question. A common approach to this problem is to estimate the density from the data. If a particular form of the density is assumed or known, then parametric estimation is used. If nothing is assumed about the density shape, non-parametric estimation is employed. Besides being widely used in the field of pattern recognition and classification (Fukunaga, 990), non-parametric probability density estimation has * Corresponding author. Tel.: ; fax: addresses: vlkatkov@hotmail.com (V. Katkovnik), is@ieee.org (I. Shmulevich). been applied in image processing (Sindoukas et al., 997; Wright et al., 997), communications (Zabin and Wright, 2000), and many other fields. One of the most well-known and popular techniques of non-parametric density estimation is the kernel or Parzen density estimate (Fukunaga, 990; Parzen, 962; Cacoullos, 966). Given N samples X ;...; X N drawn from a population with density function f ðxþ, x 2 R, the Parzen density estimate at x is given by ^f h ðxþ ¼ N X N i¼ h j x X i h ; ðþ where jðþ is a window or kernel function and h is the window width, smoothing parameter, or simply the R kernel size. Traditionally, it is assumed that jðuþdu ¼ and jðþ is symmetric, that is, jðuþ ¼ jð uþ. One p popular choice is the Gaussian kernel jðuþ ¼= ffiffiffiffiffi 2p expð u 2 =2Þ. The kernel size h is the most important characteristic of the Parzen density estimate (Raudys, /02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII: S (02)0027-7

2 642 V. Katkovnik, I. Shmulevich / Pattern Recognition Letters 23 (2002) ; Silverman, 978). One can compute the ideal or optimal value of h by minimizing the meansquare error MSEff ^ h ðxþg ¼ Ef½f ^ h ðxþ f ðxþš 2 g ð2þ between the true and estimated densities, with respect to h. The MSE is a function of x and so the optimal kernel size h is also a function of x. In order to minimize the MSE, a best compromise between variance and bias must be selected. Using Taylor series approximations of the moments of f^ h ðxþ and noting that MSEff ^ h ðxþg ¼ ½Eff ^ h ðxþg f ðxþš 2 þ Varff ^ h ðxþg; ð3þ where Ef ^ f h ðxþg ¼ and Z Varf ^ f h ðxþg ¼ N jðuþf ðx þ huþdu; h Z E 2 f ^ f h ðxþg j 2 ðuþf ðx þ huþ ð4þ : ð5þ We have that for small h and large N (h! 0, N!, and Nh!), Biasff ^ h ðxþg ¼ Eff ^ h ðxþg f ðxþ 8 < h 2 2 f 00 ðxþ R u 2 jðuþdu; if jðuþ ¼jð uþ : hf 0 ðxþ R ujðuþdu; otherwise ð6þ and Varff ^ h ðxþg f ðxþ Z Nh j 2 ðuþdu: ð7þ Then, the optimal value of the kernel size can be shown to be equal to (Parzen, 962) 0 f ðxþ R =5 B j 2 ðuþdu h 0 ðxþ ¼ h N f 00 ðxþ R i 2 A ð8þ u 2 jðuþdu for a symmetric kernel, when h! 0asN! and Nh!, assuring asymptotic unbiasedness and consistency of the estimate. Similarly, 0 B f ðxþ R j 2 ðuþdu h 0 ðxþ ¼ h 2N f 0 ðxþ R i 2 A ujðuþdu =3 ð9þ for a non-symmetric window. As can be seen from Eqs. (8) and (9), the optimal kernel size depends on the value of the density and on its second or first derivative. These equations, in particular the order with respect to N, depend essentially on whether the kernel is symmetric or non-symmetric. It is also possible to obtain an optimal constant kernel size independent of x by minimizing either the integral mean-square error R MSEff ^ h ðxþgdx or the expected mean-square error R MSEff ^ h ðxþgf ðxþdx (Fukunaga, 990; Parzen, 962). Clearly, in practice, one does not have access to the true density function f ðxþ which is proposed to be estimated. Thus, a number of heuristic approaches can be taken for finding the window width. For instance, the optimal constant h can be computed for, say, the Normal distribution (as a function of N) and then used for making the estimate f ^ h ðxþ (Fukunaga, 990). Since the density estimates are often used for classification purposes, another approach is to determine h on the basis of the expected probability of misclassification (Raudys, 99). Although h is usually taken to be a constant, several important approaches have been proposed to vary it. One is the well-known kth nearest neighbor estimate (Loftsgaarden and Quesenberry, 965). Another method is the adaptive kernel estimate proposed in (Breiman et al., 977). A number of other papers considering the problem of kernel size selection exist (e.g. Abramson, 982; Terrell and Scott, 992; Chiu, 992; Sheather and Jones, 99; Hall et al., 99; Taylor, 989). In this paper, we propose and develop a special rule (statistic) for choosing a data-driven kernel size that is selected in a point-wise manner for every argument value of the density. Our main motivation for this complex estimator is to make it adaptive to unknown and varying smoothness of the density to be estimated. This method is described in Section 2. In Section 3, we assess the quality of the density estimate by comparing it to a known density which we are

3 V. Katkovnik, I. Shmulevich / Pattern Recognition Letters 23 (2002) estimating, under the mean-square error criterion. It will be shown that the estimates based on the variable-sized kernels are superior to the estimates based on optimal constant-sized kernels. 2. Adaptive kernel size selection The method of adaptive kernel size selection is based on the ICI rule, proposed in (Goldenshluger and Nemirovsky, 997) for adaptive regression smoothing and later developed for signal filtering in (Katkovnik, 999). One of several attractive properties of the ICI rule is that it is spatially adaptive over a wide range of signal classes in the sense that its quality is close to that which one could achieve if smoothness of the original signal was known in advance (Goldenshluger and Nemirovsky, 997). We briefly reintroduce this method here in the context of density estimation. Consider the ratio of the standard deviation Stdff ^ h ðxþg of the estimate to the absolute value of the bias jbiasff ^ h ðxþgj of the estimate, evaluated at the ideal value of h 0 ðxþ, given in Eqs. (8) and (9). We get Stdf ^ f h0 ðxþg jbiasf ^ f h0 ðxþgj ¼ k; k ¼ 2; p ffiffiffi if jðuþ ¼jð uþ 2 ; otherwise It is useful to note that as the standard deviation and bias are monotonically increasing and decreasing, respectively, as h! 0, we have that k Stdf f ^ h ðxþg P jbiasff ^ h ðxþgj; if h 6 h 0 ; k Stdf f ^ h ðxþg 6 jbiasff ^ h ðxþgj; if h P h 0 : ð0þ Although it is known that in regions where the density function is convex, it is theoretically possible to find bandwidths for which the point-wise bias is equal to zero (Sain and Scott, 996), we assume the monotonicity of the bias with respect to the bandwidth locally, for the theory and initial motivation. Moreover, these zero-bias bandwidths are typically much larger than the asymptotic data-dependent bandwidth given in (8), while Eq. (0) holds for h! 0. For a given kernel size h, the estimation error can be represented as jf ^ h ðxþ f ðxþj ¼ jbiasff ^ h ðxþg þ n h ðxþj 6 jbiasff ^ h ðxþgj þ jn h ðxþj; where n h ðxþ is a random variable with zero mean and standard deviation equal to Stdf ^ f h ðxþg. Thus, jf ^ h ðxþ f ðxþj 6 jbiasff ^ h ðxþgj þ v p Stdff ^ h ðxþg ðþ holds with an arbitrary probability p, for a suitably chosen v p. Using the relationships in (0) together with (), we get that for h 6 h 0, jf ^ h ðxþ f ðxþj 6 þ v k p Stdff ^ h ðxþg ¼ C Stdff ^ h ðxþg; ð2þ where C ¼ þ v k p. Larger values of C correspond to larger values of p. The ICI rule essentially tests the hypothesis h 6 h 0 for various values of h and in this way selects an h close to h 0 as follows. Suppose H ¼fh < h 2 < < h J g is a finite collection of kernel sizes, starting with a small h. Using inequality (2), we determine a sequence of confidence intervals D j ¼½L j ; U j Š; j ¼ ;...; J L j ¼ f ^ hj ðxþ C Stdff ^ hj ðxþg U j ¼ f ^ hj ðxþþc Stdff ^ hj ðxþg; ð3þ each one corresponding to a kernel size in H. The ICI rule, then, can be stated as follows (Katkovnik, 999). ICI rule: Consider the intersection of the intervals D j,6 j 6 i with increasing i and let i þ be the largest of those i for which the intervals D j, 6 j i have a point in common. This i þ defines the adaptive kernel size h þ ðxþ ¼h i þ and consequently, the density estimate f ^ h ðxþðxþ. þ It is important to note that the kernel size selection procedure based on the ICI rule requires only the knowledge of the density estimate and its variance, for which Eq. (7) can be used. However, this variance, in turn, depends on the unknown

4 644 V. Katkovnik, I. Shmulevich / Pattern Recognition Letters 23 (2002) density to be estimated. A pilot estimate of the density can be used in (7) instead of f ðxþ. However, it is emphasized that this pilot estimate should be obtained more or less independently with respect to the final estimate f ^ h ðxþðxþ. In fact, þ this is a general rule of using pilot estimates in statistics (see e.g., Fan and Gijbels, 996). The kernel density estimate with a constant window size h ¼ h is a good choice for the considered problem. In our simulation experiments, we employ the Sheather Jones plug in method (Sheather and Jones, 99) for the estimation of h in the pilot density estimate. This method is known to have excellent performance as compared to other known methods. C is a design parameter of the algorithm and the selection of its value is discussed in (Katkovnik and Shmulevich, 2000). This ICI procedure for the varying window density estimate can be implemented by Algorithm. Algorithm. Adaptive Window Width Selection L ( ; U ( while ðl 6 UÞ and ði 6 JÞ do L ( f ^ hi ðxþ C Stdff ^ h ðxþg U ( f ^ hi ðxþþc Stdff ^ h ðxþg L ( max½l; LŠ; U ( min½u; UŠ i ( i þ end while h þ ðxþ (h i 3. Simulation examples In this section, we will illustrate the use of the kernel size selection procedure based on the ICI rule. 3.. Qualitative simulation This group of simulations is given in order to demonstrate the ability of the ICI to obtain the optimal (reasonable) window sizes. As a first example, consider estimating the piece-wise constant density function shown in Fig. a. This example is intended to qualitatively demonstrate the behavior of the adaptive kernel size selection procedure, using the symmetric Gaussian kernel as well as non-symmetric right and left kernels 8 < p2 ffiffiffiffiffi exp u2 ; u P 0 j r ðuþ ¼ 2p 2 : 0; u < 0 ð4þ j l ðuþ ¼j r ð uþ: The allowable kernel sizes (H) start with h ¼ 0:0 and increase until h 300 ¼ 3:0 with a step of 0.0. Fig. b d shows the kernel sizes chosen by the ICI rule, corresponding to the three kinds of kernels (jðþ, j r ðþ, and j l ðþ) used. The number of observations N is equal to Especially worthy of notice is the behavior of non-symmetric kernels in the presence of discontinuities in the density function. For instance, the kernel size of the right kernel j r ðþ is rather high at the point of the first discontinuity (x ¼ ) and becomes smaller as it approaches the second discontinuity ðx ¼ 0Þ, after which the situation is similar (Fig. c). This behavior corresponds to the common sense idea that a large window size for the right kernel should be chosen at x ¼ þ while for x ¼ 0, the data available for the right kernel estimator is very small and hence, the kernel size is accordingly small. Here, the notation x means lim e!0;e>0 ðx eþ. For the left kernel j l ðþ, the behavior is the opposite (Fig. d). Immediately after the first discontinuity, the left kernel still contains very few observations and consequently has small size which increases up until x ¼ 0. Similarly, just after this point, the kernel size again becomes quite small, since even small sizes encompass two different density regions. Finally, as shown in Fig. b, the size of the symmetric kernel increases towards the middle between discontinuities, that is, at x ¼ 0:5. Also, very large kernel sizes, in the form of spikes, can be seen exactly at the points of discontinuities. The reason for this phenomenon is the following. It can be shown, by using Eq. (4), that at the point of discontinuity of the density function, the expectation of the estimate satisfies Z lim h!0 jðuþf ðx þ huþdu ¼ f ð xþ Þþfðx Þ : 2 Therefore, the ICI rule behaves correctly and in accordance with this fact. The reason that large kernel sizes are chosen is because the density

5 V. Katkovnik, I. Shmulevich / Pattern Recognition Letters 23 (2002) Fig.. Density f ðxþ to be estimated (a) and adaptive window widths corresponding to a symmetric kernel jðþ (b), right kernel j r ðþ (c), and left kernel j l ðþ (d) C ¼ 4. function on either side of the discontinuity is constant and larger kernel sizes decrease the variance and consequently, the MSE. The combined density estimates obtained by fusion of the left and right estimates are considered in (Katkovnik and Shmulevich, 2000). Another way to evaluate the method of adaptive kernel size selection is to compare it to an ideal varying kernel size. Rather than using Eq. (8), which is an asymptotic result, we shall compare our method to the more stringent criterion, namely, the empirically obtained varying kernel size h ðxþ, which minimizes the MSE between the known density f ðxþ and the estimated density f ^ h ðxþðxþ. In other words, h ðxþ ¼arg min hðxþ 2: E fðxþ f ^ hðxþ ðxþ ð5þ As a second example, we shall consider estimating the density function shown in Fig. 2a. The density is zero outside the interval ½0; Š. The allowable kernel sizes (H) start with h ¼ 0:0 and increase until h 00 ¼ :0 with a step of 0.0. The number of observations N is equal to An ideal kernel size, from the set of allowable kernel sizes, was found for every x, using Eq. (5). This ideal window size is shown in Fig. 2b as a solid line. The dashed line shows the variable kernel size h þ ðxþ obtained by using the ICI rule. As can be seen, their behavior is very similar. As expected, the kernel size is larger in the flat region of the density as compared with regions of the peaks where the kernel size becomes smaller Quantitative simulation For quantitative accuracy analysis, we use a double-peaked mixed density, f ðxþ ¼ð aþnð0; r ÞþaNðm 2 ; r 2 Þ, with m 2 ¼ 7, r ¼, r 2 ¼ 0:05, and a ¼ =2. This type of combined models is most reasonable for demonstrating the advantage of the varying window kernel estimates (Terrell and Scott, 992). The true density to be estimated is depicted in Fig. 3. The allowable kernel sizes (H)

6 646 V. Katkovnik, I. Shmulevich / Pattern Recognition Letters 23 (2002) Fig. 2. (a) Density f ðxþ to be estimated; (b) optimal varying window width h m ðxþ (solid line) minimizing the MSE between the known density and the estimate and the varying window width based on the ICI rule h þ ðxþ (dashed line) (C ¼ 5). point-wise average, over 200 simulation runs, of the Parzen estimate was obtained. That is, f hi ðxþ ¼ X 200 f^ ðjþ h 200 i ðxþ; ð6þ j¼ where f ^ðjþ h i ðxþ is the Parzen estimate with constant kernel size h i using data from the jth realization. For each run, the square root of the average, over 200 runs, of the point-wise squared error was computed. In other words, vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u X 200 h e hi ðxþ ¼ f ðxþ f 200 ^ i 2 t ðjþ h i ðxþ : ð7þ j¼ Fig. 3. The true density to be estimated. are logarithmically spaced between h ¼ 0:002 and h 40 ¼ 4:0. The number of observations N is equal to 000. The following computations were performed. For each constant kernel size in H, the The best (constant) kernel size, h, was selected by considering the mean value, over x, of the error in (7). In our simulations, the best kernel size was equal to h ¼ 0:024. We note in passing that the data-driven Sheather Jones plug in method (Sheather and Jones, 99) produces a kernel size roughly five times larger.

7 V. Katkovnik, I. Shmulevich / Pattern Recognition Letters 23 (2002) Fig. 4. The results of Monte Carlo simulations with 200 runs: (a) and (b) show two parts of the density using the ideal constant kernel size as well as the ICI rule. Average of the estimates as well as their confidence intervals are indicated. The true density is shown in dashed lines. Further, the ICI rule was used with values of C between 0.5 and 3.0 with a step size of 0.5. In a similar manner to the above, the point-wise average of the estimate equipped with the ICI rule was obtained. Also, the error, as in (7), was computed. This was performed for every value of C. Finally, the best C was chosen in the same way as h. In our simulations, C ¼ 0:5. It should be mentioned that the accuracy of estimation is not overly sensitive to the selection of C and there exists a range of values all of which result in similar accuracy (Katkovnik and Shmulevich, 2000). Fig. 4a and b depict the two peaks of the density estimated using the ideal constant kernel size h and the ICI rule. The true density is shown as a dashed line. The average of the 200 estimates given in (6) is indicated as mean const. and the corresponding average of the estimates using the ICI rule as mean ICI. Note that in Fig. 4b, the average estimated densities nearly coincide. In addition, these figures also show the upper and lower confidence intervals given by f hi ðxþe hi ðxþ for the ideal constant kernel size as well as for the ICI rule. These are labeled as upper/lower const. and upper/lower ICI respectively. It can be readily seen that on the wide peak of the density, the ICI rule produces smaller confidence intervals and hence, less variability of the error in the estimate. We stress that this is done in comparison with the ideal constant kernel size, which is, of course, unknown. Moreover, adaptive constant bandwidth methods (e.g., Sheather and Jones, 99) produce very different sizes from the ideal. As far as the average of the estimates, the ideal constant kernel size estimate and the estimate based on the ICI rule are comparable, with the ICI rule producing a slightly smoother estimate for the wide peak (Fig. 4a). 4. Conclusions We have proposed a new method for varying the bandwidth in kernel density estimation. This

8 648 V. Katkovnik, I. Shmulevich / Pattern Recognition Letters 23 (2002) method is based on the ICI rule and requires only the knowledge of the variance of the estimate. In our case, as the true density is unknown, the variance of the estimator is approximated by replacing the true density by a pilot estimate with a data-dependent constant kernel size. It is also possible to implement an iterative technique in which successive estimates are used to compute the variance by formula (5), using which in the ICI rule, new estimates are formed. Although we have considered this method for one-dimensional densities, there is no conceptual difficulty in extending it to multi-dimensional densities. In that case, as with other techniques, not only the size, but also the shape of the kernel is an important parameter. We have shown, by means of numerical simulations, that the proposed method can perform significantly better than any constant-bandwidth method. Acknowledgements The authors are grateful for the support and hospitality of Tampere International Center for Signal Processing in Tampere, Finland, where this work was done. References Abramson, I., 982. On bandwidth variation in kernel estimates a square root law. Ann. Stat. 0, Breiman, L., Meisel, W., Purcell, E., 977. Variable kernel estimates of multivariate densities. Technometrics 9, Cacoullos, T., 966. Estimation of a multivariate density. Ann. Inst. Stat. Math. 8, Chiu, S.-T., 992. An automatic bandwidth selector for kernel density estimation. Biometrika 79 (4), Fan, J., Gijbels, I., 996. Local Polynomial Modelling and its Application. Chapman and Hall, London. Fukunaga, K., 990. Statistical Pattern Recognition, second ed. Academic Press, New York. Goldenshluger, A., Nemirovsky, A., 997. On spatial adaptive estimation of nonparametric regression. Math. Meth. Stat. 6 (2), Hall, P., Sheather, S.J., Jones, M.C., Marron, J.S., 99. On optimal data-based bandwidth selection in kernel density estimation. Biometrika 78 (2), Katkovnik, V., 999. A new method for varying adaptive bandwidth selection. IEEE Trans. Signal Process. 47 (9), Katkovnik, V., Shmulevich, I., Kernel density estimation with varying data-driven bandwidth. in: EOS/SPIE Symposium, Image and Signal Processing for Remote Sensing, September 25 29, Barcelona, Spain. Loftsgaarden, D., Quesenberry, C., 965. A nonparametric estimate of a multivariate density function. Ann. Math. Stat. 36, Parzen, E., 962. On the estimation of a probability density function and the mode. Ann. Math. Stat. 33, Raudys, S., 99. On the effectiveness of Parzen window classifier. Informatica 2 (3), Sain, S.R., Scott, D.W., 996. On locally adaptive density estimation. J. Am. Stat. Assoc. 9 (436), Sheather, S.J., Jones, M.C., 99. A reliable data-based bandwidth selection method for kernel density estimation. J. R. Stat. Soc. B 53 (3), Silverman, B.W., 978. Choosing the window width when estimating a density. Biometrika 65,. Sindoukas, D., Laskaris, N., Fotopoulos, S., 997. Algorithms for color image edge enhancement using potential functions. IEEE Signal Process. Lett. 4 (9), Taylor, C.C., 989. Bootstrap choice of the smoothing parameter in kernel density estimation. Biometrika 76 (4), Terrell, G., Scott, D., 992. Variable kernel density estimation. Ann. Stat. 20 (3), Wright, D., Stander, J., Nicolaides, K., 997. Nonparametric density estimation and discrimination from images of shapes. J. Roy. Statistical Soc., Ser. C: Appl. Statistics 46 (3), Zabin, S., Wright, G., Nonparametric density estimation and detection in impulsive interference channels. IEEE Trans. Commun. 42 (2 4),