APPLIED SMOOTHING TECHNIQUES. Part 1: Kernel Density Estimation. Walter Zucchini

Similar documents
Instantaneous Rate of Change:

ACT Math Facts & Formulas

Geometric Stratification of Accounting Data

Verifying Numerical Convergence Rates

1.6. Analyse Optimum Volume and Surface Area. Maximum Volume for a Given Surface Area. Example 1. Solution

Lecture 10: What is a Function, definition, piecewise defined functions, difference quotient, domain of a function

Derivatives Math 120 Calculus I D Joyce, Fall 2013

SAT Subject Math Level 1 Facts & Formulas

FINITE DIFFERENCE METHODS

Math 113 HW #5 Solutions

Research on the Anti-perspective Correction Algorithm of QR Barcode

The EOQ Inventory Formula

Catalogue no XIE. Survey Methodology. December 2004

M(0) = 1 M(1) = 2 M(h) = M(h 1) + M(h 2) + 1 (h > 1)

Tangent Lines and Rates of Change

In other words the graph of the polynomial should pass through the points

2 Limits and Derivatives

New Vocabulary volume

Pressure. Pressure. Atmospheric pressure. Conceptual example 1: Blood pressure. Pressure is force per unit area:

Chapter 7 Numerical Differentiation and Integration

Section 3.3. Differentiation of Polynomials and Rational Functions. Difference Equations to Differential Equations

Sections 3.1/3.2: Introducing the Derivative/Rules of Differentiation

Determine the perimeter of a triangle using algebra Find the area of a triangle using the formula

The Derivative as a Function

MATHEMATICS FOR ENGINEERING DIFFERENTIATION TUTORIAL 1 - BASIC DIFFERENTIATION

THE NEISS SAMPLE (DESIGN AND IMPLEMENTATION) 1997 to Present. Prepared for public release by:

Notes: Most of the material in this chapter is taken from Young and Freedman, Chap. 12.

The modelling of business rules for dashboard reporting using mutual information

Average and Instantaneous Rates of Change: The Derivative

13 PERIMETER AND AREA OF 2D SHAPES

Math Test Sections. The College Board: Expanding College Opportunity

SAMPLE DESIGN FOR THE TERRORISM RISK INSURANCE PROGRAM SURVEY

SAT Math Facts & Formulas

Note nine: Linear programming CSE Linear constraints and objective functions. 1.1 Introductory example. Copyright c Sanjoy Dasgupta 1

An inquiry into the multiplier process in IS-LM model

EC201 Intermediate Macroeconomics. EC201 Intermediate Macroeconomics Problem set 8 Solution

CHAPTER 8: DIFFERENTIAL CALCULUS

Chapter 11. Limits and an Introduction to Calculus. Selected Applications

Training Robust Support Vector Regression via D. C. Program

Theoretical calculation of the heat capacity

How To Ensure That An Eac Edge Program Is Successful

- 1 - Handout #22 May 23, 2012 Huffman Encoding and Data Compression. CS106B Spring Handout by Julie Zelenski with minor edits by Keith Schwarz

Writing Mathematics Papers

SAT Math Must-Know Facts & Formulas

Perimeter, Area and Volume of Regular Shapes

2.1: The Derivative and the Tangent Line Problem

The Dynamics of Movie Purchase and Rental Decisions: Customer Relationship Implications to Movie Studios

Solutions by: KARATUĞ OZAN BiRCAN. PROBLEM 1 (20 points): Let D be a region, i.e., an open connected set in

The use of visualization for learning and teaching mathematics

Volumes of Pyramids and Cones. Use the Pythagorean Theorem to find the value of the variable. h 2 m. 1.5 m 12 in. 8 in. 2.5 m

Guide to Cover Letters & Thank You Letters

f(x + h) f(x) h as representing the slope of a secant line. As h goes to 0, the slope of the secant line approaches the slope of the tangent line.

SWITCH T F T F SELECT. (b) local schedule of two branches. (a) if-then-else construct A & B MUX. one iteration cycle

PLUG-IN BANDWIDTH SELECTOR FOR THE KERNEL RELATIVE DENSITY ESTIMATOR

Chapter 10: Refrigeration Cycles

Grade 12 Assessment Exemplars


Can a Lump-Sum Transfer Make Everyone Enjoy the Gains. from Free Trade?

6. Differentiating the exponential and logarithm functions

CHAPTER 7. Di erentiation

Comparison between two approaches to overload control in a Real Server: local or hybrid solutions?

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Projective Geometry. Projective Geometry

Schedulability Analysis under Graph Routing in WirelessHART Networks

Binary Search Trees. Adnan Aziz. Heaps can perform extract-max, insert efficiently O(log n) worst case

Optimized Data Indexing Algorithms for OLAP Systems

CHAPTER TWO. f(x) Slope = f (3) = Rate of change of f at 3. x 3. f(1.001) f(1) Average velocity = s(0.8) s(0) 0.8 0

An Introduction to Milankovitch Cycles

Distances in random graphs with infinite mean degrees

A system to monitor the quality of automated coding of textual answers to open questions

Pre-trial Settlement with Imperfect Private Monitoring

Shell and Tube Heat Exchanger

2.12 Student Transportation. Introduction

OPTIMAL DISCONTINUOUS GALERKIN METHODS FOR THE ACOUSTIC WAVE EQUATION IN HIGHER DIMENSIONS

TRADING AWAY WIDE BRANDS FOR CHEAP BRANDS. Swati Dhingra London School of Economics and CEP. Online Appendix

To motivate the notion of a variogram for a covariance stationary process, { Ys ( ): s R}

WORKING PAPER SERIES THE INFORMATIONAL CONTENT OF OVER-THE-COUNTER CURRENCY OPTIONS NO. 366 / JUNE by Peter Christoffersen and Stefano Mazzotta

2.23 Gambling Rehabilitation Services. Introduction

Torchmark Corporation 2001 Third Avenue South Birmingham, Alabama Contact: Joyce Lane NYSE Symbol: TMK

A Multigrid Tutorial part two

Section 2.3 Solving Right Triangle Trigonometry

A survey of Quality Engineering-Management journals by bibliometric indicators

Multivariate time series analysis: Some essential notions

Cyber Epidemic Models with Dependences

3 Ans. 1 of my $30. 3 on. 1 on ice cream and the rest on 2011 MATHCOUNTS STATE COMPETITION SPRINT ROUND

Solving partial differential equations (PDEs)

Predicting the behavior of interacting humans by fusing data from multiple sources

Broadband Digital Direct Down Conversion Receiver Suitable for Software Defined Radio

College Planning Using Cash Value Life Insurance

NAFN NEWS SPRING2011 ISSUE 7. Welcome to the Spring edition of the NAFN Newsletter! INDEX. Service Updates Follow That Car! Turn Back The Clock

Tis Problem and Retail Inventory Management

KM client format supported by KB valid from 13 May 2015

An Intuitive Framework for Real-Time Freeform Modeling

A strong credit score can help you score a lower rate on a mortgage

Pioneer Fund Story. Searching for Value Today and Tomorrow. Pioneer Funds Equities

Transcription:

APPLIED SMOOTHING TECHNIQUES Part 1: Kernel Estimation Walter Zuccini October 2003

Contents 1 Estimation 2 1.1 Introduction................................... 2 1.1.1 Te probability density function.................... 2 1.1.2 Non parametric estimation of f(x) istograms.......... 3 1.2 Kernel density estimation........................... 3 1.2.1 Weigting functions........................... 3 1.2.2 Kernels................................. 7 1.2.3 Densities wit bounded support.................... 8 1.3 Properties of kernel estimators......................... 12 1.3.1 Quantifying te accuracy of kernel estimators............ 12 1.3.2 Te bias, variance and mean squared error of ˆf(x).......... 12 1.3.3 Optimal bandwidt........................... 15 1.3.4 Optimal kernels............................. 15 1.4 Selection of te bandwidt........................... 16 1.4.1 Subjective selection........................... 16 1.4.2 Selection wit reference to some given distribution.......... 16 1.4.3 Cross validation............................ 18 1.4.4 Plug in estimator.......................... 19 1.4.5 Summary and extensions........................ 19 1

Capter 1 Estimation 1.1 Introduction 1.1.1 Te probability density function Te probability distribution of a continuous valued random variable X is conventionally described in terms of its probability density function (pdf), f(x), from wic probabilities associated wit X can be determined using te relationsip P (a X b) = b a f(x)dx. Te objective of many investigations is to estimate f(x) from a sample of observations x 1, x 2,..., x n. In wat follows we will assume tat te observations can be regarded as independent realizations of X. Te parametric approac for estimating f(x) is to assume tat f(x) is a member of some parametric family of distributions, e.g. N(µ, σ 2 ), and ten to estimate te parameters of te assumed distribution from te data. For example, fitting a normal distribution leads to te estimator ˆf(x) = 1 e (x ˆµ)/2ˆσ2, x IR, 2πˆσ were ˆµ = 1 n n i=1 x i and ˆσ 2 = 1 n 1 n i=1 (x i ˆµ) 2. Tis approac as advantages as long as te distributional assumption is correct, or at least is not seriously wrong. It is easy to apply and it yields (relatively) stable estimates. Te main disadvantage of te parametric approac is lack of flexibility. Eac parametric family of distributions imposes restrictions on te sapes tat f(x) can ave. For example te density function of te normal distribution is symmetrical and bell saped, and terefore is unsuitable for representing skewed densities or bimodal densities. 2

CHAPTER 1. DENSITY ESTIMATION 3 1.1.2 Non parametric estimation of f(x) istograms Te idea of te non parametric approac is to avoid restrictive assumptions about te form of f(x) and to estimate tis directly from te data. A well known non parametric estimator of te pdf is te istogram. It as te advantage of simplicity but it also as disadvantages, suc as lack of continuity. Secondly, in terms of various matematical measures of accuracy tere exist alternative non parametric estimators tat are superior to istograms. To construct a istogram one needs to select a left bound, or starting point, x 0, and te bin widt, b. Te bins are of te form [x 0 + (i 1)b, x 0 + ib), i = 1, 2,..., m. Te estimator of f(x) is ten given by ˆf(x) = 1 n Number of observations in te same bin as x b More generally one can also use bins of different widts, in wic case ˆf(x) = 1 n Number of observations in te same bin as x Widt of bin containing x Te coice of bins, especially te bin widts, as a substantial effect on te sape and oter properties of ˆf(x). Tis is illustrated in te example tat follows. Example 1 We consider a population of 689 of a certain model of new cars. Of interest ere is te amount (in DM) paid by te customers for optional extras, suc as radio, ubcaps, special upolstery, etc.. Te top istogram in Figure 1.1 relates to te entire population; te bottom istogram is for a random sample of size 10 from te population. Figure 1.2 sows tree istogram estimates of f(x) for te sample, for different bin widts. Note tat te estimates are piecewise constant and tat tey are strongly influenced by te coice of bin widt. Te bottom rigt and grap is an example of a so called kernel estimator of f(x). We will be examining suc estimations in more detail. 1.2 Kernel density estimation 1.2.1 Weigting functions From te definition of te pdf, f(x), of a random variable, X, one as tat P (x < X < x + ) = x+ f(t)dt 2f(x) x

CHAPTER 1. DENSITY ESTIMATION 4 Histogram of total expenditure (population of 689 cars) Histogram of total expenditure (sample of 10 cars) Figure 1.1: Histogram for all cars in te population and for a random sample of size n = 10. Histogram of samp Histogram of samp 0.04 0.03 0.02 0.01 0.08 0.06 0.04 0.02 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Histogram of samp Kernel density estimate 0.08 0.06 0.04 0.02 0.06 0.04 0.02 0 5 10 15 20 25 30 0 5 10 15 20 25 30 N = 10 Bandwidt = 2.344 Figure 1.2: Histograms wit different bin widts and a kernel estimate of f(x) for te same sample.

CHAPTER 1. DENSITY ESTIMATION 5 and ence f(x) 1 P (x < X < x + ) (1.1) 2 Te above probability can be estimated by a relative frequency in te sample, ence ˆf(x) = 1 2 An alternative way to represent ˆf(x) is number of observations in (x, x + ) n ˆf(x) = 1 n were x 1, x 2,..., x n are te observed values and { 1 for t <, w(t, ) = 2 0 oterwise. (1.2) n w(x x i, ), (1.3) i=1 It is left to te reader as an exercise to sow tat ˆf(x) defined in (1.3) as te properties of a pdf, tat is ˆf(x) is non negative for all x, and te area between ˆf(x) and te x axis is equal to one. One way to tink about (1.3) is to imagine tat a rectangle (eigt 1 and widt 2) is 2 placed over eac observed point on te x axis. Te estimate of te pdf at a given point is 1/n times te sum of te eigts of all te rectangles tat cover te point. Figure 1.3 sows ˆf(x) for suc rectangular weigting functions and for different values of. We note tat te estimates of ˆf(x) in Figure 1.3 fluctuate less as te value of is increased. By increasing one increases te widt of eac rectangle and tereby increases te degree of smooting. Instead of using rectangles in (1.3) one could use oter weigting functions, for example triangles: { 1 (1 t /) for t <, w(t, ) = 0 oterwise. Again it is left to te reader to ceck tat te resulting ˆf(x) is indeed a pdf. Examples of ˆf(x) based on te triangular weigting function and four different values of are sown in Figure 1.4. Note tat ere too larger values of lead to smooter estimates ˆf(x). Anoter alternative weigting function is te Gaussian: w(t, ) = 1 2π e t2 /2 2, < t <. Figure 1.5 sows ˆf(x) based on tis weigting function for different values of. Again te fluctuations in ˆf(x) decrease wit increasing.

CHAPTER 1. DENSITY ESTIMATION 6 Rectangular kernel bw= 0.5 Rectangular kernel bw= 1 Rectangular kernel bw= 2 Rectangular kernel bw= 4 Figure 1.3: Estimates of f(x) for different values of. Te abbreviation bw (sort for bandwidt) is used ere instead of. Triangular kernel bw= 0.5 Triangular kernel bw= 1 Triangular kernel bw= 2 Triangular kernel bw= 4 Figure 1.4: Estimates of f(x) based on triangular weigting functions.

CHAPTER 1. DENSITY ESTIMATION 7 Gaussian kernel bw= 0.5 Gaussian kernel bw= 1 Gaussian kernel bw= 2 Gaussian kernel bw= 4 Figure 1.5: Estimates of f(x) based on Gaussian weigting functions. 1.2.2 Kernels Te above weigting functions, w(t, ), are all of te form w(t, ) = 1 ( ) t K, (1.4) were K is a function of a single variable called te kernel. A kernel is a standardized weigting function, namely te weigting function wit = 1. Te kernel determines te sape of te weigting function. Te parameter is called te bandwidt or smooting constant. It determines te amount of smooting applied in estimating f(x). Six examples of kernels are given in Table 1.

CHAPTER 1. DENSITY ESTIMATION 8 Efficiency (exact Kernel K(t) and to 4 d.p.) Epanecnikov Biweigt Triangular 3 (1 1 4 5 t2 )/ 5 for t < 5 0 oterwise 15 (1 16 t2 ) 2 for t < 1 0 oterwise 1 t for t < 1, 0 oterwise 1 ( 3087 3125) 1/2 0.9939 ( 243 250) 1/2 0.9859 Gaussian 1 2π e (1/2)t2 ( 36π 125 ) 1/2 0.9512 Rectangular 1 2 for t < 1, 0 oterwise ( 108 125) 1/2 0.9295 Table 1.1: Six kernels and teir efficiencies (tat will be discussed in section 1.3.4). In general any function aving te following properties can be used as a kernel: (a) K(z)dz = 1 (b) zk(z)dz = 0 (c) z 2 K(z)dz := k 2 < (1.5) It follows tat any symmetric pdf is a kernel. However, non pdf kernels can also be used, e.g. kernels for wic K(z) < 0 for some values of z. Te latter type of kernels ave te disadvantage tat ˆf(x) may be negative for some values of x. Kernel estimation of pdfs is caractized by te kernel, K, wic determines te sape of te weigting function, and te bandwidt,, wic determines te widt of te weigting function and ence te amount of smooting. Te two components determine te properties of ˆf(x). Considerable researc as been carried out (and continues to be carried out) on te question of ow one sould select K and in order to optimize te properties of ˆf(x). Tis issue will be discussed in te sections tat follow. 1.2.3 Densities wit bounded support In many situations te values tat a random variable, X, can take on is restricted, for example to te interval [0, ), tat is f(x) = 0 for x < 0. We say tat te support of f(x) is [0, ). Similarly if X can only take on values in te interval (a, b) ten f(x) = 0 for x (a, b); te support of f(x) is (a, b). In suc situations it is clearly desirable tat te estimator ˆf(x) as te same support as f(x). Direct application of kernel smooting metods does not guarantee tis property

CHAPTER 1. DENSITY ESTIMATION 9 and so tey need to be modified wen f(x) as bounded support. Te simplest metod of solving tis problem is use a transformation. Te idea is to estimate te pdf of a transformed random variable Y = t(x) wic as unbounded support. Suppose tat te pdf of Y is given by g(y). Ten te relationsip between f and g is given by One carries out te following steps: (a) Transform te observations y i = t(x i ), i = 1, 2,..., n. (b) Apply te kernel metod to estimate te pdf g(y). (c) Estimate f(x) using ˆf(x) = ĝ(t(x))t (x). f(x) = g(t(x))t (x). (1.6) Example 2 Suppose tat f(x) as support [0, ). A simple transformation t : [0, ) (, ) is te log transformation, i.e. f(x) = log(x). Here t (x) = d log(x) = 1 and so dx x ˆt(x) = ĝ(log(x)) 1 x (1.7) Te resulting estimator as support [0, ). Figure 1.6 provides an illustration for tis case for te sample considered in Example 1. (a) Te grap on te top left gives te estimated density ˆf(x) obtained witout restrictions on te support. Note tat ˆf(x) 0 for some x < 0. (b) Te grap on te top rigt sows a modified version of ˆf(x) obtained in (a), namely ˆf(x) R for x 0 ˆf c (x) = ˆf(x)dx (1.8) 0 0 for x < 0 Here ˆf c (x) is set equal to zero for x < 0 and te ˆf(x) is rescaled so tat te area under te estimated density equals one. (c) Te bottom grap sows a kernel estimator of g(y), tat is te density of Y = log(x). (d) Te bottom rigt grap sows te transformed estimator ˆf(x) obtained via ĝ(y). Example 3 Suppose tat te support of f(x) is (a, b). (, ) is f(x) = log ( x a b x ˆt(x) = ĝ ). Here t (x) = 1 x a + 1 b x ( log ( x a b x Ten a simple transformation t : (a, b) and so )) ( 1 x a + 1 ) (1.9) b x

CHAPTER 1. DENSITY ESTIMATION 10 Gaussian kernel default bw= 2.34 Gaussian kernel wit cutoff option default bw= 2.34 Gaussian kernel of log(values) default bw= 0.21 Estimate via kernel smoot of log(values) 0.8 0.6 0.4 0.2 0.0 0 1 2 3 4 5 6 log() Figure 1.6: Kernel estimates of pdf wit support [0, ). Gaussian kernel default bw= 2.34 Gaussian kernel wit cutoff option default bw= 2.34 0 5 10 15 20 25 0 5 10 15 20 25 Gaussian kernel of logit(values) default bw= 0.4 Estimate via kernel smoot of logit(values) 0.4 0.3 0.2 0.1 0.0 3 2 1 0 1 2 3 logit() 0 5 10 15 20 25 Figure 1.7: Kernel estimates of a pdf wit support (0,25).

CHAPTER 1. DENSITY ESTIMATION 11 Gaussian kernel default bw= 2.34 0 5 10 15 20 25 Gaussian kernel wit cutoff option default bw= 2.34 0 5 10 15 20 25 Gaussian kernel of normit(values) default bw= 0.25 Estimate via kernel smoot of normit(values) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 2 1 0 1 normit() 0 5 10 15 20 25 Figure 1.8: Application of te normit transformation for ˆf(x) wit support (0,25). Figure 1.7 provides an illustration of tis case wit a = 0 and b = 25. Te four figures sown are analogous to tose in Example 2 but wit ˆf(x) R for a < x < b ˆf c (x) = b ˆf(x)dx a 0 oterwise (1.10) for te grap on te top rigt. Example 4 As an alternative to te transformation in Example 2 one can use te inverse of some probability distribution function, suc as te normal; e.g. f(x) = Φ ( ) 1 x a b a, were Φ is te distribution function of te standard normal. Here too t : (a, b) (, ) and te estimator is ˆf(x) = { ( ( )) ĝ Φ 1 x a b a for a < x < b, b a ϕ( x a b a ) 0 oterwise, were ϕ(x) is te pdf of a standard normal distribution. (1.11) Te application of tis transformation is illustrated in Figure 1.8 in wic te four figures are analogous to tose in te previous two examples.

CHAPTER 1. DENSITY ESTIMATION 12 Te above tree examples illustrate tat te transformation procedure can lead to a considerable cange in te appearance of te estimate ˆf(x). By applying kernel smooting to te transformed values one is, in effect, applying a different kernel at eac point in te estimation of f(x). 1.3 Properties of kernel estimators 1.3.1 Quantifying te accuracy of kernel estimators Tere are various ways to quantify te accuracy of a density estimator. We will focus ere on te mean squared error (MSE) and its two components, namely bias and standard error (or variance). We note tat te MSE of ˆf(x) is a function of te argument x: MSE( ˆf(x)) = E( ˆf(x) f(x)) 2 (1.12) = (E ˆf(x) f(x)) 2 + E( ˆf(x) E ˆf(x)) 2 = Bias 2 ( ˆf(x)) + V ar( ˆf(x)) A measure of te global accuracy of ˆf(x) is te mean integrated squared error (MISE) MISE( ˆf) = E ( ˆf(x) f(x)) 2 dx (1.13) = = MSE( ˆf(x))dx Bias 2 ( ˆf(x))dx + V ar( ˆf(x))dx We consider eac of tese components in term. 1.3.2 Te bias, variance and mean squared error of ˆf(x) E( ˆf(x)) = 1 n = 1 n = 1 n i=1 n 1 i=1 1 E K ( x t K ( ) x xi ( ) x t K f(t)dt ) f(t)dt (1.14)

CHAPTER 1. DENSITY ESTIMATION 13 Te transformation z = x t dz, i.e. t = z + x, = 1 yields dt E( ˆf(x)) = K(z)f(x z)dz Expanding f(x z) in a Taylor series yields f(x z) = f(x) zf (x) + 1 2 (z)2 f (x) + o( 2 ), were o( 2 ) represents terms tat converge to zero faster tan 2 as approaces zero. Tus E( ˆf(x)) = + = f(x) K(z)f(x)dz K(z) (z)2 2 f (z)dz + o( 2 ) + 2 2 f (x) K(z)dz f (x) K(z)zf (x)dz (1.15) z 2 K(z)dz + o( 2 ) zk(z)dz = f(x) + 2 2 k 2f (x) + o( 2 ) (1.16) Tis depends on Bias( ˆf(x)) 2 2 k 2f (x) (1.17) Bias ( ˆf(x)) 0 0, k 2 te variance of te kernel, f (x) te curvature of te density at te point x. Te variance of ˆf(x) is given by ( V ar( ˆf(x)) 1 n ( ) ) x xi = V ar K n i=1 1 n ( ( )) x xi = V ar K n 2 2 i=1

CHAPTER 1. DENSITY ESTIMATION 14 because te x i, i = 1, 2,..., n, are independently distributed. Now ( V ar K ( )) x xi ( ( ) ) 2 ( ( )) 2 x xi x xi = E K EK ( ) 2 ( ( ) ) 2 x t x t = K f(t)dt K f(t)dt V ar( ˆf(x)) = 1 ( ) 2 1 x t n K f(t)dt 1 ( 1 2 n = 1 ( ) 2 1 x t n K f(t)dt 1 2 n Substituting z = x t one obtains V ar( ˆf(x)) = 1 n K(z) 2 f(x z)dz 1 n ( x t K ( f(x) + Bias( ˆf(x)) ( f(x) + o( 2 ) ) 2 ) ) 2 f(t)dt Applying a Taylor approximation yields V ar( ˆf(x)) = 1 K(z) 2 (f(x) zf (x) + o())dz 1 ( f(x) + o( 2 ) ) 2 n n Note tat if n becomes large and becomes small ten te above expression becomes approximately: V ar( ˆf(x)) 1 n f(x) K 2 (z)dz (1.18) We note tat te variance decreases as increases. Te above approximations for te bias and variance of ˆf(x) lead to ) 2 MSE( ˆf(x)) = Bias 2 ( ˆf(x)) + V ar( ˆf(x)) (1.19) 1 4 4 k 2 2f (x) 2 + 1 n f(x)j 2 were k 2 := z 2 K(z)dz and j 2 := K(z) 2 dz. Integrating (1.19) wit respect to x yields MISE( ˆf) 1 4 4 k 2 2β(f) + 1 n j 2, (1.20) were β(f) := f (x) 2 dx. Of central importance is te way in wic MISE( ˆf) canges as a function of te bandwidt. For very small values of te second term in (1.20) becomes large but as gets larger so te first term in (1.20) increases. Tere is an optimal value of wic minimizes MISE( ˆf).

CHAPTER 1. DENSITY ESTIMATION 15 1.3.3 Optimal bandwidt Expression (1.20) is te measure tat we use to quantify te performance of te estimator. We can find te optimal bandwidt by minimizing (1.20) wit respect to. Te first derivative is given by d MISE( ˆf) = 3 k 2 d 2β(f) 1 n j 2 2. Setting tis equal to zero yields te optimal bandwidt, opt, for te given pdf and kernel: ( ) 1/5 1 γ(k) opt =, (1.21) n β(f) were γ(k) := j 2 k2 2. Substituting (1.21) for in (1.20) gives te minimal MISE for te given pdf and kernel. After some manipulation tis can be sown to be MISE opt ( ˆf) = 5 ( ) β(f)j 4 2 k2 2 1/5. (1.22) 4 n 4 We note tat opt depends on te sample size, n, and te kernel, K. However, it also depends on te unknown pdf, f, troug te functional β(f). Tus as it stands expression (1.21) is not applicable in practice. However, te plug in estimator of opt, to be discussed later, is simply expression (1.21) wit β(f) replaced by an estimator. 1.3.4 Optimal kernels Te MISE( ˆf) can also be minimized wit respect to te kernel used. It can be sown (see, e.g., Wand and Jones, 1995) tat Epanecnikov kernel is optimal in tis respect. { 3 K(z) = 4 (1 1 5 5 z2 ) for z < 5 0 oterwise. Tis result togeter wit (1.22) enables one to examine te impact of kernel coice on MISE opt ( ˆf). Te efficiency of a kernel, K, relative to te optimal Epanecnikov kernel, K EP, is defined as ( MISE opt ( Eff(K) = ˆf) ) 5/4 ( ) using K EP k 2 = 2 j 4 5/4 2 using K EP (1.23) MISE opt (f) using K k2j 2 2 4 using K Te efficiencies for a number of well known kernels are given in Table 1. It is clear tat te selection of kernel as rater limited impact on te efficiency. Te rectangular kernel, for example, as an efficiency of approximately 93%. Tis can be interpreted as follows: Te MISE opt ( ˆf) obtained using an Epanecnikov kernel wit n = 93 is approximately equal to te MISE opt ( ˆf) obtained using a rectangular kernel wit n = 100.

CHAPTER 1. DENSITY ESTIMATION 16 Population density (ps105) and istgrogam of sample of size 20 0 5 10 15 20 25 x Figure 1.9: Te pdf for te car example and a istogram for a sample of size 20. 1.4 Selection of te bandwidt Selection of te bandwidt of kernel estimator is a subject of considerable researc. We will outline four popular metods. 1.4.1 Subjective selection One can experiment by using different bandwidts and simply select one tat looks rigt for te type of data under investigation. Figure 1.9 sows te pdf (for te car example) and a istogram for random sample of size n = 20. Figure 1.10 sows kernel density estimation (based on a Gaussian kernel) of f(x) using 4 different bandwidts. Also sown is te density of te population. Te latter is usually unknown in practice (oterwise we wouldn t need to estimate it using a sample). Clearly = 0.5 is too small, and = 3 is too large. Appropriate ere is a bandwidt greater tan 1 but less tan 3. 1.4.2 Selection wit reference to some given distribution Here one selects te bandwidt tat would be optimal for a particular pdf. Convenient ere is te normal. We note tat one is not assuming tat f(x) is normal; rater one is

CHAPTER 1. DENSITY ESTIMATION 17 Probability density function Bandwidt: = 0.5 0 10 20 Probability density function Bandwidt: = 1 0 10 20 x x Probability density function Bandwidt: = 2 0 10 20 Probability density function Bandwidt: = 3 0 10 20 x x Figure 1.10: Te pdf for te car example and kernel density estimates using a Gaussian kernel and different bandwidts. selecting wic would be optimal if te pdf were normal. In tis case it can be sown tat β(f) = f (x) 2 dx = 3σ 5 8 π and using a Gaussian kernel leads to ( ) 1/5 4 opt = σ 1.06σ. (1.24) 3n n 5 Te normal distribution is not a wiggly distribution; it is unimodal and bell saped. It is terefore to be expected tat (1.24) will be too large for multimodal distributions. Secondly to apply (1.24) one as to estimate σ. Te usual estimator, te sample variance, is not robust; it overestimates σ if some outliers (extreme observations) are present and tereby increases ĥopt even more. To overcome tese problems Silverman (1986) proposed te following estimator ĥ opt = 0.9ˆσ n 5, (1.25) were ˆσ = min(s, R/1.34), were s 2 = 1 n 1 n i=1 (x i x) 2 and R is te interquartile range of te data. Te constant 1.34 is derived from te fact tat for a N(µ, σ 2 ) random variable, X, one as P { X µ < 1.34 σ} = 0.5.

CHAPTER 1. DENSITY ESTIMATION 18 0.035 0.040 0.045 0 5 0.060 0.065 CV CV estimate of MISE 1 3 5 Probability density function norm = 2.9 0 10 20 x Probability density function cv = 0.78 0 10 20 Probability density function sj= 1.41 0 10 20 x x Figure 1.11: Te cross-validation criterion (top left) and te estimated pdf using tree different bandwidt selectors, namely cross-validation (bottom left), normal-based (top rigt) and plug-in (bottom rigt). Te expression (1.25) is used as te default option in te R function density. It is also used as a starting value in some more sopisticated iterative estimators for te optimal bandwidt. Te top rigt grap in Figure 1.11 sows te estimated density, wit tis metod of estimating opt. 1.4.3 Cross validation Te tecnique of cross validation will be discussed in more detail in te capter on model selection. At tis point we will only outline its application to te problem of estimating optimal bandwidts. By definition MISE( ˆf) = ( ˆf(x) f(x)) 2 dx = ˆf(x) 2 dx 2 ˆf(x)f(x)dx + f(x) 2 dx

CHAPTER 1. DENSITY ESTIMATION 19 Te tird term does not depend on te sample or on te bandwidt. An approximately unbiased estimator of te first two terms is given by MCV ( ˆf) = 1 n n i=1 ˆf i (x) 2 dx 2 n n i=1 ˆf i (x i ), (1.26) were ˆf i (x) is te estimated density at te argument x using te original sample apart from observation x i. One computes MCV ( ˆf) for different values of and estimates te optimal value, opt, using te wic minimizes MCV ( ˆf). Te top left and grap in Figure 1.11 sows te curve MCV ( ˆf) for te sample of car data. Te bottom left and grap sows te corresponding estimated density. In tis example cross-validation as selected a bandwidt tat is too small and te resulting estimated ˆf as not been sufficiently smooted. 1.4.4 Plug in estimator Te idea developed by Seater and Jones (1991) is to estimate from (1.21) by applying a separate smooting tecnique to estimate f (x) and ence β(f ). For details see, e.g. Wand and Jones (1995), section 3.6. An R function to carry out te computations is available in te R library sm of Bowman and Azzalini (1997). Figure 1.11 sows tat for te car example considered ere te plug in estimator yields te most sensible estimator of ˆf. 1.4.5 Summary and extensions Te above bandwidt selectors represent only a sample of te many suggestions tat ave been offered in te recent literature. Some alternatives are described in Wand and Jones (1995) in wic te teory is given in more detail. Tese autors also offer recommendations regarding wic estimators sould be used. Te plug in estimator outlined above is one of teir recommendations. Te above metodology can be extended to bivariate (and multivariate) pdfs. it is also possible to provide approximate confidence bands for te estimated pdfs. Tese subjects will be covered in te presentations by te participants of te course.