Functional Principal Components Analysis with Survey Data

First International Workshop on Functional and Operatorial Statistics. Toulouse, June 19-21, 2008 Functional Principal Components Analysis with Survey Data Hervé CARDOT, Mohamed CHAOUCH ( ), Camelia GOGA & Catherine LABRUÈRE Institut de Mathématiques de Bourgogne, Université de Bourgogne, 9 Avenue Alain Savary, BP 47870, 21078 DIJO Cedex, FRACE. email : {herve.cardot, mohamed.chaouch, camelia.goga, catherine.labruere}@u-bourgogne.fr Abstract This work aims at performing Functional Principal Components Analysis (FPCA) thanks to Horvitz-Thompson estimators when the curves are collected with survey sampling techniques. Linearization approaches based on the influence function allow us to derive estimators of the asymptotic variance of the eigenelements of the FPCA. The method is illustrated with simulations which confirm the good properties of the linearization technique. 1. Introduction Functional Data Analysis whose main purpose is to provide tools for describing and modeling sets of curves is a topic of growing interest in the statistical community. The books by Ramsay and Silverman (2002, 2005) propose an interesting description of the available procedures dealing with functional observations. These functional approaches have been proved useful in various domains such as chemometrics, economy, climatology, biology or remote sensing. The statistician generally wants, in a first step, to represent as well as possible a set of random curves in a small space in order to get a description of the functional data that allows interpretation. Functional principal components analysis (FPCA) gives a small dimension space which captures the main modes of variability of the data (see Ramsay and Silverman, 2002 for more details).

The way the data are collected is seldom taken into account in the literature and one generally supposes the data are independent realizations of a common functional distribution. However there are some cases for which this assumption is not fulfilled, for example when the realizations result from a sampling scheme. For instance, Dessertaine (2006) considers the estimation with time series procedures of a global demand for electricity at fine time scales with the observation of individual electricity consumption curves. More generally, there are now data (data streams) produced automatically by large numbers of distributed sensors which generate huge amounts of data that can be seen as functional. The use of sampling technique to collect them proposed for instance in Chiky and Hébrail (2007) seems to be a relevant approach in such a framework allowing a trade off between storage capacities and accuracy of the data. We propose in this work to give estimators of the functional principal components analysis when the curves are collected with survey sampling strategies. Let us note that Skinner et al. (1986) have studied some properties of multivariate PCA in a survey framework. The functional framework is different since the eigenfunctions which exibit the main modes of variability of the data are also functions and can be naturally interpreted as modes of variability varying along time. In this new functional framework, we estimate the mean function and the covariance operator using the Horvitz-Thompson estimator. The eigenelements are estimated by diagonalization of the estimated covariance operator. In order to calculate and estimate the variance of the so-constructed estimators, we use the influence function linearization method introduced by Deville (1999). This paper is organized as follows : Section 2 presents the functional principal components analysis in the setting of finite populations and defines then the Horvitz-Thompson estimator in the new functional framework. The generality of the influence function allows us to extend in section 3 the estimators proposed by Deville to our functional objects and to get asymptotic variances with the help of perturbation theory (Kato, 1966). Section 4 proposes a simulation study which shows the good behavior of our estimators for various sampling schemes as well as good approximations to their theoretical variances. 2. FPCA and sampling 2.1 FPCA in a finite population setting Let us consider a finite population U = {1,..., k,..., } with size not necessarily known and a functional variable Y defined for each element k of the population U : Y k = (Y k (t)) t [0,1] belongs to the separable Hilbert space L 2 [0, 1] of square integrable functions defined on the closed interval [0, 1] equipped with the usual inner product.,. and the norm.. The mean function µ L 2 [0, 1], is defined by µ(t) = 1 Y k (t), t [0, 1] (1) and the covariance operator Γ by Γ = 1 (Y k µ) (Y k µ) (2)

where the tensor product of two elements a and b of L 2 [0, 1] is the rank one operator such that a b(u) = a, u b for all u in L 2 [0, 1]. The operator Γ is symmetric and non negative ( Γu, u 0). Its eigenvalues, sorted in decreasing order, λ 1 λ 2 λ 0, satisfy Γv j (t) = λ j v j (t), t [0, 1], (3) where the eigenfunctions v j form an orthonormal system in L 2 [0, 1], i.e v j, v j = 1 if j = j and zero else. We can get now an expansion similar to the Karhunen-Loeve expansion or FPCA which allows to get the best approximation in a finite dimension space with dimension q to the curves of the population q Y k (t) µ(t) + Y k µ, v j v j (t), t [0, 1] j=1 The eigenfunctions v j indicate the main modes of variation along time t of the data around the mean µ and the explained variance of the projection onto each v j is given by the eigenvalue λ j = 1 Y k µ, v j 2. We aim at estimating the mean function µ and the covariance operator Γ in order to deduce estimators of the eigenelements (λ j, v j ) when the data are obtained with survey sampling procedures. 2.2 The Horvitz-Thompson estimator We consider a sample of n individuals s, i.e. a subset s U, selected according to a probabilistic procedure p(s) where p is a probability distribution on the set of 2 subsets of U. We denote by = Pr(k s) for all k U the first order inclusion probabilities and by l = Pr(k & l s) for all k, l U with k =, the second order inclusion probabilities. We suppose that > 0 and l > 0. We suppose also that and l are not depending on t [0, 1]. We propose to estimate the mean function µ and the covariance operator Γ by replacing each total with the corresponding Horvitz-Thompson (HT) estimator (Horvitz and Thompson, 1952). We obtain µ = 1 k s Γ = 1 k s Y k (4) Y k Y k µ µ (5) where the size of the population is estimated by = k s 1 when it is not known. Then estimators of the eigenfunctions { v j, j = 1,... q} and eigenvalues { λ j, j = 1,... q}

are obtained readily by diagonalisation (or spectral analysis) of the estimated covariance operator Γ. Let us note that the eigenelements of the covariance operator are not linear functions. 3. Linearization by influence function We would like to calculate and estimate the variance of ˆµ, v j and λ j. The nonlinearity of these estimators and the functional nature of Y make the variance estimation issue difficult. For this reason, we adapt the influence function linearization technique introduced by Deville (1999) to the functional framework. Let us consider the discrete measure M defined on L 2 [0, 1] as follows M = U δ Y k where δ Yk is the Dirac function taking value 1 if Y = Y k and zero otherwise. Let us suppose that each parameter of interest can be written as a functional T of M. For example, (M) = dm, µ(m) = YdM/ dm and Γ(M) = (Y µ(m)) (Y µ(m)) dm/ dm. The eigenelements given by (??) are implicit functionals T of M. The measure M is estimated by the random measure M defined as follows M = U with I k = 1 {k s}. Then the estimators given by (??) and (??) are obtained by substitution of M by M, namely they are written as functionnals T of M. 3.1 Asymptotic Properties We give in this section the asymptotic properties of our estimators. In order to do that, one need that the population and sample sizes tend to infinity. We use the asymptotic framework introduced by Isaki & Fuller (1982). Let us suppose the following assumptions : (A1) sup Y k C <, (A2) lim n = π (0, 1), (A3) min λ > 0, min l λ > 0 and lim n max l π l <, k l k l with λ and λ are two positive constant. We also suppose that the functional T giving the parameter of interest is an homogeneous functional of degree α, namely T (rm) = r α T (M) and lim α T (M) <. For example, µ and Γ are functionals of degree zero with respect to M. Let us note that the eigenelements of Γ are also functionals of degree zero with respect to M. Let us also introduce the Hilbert-Schmidt norm, denoted by 2 for operators mapping L 2 [0, 1] to L 2 [0, 1]. We show in the next proposition that the our estimators are asymptotically design ) unbiased, lim (E p (T ( M)) T (M) = 0, and consistent, namely for any fixed ε > 0 we have lim P ( T ( M) T (M) > ε) = 0. Here, E p ( ) is the expectation with respect to p(s). Proposition 1 Under hypotheses (A1), (A2) and (A3), E p µ µ 2 = O(n 1 Γ ), E p Γ 2 = 2 O(n 1 ). δ Yk I k

If we suppose that the non null eigenvalues are distinct, we also have, E p (sup λ j λ ) 2 j = O(n 1 ), E p v j v j 2 = O(n 1 ) for each fixed j. j 3.2 Variance approximation and estimation Let define, when it exists, the influence function of a functional T at point Y L 2 [0, 1] say IT (M, Y), as follows IT (M, Y) = lim h 0 T (M + hδ Y ) T (M) h where δ Y is the Dirac function at Y. Proposition 2 Under assumption (A1), we get that the influence functions of µ and Γ exist and Iµ(M, Y k ) = (Y k µ)/ and IΓ(M, Y k ) = 1 ((Y k µ) (Y k µ) Γ). If the non null eigenvalues of Γ are distinct then Iλ j (M, Y k ) = 1 ( Yk µ, v j 2 ) λ j Iv j (M, Y k ) = 1 Y k µ, v j Y k µ, v l v l. λ j λ l l j In order to obtain the asymptotic variance of T ( M) for T given by (??), (??) and (??), we write the first-order von Mises expansion of our functional in M/ near M/ and use the fact that T is of degree 0 and IT (M/, Y k ) = IT (M, Y k ), T ( M) = T (M) + ( ) ( ) Ik M IT (M, Y k ) 1 + R T, M. Proposition 3 Suppose the hypotheses (A1), (A2) and (A3) are fulfilled. Consider the functional T giving the parameters of interest defined in ((??), (??), ) (??). We suppose that the non null eigenvalues are distinct. Then R T, M cm = o p (n 1/2 ) and U (l the asymptotic variance of T ( M) is equal to V p [ k s IT (M, Y k) I k ] = U π l ) IT (M,Y k) IT (M,Y l ) π l. One can remark that the asymptotic variance given by the above result is not known. We propose to estimate it by the HT variance estimator with IT (M, Y k ) replaced by its HT estimator. We obtain V p ( µ) = 1 2 V p ( λj ) = 1ˆ 2 V p ( v j ) = k s k s l s k s l s l s 1 l kl π l (Y k µ) (Y l µ) 1 l kl π l ( Y k µ, v j 2 λ j ) ( Y l µ, v j 2 λ j ) 1 l kl π l Îv j (M, Y s ) Îv j(m, Y l )

( ) where kl = l π l and Îv j(m, Y l ) = 1ˆ Y k bµ,bv j Y k bµ,bv l l j bλ j λ b v l. Cardot et al. l (2007) show that under the assumptions (A1)-(A3), these estimators are asymptotically design unbiased and consistent. 4. A Simulation study In our simulations all functional variables are discretized in p = 100 equispaced points in the interval [0, 1]. We consider a random variable Y distributed as brownian motion on [0, 1]. We make = 10000 replications of Y and construct then two strata U 1 and U 2 with different variances and with sizes 1 = 7000 and 2 = 3000. Our population U is the union of the two strata. Then we estimate the eigenelements of the covariance operator for two different sampling designs (Simple Random Sampling Without Replacement (SRSWR) and stratified) and two different sample sizes n = 100 and n = 1000. To evaluate our estimation procedures we make 500 replications of the previous experiment. Then estimation errors for the first eigenvalue and the first eigenvector are evaluated by considering the following loss criterions λ1 ˆλ 1 λ 1 and v1 ˆv1 v 1, with. is the Euclidiean norm. Linear approximation by influence function gives reasonable estimation of the variance for small size samples and accurates estimations as far as n gets large enough (n = 1000). We also note that the variance of the estimators given by stratified sampling turns out to be smaller than those by SRSWR. References Cardot, H, Chaouch, M, Goga, C. and Labruère, C. (2007). Functional Principal Components Analysis with Survey Data. Preprint. Chiky, R, Hébrail, G. (2007). Generic tool for summarizing distributed data streams. Preprint. Dauxois, J., Pousse, A., and Romain, Y. (1982). Asymptotic theory for the principal component analysis of a random vector function : some applications to statistical inference. J. Multivariate Anal., 12, 136-154. Dessertaine A. (2006). Sondage et séries temporelles : une application pour la prévision de la consommation electrique. 38èmes Journées de Statistique, Clamart, Juin 2006. Deville, J.C. (1999). Variance estimation for complex statistics and estimators : linearization and residual techniques. Survey Methodology, 25, 193-203. Horvitz, D.G. and Thompson, D.J. (1952). A generalization of sampling without replacement from a finite universe. J. Am. Statist. Ass., 47, 663-685. Isaki, C.T. and Fuller, W.A. (1982). Survey design under the regression superpopulation model. J. Am. Statist. Ass. 77, 89-96. Kato, T. (1966). Perturbation theory for linear operators. Springer Verlag, Berlin. Ramsay, J. O. and Silverman, B.W. (2005). Functional Data Analysis. Springer-Verlag, 2nd ed. Skinner, C.J, Holmes, D.J, Smith, T.M.F (1986). The Effect of Sample Design on Principal Components Analysis. J. Am. Statist. Ass. 81, 789-798.