Functional Principal Components Analysis with Survey Data



Similar documents
Least Squares Estimation

Department of Economics

Factor analysis. Angela Montanari

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Similarity and Diagonalization. Similar Matrices

Introduction to General and Generalized Linear Models

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Estimation of the Population Total using the Generalized Difference Estimator and Wilcoxon Ranks

Multivariate Normal Distribution

Continuity of the Perron Root

CONSTANT-SIGN SOLUTIONS FOR A NONLINEAR NEUMANN PROBLEM INVOLVING THE DISCRETE p-laplacian. Pasquale Candito and Giuseppina D Aguí

E3: PROBABILITY AND STATISTICS lecture notes

NOV /II. 1. Let f(z) = sin z, z C. Then f(z) : 3. Let the sequence {a n } be given. (A) is bounded in the complex plane

Monte Carlo Methods in Finance

Linear Algebra Review. Vectors

Component Ordering in Independent Component Analysis Based on Data Power

Inner Product Spaces and Orthogonality

Orthogonal Diagonalization of Symmetric Matrices

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Vector and Matrix Norms

MATHEMATICAL METHODS OF STATISTICS

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

BANACH AND HILBERT SPACE REVIEW

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Contributions to extreme-value analysis

The Advantages of a Strochastic Approach to Data Mining

Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Lecture 3: Linear methods for classification

Panel Data Econometrics

LOGNORMAL MODEL FOR STOCK PRICES

Sensitivity analysis of European options in jump-diffusion models via the Malliavin calculus on the Wiener space

Statistical Machine Learning

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Life Table Analysis using Weighted Survey Data

Approaches for Analyzing Survey Data: a Discussion

Master of Mathematical Finance: Course Descriptions

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

Chapter 6: Multivariate Cointegration Analysis

Dimensionality Reduction: Principal Components Analysis

HETEROGENEOUS AGENTS AND AGGREGATE UNCERTAINTY. Daniel Harenberg University of Mannheim. Econ 714,

Statistics Graduate Courses

Introduction to Longitudinal Data Analysis

Multivariate Analysis of Ecological Data

Introduction to Principal Components and FactorAnalysis

Lecture 5: Singular Value Decomposition SVD (1)

Understanding and Applying Kalman Filtering

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Partial Least Squares (PLS) Regression.

Comparison of Estimation Methods for Complex Survey Data Analysis

Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis

Simultaneous Prediction of Actual and Average Values of Study Variable Using Stein-rule Estimators

The Master s Degree with Thesis Course Descriptions in Industrial Engineering

Classification of Cartan matrices

Chapter 11 Introduction to Survey Sampling and Analysis Procedures

A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Nonlinear Iterative Partial Least Squares Method

Information Security and Risk Management

Notes on Symmetric Matrices

Dimensioning an inbound call center using constraint programming

Subspace Analysis and Optimization for AAM Based Face Alignment

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STA 4273H: Statistical Machine Learning

Linear Algebra Methods for Data Mining

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

On the existence of multiple principal eigenvalues for some indefinite linear eigenvalue problems

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 10

Betting with the Kelly Criterion

Corollary. (f є C n+1 [a,b]). Proof: This follows directly from the preceding theorem using the inequality

Monte Carlo testing with Big Data

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

Chapter 5. Banach Spaces

1 Teaching notes on GMM 1.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

NOTES ON LINEAR TRANSFORMATIONS

Graduate Certificate in Systems Engineering

x = + x 2 + x

Clarifying Some Issues in the Regression Analysis of Survey Data

[1] Diagonal factorization

An Analysis of Rank Ordered Data

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Integrating Benders decomposition within Constraint Programming

Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products

Fairfield Public Schools

Visualization of textual data: unfolding the Kohonen maps.

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

Transcription:

First International Workshop on Functional and Operatorial Statistics. Toulouse, June 19-21, 2008 Functional Principal Components Analysis with Survey Data Hervé CARDOT, Mohamed CHAOUCH ( ), Camelia GOGA & Catherine LABRUÈRE Institut de Mathématiques de Bourgogne, Université de Bourgogne, 9 Avenue Alain Savary, BP 47870, 21078 DIJO Cedex, FRACE. email : {herve.cardot, mohamed.chaouch, camelia.goga, catherine.labruere}@u-bourgogne.fr Abstract This work aims at performing Functional Principal Components Analysis (FPCA) thanks to Horvitz-Thompson estimators when the curves are collected with survey sampling techniques. Linearization approaches based on the influence function allow us to derive estimators of the asymptotic variance of the eigenelements of the FPCA. The method is illustrated with simulations which confirm the good properties of the linearization technique. 1. Introduction Functional Data Analysis whose main purpose is to provide tools for describing and modeling sets of curves is a topic of growing interest in the statistical community. The books by Ramsay and Silverman (2002, 2005) propose an interesting description of the available procedures dealing with functional observations. These functional approaches have been proved useful in various domains such as chemometrics, economy, climatology, biology or remote sensing. The statistician generally wants, in a first step, to represent as well as possible a set of random curves in a small space in order to get a description of the functional data that allows interpretation. Functional principal components analysis (FPCA) gives a small dimension space which captures the main modes of variability of the data (see Ramsay and Silverman, 2002 for more details).

The way the data are collected is seldom taken into account in the literature and one generally supposes the data are independent realizations of a common functional distribution. However there are some cases for which this assumption is not fulfilled, for example when the realizations result from a sampling scheme. For instance, Dessertaine (2006) considers the estimation with time series procedures of a global demand for electricity at fine time scales with the observation of individual electricity consumption curves. More generally, there are now data (data streams) produced automatically by large numbers of distributed sensors which generate huge amounts of data that can be seen as functional. The use of sampling technique to collect them proposed for instance in Chiky and Hébrail (2007) seems to be a relevant approach in such a framework allowing a trade off between storage capacities and accuracy of the data. We propose in this work to give estimators of the functional principal components analysis when the curves are collected with survey sampling strategies. Let us note that Skinner et al. (1986) have studied some properties of multivariate PCA in a survey framework. The functional framework is different since the eigenfunctions which exibit the main modes of variability of the data are also functions and can be naturally interpreted as modes of variability varying along time. In this new functional framework, we estimate the mean function and the covariance operator using the Horvitz-Thompson estimator. The eigenelements are estimated by diagonalization of the estimated covariance operator. In order to calculate and estimate the variance of the so-constructed estimators, we use the influence function linearization method introduced by Deville (1999). This paper is organized as follows : Section 2 presents the functional principal components analysis in the setting of finite populations and defines then the Horvitz-Thompson estimator in the new functional framework. The generality of the influence function allows us to extend in section 3 the estimators proposed by Deville to our functional objects and to get asymptotic variances with the help of perturbation theory (Kato, 1966). Section 4 proposes a simulation study which shows the good behavior of our estimators for various sampling schemes as well as good approximations to their theoretical variances. 2. FPCA and sampling 2.1 FPCA in a finite population setting Let us consider a finite population U = {1,..., k,..., } with size not necessarily known and a functional variable Y defined for each element k of the population U : Y k = (Y k (t)) t [0,1] belongs to the separable Hilbert space L 2 [0, 1] of square integrable functions defined on the closed interval [0, 1] equipped with the usual inner product.,. and the norm.. The mean function µ L 2 [0, 1], is defined by µ(t) = 1 Y k (t), t [0, 1] (1) and the covariance operator Γ by Γ = 1 (Y k µ) (Y k µ) (2)

where the tensor product of two elements a and b of L 2 [0, 1] is the rank one operator such that a b(u) = a, u b for all u in L 2 [0, 1]. The operator Γ is symmetric and non negative ( Γu, u 0). Its eigenvalues, sorted in decreasing order, λ 1 λ 2 λ 0, satisfy Γv j (t) = λ j v j (t), t [0, 1], (3) where the eigenfunctions v j form an orthonormal system in L 2 [0, 1], i.e v j, v j = 1 if j = j and zero else. We can get now an expansion similar to the Karhunen-Loeve expansion or FPCA which allows to get the best approximation in a finite dimension space with dimension q to the curves of the population q Y k (t) µ(t) + Y k µ, v j v j (t), t [0, 1] j=1 The eigenfunctions v j indicate the main modes of variation along time t of the data around the mean µ and the explained variance of the projection onto each v j is given by the eigenvalue λ j = 1 Y k µ, v j 2. We aim at estimating the mean function µ and the covariance operator Γ in order to deduce estimators of the eigenelements (λ j, v j ) when the data are obtained with survey sampling procedures. 2.2 The Horvitz-Thompson estimator We consider a sample of n individuals s, i.e. a subset s U, selected according to a probabilistic procedure p(s) where p is a probability distribution on the set of 2 subsets of U. We denote by = Pr(k s) for all k U the first order inclusion probabilities and by l = Pr(k & l s) for all k, l U with k =, the second order inclusion probabilities. We suppose that > 0 and l > 0. We suppose also that and l are not depending on t [0, 1]. We propose to estimate the mean function µ and the covariance operator Γ by replacing each total with the corresponding Horvitz-Thompson (HT) estimator (Horvitz and Thompson, 1952). We obtain µ = 1 k s Γ = 1 k s Y k (4) Y k Y k µ µ (5) where the size of the population is estimated by = k s 1 when it is not known. Then estimators of the eigenfunctions { v j, j = 1,... q} and eigenvalues { λ j, j = 1,... q}

are obtained readily by diagonalisation (or spectral analysis) of the estimated covariance operator Γ. Let us note that the eigenelements of the covariance operator are not linear functions. 3. Linearization by influence function We would like to calculate and estimate the variance of ˆµ, v j and λ j. The nonlinearity of these estimators and the functional nature of Y make the variance estimation issue difficult. For this reason, we adapt the influence function linearization technique introduced by Deville (1999) to the functional framework. Let us consider the discrete measure M defined on L 2 [0, 1] as follows M = U δ Y k where δ Yk is the Dirac function taking value 1 if Y = Y k and zero otherwise. Let us suppose that each parameter of interest can be written as a functional T of M. For example, (M) = dm, µ(m) = YdM/ dm and Γ(M) = (Y µ(m)) (Y µ(m)) dm/ dm. The eigenelements given by (??) are implicit functionals T of M. The measure M is estimated by the random measure M defined as follows M = U with I k = 1 {k s}. Then the estimators given by (??) and (??) are obtained by substitution of M by M, namely they are written as functionnals T of M. 3.1 Asymptotic Properties We give in this section the asymptotic properties of our estimators. In order to do that, one need that the population and sample sizes tend to infinity. We use the asymptotic framework introduced by Isaki & Fuller (1982). Let us suppose the following assumptions : (A1) sup Y k C <, (A2) lim n = π (0, 1), (A3) min λ > 0, min l λ > 0 and lim n max l π l <, k l k l with λ and λ are two positive constant. We also suppose that the functional T giving the parameter of interest is an homogeneous functional of degree α, namely T (rm) = r α T (M) and lim α T (M) <. For example, µ and Γ are functionals of degree zero with respect to M. Let us note that the eigenelements of Γ are also functionals of degree zero with respect to M. Let us also introduce the Hilbert-Schmidt norm, denoted by 2 for operators mapping L 2 [0, 1] to L 2 [0, 1]. We show in the next proposition that the our estimators are asymptotically design ) unbiased, lim (E p (T ( M)) T (M) = 0, and consistent, namely for any fixed ε > 0 we have lim P ( T ( M) T (M) > ε) = 0. Here, E p ( ) is the expectation with respect to p(s). Proposition 1 Under hypotheses (A1), (A2) and (A3), E p µ µ 2 = O(n 1 Γ ), E p Γ 2 = 2 O(n 1 ). δ Yk I k

If we suppose that the non null eigenvalues are distinct, we also have, E p (sup λ j λ ) 2 j = O(n 1 ), E p v j v j 2 = O(n 1 ) for each fixed j. j 3.2 Variance approximation and estimation Let define, when it exists, the influence function of a functional T at point Y L 2 [0, 1] say IT (M, Y), as follows IT (M, Y) = lim h 0 T (M + hδ Y ) T (M) h where δ Y is the Dirac function at Y. Proposition 2 Under assumption (A1), we get that the influence functions of µ and Γ exist and Iµ(M, Y k ) = (Y k µ)/ and IΓ(M, Y k ) = 1 ((Y k µ) (Y k µ) Γ). If the non null eigenvalues of Γ are distinct then Iλ j (M, Y k ) = 1 ( Yk µ, v j 2 ) λ j Iv j (M, Y k ) = 1 Y k µ, v j Y k µ, v l v l. λ j λ l l j In order to obtain the asymptotic variance of T ( M) for T given by (??), (??) and (??), we write the first-order von Mises expansion of our functional in M/ near M/ and use the fact that T is of degree 0 and IT (M/, Y k ) = IT (M, Y k ), T ( M) = T (M) + ( ) ( ) Ik M IT (M, Y k ) 1 + R T, M. Proposition 3 Suppose the hypotheses (A1), (A2) and (A3) are fulfilled. Consider the functional T giving the parameters of interest defined in ((??), (??), ) (??). We suppose that the non null eigenvalues are distinct. Then R T, M cm = o p (n 1/2 ) and U (l the asymptotic variance of T ( M) is equal to V p [ k s IT (M, Y k) I k ] = U π l ) IT (M,Y k) IT (M,Y l ) π l. One can remark that the asymptotic variance given by the above result is not known. We propose to estimate it by the HT variance estimator with IT (M, Y k ) replaced by its HT estimator. We obtain V p ( µ) = 1 2 V p ( λj ) = 1ˆ 2 V p ( v j ) = k s k s l s k s l s l s 1 l kl π l (Y k µ) (Y l µ) 1 l kl π l ( Y k µ, v j 2 λ j ) ( Y l µ, v j 2 λ j ) 1 l kl π l Îv j (M, Y s ) Îv j(m, Y l )

( ) where kl = l π l and Îv j(m, Y l ) = 1ˆ Y k bµ,bv j Y k bµ,bv l l j bλ j λ b v l. Cardot et al. l (2007) show that under the assumptions (A1)-(A3), these estimators are asymptotically design unbiased and consistent. 4. A Simulation study In our simulations all functional variables are discretized in p = 100 equispaced points in the interval [0, 1]. We consider a random variable Y distributed as brownian motion on [0, 1]. We make = 10000 replications of Y and construct then two strata U 1 and U 2 with different variances and with sizes 1 = 7000 and 2 = 3000. Our population U is the union of the two strata. Then we estimate the eigenelements of the covariance operator for two different sampling designs (Simple Random Sampling Without Replacement (SRSWR) and stratified) and two different sample sizes n = 100 and n = 1000. To evaluate our estimation procedures we make 500 replications of the previous experiment. Then estimation errors for the first eigenvalue and the first eigenvector are evaluated by considering the following loss criterions λ1 ˆλ 1 λ 1 and v1 ˆv1 v 1, with. is the Euclidiean norm. Linear approximation by influence function gives reasonable estimation of the variance for small size samples and accurates estimations as far as n gets large enough (n = 1000). We also note that the variance of the estimators given by stratified sampling turns out to be smaller than those by SRSWR. References Cardot, H, Chaouch, M, Goga, C. and Labruère, C. (2007). Functional Principal Components Analysis with Survey Data. Preprint. Chiky, R, Hébrail, G. (2007). Generic tool for summarizing distributed data streams. Preprint. Dauxois, J., Pousse, A., and Romain, Y. (1982). Asymptotic theory for the principal component analysis of a random vector function : some applications to statistical inference. J. Multivariate Anal., 12, 136-154. Dessertaine A. (2006). Sondage et séries temporelles : une application pour la prévision de la consommation electrique. 38èmes Journées de Statistique, Clamart, Juin 2006. Deville, J.C. (1999). Variance estimation for complex statistics and estimators : linearization and residual techniques. Survey Methodology, 25, 193-203. Horvitz, D.G. and Thompson, D.J. (1952). A generalization of sampling without replacement from a finite universe. J. Am. Statist. Ass., 47, 663-685. Isaki, C.T. and Fuller, W.A. (1982). Survey design under the regression superpopulation model. J. Am. Statist. Ass. 77, 89-96. Kato, T. (1966). Perturbation theory for linear operators. Springer Verlag, Berlin. Ramsay, J. O. and Silverman, B.W. (2005). Functional Data Analysis. Springer-Verlag, 2nd ed. Skinner, C.J, Holmes, D.J, Smith, T.M.F (1986). The Effect of Sample Design on Principal Components Analysis. J. Am. Statist. Ass. 81, 789-798.