Statistiques en grande dimension



Similar documents
α = u v. In other words, Orthogonal Projection

How To Find Out How Roots Of A Polynome Are Calculated

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Avez-vous entendu parler du Compressed sensing? Atelier Inversion d ISTerre 15 Mai 2012 H-C. Nataf

Big Data: The curse of dimensionality and variable selection in identification for a high dimensional nonlinear non-parametric system

PIXEL-PLANES: A VLSI-ORIENTED DESIGN FOR 3-D RASTER GRAPHICS. H. Fuchs and J. Pou1ton. University of North Carolina at Chapel Hill

MAT 200, Midterm Exam Solution. a. (5 points) Compute the determinant of the matrix A =

Orthogonal Projections and Orthonormal Bases

Personnalisez votre intérieur avec les revêtements imprimés ALYOS design

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Dimensionality Reduction: Principal Components Analysis

Face detection is a process of localizing and extracting the face region from the

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

CS Introduction to Data Mining Instructor: Abdullah Mueen

Wavelet analysis. Wavelet requirements. Example signals. Stationary signal 2 Hz + 10 Hz + 20Hz. Zero mean, oscillatory (wave) Fast decay (let)

the points are called control points approximating curve

AgroMarketDay. Research Application Summary pp: Abstract

Reduced echelon form: Add the following conditions to conditions 1, 2, and 3 above:

Part II Redundant Dictionaries and Pursuit Algorithms

Troncatures dans les modèles linéaires simples et à effets mixtes sous R

( ) which must be a vector

Eric Favreau Head of Legal - eyeka. Yannig Roth Marketing Manager - eyeka

Évariste Galois and Solvable Permutation Groups

Systems of Linear Equations

MAT 1341: REVIEW II SANGHOON BAEK

Liste d'adresses URL

The Scientific Data Mining Process

Unrealized Gains in Stocks from the Viewpoint of Investment Risk Management

Nonlinear Iterative Partial Least Squares Method

Name: Section Registered In:

Linear Algebra Notes

Math 215 HW #6 Solutions

French Property Registering System: Evolution to a Numeric Format?

Orthogonal Projections

1 VECTOR SPACES AND SUBSPACES

Inner product. Definition of inner product

3. INNER PRODUCT SPACES

Survey on Conference Services provided by the United Nations Office at Geneva

Development of a distributed recommender system using the Hadoop Framework

F.S. Hillier & G.T Lierberman Introduction to Operations Research McGraw-Hill, 2004

Recall that two vectors in are perpendicular or orthogonal provided that their dot

Office of the Auditor General / Bureau du vérificateur général FOLLOW-UP TO THE 2010 AUDIT OF COMPRESSED WORK WEEK AGREEMENTS 2012 SUIVI DE LA

Reconstruction d un modèle géométrique à partir d un maillage 3D issu d un scanner surfacique

MATH1231 Algebra, 2015 Chapter 7: Linear maps

Introduction: Overview of Kernel Methods

Fanny Dos Reis. Visiting Assistant Professor, Texas A&M University. September May 2008

Implementation of SAP-GRC with the Pictet Group

Applications to Data Smoothing and Image Processing I

Enterprise Informa/on Modeling: An Integrated Way to Track and Measure Asset Performance

Geometric Camera Parameters

Memory Eye SSTIC Yoann Guillot. Sogeti / ESEC R&D yoann.guillot(at)sogeti.com

Unsupervised and supervised dimension reduction: Algorithms and connections

Detection of water leakage using laser images from 3D laser scanning data

Statistical machine learning, high dimension and big data

8. Linear least-squares

Solutions to Math 51 First Exam January 29, 2015

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

9 Multiplication of Vectors: The Scalar or Dot Product

PLEbaum - Development and implementation of a Personal Learning Environment in forestry and arboriculture

A =

Introduction. GEAL Bibliothèque Java pour écrire des algorithmes évolutionnaires. Objectifs. Simplicité Evolution et coévolution Parallélisme

These axioms must hold for all vectors ū, v, and w in V and all scalars c and d.

Supervised and unsupervised learning - 1

EIT ICT Labs Information & Communication Technology Labs

Orthogonal Diagonalization of Symmetric Matrices

BIG DATA AND HIGH DIMENSIONAL DATA ANALYSIS

Incremental PCA: An Alternative Approach for Novelty Detection

Millier Dickinson Blais

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Soft Clustering with Projections: PCA, ICA, and Laplacian

STUDENT APPLICATION FORM (Dossier d Inscription) ACADEMIC YEAR (Année Scolaire )

CSS : petits compléments

Level 2 French, 2014

ARCTICNET MCLANE MOORED PROFILER DATA - QUALITY CONTROL REPORT. Jessy Barrette and Yves Gratton

Administrer les solutions Citrix XenApp et XenDesktop 7.6 CXD-203

CERN EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH

Math 241, Exam 1 Information.

REQUEST FORM FORMULAIRE DE REQUÊTE

Office of the Auditor General / Bureau du vérificateur général FOLLOW-UP TO THE 2007 AUDIT OF THE DISPOSAL OF PAVEMENT LINE MARKER EQUIPMENT 2009

Finite speed of propagation in porous media. by mass transportation methods

Energy Efficiency Competence Master

Sur 1 Bit bit n 4 bit n 3 bit n 2 bit n 1 bit n 4 bit n 3 bit n 2 bit n 1 bit n 4 bit n 3 bit n 2 bit n 1

International Diversification and Exchange Rates Risk. Summary

Calcul parallèle avec R

Topological Data Analysis

RAPPORT FINANCIER ANNUEL PORTANT SUR LES COMPTES 2014

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Application of computational fluid dynamics to spray drying

REQUEST FORM FORMULAIRE DE REQUÊTE

REVOCABILITY. Contact: Lars MEINHARDT, Telephone:(32-2) ,

Guidance on Extended Producer Responsibility (EPR) Analysis of EPR schemes in the EU and development of guiding principles for their functioning

Thursday, February 7, DOM via PHP

Audit de sécurité avec Backtrack 5

Transcription:

Statistiques en grande dimension Christophe Giraud 1,2 et Tristan Mary-Huart 3,4 (1) Université Paris-Sud (2) Ecole Polytechnique (3) AgroParistech (4) INRA - Le Moulon M2 MathSV & Maths Aléa C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 1 / 25

High-dimensional data C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 2 / 25

Données en grande dimension Données biotech: mesure des milliers de quantités par individu. Images : images médicales, astrophysique, video surveillance, etc. Chaque image est constituées de milliers ou millions de pixels ou voxels. Marketing: les sites web et les programmes de fidélité collectent de grandes quantités d information sur les préférences et comportements des clients. Ex: systèmes de recommandation... Business: exploitation des données internes et externes de l entreprise devient primordial Crowdsourcing data : données récoltées online par des volontaires. Ex: ebirds collecte des millions d observations d oiseaux en Amérique du Nord C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 3 / 25

Blessing? we can sense thousands of variables on each individual : potentially we will be able to scan every variables that may influence the phenomenon under study. the curse of dimensionality : separating the signal from the noise is in general almost impossible in high-dimensional data and computations can rapidly exceed the available resources. C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 4 / 25

Renversement de point de vue Cadre statistique classique: petit nombre p de paramètres grand nombre n d expériences on étudie le comportement asymptotique des estimateurs lorsque n (résultats type théorème central limite) C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 5 / 25

Renversement de point de vue Cadre statistique classique: petit nombre p de paramètres grand nombre n d expériences on étudie le comportement asymptotique des estimateurs lorsque n (résultats type théorème central limite) Données actuelles: inflation du nombre p de paramètres taille d échantillon réduite: n p ou n p = penser différemment les statistiques! (penser n ne convient plus) C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 5 / 25

Fléau de la dimension C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 6 / 25

Curse 1 : fluctuations cumulate Exemple : linear regression Y = Xβ + ε with cov(ε) = σ 2 I n. The Least-Square estimator β argmin β R p Y Xβ 2 has a risk [ E β β 2] ( = Tr (X T X) 1) σ 2. Illustration : Y i = p βj cos(πji/n) + ε i = f β (i/n) + ε i, for i = 1,..., n, j=1 with ε 1,..., ε n i.i.d with N (0, 1) distribution β j independent with N (0, j 4 ) distribution C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 7 / 25

Curse 1 : fluctuations cumulate p = 10 p = 20 y -3-2 -1 0 1 2 3 4 signal LS y -3-2 -1 0 1 2 3 4 signal LS 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x p = 50 p = 100 y -3-2 -1 0 1 2 3 4 signal LS y -3-2 -1 0 1 2 3 4 signal LS 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 8 / 25

Curse 2 : locality is lost Observations (Y i, X (i) ) R [0, 1] p for i = 1,..., n. Model: Y i = f (X (i) ) + ε i with f smooth. Local averaging: f (x) = average of { Y i : X (i) close to x } C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 9 / 25

Curse 2 : locality is lost dimension = 2 dimension = 10 Frequency 0 200 400 600 Frequency 0 200 400 600 800 0.0 0.5 1.0 1.5 distance between points 0.0 0.5 1.0 1.5 2.0 distance between points dimension = 100 dimension = 1000 Frequency 0 200 400 600 800 Frequency 0 200 400 600 800 0 1 2 3 4 5 distance between points 0 5 10 15 distance between points Figure: Histograms of the pairwise-distances between n = 100 points sampled uniformly in the hypercube [0, 1] p, for p = 2, 10, 100 and 1000. C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 10 / 25

Curse 2 : locality is lost Number n of points x 1,..., x n required for covering [0, 1] p by the balls B(x i, 1): Γ(p/2 + 1) ( p p ) p/2 pπ n π p/2 2πe p 20 30 50 100 200 larger than the estimated n 39 45630 5.7 10 12 42 10 39 number of particles in the observable universe C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 11 / 25

Some other curses Curse 3 : an accumulation of rare events may not be rare (false discoveries, etc) Curse 4 : algorithmic complexity must remain low C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 12 / 25

Low-dimensional structures in high-dimensional data Hopeless? Low dimensional structures : high-dimensional data are usually concentrated around low-dimensional structures reflecting the (relatively) small complexity of the systems producing the data geometrical structures in an image, regulation network of a biological system, social structures in marketing data, human technologies have limited complexity, etc. Dimension reduction : unsupervised (PCA) estimation-oriented X[,3] -1.0-0.5 0.0 0.5 1.0 0.5 0.0-0.5-1.0-1.0-0.5 0.0 0.5 1.0 1.0 X[,2] X[,1] C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 13 / 25

Principal ponent Analysis For any data points X (1),..., X (n) R p and any dimension d p, the PCA computes the linear span in R p V d argmin dim(v ) d n X (i) Proj V X (i) 2, i=1 where Proj V is the orthogonal projection matrix onto V. X[,3] -1.0-0.5 0.0 0.5 1.0 1.0 0.5 0.0-0.5-1.0-1.0-0.5 0.0 0.5 1.0 X[,1] V 2 in dimension p = 3. X[,2] To do Exercise 1.6.4 C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 14 / 25

PCA in action original image original image original image original image projected image projected image projected image projected image MNIST : 1100 scans of each digit. Each scan is a 16 16 image which is encoded by a vector in R 256. The original images are displayed in the first row, their projection onto 10 first principal axes in the second row. C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 15 / 25

Estimation-oriented dimension reduction LDA 6 1.5 PCA 4 2-4 -1.0-2 0 LD2 0.0-0.5 axis 2 0.5 1.0-1 0 1 2 3-4 -2 axis 1 0 2 4 LD1 Figure: 55 chemical measurements of 162 strains of E. coli. Left : the data is projected on the plane given by a PCA. Right : the data is projected on the plane given by a LDA. C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Ale a 16 / 25

Résumé Difficulté statistique données de très grande dimension peu de répétitions Pour nous aider Données issues d un vaste système dynamique (plus ou moins stochastique) existence de structures de faible dimension effective existence de modèles théoriques La voie du succès Trouver, à partir des données, ces structures cachées pour pouvoir les exploiter. C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 17 / 25

La voie du succès Trouver, à partir des données, les structures cachées pour pouvoir les exploiter. C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 18 / 25

Exemples de structures C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 19 / 25

Regression Model Regression model f : R p R E[ε i ] = 0 Y i = f (x (i) ) + ε i, i = 1,..., n with C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 20 / 25

Low-dimensional x Example 1: sparse piecewise constant regression It corresponds to the case where f : R R is piecewise constant with a small number of jumps. This situation appears for example for CGH profiling. C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 21 / 25

Low-dimensional x Example 2: sparse basis/frame expansion We estimate f : R R by expanding it on a basis or frame {ϕ j } j J f (x) = c j ϕ j (x), j J with a small number of nonzero c j. Typical examples of basis are Fourier basis, splines, wavelets, etc. The most simple example is the piecewise linear decomposition where ϕ j (x) = (x z j ) + where z 1 < z 2 <... and (x) + = max(x, 0). C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 22 / 25

High-dimensional x Example 3: sparse linear regression It corresponds to the case where f is linear: f (x) = β, x where β R p has only a few nonzero coordinates. This model can be too rough to model the data. Example: relationship between some phenotypes and some protein abundances. only a small number of proteins influence a given phenotype, but the relationship between these proteins and the phenotype is unlikely to be linear. C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 23 / 25

High-dimensional x Example 4: sparse additive model In the sparse additive model, we expect that f (x) = k f k(x k ) with most of the f k equal to 0. If we expand each function f k on a frame or basis {ϕ j } j Jk we obtain the decomposition p f (x) = c j,k ϕ j (x k ), j J k where most of the vectors {c j,k } j Jk k=1 are zero. Such a model can be hard to fit from a small sample since it requires to estimate a relatively large number of non-zero c j,k. C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 24 / 25

High-dimensional x In some cases the basis expansion of f k can be sparse itself. Example 5: sparse additive piecewise linear regression The sparse additive piecewise linear model, is a sparse additive model f (x) = k f k(x k ) with sparse piecewise linear functions f k. We then have two levels of sparsity : 1 most of the f k are equal to 0, 2 the nonzero f k have a sparse expansion in the following representation f k (x k ) = j J k c j,k (x k z j,k ) + In other words, the matrix c = [c j,k ] of the sparse additive model has only a few nonzero columns, and this nonzero columns are sparse. C. Giraud (Paris Sud & Polytechnique) High-dimensional statistics M2 MathSV & Maths Aléa 25 / 25