Big Data: The curse of dimensionality and variable selection in identification for a high dimensional nonlinear non-parametric system

Similar documents

HT2015: SC4 Statistical Data Mining and Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Manifold Learning Examples PCA, LLE and ISOMAP

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

Lecture 3: Linear methods for classification

Statistical Machine Learning

Component Ordering in Independent Component Analysis Based on Data Power

Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU

Linear Classification. Volker Tresp Summer 2015

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Gaussian Processes in Machine Learning

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Data Mining: Algorithms and Applications Matrix Math Review

Statistical Models in Data Mining

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

So which is the best?

Section 1.1. Introduction to R n

Penalized regression: Introduction

Least Squares Estimation

Visualization by Linear Projections as Information Retrieval

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Machine Learning and Pattern Recognition Logistic Regression

Introduction: Overview of Kernel Methods

Feature Selection vs. Extraction

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Linear Threshold Units

NOV /II. 1. Let f(z) = sin z, z C. Then f(z) : 3. Let the sequence {a n } be given. (A) is bounded in the complex plane

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

STA 4273H: Statistical Machine Learning

Unsupervised Data Mining (Clustering)

Data, Measurements, Features

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Data Visualization and Feature Selection: New Algorithms for Nongaussian Data

An Introduction to Machine Learning

Support Vector Machines with Clustering for Training with Very Large Datasets

CSCI567 Machine Learning (Fall 2014)

Supervised Learning (Big Data Analytics)

Supervised and unsupervised learning - 1

Statistiques en grande dimension

Master s Theory Exam Spring 2006

Sections 2.11 and 5.8

Large-Scale Similarity and Distance Metric Learning

Sublinear Algorithms for Big Data. Part 4: Random Topics

Visualization of General Defined Space Data

CS Introduction to Data Mining Instructor: Abdullah Mueen

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Neural Networks Lesson 5 - Cluster Analysis

Statistical machine learning, high dimension and big data

Lecture Topic: Low-Rank Approximations

3. INNER PRODUCT SPACES

Bilinear Prediction Using Low-Rank Models

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

Metrics on SO(3) and Inverse Kinematics

Dirichlet Processes A gentle tutorial

Basics of Statistical Machine Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

On Density Based Transforms for Uncertain Data Mining

ADVANCED MACHINE LEARNING. Introduction

Knowledge Discovery from patents using KMX Text Analytics

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Filtered Gaussian Processes for Learning with Large Data-Sets

BUSINESS ANALYTICS. Data Pre-processing. Lecture 3. Information Systems and Machine Learning Lab. University of Hildesheim.

Introduction to General and Generalized Linear Models

Server Load Prediction

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

A Simple Introduction to Support Vector Machines

On the effects of dimensionality on data analysis with neural networks

Monotonicity Hints. Abstract

Inner Product Spaces and Orthogonality

Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Tree based ensemble models regularization by convex optimization

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Computing with Finite and Infinite Networks

Topological Data Analysis Applications to Computer Vision

Parametric Statistical Modeling

Smoothing and Non-Parametric Regression

1 VECTOR SPACES AND SUBSPACES

(Quasi-)Newton methods

Big Data Analytics: Optimization and Randomization

Social Media Mining. Data Mining Essentials

LINEAR ALGEBRA W W L CHEN

Classification Techniques for Remote Sensing

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Part II Redundant Dictionaries and Pursuit Algorithms

Stable Signal Recovery from Incomplete and Inaccurate Measurements

Mathematics Course 111: Algebra I Part IV: Vector Spaces

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Transcription:

Big Data: The curse of dimensionality and variable selection in identification for a high dimensional nonlinear non-parametric system Er-wei Bai University of Iowa, Iowa, USA Queen s University, Belfast, UK 1

2

Big Data? Deriving meaning from, and acting upon data sets that are large, high dimensional, heterogeneous, incomplete, contradictory, noisy, fast varying,..., with many different forms and formats. 3

Algorithms: Computing Data fusion Machine learning Dimension reduction Data streaming Scalable algorithms Fundamentals: Infrastructure CPS architecture Cloud, Grid, Distributive platforms Models Uncertainties Storage Software tools Data Management: Database and web Distributed management Stream management Data integration Decision making Data search and mining: Data mining/extraction Social network Data cleaning Visualization Models Many other Issues Big Data Security/Privacy: Cryptography Threat detection/protection Privacy preserving Social-economical impact Applications: Science, engineering, healthcare, business, Education, transportation, finance, law 4

What we are interested: for identification purpose y( ) = f(x( )) + noise To estimate the unknown nonlinear non-parametric f( ) when the dimension is high. 5

Pattern recognition x data y pattern f classifier Data mining x feature y label f inducer Imaging x input grey scale y output grey scale f deformation y=f(x) Machine learning x input y target Other areas Statistics x current/past values y trend 6

Non-parametric nonlinear system identification is pattern recognition or data mining or... Given x and want to estimate y = f(x ). Step 1: Similar patterns or neighborhoods x x(k) h. k 1,..., k l with y(k i ) = f(x(k i )). Step 2: y is a convex combination of y(k i ) with the weights K( x x(k i ) h ) j K( x x(kj ) h ), y = i K( x x(k i ) h ) j K(x x(k j ) h ) y(k i) Supervised learning with infinitely many patterns y(k i ). 7

Parametric or basis approaches: polynomial (Volterra), splines, neural networks, RKHS,... Local or Point by point: DWO, local polynomial, kernel,... f is estimated point by point. If f(x 0 ) is of interests, only the data {x : x x 0 h} is used for some h > 0 and data far away from x 0 are not very useful. f(x) f(x 0 ) x 0 h x x 0 +h 0 8

Curse of Dimensionality: Empty space. x R n, x 0 = (0,..., 0) T and C = {x : x k 0.1}. Randomly sample a point x, P rob{x C} = 0.1 n. On average, the number of points in C is N 0.1 n. To have one point in C N 10 n 1 billion when n = 9. 9

The curse of dimensionality The colorful phrase the curse of dimensionality was coined by Richard Bellman in 1961. High Dimensional Data Analysis: the curse and blessing of dimensionality, D. Donoho, 2000 The Curse of Dimensionality: A Blessing to Personalized Medicine, J of Clinical Oncology, Vol 28, 2010 10

The diagonal is almost orthogonal to all coordinate axes. A unit cube [ 1, 1] n centered at the origin. The angel between its diagonal v = (±1,..., ±1) T. and any axis e i is cos(θ) =< v v, e i >= ±1 0 n 11

High dimensional Gaussian. Let p(x) = 1 (2π) n e x 2 2. Calculate P rob{ x 2} n 1 2 5 10 20 100 Prob 0.04550 0.13534 0.54942 0.94734 0.99995 1.0000 For a high dimensional Gaussian, the entire samples are almost in the tails. 12

Concentration. x = (x 1,..., x n ) T and x k s iid Gaussian of zero mean. Then, P rob{ x 2 (1 + ɛ)µ x 2} e ɛ2n/6, P rob{ x 2 (1 ɛ)µ x 2} e ɛ2 n/4, High dimensional iid vectors are distributed close to the surface of a sphere of radius µ x 2. 13

Volumes of cubes and balls. V ball = πn/2 r n Γ(n/2 + 1), V cube = (2r) n lim n V ball/v cube = 0 The volume of a cube concentrates more on its corners as n increases. 14

Extreme dimension reduction is possible, Science 2006 804,414 Reuters newswire stories, each article is represented as a vector containing the counts of the most frequently used 2000 words and further dimensionality reduction to R 2 (Boltzmann neural network). Text to a 2-dimensional code 15

16

What we do in high dimensional nonlinear nonparametric system identification: Underlying models: Additive model, Block oriented nonlinear systems... Dimension asymptotic: The limit distributions... Variable selection and dimension reduction Global (nonlinear nonparametric) models... 17

Dimension reduction and variable selection: a well studied topic in the linear setting, e.g., Principal component analysis, Low rank matrix approximation, Factor analysis, Independent component analysis, Fisher discriminator, Forward/backward stepwise(stage),... Penalized (regularized) optimization and its variants min Y X ˆβ 2 + λ ˆβ i α, 0 α < or min Y X ˆβ 2, s.t, ˆβ i α t Nonlinear(support vector regression): Complexity) (Model Fits) + λ (Model Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B., Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso, Journal of Machine Learning Research, irrepresentable condition. 18

Geometric interpretation arg min ˆβ Y X ˆβ 2 = arg min(ˆβ β 0 ) T X T X(ˆβ β 0 ) s.t, ˆβ i α t ridge Lasso bridge α = 1(Lasso), α = 2(ridge), α < 1(bridge) 19

Compressive sensing Y = Xβ, β is sparse. min ˆβ Y X ˆβ 1 ˆβ = β provided that X satisfies the restricted isometry property. E. Candes and T. Tao, Near optimal signal recovery from random projections: Universal encoding strategies?, IEEE Trans. Inform. Theory, Dec. 2006. 20

In a (FIR) system identification setting Y = u 1... u n+2..... u N... u n+n+1 e outliers and w random noise. ˆβ = arg min ξ Y Xξ 1 β + e + w Under some weak assumptions, iid on w and e has k βn non-zero elements for some β < 1, then in probability as N, ˆβ β 2 0 W. Xu, EW Bai and M Cho (2014) Automatica, System Identification in the Presence of Outliers and Random Noises: a Compressed Sensing Approach. 21

Big difference between linear and nonlinear settings Local vs global dimension. = y(k) = f(u(k 1), u(k 2), u(k 3), u(k 4)) u(k 4) u(k 1) 0 u(k 4)u(k 2) u(k 4) < 0, u(k 2) 0 u(k 4)u(k 3) u(k 4) < 0, u(k 2) < 0 u(k 1) otherwise Linear (global) PCA may not work. 22

Methods work in a linear setting may not work in a nonlinear setting, e.g., correlation methods x uniformly in [ 1,1] y = x 2,ρ(x,y) = 0 y = x,ρ(x,y) = 1 Linear: y(k) = x(k) 1 cov(y,x) cov(y) cov(x) = 1. Nonlinear: y(k) = x(k) 2 1 cov(y, x) = E(yx) E(y)E(x) = 1 1 a 3 1 2 da = 0 y depends on x nonlinearly and the correlation is zero. 23

For a high dimensional nonlinear problem, approximation is a key. The ambient dimension is very high, yet its desired property is embedded in a low dimensional structure. The goal is to design efficient algorithms that reveal dominate variables for which one can have some theoretical guarantees. 24

Manifold Embedding (Nonlinear PCA): Eliminate redundant/dependent variables. If x = (x 1, x 2 ) and x 1 = g(x 2 ), = y = f(x) = f(g(x 2 ), x 2 ) = h(x 2 ). If x = (x 1, x 2 ) = (g 1 (z), g 2 (z)) = y = f(x) = f(g 1 (z), g 2 (z)) = h(z). More realistically, y h(x 2 ) (h(z)). 25

One dimensional: principal curve: Let x = (x 1,..., x n ) T and f(t) = (f 1 (t),..., f n (t)) be a curve in R n. Define s f (x) to be the value of t corresponding to the point of f(t) that is closest to x. The principal curve is f(t) = E(x s f (x) = t) f(t) is a curve that passes through the middle of the distribution of x. Principle surface is not easy and computationally prohibitive! 26

Local (linear/nonlinear) embedding: Find a set of lower dimensional data that resembles the local structure of the original high dimensional data. The key is distance preservation or topology preservation. 27

Multidimensional scaling, Science, 2000 Given high dimensional x i s, define d ij = x i x j Find lower dimensional z i s that minimizes the pairwise distance min z i i<j ( z i z j d ij ) 2 In a linear setting, is the Euclidean norm and in a nonlinear setting, is usually the distance along some manifold (Isomap). 28

Euclidean norm embedding: (Johnson-Lindenstrauss). Theorem: Let x 1,..., x n R n, ɛ (0, 1/2) and d O(ɛ 2 log n). There exists a matrix Q : R n R d and with a high probability, (1 ɛ) x i x j 2 Q(x i ) Q(x j ) 2 (1 + ɛ) x i x j 2 1 x i R n x i x j x j Q(x i ) R d Q(x i ) Q(x j ) 0 1 Q(x j ) 0 0 The space is approximately d dimensional not n dimensional if pairwise distance is an interest. What if on a manifold? 29

Isomap distance: Euclidean: A B < A C, Along the surface: A B > A C A B C 30

Local Linear Embedding Find the local structure of the data min w ij i x i j w ij x j 2 Find a lower dimensional data z that resembles the local structure of x, under some regularization min z i i z i j w ij z i 2 Estimation is carried out in a lower dimensional space y = f(z) 31

Example: Science, 2000 32

Unsupervised: dimension reduction is based on x alone without considering output y: x 2 y(x 1,x 2 ) = x 2 By PCA, ( x1 x 2 ) x 1 ŷ(x 1,x 2 ) = x 1 y(x 1,x 2 ) x 1 ŷ(x 1,x 2 ) Add a penalty term, min zi λ i z i j w ij z i 2 + (1 λ)(output error term) An active research area. Henrik Ohlsson, Jacob Roll and Lennart Ljung, Manifold-Constrained Regressors in System Identification, 2008 33

Global Model The problem of the curse of dimensionality is the lack of a global model. As a consequence, only data in (x 0 h, x 0 + h) can be used and the majority ( of ) data ( is discarded. ) As in a linear y1 x1 1 ( ) k case, every data is used min... b 2. y N x N 1 f(x) f(x 0 ) x 0 h x 0 x 0 +h y=kx+b x 0 Bayesian in particular Gaussian process model is one of the global models. Rasmussen CE and C. Williams (2006) Gaussian Pro- cess for Machine Learning, The MIT Press, Cambridge, MA 34

Gaussian process regression model Consider a scalar nonlinear system y(k) = f(x(k)) + v(k) where v( ) is an iid radome sequence of zero mean and finite variance σ 2 v. In a Gaussian Process setting, f(x 0 ), f(x(1)),..., f(x(n)) are assumed to follow a joint Gaussian distribution with zero mean (not necessary though) and a covariance matrix Σ p. 35

Example: Let y 0 = f(x 0 ), Y = (y(1), y(2),..., y(n)). Since (y 0, Y ) follows a joint Gaussian ( ) ( ) y0 c B N (0, Y B ) A Given x 0, what is y 0?. The conditional density of y 0 conditioned on Y is also Gaussian y 0 N (BA 1 Y, c BA 1 B ) The minimum variance estimate ŷ 0 = ˆf(x 0 ) of f(x 0 ) is the conditional mean given by ˆf(x 0 ) = BA 1 Y 36

Huntington Disease Example: test Neurological cognitive A cognitive test data for 21 pa- Totally 60 patients. tients were missing. 37

Modified Gaussian Model: The first row are predicted and the second true but missing data. ( 43.9 43.9 44.6 44.6 45.2 45.2 45.5 45.6 46.0 46.3 46.6 45 45 45 45 46 46 46 46 46 46 46 ( ) 46.6 46.6 46.8 46.8 47.2 47.6 48.0 48.0 48.0 48.2 47 48 48 48 48 49 49 49 49 49 ) Bai et al, 2014 38

Some discussions: Hard vs soft approaches. Top down vs bottom up approaches. Simplified but reasonable models.... 39

Top down approach: x i ( ) is irrelevant f x i where min f(x 0 ),ˆβ y(k) = f(x 1 (k),..., x n (k)) + noise = 0. Local linear estimator f(x(k)) = f(x 0 ) + (x(k) x 0 ) T f x x 0 + h.o.t N {y(k) [1, (x(k) x 0 ) T ] k=1 β = β 1. β d 0. 0 ( f(x 0 ) ˆβ ) } 2 K Q (x(k) x 0 ), β1 > 0,..., β d > 0, β d+1 =... = β n = 0 A = {j : βj > 0} = {1, 2, 3,..., d} Goal is both parameter and set convergence. 40

Parameter and set convergence: Example: x 1 (k) = 0.8 +0.5 k x 1 = 0.8 x 2 (k) = 0.5 k x 2 = 0 ˆx jk x j, j = 1, 2 A = {j : x j > 0} = {1}. Â k = {j : x jk > 0} = {1, 2} A 41

Top down approach: The approach is an adaptive convex penalized optimization, J(β) = min β Z Φβ 2 + λ n w j β j, j=1 w j = 1 calculated from the local linear estimator. ˆβ j some technical conditions, in probability as N, f x i x 0 Then under f x i x 0 if f x i x 0 > 0 parameter convergence P rob{ f x i x 0 = 0} = 1, if f x i x 0 = 0 set convergence The approach works if the the available data is long. E Bai et al, Kernel based approaches to local nonlinear non-parametric variable selection, Automatica, 2014, K Bertin, Selection of variables and dimension reduction in high-dimensional non-parametric regression, EJS, 2008 42

Bottom up approach:forward/backward First proposed by Billings, 1989, Identification of MIMO non-linear systems using a forward-regression orthogonal estimator. Extremely popular in statistics and in practice. For a long time was considered to be Ad Hoc but recent research showed otherwise, T Zhang, Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations, IEEE TIT,2011, On the Consistency of Feature Selection using Greedy Least Squares Regression, J of Machine learning Res. 2009, Tropp, Greed is Good: Algorithmic Results for Sparse Approximation, IEEE TIT, 2004... 43

Forward selection What if the dimension is high and the available data set is limited. To illustrate, consider a linear case Y = ( x 1, x 2,..., x n ) ( ) a1. a n kth step: i k First step: i 1 = arg min Y a i x i 2 i [1,...,n] = arg min i [1,...,n]/[i 1,...,i k 1 ] Y (x i 1,..., x i k 1 ) a 1. a k 1 a i x i 2 44

Bottom up approach: Backward selection Start with the chosen (i 1,..., i p ) and a 1,..., a p. i 1 = arg min Y (x i i n/[i 2,...,i p ] 2,..., x ip ) Repeated for every i j. a 2. a p a i x i 2, 45

Forward/Backward for nonlinear system: The minimum set of unfalsified variables f(x 1,..., x n ) g(x i1,..., x ip ) 0 Low dimensional neighborhood One dimensional neighborhood of x i (k), {x(j) R n (x(k) x(j)) i = (x i (k) x i (k)) 2 h} p-dimensional neighborhood of (x i1 (k),..., x ip (k)), = {x(j) R n (x(k) x(j)) i1,...,ip (x i1 (k) x i1 (k)) 2 +... + (x ip (k) x ip (j)) 2 h} 46

Algorithm: Step 1: Determine the bandwidth. Step 2: The number of variables are determined by the modified Box-Pierce test. Step 3: Forward selection. Step 4: Backward selection. Step 5: Terminate. 47

Example: the actual system lies approximately on a unknown dimensional manifold: x 2 (k) g(x 1 (k)) y(k) f(x 1 (k), g(x 1 (k))) = f 1 (x 1 (k)) which is lower dimensional. y(k) = 10sin(x 1 (k)x 2 (k))+20(x 3 (k) 0.5) 2 +10x 4 (k)+5x 5 (k) +x 6 (k)x 7 (k)+x 7 (k) 2 +5cos(x 6 (k)x 8 (k))+exp( x 8 (k) ) +0.5η(k), k = 1, 2,..., 500 η(k) is i.i.d. variance. Gaussian noise of zero mean and unit 48

x 3 (k), x 5 (k) are independent and uniformly in [ 1, 1] and x 4 (k) = x 1 (k) = x 2 (k) = x 6 (k) = x 7 (k) = x 8 (k) = x 3 (k) x 5 (k) + 0.1 η(k) x 3 (k) 2 x 5 (k) + 0.1 η(k) x 3 (k) x 5 (k) 2 + 0.1 η(k) x 1 (k) x 4 (k) + 0.1 η(k) x 3 (k) 3 x 5 (k) + 0.1 η(k) x 2 (k) x 5 (k) + 0.1 η(k) x 1 ( ),..., x 8 ( ) are not exactly but approximately on an two dimensional manifold. h = 0.2 was chosen by the 5-fold cross validation. Test signal x 3 (k) = 0.9 sin( 2πk 20 ), x 5(k) = 0.9 cos( 2πk ), k = 1,..., 40 20 49

Results: 35 50 Gof=0.9205(2 dim),0.6940(8 dim) 30 40 25 20 30 RSS 15 20 10 10 5 0 0 1 2 3 4 5 6 7 8 number n of variables chosen 10 0 5 10 15 20 25 30 35 40 Bai et al, 2013, On Variable Selection of a Nonlinear Non-parametric System with a Limited Data Set 50

Nonlinear systems with short term memory and low degree of interactions: Additive systems y(k) = f(x 1 (k),..., x n (k)) + noise = n f i (x i (k)) + f ij (x i (k), x j (k)) + noise i=1 j>i An n-dimensional system becomes a number of 1 and 2-dimensional systems. Under some technical conditions, e.g., iid or Galois on the inputs, f i and f ij can be identified independently... Bai et al TAC(2008), Automatica(2009), Automatica(2010), TAC(2010)... 51

Why? Consider a household expense tree: The most widely used nonlinear model in practice (and also in the statistics literature.) 52

A simple example: y(k) = c+ f 1 (x 1 (k))+ f 2 (x 2 (k))+ f 12 (x 1 (k), x 2 (k))+v(k) = c + f 1 (x 1 (k)) + f 2 (x 2 (k)) + f 12 (x 1 (k), x 2 (k)) + v(k) Each term can be identified separately and low dimension. c = Ey(k) f 1 (x 0 1 ) = E(y(k) x 1(k) = x 0 1 ) c f 2 (x 0 2 ) = E(y(k) x 2(k) = x 0 2 ) c f 12 (x 0 1, x0 2 ) = E(y(k) x 1(k) = x 0 1, x 2(k) = x 0 2 ) c 53

Quick boring derivation: g 1,12 (x 0 1 ) = E( f 12 (x 1 (k), x 2 (k)) x 1 (k) = x 0 1 ) g 2,12 (x 0 2 ) = E( f 12 (x 1 (k), x 2 (k)) x 2 (k) = x 0 2 ) c 12 = E( f 12 (x 1 (k), x 2 (k))) c 1 = E( f 1 (x 1 (k)) + g 1,12 (x 1 (k))) c 2 = E( f 2 (x 2 (k)) + g 2,12 (x 2 (k))) f 12 (x 1 (k), x 2 (k)) = f 12 (x 1 (k), x 2 (k)) g 2,12 (x 2 (k)) g 1,12 (x 1 (k)) + c 12 f 1 (x 1 (k)) = f 1 (x 1 (k)) + g 1,12 (x 1 (k)) c 1 f 2 (x 2 (k)) = f 2 (x 2 (k)) + g 2,12 (x 2 (k)) c 2 c = c c 12 = c 1 + c 2 Ef 1 (x 1 (k)) = Ef 2 (x 2 (k)) = Ef 12 (x 1 (k), x 2 (k)) = 0 Ef 1 (x 1 (k)) f 2 (x 2 (k)) = Ef i (x i (k)) f 12 (x 1 (k), x 2 (k)) = 0, i = 1, 2, 54

Convergence and sparsity Consider J = Y ( i f i + j<k f jk ) 2 +λ 1 ( f i 2 + f jk 2 ) 1/2 +λ 2 f jk i j k j<k Then, under some technical conditions, the true f i, f jk s can be identified and sparse. Rachenko and James, J of ASA, 2012 55

Thanks! 56