Linear regression methods for large n and streaming data Large n and small or moderate p is a fairly simple problem. The sufficient statistic for β in OLS (and ridge) is: The concept of sufficiency is key to processing big data. In this case, the sufficient statistic is a simple sum over observations and can be computed in batches or updated online in the obvious way. (2) Linear regression - Part 2 Page 1
Large n For truly massive data subsampling is an option. There are many types of sampling, including: You can also take several samples, average the estimates over samples for a point estimate, and use quantiles of the estimates over the samples as bootstrap confidence intervals. (2) Linear regression - Part 2 Page 2
The problem is more interesting when the regression coefficients evolve over time. A dynamic linear model (DLM) is: Y t N(X t β t, σ 2 I) β t β t 1 Normal(ρβ t 1, τ 2 I) There are far more general versions of the DLM. This is also called a state-space model. Here the regression relationship β t varies over time. Examples where this might be a useful model: If ρ = 1 and τ 0 then the parameters evolve slowly over time. If ρ = 0 then β t is independent over time. (2) Linear regression - Part 2 Page 3
The simplest approach is a weighted linear regression: For example, the weights might be a Gaussian pdf: Or the weights could be a moving window: How to pick the bandwidth? (2) Linear regression - Part 2 Page 4
A more elegant approach is the Kalman Filter (KF). The KF can be motivated using a Bayesian approach. Before discussing the KF, we will introduce Bayesian linear regression. Bayesian methods can be applied to virtually any statistical problem, but we will focus here on linear models: Y β N(Xβ, σ 2 I n ) β Normal(µ, Σ) where σ, µ and Σ are assumed to be known. Bayesians assume that there is truly a fixed of β. However, we acknowledge that we don t and will never know what it is, so we represent our uncertainty about β by treating it as a random variable using a probability distribution. Before we observe the data, our uncertainty is captured with the prior distribution. Above we select the prior β Normal(µ, Σ). A Bayesian analysis combines the data and the prior to give the posterior distribution. Bayes Theorem gives the posterior: That is, posterior likelihood prior. p(β Y) = f(y β)f(β). f(y) The posterior quantifies our uncertainty about β after observing the data, and is what we use to conduct inference and make predictions. (2) Linear regression - Part 2 Page 5
Derivation of the posterior of β for the model Y β N(Xβ, σ 2 I) with prior β Normal(µ, Σ): (2) Linear regression - Part 2 Page 6
Y t N(X t β t, σ 2 I) β t β t 1 Normal(ρβ t 1, τ 2 I) The KF is a sequential application of the Bayesian linear model. At the first time point we apply the usual Bayesian linear model and obtain the posterior: β 1 Y 1 Normal(M 1, V 1 ). This posterior is used to define the prior for β 2. At the second time point, the prior is β 2 β 1 Normal(ρβ 1, τ 2 I). We don t know β 1 exactly, but we have its posterior distribution given all the data we have observed. Accounting for our uncertainty in β 1, the prior for β 2 is: Applying the Bayesian linear model formulas again, the posterior of β 2 is: (2) Linear regression - Part 2 Page 7
General Kalman filter updating rule: (2) Linear regression - Part 2 Page 8
We have looked at only the most simple case. A more general is the DLM is Y t N(X t β t, Σ t ) β t β t 1 Normal(G t β t 1, Ω t ) How to estimate Σ t, G t, and Ω t? So far we have assumed normality everywhere, what to do for non-normal models? Y t Possion[exp(X t β t )] β t β t 1 Normal(G t β t 1, Ω t ) So far we have assumed linear relationships between all variables, how to handle nonlinearity? Y t N[exp(X t β t ), σ 2 I] β t β t 1 Normal(G t β t 1, Ω t ) These extensions have been worked out, but they are complicated (extended KF, unscented KM, ensemble KF, etc.). (2) Linear regression - Part 2 Page 9
Bayesian linear models for large p While we re on the topic of Bayesian linear models, what do Bayesians do for linear (not dynamic) regression with large p? The linear regression model is Y i Normal(X T i β, σ 2 ). The Bayesian model allows up to put priors on the regression coefficients β 1,..., β p. If we believe before seeing the data that most of the covariates are unimportant, we can simply specify a prior that has mass near zero for the β j. For example: This a very intuitive approach, and has been shown to be very competitive. (2) Linear regression - Part 2 Page 10