Non-linear/non-parametric regression

Size: px

Start display at page:

Download "Non-linear/non-parametric regression"

Muriel Riley
7 years ago
Views:

1 Non-linear/non-parametric regression Thus far we have assumed that the mean response is a linear combination of the covariates. When this model fits, it leads to a simple and interpretable model. However, this is often a gross over-simplification of reality. Misspecification leads to questionable statistical inference and poor prediction. In this section we will explore more sophisticated methods that avoid assuming a linear relationship between the predictors and response. Often the objective of the analysis is prediction, rather than inference. Outline: 1. Polynomial regression 2. Local linear regression 3. Nearest neighbor analysis 4. Gaussian process regression 5. Splines/Additive models 6. Regression trees 7. Neural networks (4) Nonlinear regression - Part 1 Page 1

2 Examples: Dose-response modeling: Climate research: Precision medicine: Engineering: Homes data: (4) Nonlinear regression - Part 1 Page 2

3 Diagnostics Before fitting a fancy model, we should fit the linear model to verify it is insufficient. Say ˆβ OLS is the ordinary least squares estimate and r i = Y i X i ˆβOLS is the residual for observation i. The added variable plot plots the residuals against each covariates: If all of these plots looks like random scatter then the linear model is probably sufficient. (4) Nonlinear regression - Part 1 Page 3

4 Polynomial regression If a linear model doesn t fit, the next simplest method is polynomial regression. The full second-order model is: This model has p + p + p(p 1)/2 terms. There are obviously functions that are not quadratic, and adding higher-order terms is costly with even moderate p. (4) Nonlinear regression - Part 1 Page 4

5 Non-linear/non-parametric regression The non-linear regression model is Y i = f(x i ) + ɛ i where f(x i ) is the mean of Y i and ɛ i N(0, σ 2 ) are iid errors. The function f is a curve if p = 1 and a (response) surface if p = 2. The objective is to estimate f at the data points and predict f(x 0 ) for a new data point X 0. We make no assumptions about the response function f other than it is a continuous function from X i to R. In this sense we are performing nonparametric regression. Shorthand: f(x i ) f i (4) Nonlinear regression - Part 1 Page 5

6 Nearest neighbor analysis Define the distance as d ij = X i X j = p (X il X jl ) 2. l=1 The simplest approach is to take the predicted value for an observation with covariates X 0 as the average of Y i for the k points with smallest d i0 : Plot: How to pick k? No observations are neighbors in high dimensions. (4) Nonlinear regression - Part 1 Page 6

7 Kernel smoothing Another approach is to take a weighted averaged of all observations with more weight going to nearby observations. Define the kernel function k ij = K(X i, X j ) = K(X j, X i ) > 0 as a measure of similarity between observations i and j. There are many types of kernel functions: Usually the kernels for a given point X 0 are scaled to sum to one: The kernel smoothing estimate of f(x 0 ) is: Plot in 1D: (4) Nonlinear regression - Part 1 Page 7

8 Local linear regression Maybe the linear model doesn t fit globally, but in most cases functions are locally linear: Local linear regression is weighted linear regression with weights given by k i0 so that only observations local to X 0 contribute to the prediction. The estimated slope and prediction are: How to pick the bandwidth? Should all the covariates contribute equally to the distance metric? This is implemented in R s loess package. (4) Nonlinear regression - Part 1 Page 8

9 Gaussian process (GP) regression All of these methods are linear predictors, i.e., the predicted values is a linear combination of the observations. These methods are all fast, flexible, and interpretable. However, they are not optimal for prediction. Gaussian process regression is computationally more challenging, but has optimal prediction properties similar to the Gauss-Markov property in linear regression. In this sense, it serves as the gold standard for non-linear regression methods. Free book! Rasmussen and Williams. Gaussian Processes for Machine Learning, The Gaussian process regression model is Y i = f(x i ) + ɛ i where f(x i ) is the mean of Y i and ɛ i N(0, σ 2 ) are iid errors. In GP regression f is modeled as a Gaussian process. (4) Nonlinear regression - Part 1 Page 9

10 Definition of a GP The random function f is a Gaussian process if and only if for any n locations X 1,..., X n : A Gaussian distribution is completely specified by its mean and variance. Analogously, a Gaussian process is completely specified by its mean and covariance functions: (4) Nonlinear regression - Part 1 Page 10

11 Common mean and covariance functions We could use a constant mean function µ(x) = β 0. Another possibility is a linear mean µ(x) = X β. A common covariance function is the powered exponential If (f 1,..., f n ) N(µ, Σ) then (Y 1,..., Y n ) is distributed as: (4) Nonlinear regression - Part 1 Page 11

12 The GP likelihood The likelihood is the density of Y = (Y 1,..., Y n ). Say the mean is E(Y) = Xβ and the covariance is Σ(θ) where: Then the likelihood is: The log likelihood is: (4) Nonlinear regression - Part 1 Page 12

13 Coordinate descent (CD) One way to estimate β and θ is CD In CD, we iterate between updating β given θ is fixed and updating θ given β is fixed. The update of β is available in closed form given θ: The covariance parameters can be updated using Newton-Raphson because the gradient and Hessian are computable. (4) Nonlinear regression - Part 1 Page 13

14 Methods for large n Inverting the covariance is painfully slow with n is large. There are many approaches to avoid this. One approach is to replace the log likelihood with the sum over independent blocks. Here we partition the observations into B blocks: We then approximate the likelihood as the sum over blocks: This is parallelizable and avoids large matrix inversions. (4) Nonlinear regression - Part 1 Page 14

15 Predictions In GP regression we fit the model to estimate the covariate function. For prediction we then have the covariance between the data points and the prediction point. This gives the best linear unbiased prediction (BLUP): This method of prediction is called Kriging. (4) Nonlinear regression - Part 1 Page 15

Local classification and local likelihoods

Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor