Model-based boosting in R

Similar documents
Statistical Machine Learning

BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING

Statistical Applications in Genetics and Molecular Biology

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Big Data Big Knowledge?

Lasso on Categorical Data

Machine Learning Big Data using Map Reduce

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Joint models for classification and comparison of mortality in different countries.

How To Understand Data Science

Lecture 3: Linear methods for classification

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

BayesX - Software for Bayesian Inference in Structured Additive Regression

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

GI01/M055 Supervised Learning Proximal Methods

Predicting daily incoming solar energy from weather data

Model Combination. 24 Novembre 2009

Regression Modeling Strategies

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Regression III: Advanced Methods

Supervised Learning (Big Data Analytics)

Generalized Boosted Models: A guide to the gbm package

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

STATISTICA Formula Guide: Logistic Regression. Table of Contents

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Penalized Logistic Regression and Classification of Microarray Data

Local classification and local likelihoods

Least Squares Estimation

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Logistic Regression for Spam Filtering

HT2015: SC4 Statistical Data Mining and Machine Learning

Degrees of Freedom and Model Search

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Multiple Linear Regression in Data Mining

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

Smoothing and Non-Parametric Regression

Penalized regression: Introduction

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Regularized Logistic Regression for Mind Reading with Parallel Validation

A Learning Algorithm For Neural Network Ensembles

Introduction to Logistic Regression

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

STA 4273H: Statistical Machine Learning

Linear Classification. Volker Tresp Summer 2015

Testing, Monitoring, and Dating Structural Changes in Exchange Rate Regimes

Ensemble Approach for the Classification of Imbalanced Data

1 Teaching notes on GMM 1.

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Advanced Ensemble Strategies for Polynomial Models

One Statistician s Perspectives on Statistics and "Big Data" Analytics

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Data Mining. Nonlinear Classification

GLM, insurance pricing & big data: paying attention to convergence issues.

Lecture 6: Logistic Regression

Statistics Graduate Courses

Cross Validation. Dr. Thomas Jensen Expedia.com

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Multivariate Logistic Regression

Chapter 5: Bivariate Cointegration Analysis

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

5. Multiple regression

Predict the Popularity of YouTube Videos Using Early View Data

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Lecture 6. Artificial Neural Networks

2. Linear regression with multiple regressors

Designing a learning system

Decompose Error Rate into components, some of which can be measured on unlabeled data

Model Validation Techniques

Modelling and added value

Semiparametric Multinomial Logit Models for the Analysis of Brand Choice Behaviour

Linear Threshold Units

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

The Operational Value of Social Media Information. Social Media and Customer Interaction

Estimation of Fetal Weight: Mean Value from Multiple Formulas

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

Big Data - Lecture 1 Optimization reminders

Problem of Missing Data

Empirical Model-Building and Response Surfaces

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Logistic Regression (1/24/13)

Machine Learning Methods for Demand Estimation

Hong Kong Stock Index Forecasting

Tree based ensemble models regularization by convex optimization

Studying Auto Insurance Data

Determining Minimum Sample Sizes for Estimating Prediction Equations for College Freshman Grade Average

Fortgeschrittene Computerintensive Methoden

Introduction to General and Generalized Linear Models

Applications to Data Smoothing and Image Processing I

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Probabilistic user behavior models in online stores for recommender systems

Features of Big Data and sparsest solution in high confidence set

Transcription:

Model-based boosting in R Introduction to Gradient Boosting Matthias Schmid Institut für Medizininformatik, Biometrie und Epidemiologie (IMBE) Friedrich-Alexander-Universität Erlangen-Nürnberg Statistical Computing 2011

Aims and scope Why boosting? References 2/35

Aims and scope Aims and scope Consider a sample containing the values of a response variable Y and the values of some predictor variables X = (X 1,..., X p ) Aim: Find the optimal function f (X) to predict Y f (X) should have a nice structure, for example, f (X) = β 0 + β 1 X 1 + + β p X p (GLM) or f (X) = β 0 + f 1 (X 1 ) + + f p (X p ) (GAM) f should be interpretable 3/35

Aims and scope Example 1 - Birth weight data Prediction of birth weight by means of ultrasound measures (Schild et al. 2008) Outcome: birth weight (BW) in g Predictor variables: abdominal volume (volabdo) biparietal diameter (BPD) head circumference (HC) other predictors (measured one week before delivery) Data from n = 150 children with birth weight 1600g Find f to predict BW 4/35

Aims and scope Birth weight data (2) Idea: Use 3D ultrasound measurements (left) in addition to conventional 2D ultrasound measurements (right) Sources: www.yourultrasound.com, www.fetalultrasoundutah.com Improve established formulas for weight prediction 5/35

Aims and scope Example 2 - Breast cancer data Data collected by the Netherlands Cancer Institute (van de Vijver et al. 2002) 295 female patients younger than 53 years Outcome: time to death after surgery (in years) Predictor variables: microarray data (4919 genes) + 9 clinical variables (age, tumor diameter,...) Select a small set of marker genes ( sparse model ) Use clinical variables and marker genes to predict survival 6/35

Why boosting? Classical modeling approaches Classical approach to obtain predictions from birth weight data and breast cancer data: Fit additive regression models (Gaussian regression, Cox regression) using maximum likelihood (ML) estimation Example: Additive Gaussian model with smooth effects (represented by P-splines) for birth weight data f (X) = β 0 + f 1 (X 1 ) + + f p (X p ) 7/35

Why boosting? Problems with ML estimation Predictor variables are highly correlated Variable selection is of interest because of multicollinearity ( Do we really need 9 highly correlated predictor variables? ) In case of the breast cancer data: Maximum (partial) likelihood estimates for Cox regression do not exist (there are 4928 predictor variables but only 295 observations, p n) Variable selection because of extreme multicollinearity We want to have a sparse (interpretable) model including the relevant predictor variables only Conventional methods for variable selection (univariate, forward, backward, etc.) are known to be instable and/or require the model to be fitted multiple times. 8/35

Why boosting? Boosting - General properties Gradient boosting (boosting for short) is a fitting method to minimize general types of risk functions w.r.t. a prediction function f Examples of risk functions: Squared error loss in Gaussian regression, negative log likelihood loss Boosting generally results in an additive prediction function, i.e., f (X) = β 0 + f 1 (X 1 ) + + f p (X p ) Prediction function is interpretable If run until convergence, boosting can be regarded as an alternative to conventional fitting methods (Fisher scoring, backfitting) for generalized additive models. 9/35

Why boosting? Why boosting? In contrast to conventional fitting methods,...... boosting is applicable to many different risk functions (absolute loss, quantile regression)... boosting can be used to carry out variable selection during the fitting process No separation of model fitting and variable selection... boosting is applicable even if p n... boosting addresses multicollinearity problems (by shrinking effect estimates towards zero)... boosting optimizes prediction accuracy (w.r.t. the risk function) 10/35

Gradient boosting - estimation problem Consider a one-dimensional response variable Y and a p-dimensional set of predictors X = (X 1,..., X p ) Aim: Estimation of f := argmin E[ρ(Y, f(x))], f( ) where ρ is a loss function that is assumed to be differentiable with respect to a prediction function f(x) Examples of loss functions: ρ = (Y f(x)) 2 squared error loss in Gaussian regression Negative log likelihood function of a statistical model 11/35

Gradient boosting - estimation problem (2) In practice, we usually have a set of realizations X = (X 1,..., X n ), Y = (Y 1,..., Y n ) of X and Y, respectively Minimization of the empirical risk with respect to f Example: R = 1 n R = 1 n n ρ(y i, f(x i )) i=1 n (Y i f(x i )) 2 corresponds to minimizing the i=1 expected squared error loss 12/35

Naive functional gradient descent (FGD) Idea: use gradient descent methods to minimize R = R(f (1),..., f (n) ) w.r.t. f (1) = f(x 1 ),..., f (n) = f(x n ) Start with offset values In iteration m: ˆf [m] (1). ˆf [m] (n) = ˆf [m 1] (1). ˆf [m 1] (n) [0] [0] ˆf (1),..., ˆf (n) + ν R [m 1] f (1) ( ˆf (1) ). R f (n) ( ˆf [m 1] (n) ), where ν is a step length factor Principle of steepest descent 13/35

Naive functional gradient descent (2) (Very) simple example: n = 2, Y 1 = Y 2 = 0, ρ = squared error loss R = 1 ( ) f(1) 2 2 + f (2) 2 z = f1 2 + f2 2 f2 3 2 1 0 1 2 3 16 16 8 14 14 6 12 4 12 2 10 10 10 10 12 12 14 14 16 16 3 2 1 0 1 2 3 f1 14/35

Naive functional gradient descent (3) Increase m until the algorithm converges to some values ˆf [mstop] [mstop] (1),..., ˆf (n) Problem with naive gradient descent: No predictor variables involved [mstop] [mstop] Structural relationship between ˆf (1),..., ˆf (n) is ignored [m] ( ˆf (1) Y [m] 1,..., ˆf (n) Y n) Predictions only for observed values Y 1,..., Y n 15/35

Gradient Boosting Solution: Estimate the negative gradient in each iteration Estimation is performed by some base-learning procedure regressing the negative gradient on the predictor variables ˆf [mstop] ˆf [mstop] (n) base-learning procedure ensures that (1),..., predictions from a statistical model depending on the predictor variables To do this, we specify a set of regression models ( base-learners ) with the negative gradient as the dependent variable In many applications, the set of base-learners will consist of p simple regression models ( one base-learner for each of the p predictor variables, component-wise gradient boosting ) are 16/35

Gradient Boosting (2) Functional gradient descent (FGD) boosting algorithm: 1. Initialize the n-dimensional vector ˆf [0] with some offset values (e.g., use a vector of zeroes). Set m = 0 and specify the set of base-learners. Denote the number of base-learners by P. 2. Increase m by 1. Compute the negative gradient ρ(y, f) and f evaluate at ˆf [m 1] (X i ), i = 1,..., n. This yields the negative gradient vector U [m 1] = (U [m 1] i ) i=1,...,n := ( ) f ρ(y, f) Y =Yi,f= ˆf [m 1] (X i) i=1,...,n. 17/35

Gradient Boosting (3) 3. Estimate the negative gradient U [m 1] by using the base-learners (i.e., the P regression estimators) specified in Step 1. This yields P vectors, where each vector is an estimate of the negative gradient vector U [m 1].. Select the base-learner that fits U [m 1] best ( min. SSE). Set Û [m 1] equal to the fitted values from the corresponding best model.. 18/35

Gradient Boosting (4). 4. Update ˆf [m] = ˆf [m 1] + ν Û [m 1], where 0 < ν 1 is a real-valued step length factor. 5. Iterate Steps 2-4 until m = m stop. The step length factor ν could be chosen adaptively. Usually, an adaptive strategy does not improve the estimates of f but will only lead to an increase in running time choose ν small (ν = 0.1) but fixed 19/35

Schematic overview of Step 3 in iteration m Component-wise gradient boosting with one base-learner for each predictor variable: U [m 1] X 1 U [m 1] X 2 U [m 1] X j.. U [m 1] X p. best-fitting base-learner Û [m 1] 20/35

Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model. 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 x y m = 0 y = (0.5 0.9 e 50 x2 ) x + 0.02 ε 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 x Residuals Residuals m = 0 21/35

Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model. 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 y = (0.5 0.9 e 50 x2 ) x + 0.02 ε x y m = 1 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 Residuals x Residuals m = 1 22/35

Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model. 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 y = (0.5 0.9 e 50 x2 ) x + 0.02 ε x y m = 2 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 Residuals x Residuals m = 2 23/35

Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model. 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 y = (0.5 0.9 e 50 x2 ) x + 0.02 ε x y m = 3 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 Residuals x Residuals m = 3 24/35

Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model. 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 y = (0.5 0.9 e 50 x2 ) x + 0.02 ε x y m = 10 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 Residuals x Residuals m = 10 25/35

Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model. 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 y = (0.5 0.9 e 50 x2 ) x + 0.02 ε x y m = 20 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 Residuals x Residuals m = 20 26/35

Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model. 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 y = (0.5 0.9 e 50 x2 ) x + 0.02 ε x y m = 30 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 Residuals x Residuals m = 30 27/35

Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model. 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 y = (0.5 0.9 e 50 x2 ) x + 0.02 ε x y m = 150 0.2 0.1 0.0 0.1 0.2 0.10 0.05 0.00 0.05 0.10 Residuals x Residuals m = 150 28/35

Properties of gradient boosting It is clear from Step 4 that the predictions of Y 1,..., Y n in iteration m stop take the form of an additive function: ˆf [mstop] = ˆf [0] + ν Û [0] + + ν Û [mstop 1] The structure of the prediction function depends on the choice of the base-learners For example, linear base-learners result in linear prediction functions Smooth base-learners result in additive prediction functions with smooth components ˆf [mstop] has a meaningful interpretation 29/35

Gradient boosting with early stopping Gradient boosting has a built-in mechanism for base-learner selection in each iteration. This mechanism will carry out variable selection. Gradient boosting is applicable even if p > n. In case p > n, it is usually desirable to select a small number of informative predictor variables ( sparse solution ). If m, the algorithm will select non-informative predictor variables. Overfitting can be avoided if the algorithm is stopped early, i.e., if m stop is considered as a tuning parameter of the algorithm 30/35

Illustration of variable selection and early stopping Very simple example: 3 predictor variables X 1, X 2, X 3, [m] 3 linear base-learners with coefficient estimates ˆβ j, j = 1, 2, 3 Assume that m stop = 5 Assume that X 1 was selected in the first, second and fifth iteration Assume that X 3 was selected in the third and forth iteration ˆf [m stop] = ˆf [0] + ν Û [0] + ν Û [1] + ν Û [2] + ν Û [3] + ν Û [4] = ˆβ [0] [0] + ν ˆβ 1 X [1] 1 + ν ˆβ 1 X [2] 1 + ν ˆβ 3 X [3] 3 + ν ˆβ 3 X 3 + ν = ˆβ ( ) ( ) [0] + ν ˆβ[0] [1] [4] 1 + ˆβ 1 + ˆβ 1 X 1 + ν ˆβ[2] [3] 3 + ˆβ 3 X 3 ˆβ [4] 1 X 1 = ˆβ [0] + ˆβ 1 X 1 + ˆβ 3 X 3 Linear prediction function X 2 is not included in the model (variable selection) 31/35

How should the stopping iteration be chosen? Use cross-validation techniques to determine m stop cross validated mstop predictive risk 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0 20 40 60 80 Number of boosting iterations The stopping iteration is chosen such that it maximizes prediction accuracy. 32/35

Shrinkage Early stopping will not only result in sparse solutions but will also lead to shrunken effect estimates ( only a small fraction of Û is added to the estimates in each iteration). Shrinkage leads to a downward bias (in absolute value) but to a smaller variance of the effect estimates (similar to Lasso or Ridge regression). Multicollinearity problems are addressed. 33/35

Further aspects There are many types of boosting methods, e.g., tree-based boosting (AdaBoost, Freund & Schapire 1997) likelihood-based boosting (Tutz & Binder 2006) Here we consider gradient boosting Flexible method to fit many types of statistical models in highand low-dimensional settings Regularization of estimates via variable selection and shrinkage Implemented in R package mboost (Hothorn et al. 2010, 2011) 34/35

References References Freund, Y. and R. Schapire (1997): A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, 119-139. Hothorn, T., P. Bühlmann, T. Kneib, M. Schmid and B. Hofner (2010): Model-based boosting 2.0. Journal of Machine Learning Research 11, 2109-2113. Hothorn, T., P. Bühlmann, T. Kneib, M. Schmid and B. Hofner (2011): mboost: Model-Based Boosting. R package version 2.1-0. https://r-forge.r-project.org/projects/mboost/ Schild, R. L., M. Maringa, J. Siemer, B. Meurer, N. Hart, T. W. Goecke, M. Schmid, T. Hothorn and M. E. Hansmann (2008). Weight estimation by three-dimensional ultrasound imaging in the small fetus. Ultrasound in Obstetrics and Gynecology 32, 168-175. Tutz, G. and H. Binder (2006): Generalized additive modelling with implicit variable selection by likelihood based boosting. Biometrics 62, 961-971. van de Vijver, M. J., Y. D. He, L. J. van t Veer, H. Dai, A. A. M. Hart, D. W. Voskuil et al. (2002). A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine 347, 1999-2009. 35/35