# Model-based boosting in R

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Model-based boosting in R Introduction to Gradient Boosting Matthias Schmid Institut für Medizininformatik, Biometrie und Epidemiologie (IMBE) Friedrich-Alexander-Universität Erlangen-Nürnberg Statistical Computing 2011

2 Aims and scope Why boosting? References 2/35

3 Aims and scope Aims and scope Consider a sample containing the values of a response variable Y and the values of some predictor variables X = (X 1,..., X p ) Aim: Find the optimal function f (X) to predict Y f (X) should have a nice structure, for example, f (X) = β 0 + β 1 X β p X p (GLM) or f (X) = β 0 + f 1 (X 1 ) + + f p (X p ) (GAM) f should be interpretable 3/35

4 Aims and scope Example 1 - Birth weight data Prediction of birth weight by means of ultrasound measures (Schild et al. 2008) Outcome: birth weight (BW) in g Predictor variables: abdominal volume (volabdo) biparietal diameter (BPD) head circumference (HC) other predictors (measured one week before delivery) Data from n = 150 children with birth weight 1600g Find f to predict BW 4/35

5 Aims and scope Birth weight data (2) Idea: Use 3D ultrasound measurements (left) in addition to conventional 2D ultrasound measurements (right) Sources: Improve established formulas for weight prediction 5/35

6 Aims and scope Example 2 - Breast cancer data Data collected by the Netherlands Cancer Institute (van de Vijver et al. 2002) 295 female patients younger than 53 years Outcome: time to death after surgery (in years) Predictor variables: microarray data (4919 genes) + 9 clinical variables (age, tumor diameter,...) Select a small set of marker genes ( sparse model ) Use clinical variables and marker genes to predict survival 6/35

7 Why boosting? Classical modeling approaches Classical approach to obtain predictions from birth weight data and breast cancer data: Fit additive regression models (Gaussian regression, Cox regression) using maximum likelihood (ML) estimation Example: Additive Gaussian model with smooth effects (represented by P-splines) for birth weight data f (X) = β 0 + f 1 (X 1 ) + + f p (X p ) 7/35

8 Why boosting? Problems with ML estimation Predictor variables are highly correlated Variable selection is of interest because of multicollinearity ( Do we really need 9 highly correlated predictor variables? ) In case of the breast cancer data: Maximum (partial) likelihood estimates for Cox regression do not exist (there are 4928 predictor variables but only 295 observations, p n) Variable selection because of extreme multicollinearity We want to have a sparse (interpretable) model including the relevant predictor variables only Conventional methods for variable selection (univariate, forward, backward, etc.) are known to be instable and/or require the model to be fitted multiple times. 8/35

9 Why boosting? Boosting - General properties Gradient boosting (boosting for short) is a fitting method to minimize general types of risk functions w.r.t. a prediction function f Examples of risk functions: Squared error loss in Gaussian regression, negative log likelihood loss Boosting generally results in an additive prediction function, i.e., f (X) = β 0 + f 1 (X 1 ) + + f p (X p ) Prediction function is interpretable If run until convergence, boosting can be regarded as an alternative to conventional fitting methods (Fisher scoring, backfitting) for generalized additive models. 9/35

10 Why boosting? Why boosting? In contrast to conventional fitting methods, boosting is applicable to many different risk functions (absolute loss, quantile regression)... boosting can be used to carry out variable selection during the fitting process No separation of model fitting and variable selection... boosting is applicable even if p n... boosting addresses multicollinearity problems (by shrinking effect estimates towards zero)... boosting optimizes prediction accuracy (w.r.t. the risk function) 10/35

11 Gradient boosting - estimation problem Consider a one-dimensional response variable Y and a p-dimensional set of predictors X = (X 1,..., X p ) Aim: Estimation of f := argmin E[ρ(Y, f(x))], f( ) where ρ is a loss function that is assumed to be differentiable with respect to a prediction function f(x) Examples of loss functions: ρ = (Y f(x)) 2 squared error loss in Gaussian regression Negative log likelihood function of a statistical model 11/35

12 Gradient boosting - estimation problem (2) In practice, we usually have a set of realizations X = (X 1,..., X n ), Y = (Y 1,..., Y n ) of X and Y, respectively Minimization of the empirical risk with respect to f Example: R = 1 n R = 1 n n ρ(y i, f(x i )) i=1 n (Y i f(x i )) 2 corresponds to minimizing the i=1 expected squared error loss 12/35

13 Naive functional gradient descent (FGD) Idea: use gradient descent methods to minimize R = R(f (1),..., f (n) ) w.r.t. f (1) = f(x 1 ),..., f (n) = f(x n ) Start with offset values In iteration m: ˆf [m] (1). ˆf [m] (n) = ˆf [m 1] (1). ˆf [m 1] (n) [0] [0] ˆf (1),..., ˆf (n) + ν R [m 1] f (1) ( ˆf (1) ). R f (n) ( ˆf [m 1] (n) ), where ν is a step length factor Principle of steepest descent 13/35

14 Naive functional gradient descent (2) (Very) simple example: n = 2, Y 1 = Y 2 = 0, ρ = squared error loss R = 1 ( ) f(1) f (2) 2 z = f1 2 + f2 2 f f1 14/35

15 Naive functional gradient descent (3) Increase m until the algorithm converges to some values ˆf [mstop] [mstop] (1),..., ˆf (n) Problem with naive gradient descent: No predictor variables involved [mstop] [mstop] Structural relationship between ˆf (1),..., ˆf (n) is ignored [m] ( ˆf (1) Y [m] 1,..., ˆf (n) Y n) Predictions only for observed values Y 1,..., Y n 15/35

16 Gradient Boosting Solution: Estimate the negative gradient in each iteration Estimation is performed by some base-learning procedure regressing the negative gradient on the predictor variables ˆf [mstop] ˆf [mstop] (n) base-learning procedure ensures that (1),..., predictions from a statistical model depending on the predictor variables To do this, we specify a set of regression models ( base-learners ) with the negative gradient as the dependent variable In many applications, the set of base-learners will consist of p simple regression models ( one base-learner for each of the p predictor variables, component-wise gradient boosting ) are 16/35

17 Gradient Boosting (2) Functional gradient descent (FGD) boosting algorithm: 1. Initialize the n-dimensional vector ˆf [0] with some offset values (e.g., use a vector of zeroes). Set m = 0 and specify the set of base-learners. Denote the number of base-learners by P. 2. Increase m by 1. Compute the negative gradient ρ(y, f) and f evaluate at ˆf [m 1] (X i ), i = 1,..., n. This yields the negative gradient vector U [m 1] = (U [m 1] i ) i=1,...,n := ( ) f ρ(y, f) Y =Yi,f= ˆf [m 1] (X i) i=1,...,n. 17/35

18 Gradient Boosting (3) 3. Estimate the negative gradient U [m 1] by using the base-learners (i.e., the P regression estimators) specified in Step 1. This yields P vectors, where each vector is an estimate of the negative gradient vector U [m 1].. Select the base-learner that fits U [m 1] best ( min. SSE). Set Û [m 1] equal to the fitted values from the corresponding best model.. 18/35

19 Gradient Boosting (4). 4. Update ˆf [m] = ˆf [m 1] + ν Û [m 1], where 0 < ν 1 is a real-valued step length factor. 5. Iterate Steps 2-4 until m = m stop. The step length factor ν could be chosen adaptively. Usually, an adaptive strategy does not improve the estimates of f but will only lead to an increase in running time choose ν small (ν = 0.1) but fixed 19/35

20 Schematic overview of Step 3 in iteration m Component-wise gradient boosting with one base-learner for each predictor variable: U [m 1] X 1 U [m 1] X 2 U [m 1] X j.. U [m 1] X p. best-fitting base-learner Û [m 1] 20/35

21 Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model x y m = 0 y = ( e 50 x2 ) x ε x Residuals Residuals m = 0 21/35

22 Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model y = ( e 50 x2 ) x ε x y m = Residuals x Residuals m = 1 22/35

23 Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model y = ( e 50 x2 ) x ε x y m = Residuals x Residuals m = 2 23/35

24 Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model y = ( e 50 x2 ) x ε x y m = Residuals x Residuals m = 3 24/35

25 Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model y = ( e 50 x2 ) x ε x y m = Residuals x Residuals m = 10 25/35

26 Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model y = ( e 50 x2 ) x ε x y m = Residuals x Residuals m = 20 26/35

27 Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model y = ( e 50 x2 ) x ε x y m = Residuals x Residuals m = 30 27/35

28 Simple example In case of Gaussian regression, gradient boosting is equivalent to iteratively re-fitting the residuals of the model y = ( e 50 x2 ) x ε x y m = Residuals x Residuals m = /35

29 Properties of gradient boosting It is clear from Step 4 that the predictions of Y 1,..., Y n in iteration m stop take the form of an additive function: ˆf [mstop] = ˆf [0] + ν Û [0] + + ν Û [mstop 1] The structure of the prediction function depends on the choice of the base-learners For example, linear base-learners result in linear prediction functions Smooth base-learners result in additive prediction functions with smooth components ˆf [mstop] has a meaningful interpretation 29/35

30 Gradient boosting with early stopping Gradient boosting has a built-in mechanism for base-learner selection in each iteration. This mechanism will carry out variable selection. Gradient boosting is applicable even if p > n. In case p > n, it is usually desirable to select a small number of informative predictor variables ( sparse solution ). If m, the algorithm will select non-informative predictor variables. Overfitting can be avoided if the algorithm is stopped early, i.e., if m stop is considered as a tuning parameter of the algorithm 30/35

31 Illustration of variable selection and early stopping Very simple example: 3 predictor variables X 1, X 2, X 3, [m] 3 linear base-learners with coefficient estimates ˆβ j, j = 1, 2, 3 Assume that m stop = 5 Assume that X 1 was selected in the first, second and fifth iteration Assume that X 3 was selected in the third and forth iteration ˆf [m stop] = ˆf [0] + ν Û [0] + ν Û [1] + ν Û [2] + ν Û [3] + ν Û [4] = ˆβ [0] [0] + ν ˆβ 1 X [1] 1 + ν ˆβ 1 X [2] 1 + ν ˆβ 3 X [3] 3 + ν ˆβ 3 X 3 + ν = ˆβ ( ) ( ) [0] + ν ˆβ[0] [1] [4] 1 + ˆβ 1 + ˆβ 1 X 1 + ν ˆβ[2] [3] 3 + ˆβ 3 X 3 ˆβ [4] 1 X 1 = ˆβ [0] + ˆβ 1 X 1 + ˆβ 3 X 3 Linear prediction function X 2 is not included in the model (variable selection) 31/35

32 How should the stopping iteration be chosen? Use cross-validation techniques to determine m stop cross validated mstop predictive risk Number of boosting iterations The stopping iteration is chosen such that it maximizes prediction accuracy. 32/35

33 Shrinkage Early stopping will not only result in sparse solutions but will also lead to shrunken effect estimates ( only a small fraction of Û is added to the estimates in each iteration). Shrinkage leads to a downward bias (in absolute value) but to a smaller variance of the effect estimates (similar to Lasso or Ridge regression). Multicollinearity problems are addressed. 33/35

34 Further aspects There are many types of boosting methods, e.g., tree-based boosting (AdaBoost, Freund & Schapire 1997) likelihood-based boosting (Tutz & Binder 2006) Here we consider gradient boosting Flexible method to fit many types of statistical models in highand low-dimensional settings Regularization of estimates via variable selection and shrinkage Implemented in R package mboost (Hothorn et al. 2010, 2011) 34/35

35 References References Freund, Y. and R. Schapire (1997): A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, Hothorn, T., P. Bühlmann, T. Kneib, M. Schmid and B. Hofner (2010): Model-based boosting 2.0. Journal of Machine Learning Research 11, Hothorn, T., P. Bühlmann, T. Kneib, M. Schmid and B. Hofner (2011): mboost: Model-Based Boosting. R package version Schild, R. L., M. Maringa, J. Siemer, B. Meurer, N. Hart, T. W. Goecke, M. Schmid, T. Hothorn and M. E. Hansmann (2008). Weight estimation by three-dimensional ultrasound imaging in the small fetus. Ultrasound in Obstetrics and Gynecology 32, Tutz, G. and H. Binder (2006): Generalized additive modelling with implicit variable selection by likelihood based boosting. Biometrics 62, van de Vijver, M. J., Y. D. He, L. J. van t Veer, H. Dai, A. A. M. Hart, D. W. Voskuil et al. (2002). A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine 347, /35

### Linear Model Selection and Regularization

Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

### BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING

Submitted to Statistical Science BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING By Peter Bühlmann and Torsten Hothorn ETH Zürich and Universität Erlangen-Nürnberg We present a statistical

### Modern regression 1: Ridge regression

Modern regression 1: Ridge regression Ryan Tibshirani Data Mining: 36-462/36-662 March 19 2013 Optional reading: ISL 6.2.1, ESL 3.4.1 1 Reminder: shortcomings of linear regression Last time we talked about:

### Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Volume 11, Issue 5 2012 Article 3 A PAUC-based Estimation Technique for Disease Classification and Biomarker Selection Matthias Schmid, Friedrich-Alexander-University

### Big Data Big Knowledge?

EBPI Epidemiology, Biostatistics and Prevention Institute Big Data Big Knowledge? Torsten Hothorn 2015-03-06 The end of theory The End of Theory: The Data Deluge Makes the Scientific Method Obsolete (Chris

### Introduction to Machine Learning

Introduction to Machine Learning Prof. Alexander Ihler Prof. Max Welling icamp Tutorial July 22 What is machine learning? The ability of a machine to improve its performance based on previous results:

### Simple Linear Regression Chapter 11

Simple Linear Regression Chapter 11 Rationale Frequently decision-making situations require modeling of relationships among business variables. For instance, the amount of sale of a product may be related

### Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.

### A brush-up on residuals

A brush-up on residuals Going back to a single level model... y i = β 0 + β 1 x 1i + ε i The residual for each observation is the difference between the value of y predicted by the equation and the actual

### Classification and Regression Trees

Classification and Regression Trees Bob Stine Dept of Statistics, School University of Pennsylvania Trees Familiar metaphor Biology Decision tree Medical diagnosis Org chart Properties Recursive, partitioning

### Lasso on Categorical Data

Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

### Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is

### Modern regression 2: The lasso

Modern regression 2: The lasso Ryan Tibshirani Data Mining: 36-462/36-662 March 21 2013 Optional reading: ISL 6.2.2, ESL 3.4.2, 3.4.3 1 Reminder: ridge regression and variable selection Recall our setup:

### The International Journal of Biostatistics

The International Journal of Biostatistics Volume 7, Issue 1 2011 Article 13 HingeBoost: ROC-Based Boost for Classification and Variable Selection Zhu Wang, Connecticut Children's Medical Center and University

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Mixed models: design of experiments. V. Fedorov August, 2011

Mixed models: design of experiments V. Fedorov August, 2011 1 Selected references (estimation) Johnson L. (1977) Stochastic parameter regression: an annotated bibliography, International Statistical Review,

### Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and

### Actuarial. Modeling Seminar Part 2. Matthew Morton FSA, MAAA Ben Williams

Actuarial Data Analytics / Predictive Modeling Seminar Part 2 Matthew Morton FSA, MAAA Ben Williams Agenda Introduction Overview of Seminar Traditional Experience Study Traditional vs. Predictive Modeling

### Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

### Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: A guide to the gbm package Greg Ridgeway August 3, 2007 Boosting takes on various forms th different programs using different loss functions, different base models, and different

### Generalized Linear Model Theory

Appendix B Generalized Linear Model Theory We describe the generalized linear model as formulated by Nelder and Wedderburn (1972), and discuss estimation of the parameters and tests of hypotheses. B.1

### Regression Analysis: Basic Concepts

The simple linear model Regression Analysis: Basic Concepts Allin Cottrell Represents the dependent variable, y i, as a linear function of one independent variable, x i, subject to a random disturbance

### Joint models for classification and comparison of mortality in different countries.

Joint models for classification and comparison of mortality in different countries. Viani D. Biatat 1 and Iain D. Currie 1 1 Department of Actuarial Mathematics and Statistics, and the Maxwell Institute

### COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16 Lecture 2: Linear Regression Gradient Descent Non-linear basis functions LINEAR REGRESSION MOTIVATION Why Linear Regression? Regression

### BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

### ELEC-E8104 Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems

Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems Minimum Mean Square Error (MMSE) MMSE estimation of Gaussian random vectors Linear MMSE estimator for arbitrarily distributed

### GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

### Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization Function to minimize w.r.t. b 0, b 1 Q = n (Y i (b 0 + b 1 X i )) 2 i=1 Minimize this by maximizing

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )

### EBPI Epidemiology, Biostatistics and Prevention Institute. Big Data Science. Torsten Hothorn 2014-03-31

EBPI Epidemiology, Biostatistics and Prevention Institute Big Data Science Torsten Hothorn 2014-03-31 The end of theory The End of Theory: The Data Deluge Makes the Scientific Method Obsolete (Chris Anderson,

Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models

### Model Combination. 24 Novembre 2009

Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

### Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 10, 2014 1 Principle of maximum likelihood Consider a family of probability distributions

### Inverse covariance matrix regularization for minimum variance portfolio. Auberson Maxime

Inverse covariance matrix regularization for minimum variance portfolio Auberson Maxime June 2015 Abstract In the mean-variance optimization framework, the inverse of the covariance matrix (or the precision

### Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization 1. Function to minimize w.r.t. β 0, β 1 Q = n (Y i (β 0 + β 1 X i )) 2 i=1 2. Minimize this by

### New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### Predicting daily incoming solar energy from weather data

Predicting daily incoming solar energy from weather data ROMAIN JUBAN, PATRICK QUACH Stanford University - CS229 Machine Learning December 12, 2013 Being able to accurately predict the solar power hitting

### 12/31/2016. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Understand when to use multiple Understand the multiple equation and what the coefficients represent Understand different methods

### Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

### Regularization and Variable Selection via the Elastic Net

ElasticNet Hui Zou, Stanford University 1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Department of Statistics Stanford University ElasticNet Hui Zou, Stanford University

### Smoothing and Non-Parametric Regression

Smoothing and Non-Parametric Regression Germán Rodríguez grodri@princeton.edu Spring, 2001 Objective: to estimate the effects of covariates X on a response y nonparametrically, letting the data suggest

### AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz

AdaBoost Jiri Matas and Jan Šochman Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Presentation Outline: AdaBoost algorithm Why is of interest? How it works? Why

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Regression Modeling Strategies

Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Statistics in Geophysics: Linear Regression II

Statistics in Geophysics: Linear Regression II Steffen Unkel Department of Statistics Ludwig-Maximilians-University Munich, Germany Winter Term 2013/14 1/28 Model definition Suppose we have the following

### Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

### Logistic regression: Diagnostics

Logistic regression: Diagnostics April 12 Introduction Building blocks After a model has been fit, it is wise to check the model to see how well it fits the data In linear regression, these diagnostics

### Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients by Li Liu A practicum report submitted to the Department of Public Health Sciences in conformity with

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Econometric Methods for Panel Data

Based on the books by Baltagi: Econometric Analysis of Panel Data and by Hsiao: Analysis of Panel Data Robert M. Kunst robert.kunst@univie.ac.at University of Vienna and Institute for Advanced Studies

### HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

### ORF 523 Lecture 8 Spring 2016, Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Tuesday, March 8, 2016

ORF 523 Lecture 8 Spring 2016, Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Tuesday, March 8, 2016 When in doubt on the accuracy of these notes, please cross check with the instructor s

### Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

Ridge Regression Patrick Breheny September 1 Patrick Breheny BST 764: Applied Statistical Modeling 1/22 Ridge regression: Definition Definition and solution Properties As mentioned in the previous lecture,

### Degrees of Freedom and Model Search

Degrees of Freedom and Model Search Ryan J. Tibshirani Abstract Degrees of freedom is a fundamental concept in statistical modeling, as it provides a quantitative description of the amount of fitting performed

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection.

COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection. Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise

### Paper Let the Data Speak: New Regression Diagnostics Based on Cumulative Residuals

Paper 255-28 Let the Data Speak: New Regression Diagnostics Based on Cumulative Residuals Gordon Johnston and Ying So SAS Institute Inc. Cary, North Carolina, USA Abstract Residuals have long been used

### Cov(x, y) V ar(x)v ar(y)

Simple linear regression Systematic components: β 0 + β 1 x i Stochastic component : error term ε Y i = β 0 + β 1 x i + ε i ; i = 1,..., n E(Y X) = β 0 + β 1 x the central parameter is the slope parameter

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

Vector Derivatives, Gradients, and Generalized Gradient Descent Algorithms ECE 275A Statistical Parameter Estimation Ken Kreutz-Delgado ECE Department, UC San Diego November 1, 2013 Ken Kreutz-Delgado

### L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

### Multivariate Logistic Regression

1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

### Local classification and local likelihoods

Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

### Penalized regression: Introduction

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

### A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Statistics II Final Exam - January Use the University stationery to give your answers to the following questions.

Statistics II Final Exam - January 2012 Use the University stationery to give your answers to the following questions. Do not forget to write down your name and class group in each page. Indicate clearly

### The fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R

Journal of Machine Learning Research 15 (2014) 489-493 Submitted 3/13; Revised 8/13; Published 2/14 The fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R Haotian

### Introduction to Logistic Regression

OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

### GLM Residuals and Diagnostics

GLM Residuals and Patrick Breheny March 26 Patrick Breheny BST 760: Advanced Regression 1/24 Introduction Residuals The hat matrix After a model has been fit, it is wise to check the model to see how well

### Estimating the Inverse Covariance Matrix of Independent Multivariate Normally Distributed Random Variables

Estimating the Inverse Covariance Matrix of Independent Multivariate Normally Distributed Random Variables Dominique Brunet, Hanne Kekkonen, Vitor Nunes, Iryna Sivak Fields-MITACS Thematic Program on Inverse

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Models of binary outcomes with 3-level data: A comparison of some options within SAS. CAPS Methods Core Seminar April 19, 2013.

Models of binary outcomes with 3-level data: A comparison of some options within SAS CAPS Methods Core Seminar April 19, 2013 Steve Gregorich SEGregorich 1 April 19, 2013 Designs I. Cluster Randomized

### Weighted least squares

Weighted least squares Patrick Breheny February 7 Patrick Breheny BST 760: Advanced Regression 1/17 Introduction Known weights As a precursor to fitting generalized linear models, let s first deal with

### Robust paradigm applied to parameter reduction in actuarial triangle models ICASQF 2016

1/26 Robust paradigm applied to parameter reduction in actuarial triangle models ICASQF 2016 Gary Venter University of New South Wales 2/26 PRELIMINARIES Let (Ω, F, P) be a probability space = sample space

### Lecture 6: Logistic Regression

Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

### Estimation of Fetal Weight: Mean Value from Multiple Formulas

Estimation of Fetal Weight: Mean Value from Multiple Formulas Michael G. Pinette, MD, Yuqun Pan, MD, Sheila G. Pinette, RPA-C, Jacquelyn Blackstone, DO, John Garrett, Angelina Cartin Mean fetal weight

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

### Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error

### BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully

Universidade de Vigo Model Building in Non Proportional Hazard Regression Mar Rodríguez-Girondo, Thomas Kneib, Carmen Cadarso-Suárez and Emad Abu-Assi Report 12/09 Discussion Papers in Statistics and Operation

### 7. Autocorrelation (Violation of Assumption #B3)

7. Autocorrelation (Violation of Assumption #B3) Assumption #B3: The error term u i is not autocorrelated, i.e. Cov(u i, u j ) = 0 for all i = 1,..., N and j = 1,..., N where i j Where do we typically

### GLM, insurance pricing & big data: paying attention to convergence issues.

GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.

### Problem of Missing Data

VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

Modelling and added value Course: Statistical Evaluation of Diagnostic and Predictive Models Thomas Alexander Gerds (University of Copenhagen) Summer School, Barcelona, June 30, 2015 1 / 53 Multiple regression

### Using SPSS for Multiple Regression. UDP 520 Lab 7 Lin Lin December 4 th, 2007

Using SPSS for Multiple Regression UDP 520 Lab 7 Lin Lin December 4 th, 2007 Step 1 Define Research Question What factors are associated with BMI? Predict BMI. Step 2 Conceptualizing Problem (Theory) Individual

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### Improving Generalization

Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. Cross-Validation 4. Weight Restriction