GI01/M055 Supervised Learning Proximal Methods

Similar documents

10. Proximal point method

Big Data - Lecture 1 Optimization reminders

Online Convex Optimization

Adaptive Online Gradient Descent

Big Data Analytics: Optimization and Randomization

Convex analysis and profit/cost/support functions

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Downloaded 08/02/12 to Redistribution subject to SIAM license or copyright; see

Chapter 6. Cuboids. and. vol(conv(p ))

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Notes from Week 1: Algorithms for sequential prediction

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Week 1: Introduction to Online Learning

CSCI567 Machine Learning (Fall 2014)

Online Learning and Online Convex Optimization. Contents

(Quasi-)Newton methods

How to assess the risk of a large portfolio? How to estimate a large covariance matrix?

International Doctoral School Algorithmic Decision Theory: MCDA and MOO

Introduction to Support Vector Machines. Colin Campbell, Bristol University

First Order Algorithms in Variational Image Processing

The p-norm generalization of the LMS algorithm for adaptive filtering

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Optimization. J(f) := λω(f) + R emp (f) (5.1) m l(f(x i ) y i ). (5.2) i=1

2.3 Convex Constrained Optimization Problems

Proximal mapping via network optimization

Section 4.4 Inner Product Spaces

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

A Simple Introduction to Support Vector Machines

Towards running complex models on big data

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

Error Bound for Classes of Polynomial Systems and its Applications: A Variational Analysis Approach

Nonlinear Algebraic Equations. Lectures INF2320 p. 1/88

Conductance, the Normalized Laplacian, and Cheeger s Inequality

1 Norms and Vector Spaces

Notes on Symmetric Matrices

Lecture 7: Finding Lyapunov Functions 1

Online Learning, Stability, and Stochastic Gradient Descent

Follow the Perturbed Leader

Nonlinear Algebraic Equations Example

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

Mathematical finance and linear programming (optimization)

Sketch As a Tool for Numerical Linear Algebra

Introduction to Online Optimization

Lecture 6: Logistic Regression

Lecture 13 Linear quadratic Lyapunov theory

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Parallel Selective Algorithms for Nonconvex Big Data Optimization

Largest Fixed-Aspect, Axis-Aligned Rectangle

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = i.

Maximum Likelihood Estimation

Solutions of Equations in One Variable. Fixed-Point Iteration II

Introduction to Online Learning Theory

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Separation Properties for Locally Convex Cones

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1

Introduction to Logistic Regression

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Cyber-Security Analysis of State Estimators in Power Systems

SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI. (B.Sc.(Hons.), BUAA)

The Heat Equation. Lectures INF2320 p. 1/88

9 Multiplication of Vectors: The Scalar or Dot Product

Lecture 2: The SVM classifier

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

No: Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

Lecture 4: Partitioned Matrices and Determinants

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

A primal-dual algorithm for group sparse regularization with overlapping groups

Online Algorithms: Learning & Optimization with No Regret.

Constrained Least Squares

Chapter 2: Binomial Methods and the Black-Scholes Formula

Quadratic forms Cochran s theorem, degrees of freedom, and all that

Definition and Properties of the Production Function: Lecture

Online Lazy Updates for Portfolio Selection with Transaction Costs

Convex Optimization for Big Data

Statistical Machine Learning

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

Lecture 1: Schur s Unitary Triangularization Theorem

Similarity and Diagonalization. Similar Matrices

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Date: April 12, Contents

Høgskolen i Narvik Sivilingeniørutdanningen STE6237 ELEMENTMETODER. Oppgaver

Numerical methods for American options

Beyond stochastic gradient descent for large-scale machine learning

Machine learning and optimization for massive data

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

Derivative Free Optimization

Transcription:

GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20

Today s Plan Problem setting Convex analysis concepts Proximal operators O(1/T) algorithm O(1/T 2 ) algorithm Empirical comparison (UCL) Proximal Methods 2 / 20

Problem setting We are interesting in the following optimization problem min F(w) := f (w) + r(w). w R d We assume that r is convex and f is convex and differentiable Examples: SQUARE LOSS: f (w) = 1 2 Aw y 2 LASSO: r(w) = λ w 1 TRACE NORM: r(w) = λ w (UCL) Proximal Methods 3 / 20

Convex Analysis I We assume that f has Lipschitz continuous gradient: f (w) f (v) L w v. Lemma The above assumption is equivalent to f (w) f (v) + f (v),w v + L 2 w v 2. (UCL) Proximal Methods 4 / 20

Convex Analysis II Define the linear approximation of F in v, w.r.t. f F(w;v) := f (v) + f (v),w v + r(w). Lemma (Sandwich) F(w) L 2 w v 2 F(w;v) F(w). Proof. The left inequality follows from Lemma 1, the right inequality follows from the convexity of f, f (w) f (v) + f (v),w v. Equivalent version F(w) F(w;v) + L 2 w v 2 F(w) + L 2 w v 2. (UCL) Proximal Methods 5 / 20

Sandwich example F(w) = 1 2 (aw 1)2 + 1 2 w F(w; v) = 1 2 (av 1)2 + a(av 1)(w v) + 1 2 w 4 3 upper bound Ft+L/2 x v 2 F 2 1 0 1 0.5 0 0.5 1 1.5 2 (UCL) Proximal Methods 6 / 20

Subdifferential If f : R d R is convex, its subdifferential at w is defined as f (w) = {u : f (v) f (w) + u,v w, v R d } f is a set-valued function the elements of f (w) are called the subgradients of f at w intuition: u f (w) if the affine function f (w) + u,v w is a global underestimator of f Theorem ŵ argmin f (w) 0 f (ŵ) w R d (UCL) Proximal Methods 7 / 20

Proximal operator The proximal operator of a convex function r : R d R is defined as 1 prox r (v) = argmin w R d 2 w v 2 + r(w). Convexity of r ensures that the minimizer always exists and is unique. Examples: LASSO: r(w) = λ w 1, prox r (v) = H λ (v), which is the component-wise soft-thresholding operator (H λ (v)) i = sign(v i )( v i λ) + l 2 NORM: r(w) = λ w 2, prox r (v) = v v 2 ( v 2 λ) +. GROUP LASSO: r(w) = K l=1 w J l 2, (prox r (v)) Jl = v J l v Jl ( v J l 2 λ) +. (UCL) Proximal Methods 8 / 20

O(1/T) algorithm Algorithm w 0 0 for t = 0,1,...,T do L w t+1 = argmin w 2 w w t 2 + F(w;w t ) end for Recall, F(w;w t ) := f (w t ) + f (w t ),w w t + r(w) The term f (w t ) does not depend on w and can be discarded. By completing the square, we also obtain ( ) 2 L 1 2 w w t 2 + f (w t ), w w t + L f (w t) = L ( w w t 1 ) 2 L f (w 2 t) where the extra term ( 1 L f (w t) ) 2 does not depend on w. (UCL) Proximal Methods 9 / 20

O(1/T) algorithm (cont d) Algorithm 1 w 0 0 for t = 0,1,...,T do ( w t+1 = prox r wt 1 L L f (w t) ) end for Remark. If r = 0, we recover Gradient Descent, w t+1 = w t 1 L f (w t) Theorem (Convergence rate) Let w argmin w F(w), then, at iteration T, Algorithm 1 yields a solution w T that satisfies F(w T ) F(w ) L w w 0 2 2T (1) (UCL) Proximal Methods 10 / 20

O(1/T) algorithm - Examples LASSO f (w) = 1 2 Aw y 2 r(w) = λ w 1 GROUP LASSO w t+1 = H λ L f (w) = 1 2 Aw y 2 r(w) = λ K l=1 w J l 2 ( w t 1 ) L A (Aw t y) v = w t 1 L A (Aw t y) (w t+1 ) Jl = v ( J l v v Jl Jl 2 λ ) L (UCL) Proximal Methods 11 / 20 +

O(1/T) algorithm - convergence rate proof Sandwich: F(w) L 2 w v 2 F(w; v) F(w) Lemma: 3-point property. If ŵ = argmin w R d 1 2 w w 0 2 + φ(w), then, for any w R d φ(ŵ) + 1 2 ŵ w 0 2 φ(w) + 1 2 w w 0 2 1 2 w ŵ 2. Proof of the convergence rate. F(w t+1 ) F(w t+1 ; w t ) + L 2 w t+1 w t 2 (Sandwich-left) F(w ; w t ) + L 2 w w t 2 L 2 w w t+1 2 (3-point with w = w ) F(w ) + L 2 w w t 2 L 2 w w t+1 2 (Sandwich-right) (UCL) Proximal Methods 12 / 20

O(1/T) algorithm - convergence rate proof cont d Let us now define ε t := F(w t ) F(w ), so that ε t+1 L 2 w w t 2 L 2 w w t+1 2 Lemma. The sequence ε t, for t = 0,...,T is monotone non-increasing. F(w t+1 ) F(w t+1 ; w t ) + L 2 w t+1 w t 2 (Sandwich-left) F(w t ; w t ) + L 2 w t w t 2 = F(w t ) (Def. of w t+1 ) Since ε t is monotone non-increasing, T 1 Tε T ε t+1 L 2 w w 0 2 L 2 w w T 2 L 2 w w 0 2 t=0 ε T = F(w T ) F(w ) L w w 0 2 2T (UCL) Proximal Methods 13 / 20

O(1/T 2 ) algorithm We want to accelerate Algorithm 1, by introducing some factors tending to zero. We define w t+1 by taking the linear approximation at an auxiliary point v t : w t+1 := argmin w F(w; v t ) + L 2 w v t 2. We perform the same analysis as above, letting v be a reference vector that will be chosen later F(w t+1 ) F(w t+1 ; v t ) + L 2 w t+1 v t 2 (Sandwich-left) F(v ; v t ) + L 2 v v t 2 L 2 v w t+1 2 (3-point with w = v ) F(v ) + L 2 v v t 2 L 2 v w t+1 2 (Sandwich-right) By introducing yet another sequence {u t }, we would like to obtain F(w t+1 ) F(v ) + Lθ2 t 2 w u t 2 Lθ2 t 2 w u t+1 2. (WANT) (UCL) Proximal Methods 14 / 20

O(1/T 2 ) algorithm - cont d F(w t+1 ) F(v ) + L 2 v v t 2 L 2 v w t+1 2 F(w t+1 ) F(v ) + Lθ2 t 2 w u t 2 Lθ2 t 2 w u t+1 2. (WANT) In order for (WANT) to hold, we need v v t = θ t (w u t ) v w t+1 = θ t (w u t+1 ). To satisfy the second relation we can choose v = α t + θ t w u t+1 = w t+1 α t θ t In order to to exploit the convexity of F, we can choose α t = (1 θ t )w t so that v becomes a convex combination of w and the previous point w t. (UCL) Proximal Methods 15 / 20

O(1/T 2 ) algorithm - cont d In summary, we have v = (1 θ t )w t + θ t w u t+1 = w t+1 (1 θ t )w t θ t v t = (1 θ t )w t + θ t u t. Accelerated Algorithm w 0, u 0 0 for t = 0, 1,...,T do v t (1 θ t )w t + θ t u t w t+1 argmin w F(w; v t ) + L 2 w v ( t 2 = prox r vt 1 L L f (v t) ) u t+1 wt+1 (1 θt)wt θ t end for (UCL) Proximal Methods 16 / 20

O(1/T 2 ) algorithm - convergence rate Let w argmin w F(w), then, at iteration T, Algorithm 2 yields w T that satisfies F(w T ) F(w ) L 2 θ2 T w 2 (2) Let us consider the sequence {θ t } θ 0 = 1 1 θ t+1 θ 2 t+1 = 1 θt 2. (Theta-Def) The sequence {θ t } satisfies θ t 2/(t + 2), (3) Proof by induction: use arithmetic-geometric inequality on (Theta-Def). We then have the following convergence rate F(w T+1 ) F(w ) L 2 θ2 T w 2 2L T 2 w 2 (UCL) Proximal Methods 17 / 20

O(1/T 2 ) algorithm - convergence rate proof F(w t+1 ) F(v ) + Lθ2 t 2 w u t 2 Lθ2 t 2 w u t+1 2. (WANT) F(w t+1 ) (1 θ t )F(w t ) + θ t F(w ) + Lθ2 t 2 w u t 2 Lθ2 t 2 w u t+1 2. Define ε t := F(w t ) F(w ) and Φ t := L 2 w u t 2, 1 θ 2 t 1 θ t+1 θ 2 t+1 ε t+1 (1 θ t )ε t + θ 2 t (Φ t Φ t+1 ) ε t+1 1 θ t θt 2 ε t Φ t Φ t+1 ε t+1 1 θ t θt 2 ε t Φ t Φ t+1 Taking the sum from t = 1 to t = T gives 1 θ T+1 θ 2 T+1 ε T+1 Φ 0 Φ T+1 + 1 θ 0 θ0 2 ε 0 1 θ T+1 θt+1 2 ε T+1 = 1 θt 2 ε T+1 Φ 0 = L 2 w u 0 2 Using (Theta-Def) (UCL) Proximal Methods 18 / 20

Simple numerical comparison Solve LASSO with d = 100 variables and Regression vector w has 20 nonzero components with random ±1 n = 40 examples, x ij N(0,1) and y = Xw + ε, ε i N(0,0.01). 600 500 Objective value O(1/T) O(1/T 2 ) 400 300 200 100 0 0 100 200 300 400 500 (UCL) Proximal Methods 19 / 20

Summary Proximal algorithms are fast first-order methods. The accelerated algorithm has O(1/T 2 ) convergence rate. All you need is: The gradient of the smooth part The proximal operator of the non-smooth part (UCL) Proximal Methods 20 / 20