On Convergence Rate of Concave-Convex Procedure



Similar documents
Big Data - Lecture 1 Optimization reminders

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

(Quasi-)Newton methods

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Adaptive Online Gradient Descent

Moving Least Squares Approximation

2.3 Convex Constrained Optimization Problems

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Lecture 2: The SVM classifier

CONTINUED FRACTIONS AND PELL S EQUATION. Contents 1. Continued Fractions 1 2. Solution to Pell s Equation 9 References 12

GI01/M055 Supervised Learning Proximal Methods

Nonlinear Programming Methods.S2 Quadratic Programming

15.1. Exact Differential Equations. Exact First-Order Equations. Exact Differential Equations Integrating Factors

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Semi-Supervised Support Vector Machines and Application to Spam Filtering

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

Massive Data Classification via Unconstrained Support Vector Machines

Roots of Equations (Chapters 5 and 6)

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

A FIRST COURSE IN OPTIMIZATION THEORY

Support Vector Machines with Clustering for Training with Very Large Datasets

Fleet Assignment Using Collective Intelligence

by the matrix A results in a vector which is a reflection of the given

Example 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum. asin. k, a, and b. We study stability of the origin x

Multi-variable Calculus and Optimization

Continuity of the Perron Root

Lecture 2: August 29. Linear Programming (part I)

Efficient online learning of a non-negative sparse autoencoder

Mathematics 31 Pre-calculus and Limits

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

No: Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Continued Fractions and the Euclidean Algorithm

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

Several Views of Support Vector Machines

CHAPTER 9. Integer Programming

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Practical Guide to the Simplex Method of Linear Programming

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach

Section 3-7. Marginal Analysis in Business and Economics. Marginal Cost, Revenue, and Profit. 202 Chapter 3 The Derivative

24. The Branch and Bound Method

Metric Spaces. Chapter Metrics

On SDP- and CP-relaxations and on connections between SDP-, CP and SIP

PRIME FACTORS OF CONSECUTIVE INTEGERS

Online Passive-Aggressive Algorithms on a Budget

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Date: April 12, Contents

About the Gamma Function

LINEAR INEQUALITIES. less than, < 2x + 5 x 3 less than or equal to, greater than, > 3x 2 x 6 greater than or equal to,

Duality of linear conic problems

Linear Programming Notes V Problem Transformations

Practice with Proofs

MATH 425, PRACTICE FINAL EXAM SOLUTIONS.

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Support Vector Machines

In this section, we will consider techniques for solving problems of this type.

Natural cubic splines

Constrained optimization.

Proximal mapping via network optimization

Lecture 5 Principal Minors and the Hessian

88 CHAPTER 2. VECTOR FUNCTIONS. . First, we need to compute T (s). a By definition, r (s) T (s) = 1 a sin s a. sin s a, cos s a

Introduction to Online Learning Theory

Support Vector Machines Explained

Linear Equations in Linear Algebra

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Stochastic Inventory Control

Homework # 3 Solutions

Piecewise Cubic Splines

Support Vector Machine (SVM)

Dual Methods for Total Variation-Based Image Restoration

Perron vector Optimization applied to search engines

A Study on SMO-type Decomposition Methods for Support Vector Machines

Follow the Perturbed Leader

MAP-Inference for Highly-Connected Graphs with DC-Programming

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

CONSTRAINED NONLINEAR PROGRAMMING

Polynomial and Synthetic Division. Long Division of Polynomials. Example 1. 6x 2 7x 2 x 2) 19x 2 16x 4 6x3 12x 2 7x 2 16x 7x 2 14x. 2x 4.

Statistical Machine Learning

Elementary Gradient-Based Parameter Estimation

Linearly Independent Sets and Linearly Dependent Sets

LIMITS AND CONTINUITY

Derivative Free Optimization

A Simple Introduction to Support Vector Machines

Chapter 4. Polynomial and Rational Functions. 4.1 Polynomial Functions and Their Graphs

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

Week 1: Introduction to Online Learning

6.2 Permutations continued

Diagonal preconditioning for first order primal-dual algorithms in convex optimization

0 0 such that f x L whenever x a

Max-Min Representation of Piecewise Linear Functions

Section 4.2: The Division Algorithm and Greatest Common Divisors

Optimal shift scheduling with a global service level constraint

The sum of digits of polynomial values in arithmetic progressions

Transcription:

On Convergence Rate of Concave-Conve Procedure Ian En-Hsu Yen r00922017@csie.ntu.edu.tw Po-Wei Wang b97058@csie.ntu.edu.tw Nanyun Peng Johns Hopkins University Baltimore, MD 21218 npeng1@jhu.edu Shou-De Lin sdlin@csie.ntu.edu.tw Abstract Concave-Conve Procedure (CCCP) has been widely used to solve nonconve d.c.(difference of conve function) programs occur in learning problems, such as sparse support vector machine (SVM), transductive SVM, sparse principal componenent analysis (PCA), etc. Although the global convergence behavior of CC- CP has been well studied, the convergence rate of CCCP is still an open problem. Most of d.c. programs in machine learning involve constraints or nonsmooth objective function, which prohibits the convergence analysis via differentiable map. In this paper, we approach this problem in a different manner by connecting CC- CP with more general block coordinate decent method. We show that the recent convergence result [1] of coordinate gradient descent on nonconve, nonsmooth problem can also apply to eact alternating imization. This implies the convergence rate of CCCP is at least linear, if in d.c. program the nonsmooth part is piecewise-linear and the smooth part is strictly conve quadratic. Many d.c. programs in SVM literature fall in this case. 1 Introduction Concave-Conve Procedure is a popular method for optimizing d.c.(difference of conve function) program of the form: u() v() f i () 0, i = 1..p (1) where u(), v(), and f i () being conve functions, g j () being affine function, defined on R n. Suppose v() is (piecewise) differentiable, the Concave-Conve Procedure iteratively solves a sequence of conve program defined by linearizing the concave part: t+1 arg u() v( t ) T f i () 0, i = 1..p This procedure is originally proposed by Yuille etal.[2] to deal with unconstrained d.c. program with smooth objective function. Nevertheless, many d.c. programs in machine learning come with (2) 1

constraints or nonsmooth functions. For eample, [4] and [5] propose using a concave function to approimate l o -loss for sparse PCA and SVM feature selection repectively, which results in d.c. programs with conve constraints. In [11], [12], R.Collobert etal. formulate ramp-loss and transductive SVM as d.c. programs with nonsmooth u(), v(). Other eamples such as [9] and [13] use (1) with piecewise-linear u(), v() to handle nonconve tighter bound [9] and hidden variables [13] in structural-svm. Though CCCP is etensively used in machine learning, its convergence behavior has not been fully understood. Yuille etal. give an analysis of CCCP s global convergence in the original paper [2], which, however, is not complete [14]. The global convergence of CCCP is proved in [10] and [14] via different approaches. However, as [14] pointed out, the converence rate of CCCP is still an open problem. [6] and [7] have analyzed the local convergence behavior of Majorization Minimiation algorithm, where CCCP is a special case, by taking (2) as a differentiable map t+1 = M( t ), but the analysis only applies to the unconstrained, differetiable version of d.c. program. Thus it cannot be used for eamples mentioned above. In this paper, we approach this problem in a different way by connecting CCCP with more general block coordinate decent method. We show that the recent convergence result [1] of coordinate gradient descent on nonconve, nonsmooth problem can also apply to block coordinate descent. Since CCCP, and more general Majorization-Minimization algorithm are special cases of block coordinate descent, this connection provides a simple way to prove convergence rate of CCCP. In particular, we show that the sequence { t } t=0 provided by (2) converges at least linearly to a stationary point of (1) if the nonsmooth part in u() and v() are conve pieceiwise-linear and the smooth part in u() is strictly conve quadratic. Many d.c. programs in SVM literature fall in this case, such as that in ramp-loss SVM [11], transductive SVM [12], and structual-svm with hidden variables [13] or nonconve tighter bound [9]. 2 Majorization Minimization as Block Coordinate Descent In this section, we show CCCP is a special case of Majorization Minimization algoritm, and thus can be viewed as block cooridnate descent on surrogate function. Then we introduce an alternative formulation of algorithm (2) when v() is piecewise-linear that fits better for convergence analysis. The Majorization Minimization (MM) principle is as follows. Suppose our goal is to imize f() over Ω R n. We first construct a majorization function g(, y) over Ω Ω such that { f() g(, y),, y Ω (3) f() = g(, ), Ω The MM algorithm, therefore, imizes an upper bound of f() at each iteration t+1 arg g(, t ) (4) Ω until fi point of (4). Since the imum of g( t, y) occurs at y = t, (4) can be seen as block coordinate descent on g(, y) with two blocks, y. CCCP is a special case of (4) with f() = u() v() and g(, y) = u() v (y) T ( y) v(y). The condition (3) holds because, by conveity of v(), the first-order Taylor approimiation is less or equal to v(), with eqaulity when y =. Thus we have { f() = u() v() u() v (y) T ( y) v(y) = g(, y),, y Ω (5) f() = g(, ), Ω However, when v() is piecewise-linear, the term v (y) T ( y) v(y) is discontinuous and nonconve. For convergence analysis, we epress v(), in another way, as ma m (at i + b i) with v () = a k piecewisely, where k = arg ma i (a T i + b i). Thus we can epress d.c. program (1) in another form m u() d i (a T i + b i ) R n,d R m m d i = 1, d i 0, i = 1..m f i () 0, i = 1..p, 2 (6)

Alternaing imizing (6) between and d yields the same CCCP algorithm (2). 1 This formulation replaces discontinous, nonconve term v (y)( y) v(y) with quadratic term m d i(a T i +b i), so (6) fits the form of nonsmooth problem studied in [1], that is, imizing a function F () = f()+cp (), with c > 0, f() being smooth, and P () being conve nonsmooth. In net section, we give convergence result of CCCP by showing that the alternating imization of (6) yields the same sequence { t, y t } t=0 as that of a special case of Coordinate Gradient Descent proposed in [1]. 3 Convergence Theorem Lemma 3.1. Consider the problem, y F (, y) = f(, y) + cp (, y) (7) where f(, y) is smooth and P (, y) is nonsmooth, conve, lower semi-continuous, and seperable for and y. The sequence { t, y t } t=0 produced by alternating imization t+1 = arg F (, y t ) (8) y t+1 = arg y F ( t+1, y) (9) (a) converges to a stationary point of (7), if in (8) and (9), or their equivalent problems 2, the objective functions have smooth parts f(), f(y), that are strictly conve quadratic. (b) with linear convergence rate, if f(, y) is quadratic (maybe nonconve), P (, y) is polyhedral, in addition to assumption in (a). Proof. Let the smooth parts of objective function in (8) and (9), or their equivalent problems, be f() and f(y). We can define a special case of Coordinate Gradient Descent (CGD) proposed in [1] which yields the same sequence { t, y t } t=0 as that produced by the alternating imization, where for each iteration k of the CGD algorithm, we use hessian matri H k = 2 f() 0 n for block J = and H k yy = 2 f(y) 0 m for block J = y, with eact line search (which satisfies Armijo Rule) 3. By Theorem 1 of [1], the sequence { t, y t } t=0 converges to a stationary point of (7). If we further assume that f(, y) is quadratic and P (, y) is polyhedral, by applying Theorem 2 and 4 in [1], the convergence rate of { t, y t } t=0 is at least linear. Lemma 3.1 provides a basis for the convergence of (6), and thus the CCCP (2). However, when imizing over d, the objective in (6) is linear instead of strictly conve quadratic as required by lemma 3.1. The net lemma shows that the problem, nevertheless, has equivalent strictly conve quadratic problem that gives the same solution. Lemma 3.2. There eist ϵ 0 > 0 and d such that the problem arg d R m c T d + ϵ 2 d 2 m d i = 1, d i 0, i = 1..m has the same optimal solution d for ϵ < ϵ 0. (10) 1 Let set S = arg i ( a T i b i ). When imizing (6) over d, an optimal solution just set d k S = 1/ S and d j / S = 0. When imizing over, the problem becomes t+1 = arg (u() a T S b S ) = arg (u() v ( t ) T ) subject to constraints in (1), where a S, b S are averages over a k, b k for k S, and v ( t ) = a S is a sub-gradient of v() at t. 2 By equivalent, we means the two problems have the same optimal solution. 3 Note, in [1], the Hassian matri H k affect CGD algorithm only through H k JJ, so we can always have a positive-definite H k when H k JJ is positive definite by assigning, other than H k JJ, 1 to diagnoal elements, 0 to non-diagnoal elements. 3

Proof. Let set S = arg ma i (c i ) and c ma = ma i (c i ). As ϵ = 0, we can obtain a optimal solution d by setting d k S = 1 S and d j / S = 0. Let α, β, be Lagrange multipliers of affine and non-negative constraints in (10) respectively. By KKT condtion, the solution d, α, β must have c α e β = 0, with βk S = 0, α = c ma, and βj / S = c ma c j, where e = [1,.., 1] T m. When ϵ > 0, the KKT condition for (10) only differs by the equation ϵd c αe β = 0, which can be satisfied by setting β k S = 0, α = α + ϵ S, and β j / S = βj / S ϵ S. If ϵ S < β j = c ma c j for j / S, then d, α, β still satisfy the KKT condition. In other words, d is still optimal if ϵ < ϵ 0, where ϵ 0 = j / S (c ma c j ) S > 0. When imizing (6) over d, the problem is of form (10) with ϵ = 0 and c t i = at i t + b i, i = 1..m. Lemma (3.2) says that we can always find equivalent strictly conve quadratic problem by setting positive ϵ t < j / S (c t ma c t j ) S. Now we are ready to give the main theorem. Theorem 3.3. The Concave-Conve Procedure (2) converges to a stationary point of (1) in at least linear rate, if the nonsmooth part of u() and v() are conve piecewise-linear, the smooth part of u() is strictly conve quadratic, and the domain formed by f i () 0, g i () = 0 is polyhedral. Proof. Let u() v() = f u () + P u () v(), where f u () is strictly conve quadratic and P u (), v() are conve piecewise-linear. We can formulate CCCP as an alternating imization on (6) between and d. The constraints f i () 0 and g j () = 0 form a polyhedral domain, so we transform them into a polyhedral, lower semi-continuous function P dom () = 0 if f i () 0, g j () = 0 and P dom () = otherwise. By the same way, the domain constraints on d are also transformed into polyhedral, lower semi-continuous function P dom (d). Then (6) can be epressed as,d F (, d) = f(, d) + P (, d), where f(, d) = f u () m d i(a T i + b i) is quadratic and P (, d) = P u () + P dom () + P dom (d) is polyhedral, lower-semi-continuous, separable for and d. When alternating imizing F (, d), the smooth part of subproblem F (, d t ) is strictly conve quadratic, and the subproblem d F ( t+1, d), by Lemma 3.2, has equivalent strictly conve quadratic problem. Hence, the alternating imization of F (, d), that is, the CCCP algorithm (2), converges at least linearly to a stationary point of (1) by Lemma 3.1. To our knowledge, this is the first result on convergence rate of CCCP for nonsmooth problem. Specifically, we show that the CCCP algorithm used in [9], [11], [12] and [13] for structual-svm with tighter bound, transductive SVM, ramp-loss SVM and structual-svm with hidden variables has at least linear convergence rate. Acknowledgments This work was also supported by National Science Council, and Intel Cooperation under Grants NSC 100-2911-I-002-001, and 101R7501. References [1] Paul Tseng and Sangwoon Yun. (2009) A coordinate gradient descent method for nonsmooth separable imization. Mathematical Programg. [2] A. L. Yuille and A. Rangarajan. The concave-conve procedure. Neural Computation, 15:915V936, 2003 [3] L. Wang, X. Shen, and W. Pan. On transductive support vector machines. In J. Verducci, X. Shen, and J. Lafferty, editors, Prediction and Discovery. American Mathematical Society, 2007. [4] B. K. Sriperumbudur, D. A. Torres, and G. R. G. Lanckriet. Sparse eigen methods by d.c. programg. In Proc. of the 24th Annual International Conference on Machine Learning, 2007. [5] J. Neumann, C. Schnorr, and G. Steidl. Combined SVM-based feature selection and classification. Machine Learning, 61:129V150, 2005. [6] X.-L. Meng. Discussion on optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1):35V43, 2000. [7] R. Salakhutdinov, S. Roweis, and Z. Ghahramani. On the convergence of bound optimization algorithms. In Proc. 19th Conference in Uncertainty in Arti?cial Intelligence, pages 509V516, 2003. 4

[8] D. D. Lee and H. S. Seung. Algorithms for non-negative matri factorization. In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 556V562. MIT Press, Cambridge, 2001. [9] C. B. Do, Q. V. Le, C. H. Teo, O. Chapelle, and A. J. Smola. Tighter bounds for structured estimation. In Advances in Neural Information Processing Systems 21, 2009. To appear. [10] T. Pham Dinh and L. T. Hoai An. Conve analysis approach to d.c. programg: Theory, algorithms and applications. Acta Mathematica Vietnamica, 22(1):289V355, 1997. [11] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive SVMs. Journal of Machine Learning Research, 7:1687V1712, 2006. [12] Collobert, R., Weston, J., Bottou, L. (2006). Trading conveity for scalability. ICML06, 23rd International Conference on Machine Learning. Pittsburgh, USA. [13] C.-N. J. Yu and T. Joachims. Learning structural svms with latent variables. In International Conference on Machine Learning (ICML), 2009 [14] B. Sriperumbudur and G. Lanckriet, On the convergence of the concave-conve procedure, in Neural Information Processing Systems, 2009. 5