Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh



Similar documents
Adaptive Online Gradient Descent

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Lecture 4: BK inequality 27th August and 6th September, 2007

2.3 Convex Constrained Optimization Problems

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

17. Inner product spaces Definition Let V be a real vector space. An inner product on V is a function

Linear Threshold Units

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probability Generating Functions

Inner Product Spaces

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Mathematical finance and linear programming (optimization)

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Statistical Machine Learning

CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

1 Norms and Vector Spaces

Proximal mapping via network optimization

No: Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

Høgskolen i Narvik Sivilingeniørutdanningen STE6237 ELEMENTMETODER. Oppgaver

Numerical methods for American options

BANACH AND HILBERT SPACE REVIEW

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Introduction to Scheduling Theory

Probability Theory. Florian Herzog. A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T..

Randomized Coordinate Descent Methods for Big Data Optimization

DATA ANALYSIS II. Matrix Algorithms

THE DYING FIBONACCI TREE. 1. Introduction. Consider a tree with two types of nodes, say A and B, and the following properties:

Distributed Coordinate Descent Method for Learning with Big Data

Continued Fractions and the Euclidean Algorithm

Numerical Analysis Lecture Notes

Stochastic Inventory Control

(Quasi-)Newton methods

Integer Factorization using the Quadratic Sieve

minimal polyonomial Example

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Tail inequalities for order statistics of log-concave vectors and applications

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Continuity of the Perron Root

Approximation Algorithms

Pacific Journal of Mathematics

ALMOST COMMON PRIORS 1. INTRODUCTION

ON SEQUENTIAL CONTINUITY OF COMPOSITION MAPPING. 0. Introduction

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI. (B.Sc.(Hons.), BUAA)

Ideal Class Group and Units

Master s Theory Exam Spring 2006

Markov random fields and Gibbs measures

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

COMBINATORIAL PROPERTIES OF THE HIGMAN-SIMS GRAPH. 1. Introduction

Two-Stage Stochastic Linear Programs

Nonlinear Optimization: Algorithms 3: Interior-point methods

THREE DIMENSIONAL GEOMETRY

Vector and Matrix Norms

Chapter 4 Lecture Notes

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

An improved on-line algorithm for scheduling on two unrestrictive parallel batch processing machines

Several Views of Support Vector Machines

Let H and J be as in the above lemma. The result of the lemma shows that the integral

Machine Learning and Pattern Recognition Logistic Regression

Date: April 12, Contents

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Cyber-Security Analysis of State Estimators in Power Systems

MATHEMATICAL METHODS OF STATISTICS

Notes on Symmetric Matrices

Imprecise probabilities, bets and functional analytic methods in Łukasiewicz logic.

RANDOM INTERVAL HOMEOMORPHISMS. MICHA L MISIUREWICZ Indiana University Purdue University Indianapolis

We shall turn our attention to solving linear systems of equations. Ax = b

Some probability and statistics

1 Approximating Set Cover

Separation Properties for Locally Convex Cones

Geometrical Characterization of RN-operators between Locally Convex Vector Spaces

Level Set Framework, Signed Distance Function, and Various Tools

Linear Algebra I. Ronald van Luijk, 2012

Lectures on Stochastic Processes. William G. Faris

Math 4310 Handout - Quotient Vector Spaces

An Additive Neumann-Neumann Method for Mortar Finite Element for 4th Order Problems

Summer course on Convex Optimization. Fifth Lecture Interior-Point Methods (1) Michel Baes, K.U.Leuven Bharath Rangarajan, U.

Associativity condition for some alternative algebras of degree three

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Error estimates for nearly degenerate finite elements

Similarity and Diagonalization. Similar Matrices

The Multiplicative Weights Update method

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing

Some stability results of parameter identification in a jump diffusion model

Influences in low-degree polynomials

Message-passing sequential detection of multiple change points in networks

GI01/M055 Supervised Learning Proximal Methods

1 Portfolio mean and variance

Follow the Perturbed Leader

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

Lecture notes: single-agent dynamics 1

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Perron vector Optimization applied to search engines

Modeling and Performance Evaluation of Computer Systems Security Operation 1

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS

Transcription:

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem We first consider the following problem: minimize f (x) (1) subject to x = (x 1,..., x n ) R n We will assume that f is: smooth (will be made precise later) strongly convex (will be made precise later) So, this is unconstrained minimization of a smooth convex function. 2 / 30

Randomized Coordinate Descent with Arbitrary Sampling NSync Algorithm (R. and Takáč 2014, [4]) Input: initial point x 0 R n subset probabilities {p S } for each S [n] = {1, 2,..., n} stepsize parameters v 1,..., v n > 0 for k = 0, 1, 2,... do a) Select a random set of coordinates S k [n] following the law P(S k = S) = p S, S [n] b) Update (possibly in parallel) selected coordinates: x k+1 = x k i S k 1 v i (e T i f (x k ))e i end for (e i is the ith unit coordinate vector) Remark: The NSync algorithm was introduced in 2013. The first coordinate descent algorithm using arbitrary sampling. 3 / 30 Two More Ways of Writing the Update Step 1. Coordinate-by-coordinate: x k+1 i = { xi k, i / S k, xi k 1 v i ( f (x k )) i, i S k. 2. Via projection to a subset of blocks: If for h R n and S [n] we write h S = h i e i, (2) i S then x k+1 = x k + h Sk for h = (Diag(v)) 1 f (x k ). (3) Depending on context, we shall interchangeably denote the ith partial derivative of f at x by i f (x) = e T i f (x) = ( f (x)) i 4 / 30

Samplings Definition 1 (Sampling) By the name sampling we refer to a set valued random mapping with values being subsets of [n] = {1, 2,..., n}. For sampling Ŝ we ine the probability vector p = (p 1,..., p n ) T by We say that Ŝ is proper, if p i > 0 for all i. p i = P(i Ŝ) (4) A sampling Ŝ is uniquely characterized by the probability mass function p S = P(Ŝ = S), S [n]; (5) that is, by assigning probabilities to all subsets of [n]. Later on it will be useful to also consider the probability matrix P = P(Ŝ) = (p ij) given by p ij = P(i Ŝ, j Ŝ) = S:{i,j} S p S. (6) 5 / 30 Samplings: A Basic Identity Lemma 2 ([3]) For any sampling Ŝ we have n p i = E[ Ŝ ]. (7) Proof. n (4)+(5) p i = n p S = p S = p S S = E[ Ŝ ]. S [n]:i S S [n] i:i S S [n] 6 / 30

Sampling Zoo - Part I Why consider different samplings? 1. Basic Considerations. It is important that each block i has a positive probability of being chosen, otherwise NSync will not be able to update some blocks and hence will not converge to optimum. For technical/sanity reasons, we ine: Proper sampling. pi = P(i Ŝ) > 0 for all i [n] Nil sampling: P( Ŝ = ) = 1 Vacuous sampling: P( Ŝ = ) > 0 2. Parallelism. Choice of sampling affects the level of parallelism: E[ Ŝ ] is the average number of updates performed in parallel in one iteration; and is hence closely related to the number of iterations. serial sampling: picks one block: P( Ŝ = 1) = 1 We call this sampling serial although nothing prevents us from computing the actual update to the block, and/or to apply he update in parallel. 7 / 30 Sampling Zoo - Part II fully parallel sampling: always picks all blocks: P(Ŝ = {1, 2,..., n}) = 1 3. Processor reliability. Sampling may be induced/informed by the computing environment: Reliable/dedicated processors. If one has reliable processors, it is sensible to choose sampling Ŝ such that P( Ŝ = τ) = 1 for some τ related to the number of processors. Unreliable processors. If processors given a computing task are busy or unreliable, they return answer later or not at all - it is then sensible to ignore such updates and move on. This then means that Ŝ varies from iteration to iteration. 4. Distributed computing. In a distributed computing environment it is sensible: to allow each compute node as much autonomy as possible so as to minimize communication cost, to make sure all nodes are busy at all times 8 / 30

Sampling Zoo - Part III This suggests a strategy where the set of blocks is partitioned, with each node owning a partition, and independently picking a chunky subset of blocks at each iteration it will update, ideally from local information. 5. Uniformity. It may or may not make sense to update some blocks more often than others: uniform samplings: P(i Ŝ) = P(j Ŝ) for all i, j [n] doubly uniform (DU): These are samplings characterized by: S = S P(Ŝ = S ) = P(Ŝ = S ) for all S, S [n] τ-nice: DU sampling with the additional property that P( Ŝ = τ) = 1 distributed τ-nice: will ine later independent sampling: union of independent uniform serial samplings nonuniform samplings 9 / 30 Sampling Zoo - Part IV 6. Complexity of generating a sampling. Some samplings are computationally more efficient to generate than others: the potential benefits of a sampling may be completely ruined by the difficulty to generate sets according to the sampling s distribution. a τ-nice sampling can be well approximated by an independent sampling, which is easy to generate... in general, many samplings will be hard to generate 10 / 30

Assumption: Strong convexity Assumption 1 (Strong convexity) Function f is differentiable and λ-strongly convex (with λ > 0) with respect to the standard Euclidean norm h = ( n ) 1/2 hi 2. That is, we assume that for all x, h R n, f (x + h) f (x) + f (x), h + λ 2 h 2. (8) 11 / 30 Assumption: Expected Separable Overapproximation Assumption 2 (ESO) Assume Ŝ is proper and that for some vector of positive weights v = (v 1,..., v n ) and all x, h R n, E[f (x + hŝ)] f (x) + f (x), h p + 1 2 h 2 p v. (9) Note that the ESO parameters v, p depend on both f and Ŝ. For simplicity, we will often instead of (9) use the compact notation Notation used above: h S = i S (f, Ŝ) ESO(v). h i e i R n (projection of h R n onto coordinates i S) g, h p p v = n p i g i h i R (weighted inner product) = (p 1 v 1,..., p n v n ) R n (Hadamard product) 12 / 30

Assumption: Expected Separable Overapproximation Here the ESO inequality again, now without the simplifying notation: E f x + h i e i f (x) + i Ŝ } {{ } complicated n p i i f (x)h i + } {{ } linear in h 1 n p i v i hi 2 2 } {{ } quadratic and separable in h 13 / 30 Complexity of NSync Theorem 3 (R. and Takáč 2013, [4]) Let x be a minimizer of f. Let Assumptions 1 and 2 be satisfied for a proper sampling Ŝ (that is, (f, Ŝ) ESO(v)). Choose starting point x 0 R n, error tolerance 0 < ɛ < f (x 0 ) f (x ) and confidence level 0 < ρ < 1. If {x k } are the random iterates generated by NSync, where the random sets S k are iid following the distribution of Ŝ, then K Ω ( f (x 0 λ log ) f (x ) ) P(f (x K ) f (x ) ɛ) 1 ρ, (10) ɛρ where Ω = max,...,n v i p i n v i E[ Ŝ ]. (11) 14 / 30

What does this mean? Linear convergence. NSync converges linearly (i.e., logarithmic dependence on ɛ) High confidence is not a problem. ρ appears inside the logarithm, so it easy to achieve high confidence (by running the method longer; there is no need to restart) Focus on the leading term. The leading term is Ω; and we have a closed-form expression for it in terms of parameters v1,..., v n (which depend on f and Ŝ) parameters p1,..., p n (which depend on Ŝ) Parallelization speedup. The lower bound suggests that if it was the case that the parameters v i did not grow with increasing E[ Ŝ ], then we could potentially be getting linear speedup in τ (average number of updates per iteration). So we shall study the dependence of vi on τ (this will depend on f and Ŝ) As we shall see, speedup is often guaranteed for sparse or well-conditioned problems. τ = Question: How to design sampling Ŝ so that Ω is minimized? 15 / 30 Analysis of the Algorithm (Proof of Theorem 3) 16 / 30

Tool: Markov s Inequality Theorem 4 (Markov s Inequality) Let X be a nonnegative random variable. Then for any ɛ > 0, P(X ɛ) E[X ]. ɛ Proof. Let 1 X ɛ be the random variable which is equal to 1 if X ɛ and 0 otherwise. Then 1 X ɛ X ɛ. By taking expectations of all terms, we obtain [ ] X P(X ɛ) = E [1 X ɛ ] E ɛ = E[X ]. ɛ 17 / 30 Tool: Tower Property of Expectations (Motivation) Example 5 Consider discrete random variables X and Y: X has 2 outcomes: x 1 and x 2 Y has 3 outcomes: y 1, y 2 and y 3 Their joint probability mass function if given in this table: y 1 y 2 y 3 x 1 0.05 0.20 0.03 0.28 x 2 0.25 0.30 0.17 0.72 0.30 0.50 0.20 1 Obviously, E[X] = 0.28x 1 + 0.72x 2. But we can also write: E[X] = (0.05x 1 + 0.25x 2 ) + (0.20x 1 + 0.30x 2 ) + (0.03x 1 + 0.17x 2 ) = 0.30 }{{} P(Y=y 1 ) 0.30 x 1 + 0.25 0.30 x 2) +0.50( 0.20 0.50 } {{ } x 1 + 0.30 0.50 x 2) + 0.20( 0.03 0.20 x 1 + 0.17 0.20 x 2) E[X Y=y 1 ] ( 0.05 = E[E[X Y]]. 18 / 30

Tower Property Lemma 6 (Tower Property / Iterated Expectation) For any random variables X and Y, we have E[X ] = E[E[X Y ]]. Proof. We shall only prove this for discrete random variables; the proof is more technical in the continuous case. E[X ] = x xp(x = x) = x x y P(X = x & Y = y) = y = y xp(x = x & Y = y) x P(X = x & Y = y) P(Y = y)x P(Y = y) x = y P(Y = y) xp(x = x Y = y) x } {{ } E[X Y =y] = E[E[X Y ]]. 19 / 30 Proof of Theorem 3 - Part I If we let µ = λ/ω, then f (x + h) (8) f (x) + f (x), h + λ 2 h 2 f (x) + f (x), h + µ 2 h 2 v p 1. (12) Indeed, one can easily verify that µ is ined to be the largest number for which λ h 2 µ h 2 v p 1 holds for all h. Hence, f is µ-strongly convex with respect to the norm v p 1. Let x be a minimizer of f, i.e., an optimal solution of (1). Minimizing both sides of (12) in h, we get f (x ) f (x) (12) min h R n f (x), h + µ 2 h 2 v p 1 = 1 2µ f (x) 2 p v 1. (13) 20 / 30

Proof of Theorem 3 - Part II Let h k = v 1 f (x k ). Then in view of (3), we have x k+1 = x k + h k S k. Utilizing Assumption 2, we get E[f (x k+1 ) x k ] = E [ f (x k + h k S k ) x k] (9) f (x k ) + f (x k ), h k p + 1 2 hk 2 p v = f (x k ) 1 2 f (x k ) 2 p v 1 (13) f (x k ) µ(f (x k ) f (x )). Taking expectations in the last inequality, using the tower property, and subtracting f (x ) from both sides of the inequality, we get E[f (x k+1 ) f (x )] (1 µ)e[f (x k ) f (x )]. Unrolling the recurrence, we get E[f (x k ) f (x )] (1 µ) k (f (x 0 ) f (x )). (14) 21 / 30 Proof of Theorem 3 - Part III Using Markov s inequality, (14) and the inition of K, we get P(f (x K ) f (x ) ɛ) E[f (x K ) f (x )]/ɛ (14) (1 µ) K (f (x 0 ) f (x ))/ɛ (10) ρ. Finally, let us now establish the lower bound on Ω. Letting { } n = p R n : p 0, p i = E[ Ŝ ], we have Ω (11) = max i v i p i (7) min max p i v i p i = 1 E[ Ŝ ] n v i, where the last equality follows since optimal p i is proportional to v i. 22 / 30

How to compute the ESO stepsize parameters v 1,..., v n? 23 / 30 C 1 (A) Functions By inition, the ESO parameters v depend on both f and also highlighted by the notation we use: Ŝ. This is (f, Ŝ) ESO(v). Hence, in order to compute these parameters, we need to specify f and Ŝ. The following inition describes a wide class of functions, often appearing in computational practice, for which we will be able to do this. Definition 7 (C 1 (A) functions) Let A R m n. By C 1 (A) we denote the set of continuously differentiable functions f : R n R satisfying the following inequality for all x, h R n : f (x + h) f (x) + ( f (x)) T h + 1 2 ht A T Ah. (15) Example 8 (Least Squares) The function f (x) = 1 2 Ax b 2 satisfies (15) with an equality. Hence, f C 1 (A). 24 / 30

Are There More C 1 (A) Functions? Functions we wish to minimize in machine learning often have a finite sum structure: f (x) = j φ j(m j x), (16) where M j R m n are data matrices and φ j : R m R are loss functions. The next result says that under certain assumptions on the loss functions, f C 1 (A) for some A, and describes what this A is. Theorem 9 ([5]) Assume that for each j, function φ j is γ j -smooth: φ j (s) φ j (s ) γ j s s, for all s, s R m. Then f C 1 (A), where A satisfies A T A = J j=1 γ jm T j M j. Remark: The above theorem also says what A T A is. This is important, as will be clear from the next theorem. 25 / 30 ESO for C 1 (A) Functions and an Arbitrary Sampling Theorem 10 ([5]) Assume f C 1 (A) for some real matrix A R m n, let Ŝ be an arbitrary sampling and let P = P(Ŝ) be its probability matrix. If v = (v 1,..., v n ) satisfies P A T A Diag(p v), then (f, Ŝ) ESO(v). 26 / 30

Sampling Identity for a Quadratic In in order to prove Theorem 10, we will need the following lemma. Lemma 11 ([5]) Let G be any real n n matrix and Ŝ an arbitrary sampling. Then for any h R n we have E [ h Ṱ Gh ] ( ) S Ŝ = h T P(Ŝ) G h, (17) where denotes the Hadamard (elementwise) product of matrices, and P(Ŝ) is the probability matrix of Ŝ. 27 / 30 Proof of Lemma 11 Proof. Let 1 ij be the indicator random variable of the event i Ŝ & j Ŝ: 1 ij = { 1 if i Ŝ & j Ŝ, 0 otherwise. Note that E[1 ij ] = P(i Ŝ & j Ŝ) = p ij. We now have E [ h Ṱ Gh ] (2) S Ŝ = E n n G ij h i h j = E 1 ij G ij h i h j i Ŝ j Ŝ j=1 n n n n = E[1 ij ]G ij h i h j = p ij G ij h i h j j=1 = h T ( P(Ŝ) G ) h. j=1 28 / 30

Proof of Theorem 10 Having established Lemma 11, we are now ready prove Theorem 10. Proof. Fixing any x, h R n, we have E [ f (x + hŝ) ] (15) E [ ] f (x) + f (x), hŝ + 1 2 hṱ S AT AhŜ ] = f (x) + f (x), h p + 1 2 E [ h Ṱ S AT AhŜ (Lemma 11) = f (x) + f (x), h p + 1 2 ht ( P A T A ) h f (x) + f (x), h p + 1 2 ht Diag(p v)h. } {{ } = h 2 p v 29 / 30 References I [1] Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341-362, 2012 [2] Peter Richtárik and Martin Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(2):1 38, 2014 [3] Peter Richtárik and Martin Takáč. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 1 52, 2015 [4] Peter Richtárik and Martin Takáč. On optimal probabilities in stochastic coordinate descent methods. Optimization Letters, 2015 [5] Zheng Qu and Peter Richtárik. Coordinate descent with arbitrary sampling II: expected separable overapproximation, arxiv:1412.8063, 2014 30 / 30