Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Transcription

1 Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richtárik School of Mathematics The University of Edinburgh Joint work with Martin Takáč (Edinburgh) Les Houches January 11, / 30

2 Lock-Free (Asynchronous) Updates Between the time when x is read by any given processor and an update is computed and applied to x by it, other processors apply their updates. Other processors x 6 x 5 + update(x 3 ) Update x Update x Update x Update x x 3 x 6 time x 2 x 4 x 5 x 7 Read current x (x 3 ) Compute update Write to current x (x 5 ) Viewpoint of a single processor 2 / 30

3 Generic Parallel Lock-Free Algorithm In general: x j+1 = x j + update(x r(j) ) r(j) = index of iterate current at reading time j = index of iterate current at writing time Assumption: j r(j) τ τ + 1 # processors 3 / 30

4 The Problem and Its Structure minimize x R V [f (x) e E f e (x)] (OPT ) Set of vertices/coordinates: V (x = (x v, v V ), dim x = V ) Set of edges: E 2 V Set of blocks: B (a collection of sets forming a partition of V ) Assumption: f e depends on x v, v e, only Example (convex f : R 5 R): f (x) = 7(x 1 + x 3 ) 2 + 5(x 2 x 3 + x 4 ) 2 + (x 4 x 5 ) 2 } {{ } } {{ } } {{ } f e1 (x) f e2 (x) f e3 (x) V = {1, 2, 3, 4, 5}, V = 5, e 1 = {1, 3}, e 2 = {2, 3, 4}, e 3 = {4, 5} 4 / 30

5 Applications structured stochastic optimization (via Sample Average Approximation) learning sparse least-squares sparse SVMs, matrix completion, graph cuts (see Niu-Recht-Ré-Wright (2011)) truss topology design optimal statistical designs 5 / 30

6 PART 1: LOCK-FREE HYBRID SGD/RCD METHODS based on: P. R. and M. Takáč, Lock-free randomized first order methods, manuscript, / 30

7 Problem-Specific Constants function definition average maximum Edge-Vertex Degree ω e = e = {v V : v e} ω ω (# vertices incident with an edge) (relevant if B = V ) Edge-Block Degree (# blocks incident with an edge) (relevant if B > 1) Vertex-Edge Degree (# edges incident with a vertex) (not needed!) Edge-Edge Degree (# edges incident with an edge) (relevant if E > 1) σ e = {b B : b e } σ σ δ v = {e E : v e} δ δ ρ e = {e E : e e } ρ ρ Remarks: Our results depend on: σ (avg Edge-Block degree) and ρ (avg Edge-Edge degree) First and second row are identical if B = V (blocks correspond to vertices/coordinates) 7 / 30

8 Example A = A T 1 A T 2 A T 3 A T 4 = f (x) = 1 2 Ax 2 2 = R4 3 4 (A T i x) 2, E = 4, V = 3 i=1 8 / 30

9 Example A = A T 1 A T 2 A T 3 A T 4 = f (x) = 1 2 Ax 2 2 = 1 2 Computation of ω and ρ: R4 3 4 (A T i x) 2, E = 4, V = 3 i=1 v 1 v 2 v 3 ω ei ρ ei e e e e δ vj ω = = 1.5, ρ = = 3 ω e = e, ρ e = {e E : e e }, δ v = {e E : v e} 8 / 30

10 Algorithm Iteration j + 1 looks as follows: x j+1 = x j γ E σ e b f e (x r(j) ) Viewpoint of the processor performing this iteration: Pick edge e E, uniformly at random Pick block b intersecting edge e, uniformly at random Read current x (enough to read x v for v e) Compute b f e (x) Apply update: x x α b f e (x) with α = γ E σ e and γ > 0 Do not wait (no synchronization!) and start again! Easy to show that E[ E σ e b f e (x)] = f (x) 9 / 30

11 Main Result Setup: c = strong convexity parameter of f L = Lipschitz constant of f f (x) 2 M for x visited by the method Starting point: x 0 R V 0 < ɛ < L 2 x 0 x 2 2 constant stepsize: γ := cɛ ( σ+2τ ρ/ E )L 2 M 2 10 / 30

12 Main Result Setup: c = strong convexity parameter of f L = Lipschitz constant of f f (x) 2 M for x visited by the method Starting point: x 0 R V 0 < ɛ < L 2 x 0 x 2 2 constant stepsize: γ := cɛ ( σ+2τ ρ/ E )L 2 M 2 Result: Under the above assumptions, for ( ) ( 2τ ρ LM 2 k σ + E c 2 ɛ log L x0 x 2 2 ɛ ) 1, we have min E{f (x j) f } ɛ. 0 j k 10 / 30

13 Special Cases General result: 2τ ρ ) LM 2 E ( σ + } {{ } Λ ( ) log 2L x0 x 2 1 c 2 ɛ ɛ } {{ } common to all special cases special case lock-free parallel version of... Λ E = 1 Randomized Block Coordinate Descent B + 2τ E B = 1 B = V Incremental Gradient Descent (Hogwild! as implemented) RAINCODE: RAndomized INcremental COordinate DEscent (Hogwild! as analyzed) 1 + ω + 2τ ρ E 2τ ρ E E = B = 1 Gradient Descent 1 + 2τ 11 / 30

14 Analysis via a New Recurrence Let a j = 1 2 E[ x j x 2 ] Nemirovski-Juditsky-Lan-Shapiro: a j+1 (1 2cγ j )a j γ2 j M 2 Niu-Recht-Ré-Wright (Hogwild!): R.-Takáč: a j+1 (1 cγ)a j + γ 2 ( 2cω Mτ(δ ) 1/2 )a 1/2 j γ2 M 2 Q, where Q = ω + 2τ ρ ρ E + 4ω E τ + 2τ 2 (ω ) 2 (δ ) 1/2 a j+1 (1 2cγ)a j γ2 ( σ + 2τ ρ E )M2 12 / 30

15 Parallelization Speedup Factor PSF = Λ of serial version (Λ of parallel version)/τ = σ ( σ + 2τ ρ E )/τ = 1 1 τ + 2 ρ σ E 13 / 30

16 Parallelization Speedup Factor PSF = Three modes: Λ of serial version (Λ of parallel version)/τ = σ ( σ + 2τ ρ E )/τ = 1 1 Brute force (many processors; τ very large): PSF σ E 2 ρ Favorable structure ( ρ σ E 1 τ ; fixed τ): PSF τ τ + 2 ρ σ E Special τ (τ = E ρ ): PSF = E ρ σ σ + 2 τ 13 / 30

17 Improvements vs Hogwild! If B = V (blocks = coordinates), then our method coincides with Hogwild! (as analyzed in Niu et al), up to stepsize choice: x j+1 = x j γ E ω e v f e (x r(j) ) 14 / 30

18 Improvements vs Hogwild! If B = V (blocks = coordinates), then our method coincides with Hogwild! (as analyzed in Niu et al), up to stepsize choice: x j+1 = x j γ E ω e v f e (x r(j) ) Niu-Recht-Ré-Wright (Hogwild!, 2011): R.-Takáč: Λ = 4ω + 24τ ρ E + 24τ 2 ω (δ ) 1/2 Λ = ω + 2τ ρ E 14 / 30

19 Improvements vs Hogwild! If B = V (blocks = coordinates), then our method coincides with Hogwild! (as analyzed in Niu et al), up to stepsize choice: x j+1 = x j γ E ω e v f e (x r(j) ) Niu-Recht-Ré-Wright (Hogwild!, 2011): R.-Takáč: Advantages of our approach: Λ = 4ω + 24τ ρ E + 24τ 2 ω (δ ) 1/2 Λ = ω + 2τ ρ E Dependence on averages and not maxima! (ω ω, ρ ρ) Better constants (4 1, 24 2) The third large term is not present (no dependence on τ 2 and δ ) Introduction of blocks ( cover also block coordinate descent, gradient descent, SGD) Simpler analysis 14 / 30

20 Modified Algorithm: Global Reads and Local Writes Partition vertices (coordinates) into τ + 1 blocks V = b 1 b 2 b τ+1 and assign block b i to processor i, i = 1, 2,..., τ + 1. Processor i will (asynchronously) do: Pick edge e {e E : e b i }, uniformly at random (edge intersecting with block owned by processor i) Update: Pros and cons: x j+1 = x j α bi f e (x r(j) ) + good if global reads and local writes are cheap, but global writes are expensive (NUMA = Non Uniform Memory Access) - do not have an analysis Idea proposed by Ben Recht. 15 / 30

21 Experiment 1: rcv size = 1.2 GB, features = V = 47,236, training: E = 677,399, testing: 20, CPU, Asyn. 1 CPU, Syn. 4 CPU, Asyn. 4 CPU, Syn. 16 CPU, Asyn. 16 CPU, Syn. Train Error Epoch 16 / 30

22 Experiment 2 Artificial problem instance: minimize f (x) = 1 2 Ax 2 = m i=1 1 2 (AT i x) 2. A R m n ; m = E = 500, 000; n = V = 50, 000 Three methods: Synchronous, all = parallel synchronous method with B = 1 Asynchronous, all = parallel asynchronous method with B = 1 Asynchronous, block = parallel asynchronous method with B = τ (no need for atomic operations additional speedup) We measure elapsed time needed to perform 20m iterations (20 epochs) 17 / 30

23 Uniform instance: e = 10 for all edges 18 / 30

24 PART 2: PARALLEL BLOCK COORDINATE DESCENT based on: P. R. and M. Takáč, Parallel coordinate descent methods for big data optimization, arxiv:1212:0873, / 30

25 Overview A rich family of synchronous parallel block coordinate descent methods 20 / 30

26 Overview A rich family of synchronous parallel block coordinate descent methods Theory and algorithms work for convex composite functions with block-separable regularizer: minimize: F (x) f e (x) +λ Ψ b (x). e E b B } {{ } } {{ } f Ψ Decomposition f = e E fe does not need to be known! f : convex or strongly convex (complexity for both) 20 / 30

27 Overview A rich family of synchronous parallel block coordinate descent methods Theory and algorithms work for convex composite functions with block-separable regularizer: minimize: F (x) f e (x) +λ Ψ b (x). e E b B } {{ } } {{ } f Ψ Decomposition f = e E fe does not need to be known! f : convex or strongly convex (complexity for both) All parameters for running the method according to theory are easy to compute: block Lipschitz constants L1,..., L B ω 20 / 30

28 ACDC: Lock-Free Parallel Coordinate Descent C++ code Can solve a LASSO problem with V = 10 9, E = , ω = 35, on a machine with τ = 24 processors, to ɛ = accuracy, in 2 hours, starting with initial gap / 30

29 Complexity Results First complexity analysis of parallel coordinate descent: P(F (x k ) F ɛ) 1 p Convex functions: k ( 2β α ) x 0 x 2 L ɛ log F (x 0) F ɛp Strongly convex functions (with parameters µ f and µ Ψ ): k Leading constants matter! β+µ Ψ log F (x 0) F α(µ f +µ Ψ ) ɛp 22 / 30

30 Parallelization Speedup Factors Closed-form formulas for parallelization speedup factors (PSFs): PSFs are functions of ω, τ and B, and depend on sampling Example 1: fully parallel sampling (all blocks are updated, i.e., τ = B ): PSF = B ω. Example 2: τ-nice sampling (all subsets of τ blocks are chosen with the same probability): PSF = τ 1 + (ω 1)(τ 1) B / 30

31 A Problem with Billion Variables LASSO problem: F (x) = 1 2 Ax b 2 + λ x 1 The instance: A has E = m = rows V = n = 10 9 columns (= # of variables) exactly 20 nonzeros in each column on average 10 and at most 35 nonzeros in each row (ω = 35) optimal solution x has 10 5 nonzeros λ = 1 Solver: Asynchronous parallel coordinate descent method with independent nice sampling and τ = 1, 8, 16 cores 24 / 30

32 # Coordinate Updates / n 25 / 30

33 # Iterations / n 26 / 30

34 Wall Time 27 / 30

35 Billion Variables 1 Core k/n F (x k ) F x k 0 time [hours] 0 < < ,016, < ,761, < ,858, < ,384, < ,550, < ,280, < ,256, < ,496, < ,781, < ,580, < ,353, < , < , < , < , < , < , < , < , < , / 30

36 Billion Variables 1, 8 and 16 Cores F (x k ) F Elapsed Time (k τ)/n 1 core 8 cores 16 cores 1 core 8 cores 16 cores e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e / 30

37 References P. R. and M. Takáč, Lock-free randomized first order methods, arxiv:1301:xxxx. P. R. and M. Takáč, Parallel coordinate descent methods for big data optimization, arxiv:1212:0873, P. R. and M. Takáč, Iteration complexity of block coordinate descent methods for minimizing a composite function, Mathematical Programming, Series A, P. R. and M. Takáč, Efficient serial and parallel coordinate descent methods for huge-scale truss topology design, Operations Research Proceedings, F. Niu, B. Recht, C. Ré, and S. Wright, Hogwild!: A lock-free approach to parallelizing stochastic gradient descent, NIPS A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Opt., (4): , M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent, NIPS Yu. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM J. Opt. 22(2): , / 30