An asynchronous Oneshot method with Load Balancing Torsten Bosse Argonne National Laboratory, IL
. 2 of 24 Design Optimization Problem min f(u,y) subject to c(u,y) = 0, (u,y) R m+n for sufficiently smooth functions f : R m n R and c : R m n R n Assumption a slowly convergent fixed-point solver G : R m n R n for c: y k+1 = G(u,y k ) [c(u,y) = 0 y = G(u,y)] with a global contraction rate G y (u,y) ρ < 1, e.g. ρ 0.999 Lagrange function with adjoint variable ȳ R n L(u,y,ȳ) f(u,y) + ȳ [G(u,y) y] Find (u,y,ȳ ) satisfying first order necessary / KKT conditions 0 = Lȳ (u,y,ȳ ) G(u,y ) y (primal) 0 = L y (u,y,ȳ ) f y (u,y ) + [G y (u,y) I] ȳ (adjoint) 0 = L u (u,y,ȳ ) f u (u,y ) + G u (u,y )ȳ (design)
. 3 of 24 Fixed-Point Solver for the State Equation Assume Optimal control u and y is known. Goal Find state y satisfying c(u,y ) = 0. Solution Use slowly convergent state fixed-point solver G : U Y Y, i.e., start with an initial approximation y 0 Y and iterate Notation y k+1 = G(u,y k ) for k = 0,1,2,... such that y = lim k y k solves y = G(u,y ) c(u,y ) = 0 due to global contraction rate G y (u,y) ρ < 1 and Banach FPT. (state) (state) (state)... Interpretation Fixed-point iteration is some solver for a PDE/numerical model that describes a physical process represented by the equality c(u, y) = 0
. 4 of 24 Fixed-Point Solver for the State + Adjoint Equation Assume Optimal control u is known. Goal Find state y satisfying c(u,y ) = 0 and adjoint solution ȳ with 0 = f y (u,y ) + [G y (u,y ) I] ȳ BUT only allowed to use Jacobi-vector, vector-jacobi products - evaluation/inversion/factorization of G y might be to costly! Idea For an initial approximation ȳ 0 and given y the adjoint iteration ȳ k+1 = f y (u,y) + G y (u,y)ȳ k is also a fixed point iteration due to G y (u,y) ρ < 1. Updating scheme according to Christianson (state) (state) (adjoint) (adjoint)... (B.C.)
. 5 of 24 Fixed-Point Solver for the State + Adjoint Equation (Griewank) Problem The B.C. updating is sequential, i.e., converge primal iteration y k+1 = G(u,y k ) for k = 0,1,2,... to a solution y = G(u,y ) (or an approximation) before iterating ȳ k+1 = f y (u,y ) + G y (u,y )ȳ k for k = 0,1,2,... from initial ȳ 0 until ȳ = lim k ȳ k with L y (u,y,ȳ ) = 0. Alternative approach (Griewank): Update (y k,ȳ k ) in parallel, i.e., y k+1 = G(u },y k ) ȳ k+1 = f y (u,y k ) + G y (u,y k )ȳ k for k = 0,1,2,... or, in short, use the Piggyback method (state,adjoint) (state,adjoint)... (A.G.)
??? 6 of 24 Observations Observation I Changes of y k+1 = G(u,y k ) in primal update cause a lag-effect of the adjoint iteration ȳ k+1 = f y (u,y k ) + G y (u,y k )ȳ k, i.e., L y (u,y k,ȳ k ) might grow while y k is not sufficiently converged. Observation II Multiple primal for/before an adjoint update reduce the lag effect Observation III Primal precision/number of forward iterations unkown for B.C. Question What are the consequences and can we do better? B.C. Residual Pure Pure/B.C. A.G.??? Adjoint (B.C.) Adjoint (A.G.) Time Motivation primal and adjoint updates have usually different execution times
7 of 24 Blurred Piggyback - Best of both worlds Idea I Run primal and adjoints in an asynchronous parallel fashion, i.e., (state) (state) (state) (state) (state)... (adjoint) (adjoint) (adjoint) (adjoint) (adjoint)... Idea II Always use the latest available information y for the adjoint update. Quasi-Code 1: Input: u = u, y = t y, ȳ = tȳ ; t y shared 2: while (y, ȳ) not converged do 3: Process 0: 4: Compute state t y = G(t u,y) 5: Update state y = t y 6: Process 1: 7: Compute adjoint tȳ = f y (t u,t y ) + ȳ G y (t u,t y ) 8: Update adjoint ȳ = tȳ 9: end while 10: Output: y, ȳ Convergence Based on G y (u,y) ρ < 1 and smoothness of f, G
. 8 of 24 Numerical results for Laplace problem Objective Tracking type functional f(u,y) = 1 2 y y d 2 + µ 2 u 2, y d triangle Constraint Laplace y = u, y Ω = 0 over Ω = [0,1] Slow Contraction Jacobi-iteration y k+1 = D 1 (u Ry k ) for A = D + R, D diagonal Measurements of the runtime seem to be inconsistent with the convergence history Residual 10 1 0.1 0.01 0.001 0.0001 1e-05 1e-06 1e-07 1e-08 1e-09 B.C. A.G. B.P. PURE Residual Pure Adjoint (B.C.) Adjoint (B.P.) Adjoint (A.G.) 0 40000 10000 20000 50000 60000 30000 Iteration Time PURE A.G. B.C. B.P. 1.35978293e-02 1.5283478e-01 2.7537666e-02 1.7725375e-02
. 8 of 24 Numerical results for Laplace problem Objective Tracking type functional f(u,y) = 1 2 y y d 2 + µ 2 u 2, y d triangle Constraint Laplace y = u, y Ω = 0 over Ω = [0,1] Slow Contraction Jacobi-iteration y k+1 = D 1 (u Ry k ) for A = D + R, D diagonal Measurements of the runtime seem to be inconsistent with the convergence history Residual 10 1 0.1 0.01 0.001 0.0001 1e-05 1e-06 1e-07 1e-08 1e-09 B.C. A.G. B.P. PURE Residual Pure Adjoint (B.C.) Adjoint (B.P.) Adjoint (A.G.) 0 40000 10000 20000 50000 60000 30000 Iteration Time PURE A.G. B.C. B.P. 1.35978293e-02 1.5283478e-01 2.7537666e-02 1.7725375e-02 G G, f y + Gy ȳ, sync G or f y + Gy ȳ λ 1 [G] and λ 2 [f y + Gy ȳ]
Load Balancing Assumption I The functions f, G and their derivatives can be implemented as parallel programs and only scale up to a certain degree Assumption II Finite amount of computational resources available λ R + Idea Use load balancing λ = λ 1 + λ 2 and find a speed up of overall process by optimal distribution (λ 1,λ 2 ) of the primal/adjoint updates on available resources Goal Minimize computational time T (λ 1,λ 2 ) required to find (y,ȳ ) Example of a static load balancing for three different tuple (λ 1,λ 2 ) Adjoint (B.P.) Residual Pure Residual Pure Residual Pure Adjoint (B.P.) Adjoint (B.P.) Time Time Time. 9 of 24
. 10 of 24 Dynamic Load Balancing Automatic setup based on information about f, G and their derivatives, residual, scalability, extra cost for data transfer, communication, experience,... develop self adjusting method that efficiently distributes the computational resources during the course of optimization Example Consider the following dynamic load balancing Phase I: Mainly reduce primal error (B.C. style) Phase II: Reduce primal error and adjoint error (A.G./B.P. style) Phase III: Mainly reduce adjoint error (B.C. style) Adjoint (B.P.) Adjoint (B.P.) Adjoint (B.P.) Residual Pure Residual Pure Residual Pure Phase I Phase II Phase III
11 of 24 (Summary I) State and Adjoint equation can be solved by three different fixed-point approaches: Christianson (state) (state) (adjoint) (adjoint)... (B.C.) Griewank (state,adjoint) (state,adjoint)... (A.G.) Blurred (state) (state) (state) (state) (state)... (adjoint) (adjoint) (adjoint) (adjoint)... (B.P.) (New) Blurred Piggyback method (asynchronous + blurred state information for the adjoint) seems to be favorable (at least for this example, this setting,...) A.G. and B.C. can be recovered as a special case of B.P. by load balancing. (Dynamic) load balancing to reduce execution time - that s the important thing!!!
12 of 24 Oneshot methods Goal Solve optimization problem min (u,y) R m+n f(u,y) s.t. y = G(u,y). Updates u + = u αb 1 L u (u,y,ȳ), y + = G(u,y),and ȳ + = ȳ + L y (u,y,ȳ) Sequencing Combine the three update steps into an optimization method. Choose order of state, adjoint and design updates, their multiplicity s, parameters B and α to ensure (optimal) contraction of the combined method towards stationary points. Scenarios (design,state,adjoint) (design,state,adjoint)... (Jacobi) (design) (state) (adjoint) (design)... (Seidel) (design) (state,adjoint) (design)... (Mixed) (design) (state) s (adjoint) s (design)... (Multi-Step)
. 13 of 24 Most wanted Oneshot methods Jacobi method by Griewank/Gauger/Slawig/Kaland... (design,state,adjoint) (design,state,adjoint)... with doubly augmented Lagrangian preconditioner B L uu + αg u G u + βl uy L yu Seidel (multi-step) by B/Lehmann/Griewank (design) (state) s (adjoint) s (design)... with projected Lagrangian preconditioner (or modifications) B [ ][ ] I Z ][ L uu L uy I,Z = (I G y ) 1 G u. L yu L yy Z Generalization Both/(almost) a all considered one-shot methods are a special case of the following general one-shot method and can be derived by a suitable load balancing schemes a to the best of my knowledge one can even skip almost
. 13 of 24 Algorithm 1 General Oneshot Method 1: Input: u, y, ȳ, α, B 2: while (u, y, ȳ) not converged do 3: Process 0: 4: Monitor processes, determine allocation of resources 5: Process 1: 6: Update α and B 7: Compute design t u = u αb 1 L u (u,y,ȳ) 8: Process 2: 9: Compute state t y = G(u,y) 10: Process 3: 11: Compute adjoint tȳ = L y (u,y,ȳ) + ȳ 12: Process 4: 13: Update design u = t u 14: Process 5: 15: Update state y = t y 16: Process 6: 17: Update adjoint ȳ = tȳ 18: end while 19: Output: u, y, ȳ, α, B
14 of 24 Load balancing Resources need to be distributed for 0+1+1+1+3 processes. Jacobi method and corresponding load balancing (design,state,adjoint) (design,state,adjoint)... STATE ADJOINT DESIGN WORK RUN-TIME RUN-TIME M.Seidel method and corresponding load balancing (design) (state) s (adjoint) s (design)... STATE ADJOINT DESIGN WORK RUN-TIME RUN-TIME Problem Both methods (Jacobi, Multi-Seidel) are not optimal: idle vs. cross-communication
Blurred Oneshot method Idea Run all three processes (design,state,adjoint) in parallel but in an asynchronous fashion and allow blurred updates analogous to the blurred piggyback approach. STATE ADJOINT DESIGN WORK RUN-TIME RUN-TIME WARNING We have a coupling: u depends on y,ȳ and vice versa. Solution Use dynamic load balancing to enforce convergence Load balancing should be mostly stable for efficiency reasons but may still vary. Intuition In the worst case, one can slow down the control updates by reducing the computational resources for this process and crawl along/to the manifold primal and adjoint feasible Manifold M.. 15 of 24
16 of 24 Blurred Oneshot Algorithm 1: Input: u = t u, y = t y, ȳ = tȳ, α, B, (t u,t y,tȳ shared ) 2: while (u, y, ȳ) not converged do 3: Process 0: 4: Monitor processes, determine allocation of resources 5: Process 1: 6: Update α and B 7: Compute design t u = u αb 1 L u (u,t y,tȳ ) 8: Update design u = t u 9: Process 2: 10: Compute state t y = G(t u,y) 11: Update state y = t y 12: Process 3: 13: Compute adjoint tȳ = f y (t u,t y ) + ȳ G y (t u,t y ) 14: Update adjoint ȳ = tȳ 15: end while 16: Output: u, y, ȳ, α, B
Numerical results for Laplace design optimization problem Setup Blurred Oneshot with L-BFGS approximation of projected Hessian, no load balancing (not yet), α by quasi-backtrack Residuals of an typical optimization evaluation 1 Control State Adjoint Residual 1e-05 1e-10 1e-15 0 5000 10000 15000 20000 Iteration. 17 of 24
. 18 of 24 Time dependent problems - Gauger/Günther/Wang/(Griewank) Problem 1 T min f(u, y(τ))dτ such that u,y T 0 y(τ) + c(y(τ),u) = 0 τ [0,T ], y(0) = y τ 0 State y : [0,T ] Y is a function that varies over a time interval [0,T ] R Time discr. 0 = τ 0 < τ 1 < < τ N = T and approx. y y i = y(τ i ) R n Assume I time integration of the constraint is a one-step method a and state y i at time t i is implicitly given by state y i 1 at previous time t i 1, i.e., R(u,y i,y i 1 ) = 0, i = 1,...,N Assume II that there exists a contractive fixed-point solver y k+1 i = G(u,yi k,y i 1 ) : U Y Y to compute solutions lim k yi k of the previous integration scheme R with contraction rate G yi (u,y i,y i 1 ) ρ G < 1. a e.g. implicit Euler - although higher order methods are also possible
. 19 of 24 Extended Fixed-Point iteration Extended Mapping G : U Y Y with Y = Y N = Y Y defined by G 1 (u,y) G(u,y 1,y 0 ) G 2 (u,y) G(u,y G(u,y) =. = 2,G(u,y 1,y 0 )). G N (u,y) G(u,y N,G(u,y N 1,G( G(u,y 1,y 0 )) ))) Contraction is inherited by the extended mapping G, i.e. Extended Problem can be formulated G y (u,y) ρ G = ρ G < 1. min f(u,y) such that y = G(u,y) with y 0 = y(0) u,y where the objective objective functional is approximated by N f(u,y) = i f N (u,y i ) i=1 t 1 T T 0 f(u,y(t))
Comments Interpretation Extended fixed-point mapping can be understood as parallel update on the discrete state approximation for each time-step STATE TIME Lag-Effect and different execution times of G PROCESS RUN-TIME Blurred updates with Load Balancing for different time-steps strongly recommended!. 20 of 24
. 21 of 24 Blurred Oneshot Method for time-dependent Problems 1: Input: u = t u, y = (t y1,...,t yn ), ȳ i = tȳi, α, B 2: while (u,y 1,...,y N,ȳ 1,...,ȳ N ) not converged do 3: Process 0: Monitor processes, determine allocation of resources 4: Process 1: 5: Update α and B 6: Compute design t u = u αb 1 L u (u,t y1,...,t yn,tȳ1,...,tȳn ) 7: Update design u = t u 8: Process 2: 9: Compute state t y1 = G(t u,y 1,y 0 ) 10: Update state y 1 = t y1 11: Process 3: 12: Compute state t y2 = G(t u,y 2,t y1 ) 13: Update state y 2 = t y2 14:... 15: Process N+1: 16: Compute state t yn = G(t u,y N,t yn 1 ) 17: Update state y N = t yn 18: 19: Process N+2: 20: Compute adjoint tȳ1 = f y (t u,t y1 ) + G y (t u,t y1,...,t yn ) ȳ 1 21: Update adjoint ȳ 1 = tȳ1 22:... 23: Process 2N+1: 24: Compute adjoint tȳn = f y (t u,t yn ) + G y (t u,t y1,...,t yn ) ȳ N 25: Update adjoint ȳ N = tȳn 26: end while
Observation Load balance I for the extended primal iteration, e.g. Dynamic Allocation of Computational Resources Allocated Resource Process Load balance II similar for extended adjoint iteration Load balance III between primal and adjoint updates still possible Load balance IV between (primal,adjoint) and design also possible GOAL Allocating the resources to all 2N+2 processes needs to be harmonized like an orchestra to achieve maximal efficiency.. 22 of 24
. 23 of 24 Summary Blurred Piggyback/Oneshot using asynchronous and inconsistent updates General Piggyback/Oneshot method for design optimization Extension to Time dependent problems can be improved by (Optimal) Load Balancing for use in HPC, convergence, efficiency Outlook - TODO It s still a loooooong way to go... Thank you for your attention!
24 of 24 Challenge Shared memory in MPI 3.0 could be a great challenge for AD in the future. Any suggestions? (Except of saying: Don t use it!)