STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

Size: px

Start display at page:

Download "STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)"

Elijah Hudson
10 years ago
Views:

1 STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University (Joint work with R. Chen and M. Menickelly)

2 Outline Stochastic optimization problem black box gradient based Existing methods vs. this work Algorithm, assumptions and analysis Computational results.

3 Black-box stochastic optimization Unconstrained optimization problem Function f2 C 1 or C 2 and bounded from below. f(x) cannot be computed, instead.. where ² is a random variable If E ² [f(x,²)]=f(x), then the noise is unbiased If E ² [f(x,²)]=h(x) f(x), then the noise is biased

4 Noisy computable function tx ty az angle

05 tx 0.1 0.1 0.05 ty 0 0.05 0.1 0 0.1 0.05 0 az 0.

5 Sampling the black box function Sample points How to choose and to use the sample points and the functions values defines different methods See book by Conn, S. and Vicente, 2009

6 Model based trust region methods Powell, Conn, S. Toint, Vicente, Wild, etc.

7 Model based trust region methods Powell, Conn, S. Toint, Vicente, Wild, etc.

8 Model based trust region methods Powell, Conn, S. Toint, Vicente, Wild, etc.

9 Model Based trust region methods Exploits curvature, flexible efficient steps, uses second order models.

10 Gradient-based stochastic optimization Unconstrained optimization problem Function f2 C 1 or C 2 and bounded from below. f(x) or r f(x) cannot be computed, instead. where ² is a random variable If E ² [g(x,²)]=rf(x), then the noise is unbiased If E ² [g(x,²)]=h(x) rf(x), then the noise is biased

f(x) or r f(x) cannot be computed, instead.

11 What methods exist for stochastic optimization? Stochastic gradient Sample average (simulation optimization) Sample path optimization Methods based on random models (ours).

12 Stochastic gradient descent (Robbins-Monro, 51, Polyak-Yuditski, 92, Spall 00, Shalev-Shwartz,11, Ghadimi, Lan 13) Assume unbiased estimator of the gradient can be computed The method then is: Many variants exits, but in most tuning k sequence is required, convergence is slow.

estimator of the gradient can be computed The method then is: Many

13 Sample averaging (Shapiro, Homem-De-Mello, Pasupathy, Ghosh, Glynn, etcl) Assume unbiased estimator of the gradient can be computed Compute a sample average gradient at x k, given sample size S k : Tend to work well in practice, many variants exist, but strong assumptions needed in theory.

a sample average gradient at x k, given sample size S k : Tend to work

14 Stochastic gradient Accurate in expectation. Accuracy does not improve. Iterations remain inexpensive! Does not apply to standard frameworks. Convergence rates are usually lower. Main algorithm k Accurate in expectation Approximate computation of :

15 sample average approximation Information gets more and more accurate as needed. This is usually achieved by resampling gradient and function Information is assumed to be accurate in expectation with bounded variance. Under sufficient sampling and unbiased noise assumptions preserves the convergence rates. Main algorithm k k+1 k+2 k+3 Progressively reduce variance of the error Approximate computation:

16 Random inexact first (and second) order models. Information gets more and more exact as needed. But this only has to hold with some probability. No assumption on distribution, or expectation. Applies to standard deterministic frameworks. Preserves the convergence rates. Main algorithm k k+1 k+2 k+3 Progressively more accurate, with failures Approximate computation:

17 Biased and unbiased noise examples. Noisy function samples. Unbiased nose Biased noise Processor failures, biased gradient estimates.

18 Our algorithm and convergence analysis

19 Deterministic trust region framework Powell, Conn, S. Toint, Vicente, Wild, Morales.

20 Randomized trust region framework Refreshing models at each iteration allows the occasional use of really bad models. Bandeira, S., Vicente, 12 Cartis, S., 14

21 Stochastic trust region framework Refreshing function estimates at each iteration allows the occasional use of really bad function values. (Chen, Menickelly, S. 2015)

22 What can happen on each step

23 What can happen at each step

24 What do we need from random models and random function estimates? We need likely Taylor-like behavior of first-order models We need likely accuracy from the function estimates Model and estimate accuracy depends on k the TR radius Probabilities and are constant throughout the algorithm

25 Key ideas in establishing convergence Construct the following stochastic process Prove: k is bounded from below Hence: k is a supermartingale and k 0

26 Key steps of analysis There exists a constant C (dependent on algorithmic parameters and Lipschitz constants): Guaranteed constant decrease in f(x k ) once k decreases below some threshold

27 Key steps of analysis w.p. 1- There exists a constant C (dependent on algorithmic parameters and Lipschitz constants): w.p. 1- w.p. 1- Different behavior depending on k being larger C or smaller that C. f(x k ) increases w. prob. 1- when k C

28 Illustration -f(x * ) k C -f(x k )

29 Illustration -f(x * ) k C -f(x k )

30 Illustration -f(x * ) k C -f(x k )

31 Illustration -f(x * ) w.p. (1- ) C k -f(x k )

32 Illustration -f(x * ) w.p. (1- ) (1- ) k C -f(x k )

33 Illustration -f(x * ) w.p. (1- ) C k -f(x k )

34 Illustration -f(x * ) w.p. (1- ) C k -f(x k )

35 Illustration -f(x * ) w.p. (1- ) (1- ) C k -f(x k )

36 Illustration -f(x * ) w.p. k C -f(x k )

37 Main convergence result Theorem: There exists a constant p 0, dependent on f and algorithmic constants, such that if with probability 1. (Chen, Menickelly, S. 2014) Specifically, where L is the Lipschitz constant of f

38 Computational results

39 Biased and unbiased noise examples, again. Noisy function samples. Unbiased nose Biased noise Processor failures, biased gradient estimates.

40 Comparison of STORM with sample averaging TR method. The relative noise case

41 Comparison of STORM with sample averaging TR method. The computations failures

42 Comparison of STORM with sample averaging TR method. The computations failures

43 Comparison of STORM with sample averaging TR method. The processor failures

44 Comparison of STORM with sample averaging TR method. The computations failures

45 Conclusion We propose a general framework for stochastic inexact first (and second) order methods. No assumption on distribution, or expectation. Models are suff. accurate with constant probability. Applies to standard deterministic frameworks. Applies to cases of biased noise. Works well in practice. Can view this a demonstration of robustness of a standard framework.

46 Future work Convergence rates analysis. Sampling rate analysis for randomly sampled models. Use of learning guarantees and Rademacher complexity of model classes. Extend to convex optimization. More examples of models that fit the framework.

47 Thank you!

10. Proximal point method

L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing