STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)



Similar documents
Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization

10. Proximal point method

Assessment of robust capacity utilisation in railway networks

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Online Learning, Stability, and Stochastic Gradient Descent

Big Data - Lecture 1 Optimization reminders

(Quasi-)Newton methods

Beyond stochastic gradient descent for large-scale machine learning

Logistic Regression for Spam Filtering

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Adaptive Online Gradient Descent

Supervised Learning (Big Data Analytics)

GI01/M055 Supervised Learning Proximal Methods

Simulation-based optimization methods for urban transportation problems. Carolina Osorio

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Variational approach to restore point-like and curve-like singularities in imaging

Cross Validation. Dr. Thomas Jensen Expedia.com

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

Linear Threshold Units

How I won the Chess Ratings: Elo vs the rest of the world Competition

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Parallel Selective Algorithms for Nonconvex Big Data Optimization

Chapter 4: Artificial Neural Networks

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Simple and efficient online algorithms for real world applications

Stochastic gradient methods for machine learning

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

An Introduction to Machine Learning

How To Improve Efficiency In Ray Tracing

Lecture 2: The SVM classifier

GLM, insurance pricing & big data: paying attention to convergence issues.

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department sss40@eng.cam.ac.

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Big learning: challenges and opportunities

Lecture. Simulation and optimization

Stationarity Results for Generating Set Search for Linearly Constrained Optimization

APPLIED MISSING DATA ANALYSIS

Numeraire-invariant option pricing

Notes from Week 1: Algorithms for sequential prediction

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Interior-Point Methods for Full-Information and Bandit Online Learning

Enhancing Parallel Pattern Search Optimization with a Gaussian Process Oracle

Online Semi-Supervised Learning

A Simultaneous Deterministic Perturbation Actor-Critic Algorithm with an Application to Optimal Mortgage Refinancing

Maximum Likelihood Estimation of ADC Parameters from Sine Wave Test Data. László Balogh, Balázs Fodor, Attila Sárhegyi, and István Kollár

Monte Carlo Simulation

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Stochastic Gradient Method: Applications

Statistical Machine Learning

Making Sense of the Mayhem: Machine Learning and March Madness

Solutions of Equations in One Variable. Fixed-Point Iteration II

Online Convex Optimization

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Monte Carlo Path Tracing

Summer course on Convex Optimization. Fifth Lecture Interior-Point Methods (1) Michel Baes, K.U.Leuven Bharath Rangarajan, U.

The QOOL Algorithm for fast Online Optimization of Multiple Degree of Freedom Robot Locomotion

Machine learning and optimization for massive data

Dealing with Missing Data

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

T Non-discriminatory Machine Learning

Euler: A System for Numerical Optimization of Programs

Useful Recent Trends in Simulation Methodology

. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.

Algorithms for sustainable data centers

CONVEX optimization forms the backbone of many

The Variability of P-Values. Summary

Cloud Computing. Computational Tasks Have value for task completion Require resources (Cores, Memory, Bandwidth) Compete for resources

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

arxiv: v1 [math.pr] 5 Dec 2011

Stability of the LMS Adaptive Filter by Means of a State Equation

Shape Optimization Problems over Classes of Convex Domains

CSCI567 Machine Learning (Fall 2014)

Discovering Stochastic Petri Nets with Arbitrary Delay Distributions From Event Logs

The Psychology of Simulation Model and Metamodeling

A central problem in network revenue management

Stochastic gradient methods for machine learning

How To Analyze Ball Blur On A Ball Image

Hybrid Evolution of Heterogeneous Neural Networks

Big Data Analytics: Optimization and Randomization

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Random graphs with a given degree sequence

Machine learning challenges for big data

Exact shape-reconstruction by one-step linearization in electrical impedance tomography

1 Norms and Vector Spaces

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

SPARE PARTS INVENTORY SYSTEMS UNDER AN INCREASING FAILURE RATE DEMAND INTERVAL DISTRIBUTION

Invariant Option Pricing & Minimax Duality of American and Bermudan Options

Routing in packet-switching networks

Factoring & Primality

2.3 Convex Constrained Optimization Problems

Introduction to Online Learning Theory

Basics of Statistical Machine Learning

Optimization by Direct Search: New Perspectives on Some Classical and Modern Methods

Contextual-Bandit Approach to Recommendation Konstantin Knauf

Nonlinear Optimization: Algorithms 3: Interior-point methods

Large-Scale Machine Learning with Stochastic Gradient Descent

Transcription:

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University (Joint work with R. Chen and M. Menickelly)

Outline Stochastic optimization problem black box gradient based Existing methods vs. this work Algorithm, assumptions and analysis Computational results.

Black-box stochastic optimization Unconstrained optimization problem Function f2 C 1 or C 2 and bounded from below. f(x) cannot be computed, instead.. where ² is a random variable If E ² [f(x,²)]=f(x), then the noise is unbiased If E ² [f(x,²)]=h(x) f(x), then the noise is biased

Noisy computable function 256 254 200 252 150 250 100 248 50 246 0.1 0.05 0 0.05 tx 0.1 0.1 0.05 ty 0 0.05 0.1 0 0.1 0.05 0 az 0.05 0.1 0.1 0.05 0 angle 0.05 0.1

Sampling the black box function Sample points How to choose and to use the sample points and the functions values defines different methods See book by Conn, S. and Vicente, 2009

Model based trust region methods Powell, Conn, S. Toint, Vicente, Wild, etc.

Model based trust region methods Powell, Conn, S. Toint, Vicente, Wild, etc.

Model based trust region methods Powell, Conn, S. Toint, Vicente, Wild, etc.

Model Based trust region methods Exploits curvature, flexible efficient steps, uses second order models.

Gradient-based stochastic optimization Unconstrained optimization problem Function f2 C 1 or C 2 and bounded from below. f(x) or r f(x) cannot be computed, instead. where ² is a random variable If E ² [g(x,²)]=rf(x), then the noise is unbiased If E ² [g(x,²)]=h(x) rf(x), then the noise is biased

What methods exist for stochastic optimization? Stochastic gradient Sample average (simulation optimization) Sample path optimization Methods based on random models (ours).

Stochastic gradient descent (Robbins-Monro, 51, Polyak-Yuditski, 92, Spall 00, Shalev-Shwartz,11, Ghadimi, Lan 13) Assume unbiased estimator of the gradient can be computed The method then is: Many variants exits, but in most tuning k sequence is required, convergence is slow.

Sample averaging (Shapiro, Homem-De-Mello, Pasupathy, Ghosh, Glynn, etcl) Assume unbiased estimator of the gradient can be computed Compute a sample average gradient at x k, given sample size S k : Tend to work well in practice, many variants exist, but strong assumptions needed in theory.

Stochastic gradient Accurate in expectation. Accuracy does not improve. Iterations remain inexpensive! Does not apply to standard frameworks. Convergence rates are usually lower. Main algorithm k Accurate in expectation Approximate computation of :

sample average approximation Information gets more and more accurate as needed. This is usually achieved by resampling gradient and function Information is assumed to be accurate in expectation with bounded variance. Under sufficient sampling and unbiased noise assumptions preserves the convergence rates. Main algorithm k k+1 k+2 k+3 Progressively reduce variance of the error Approximate computation:

Random inexact first (and second) order models. Information gets more and more exact as needed. But this only has to hold with some probability. No assumption on distribution, or expectation. Applies to standard deterministic frameworks. Preserves the convergence rates. Main algorithm k k+1 k+2 k+3 Progressively more accurate, with failures Approximate computation:

Biased and unbiased noise examples. Noisy function samples. Unbiased nose Biased noise Processor failures, biased gradient estimates.

Our algorithm and convergence analysis

Deterministic trust region framework Powell, Conn, S. Toint, Vicente, Wild, Morales.

Randomized trust region framework Refreshing models at each iteration allows the occasional use of really bad models. Bandeira, S., Vicente, 12 Cartis, S., 14

Stochastic trust region framework Refreshing function estimates at each iteration allows the occasional use of really bad function values. (Chen, Menickelly, S. 2015)

What can happen on each step

What can happen at each step

What do we need from random models and random function estimates? We need likely Taylor-like behavior of first-order models We need likely accuracy from the function estimates Model and estimate accuracy depends on k the TR radius Probabilities and are constant throughout the algorithm

Key ideas in establishing convergence Construct the following stochastic process Prove: k is bounded from below Hence: k is a supermartingale and k 0

Key steps of analysis There exists a constant C (dependent on algorithmic parameters and Lipschitz constants): Guaranteed constant decrease in f(x k ) once k decreases below some threshold

Key steps of analysis w.p. 1- There exists a constant C (dependent on algorithmic parameters and Lipschitz constants): w.p. 1- w.p. 1- Different behavior depending on k being larger C or smaller that C. f(x k ) increases w. prob. 1- when k C

Illustration -f(x * ) k C -f(x k )

Illustration -f(x * ) k C -f(x k )

Illustration -f(x * ) k C -f(x k )

Illustration -f(x * ) w.p. (1- ) C k -f(x k )

Illustration -f(x * ) w.p. (1- ) (1- ) k C -f(x k )

Illustration -f(x * ) w.p. (1- ) C k -f(x k )

Illustration -f(x * ) w.p. (1- ) C k -f(x k )

Illustration -f(x * ) w.p. (1- ) (1- ) C k -f(x k )

Illustration -f(x * ) w.p. k C -f(x k )

Main convergence result Theorem: There exists a constant p 0, dependent on f and algorithmic constants, such that if with probability 1. (Chen, Menickelly, S. 2014) Specifically, where L is the Lipschitz constant of f

Computational results

Biased and unbiased noise examples, again. Noisy function samples. Unbiased nose Biased noise Processor failures, biased gradient estimates.

Comparison of STORM with sample averaging TR method. The relative noise case

Comparison of STORM with sample averaging TR method. The computations failures

Comparison of STORM with sample averaging TR method. The computations failures

Comparison of STORM with sample averaging TR method. The processor failures

Comparison of STORM with sample averaging TR method. The computations failures

Conclusion We propose a general framework for stochastic inexact first (and second) order methods. No assumption on distribution, or expectation. Models are suff. accurate with constant probability. Applies to standard deterministic frameworks. Applies to cases of biased noise. Works well in practice. Can view this a demonstration of robustness of a standard framework.

Future work Convergence rates analysis. Sampling rate analysis for randomly sampled models. Use of learning guarantees and Rademacher complexity of model classes. Extend to convex optimization. More examples of models that fit the framework.

Thank you!