arxiv:1309.5977v1 [stat.ml] 23 Sep 2013



Similar documents
MTH6121 Introduction to Mathematical Finance Lesson 5

The Transport Equation

ANALYSIS AND COMPARISONS OF SOME SOLUTION CONCEPTS FOR STOCHASTIC PROGRAMMING PROBLEMS

Term Structure of Prices of Asian Options

Stochastic Optimal Control Problem for Life Insurance

PROFIT TEST MODELLING IN LIFE ASSURANCE USING SPREADSHEETS PART ONE

Chapter 8: Regression with Lagged Explanatory Variables

On the degrees of irreducible factors of higher order Bernoulli polynomials

A UNIFIED APPROACH TO MATHEMATICAL OPTIMIZATION AND LAGRANGE MULTIPLIER THEORY FOR SCIENTISTS AND ENGINEERS

Measuring macroeconomic volatility Applications to export revenue data,

Real-time Particle Filters

Random Walk in 1-D. 3 possible paths x vs n. -5 For our random walk, we assume the probabilities p,q do not depend on time (n) - stationary

Chapter 7. Response of First-Order RL and RC Circuits

Journal Of Business & Economics Research September 2005 Volume 3, Number 9

DYNAMIC MODELS FOR VALUATION OF WRONGFUL DEATH PAYMENTS

Niche Market or Mass Market?

Multiprocessor Systems-on-Chips

PATHWISE PROPERTIES AND PERFORMANCE BOUNDS FOR A PERISHABLE INVENTORY SYSTEM

Optimal Stock Selling/Buying Strategy with reference to the Ultimate Average

Duration and Convexity ( ) 20 = Bond B has a maturity of 5 years and also has a required rate of return of 10%. Its price is $613.

Forecasting and Information Sharing in Supply Chains Under Quasi-ARMA Demand

Why Did the Demand for Cash Decrease Recently in Korea?

Single-machine Scheduling with Periodic Maintenance and both Preemptive and. Non-preemptive jobs in Remanufacturing System 1

E0 370 Statistical Learning Theory Lecture 20 (Nov 17, 2011)

Mathematics in Pharmacokinetics What and Why (A second attempt to make it clearer)

TEMPORAL PATTERN IDENTIFICATION OF TIME SERIES DATA USING PATTERN WAVELETS AND GENETIC ALGORITHMS

INTRODUCTION TO FORECASTING

An Online Learning-based Framework for Tracking

The naive method discussed in Lecture 1 uses the most recent observations to forecast future values. That is, Y ˆ t + 1

The option pricing framework

Individual Health Insurance April 30, 2008 Pages

ON THE PRICING OF EQUITY-LINKED LIFE INSURANCE CONTRACTS IN GAUSSIAN FINANCIAL ENVIRONMENT

Morningstar Investor Return

Module 3 Design for Strength. Version 2 ME, IIT Kharagpur

Optimal Investment and Consumption Decision of Family with Life Insurance

A Note on Using the Svensson procedure to estimate the risk free rate in corporate valuation

Online Convex Programming and Generalized Infinitesimal Gradient Ascent

Bayesian Filtering with Online Gaussian Process Latent Variable Models

11/6/2013. Chapter 14: Dynamic AD-AS. Introduction. Introduction. Keeping track of time. The model s elements

How To Predict A Person'S Behavior

Economics Honors Exam 2008 Solutions Question 5

Appendix D Flexibility Factor/Margin of Choice Desktop Research

Hedging with Forwards and Futures

Lectures # 5 and 6: The Prime Number Theorem.

A Generalized Bivariate Ornstein-Uhlenbeck Model for Financial Assets

AP Calculus AB 2013 Scoring Guidelines

Task is a schedulable entity, i.e., a thread

INTEREST RATE FUTURES AND THEIR OPTIONS: SOME PRICING APPROACHES

Option Put-Call Parity Relations When the Underlying Security Pays Dividends

AP Calculus BC 2010 Scoring Guidelines

Chapter 2 Problems. 3600s = 25m / s d = s t = 25m / s 0.5s = 12.5m. Δx = x(4) x(0) =12m 0m =12m

ARCH Proceedings

DETERMINISTIC INVENTORY MODEL FOR ITEMS WITH TIME VARYING DEMAND, WEIBULL DISTRIBUTION DETERIORATION AND SHORTAGES KUN-SHAN WU

Present Value Methodology

Dependent Interest and Transition Rates in Life Insurance

Monte Carlo Observer for a Stochastic Model of Bioreactors

MACROECONOMIC FORECASTS AT THE MOF A LOOK INTO THE REAR VIEW MIRROR

Cointegration: The Engle and Granger approach

The Torsion of Thin, Open Sections

Chapter 1.6 Financial Management

Keldysh Formalism: Non-equilibrium Green s Function

Chapter 4: Exponential and Logarithmic Functions

Technical Appendix to Risk, Return, and Dividends

adaptive control; stochastic systems; certainty equivalence principle; long-term

Answer, Key Homework 2 David McIntyre Mar 25,


A Probability Density Function for Google s stocks

Working Paper On the timing option in a futures contract. SSE/EFI Working Paper Series in Economics and Finance, No. 619

CHARGE AND DISCHARGE OF A CAPACITOR

Distributing Human Resources among Software Development Projects 1

An Online Portfolio Selection Algorithm with Regret Logarithmic in Price Variation

A Two-Account Life Insurance Model for Scenario-Based Valuation Including Event Risk Jensen, Ninna Reitzel; Schomacker, Kristian Juul

Principal components of stock market dynamics. Methodology and applications in brief (to be updated ) Andrei Bouzaev, bouzaev@ya.

Optimal Time to Sell in Real Estate Portfolio Management

BALANCE OF PAYMENTS. First quarter Balance of payments

On Stochastic and Worst-case Models for Investing

Distributed and Secure Computation of Convex Programs over a Network of Connected Processors

Credit Index Options: the no-armageddon pricing measure and the role of correlation after the subprime crisis

SPEC model selection algorithm for ARCH models: an options pricing evaluation framework

A Universal Pricing Framework for Guaranteed Minimum Benefits in Variable Annuities *

Bayesian model comparison with un-normalised likelihoods

DOES TRADING VOLUME INFLUENCE GARCH EFFECTS? SOME EVIDENCE FROM THE GREEK MARKET WITH SPECIAL REFERENCE TO BANKING SECTOR

Forecasting, Ordering and Stock- Holding for Erratic Demand

SELF-EVALUATION FOR VIDEO TRACKING SYSTEMS

Markov Chain Modeling of Policy Holder Behavior in Life Insurance and Pension

Life insurance cash flows with policyholder behaviour

Longevity 11 Lyon 7-9 September 2015

Analysis of Tailored Base-Surge Policies in Dual Sourcing Inventory Systems

Transcription:

EFFICIENT SAMPLING FROM TIME-VARYING LOG-CONCAVE DISTRIBUTIONS By Hariharan Narayanan and Alexander Rakhlin Universiy of Washingon and Universiy of Pennsylvania arxiv:1309.5977v1 [sa.ml] 3 Sep 013 We propose a compuaionally efficien random walk on a convex body which rapidly mixes and closely racks a ime-varying logconcave disribuion. We develop general heoreical guaranees on he required number of seps; his number can be calculaed on he fly according o he disance from and he shape of he nex disribuion. We hen illusrae he echnique on several examples. Wihin he conex of exponenial families, he proposed mehod produces samples from a poserior disribuion which is updaed as daa arrive in a sreaming fashion. The sampling echnique can be used o rack ime-varying runcaed disribuions, as well as o obain samples from a changing mixure model, fied in a sreaming fashion o daa. In he seing of linear opimizaion, he proposed mehod has oracle complexiy wih bes known dependence on he dimension for cerain geomeries. In he conex of online learning and repeaed games, he algorihm is an efficien mehod for implemening no-regre mixure forecasing sraegies. Remarkably, in some of hese examples, only one sep of he random walk is needed o rack he nex disribuion. 1. Inroducion. Le K be a compac convex subse of R d wih non-empy inerior. Le µ 0,..., µ,... be a sequence of probabiliy measures wih suppor on K. Suppose each probabiliy disribuion µ has a densiy (1.1) dµ (x) dx = e s(x), Z = e s(x) dx Z x K wih respec o he Lebesgue measure, where each s (x) is a convex funcion on K. This paper proposes a Markov Chain Mone Carlo mehod for sequenially sampling from hese disribuions. The mehod comes wih srong mixing ime guaranees, and is shown o be applicable o a variey of problems. Observe ha, by definiion, he disribuions µ are log-concave, and hus our work falls wihin he emerging body of lieraure on sampling from log-concave disribuions. The problem of sampling from disribuions arises in many areas of saisics, mos noably in Bayesian inference [37]. In paricular, Sequenial Mone Carlo mehods [13] aim o sample from ime-varying disribuions. The need for such mehods arises, for insance, in he case of online arrival of daa: i is desirable o be able o updae he poserior disribuion a a low compuaional cos. If he disribuions are changing slowly wih ime, sequenial mehods can re-use samples from he previous disribuion and perform cerain re-weighing o rack he nex disribuion, hus saving compuaional resources. These ideas are exploied in paricle filering mehods (see [8, 13] and references herein). Beyond Bayesian inference, oher applicaions of sampling from disribuions include simulaed annealing, global opimizaion, and regre minimizaion. Suppored in par by NSF under gran CAREER DMS-0954737. AMS 000 subjec classificaions: Primary 60K35, 60K35; secondary 60K35 Keywords and phrases: sample, L A TEX ε 1 imsar-imsgeneric ver. 013/03/06 file: paper.ex dae: Sepember 5, 013

The main criique of he MCMC mehods is, in many siuaions, he lack of mixing ime analysis. In pracice, he number of seps of he chain required o obain an hones sample from a disribuion is mosly calculaed based on heurisics. There is a growing body of lieraure ha presens excepions o hese heurisic approaches. Coupling mehods, specral gap mehods, as well as he more recen sudy of posiive Ricci curvaure, yield geomeric decrease of he disance o he desired saionary disribuion a propery known as geomeric ergodiciy. The mos well-undersood cases in his conex are hose wih a finie or counable sae space (see [8, 1]). In conras, we are ineresed in a random walk on a non-discree se. This paper is focused on a paricular circle of problems defined via log-concave disribuions. These disribuions consiue an imporan subse of he se of unimodal disribuions, a fac ha has been recognized wihin Saisics (see e.g. [41]). We are no he firs o sudy mixing imes for such disribuions: his line of work sared wih he breakhrough paper of [14], followed by a series of improvemens [16, 3, 5, 6]. However, he recen advances in [1] on sampling from convex bodies give an edge o obaining sronger guaranees. In paricular, we show ha we can provably rack a changing disribuion wih a small number (or even only one sep) of a random walk, provided ha he disribuion changes slowly enough. Such a resul seems ou of reach wih oher random walk mehods due o he lack of scale-free bounds on conducance. Ineresingly, he idea of racking a changing disribuion wih only one sep parallels he echnique of following a cenral pah in he heory of inerior poin mehods for opimizaion. We assume ha we can compue a self-concordan barrier (see Secion 5 and Appendix 8) for he se K, a requiremen ha is saisfied in many cases of ineres. For insance, he self-concordan barrier can be readily compued in closed form if K is defined via linear and quadraic consrains. While he availabiliy of he barrier is a sronger assumpion han, for insance, access o a separaion oracle for K, he barrier gives a beer handle on he geomery of he space and yields fas mixing of he Markov chain. In Secion 5, we illusrae he mehod wihin several diverse applicaion domains. As one of he examples, we consider he problem of updaing he poserior wih respec o a conjugae prior in an exponenial family, where he parameer is aking values in a space of a fixed dimensionaliy given by he sufficien saisics. The consrains hen consiue a prior knowledge abou he possible locaion of he parameer. As anoher example, we consider sampling from a ime-varying runcaed disribuion, as well as he exension o sampling from mixure models fied o sreaming daa. We employ he sampling echnique o he classical problem of linear opimizaion via simulaed annealing. The final example concerns he problem of regre minimizaion where he log-concave disribuion arises naurally from he exponenial weighing scheme. The paper is organized as follows. In he nex secion we sudy he geomery of he se K induced by a self-concordan barrier and prove a key isoperimeric inequaliy in he corresponding Riemannian meric. The Markov chain for a given log-concave disribuion is defined in Secion 3. Condiions on he size of a sep are inroduced in Secion 3.1, and a lower bound on he conducance of he chain is proved in Secion 3.. Secion 4 conains main resuls abou racking ime-varying disribuions given appropriae measures of change beween ime seps. Secion 5 is devoed o applicaions. Finally, Secions 6 and 7 conain all he remaining proofs.. Geomery Induced by he Self-Concordan Barrier. The Markov chain sudied in his paper uses as a proposal a Gaussian disribuion wih a covariance ha approximaes well he local geomery of he se K a he curren poin. This local geomery plays a crucial role in he heory of inerior poin mehods for opimizaion, ye for our purposes a handle on he local geomery yields a good lower bound on conducance of he Markov chain. Furher inriguing similariies beween

opimizaion and sampling will be poined ou hroughou he paper. We refer o [30] for an inroducion o he heory of inerior poin mehods, a subjec cenered around he noion of a self-concordan barrier. Once we have defined a self-concordan barrier for K, he local geomery is defined hrough he Hessian of he barrier a he curren poin. To be more precise, for any funcion F on he inerior in(k) having coninuous derivaives of order k, for vecors h 1,..., h k R d and x in(k), for k 1, we recursively define D k F (x)[h 1,..., h k ] D k 1 (x + ɛh k )[h 1,..., h k 1 ] D k 1 (x)[h 1,..., h k 1 ] lim, ɛ 0 ɛ where D 0 F (x) F (x). Le F be a self-concordan barrier of K wih a parameer ν (see Appendix 8 for he definiion and Secion 5 for examples). The barrier induces a Riemannian meric whose meric ensor is he Hessian of F [3]. In oher words, he meric ensor on he angen space a x assigns o a vecor v he lengh v x D F (x)[v, v], and o a pair of vecors v, w, he inner produc v, w x D F (x)[v, w]. The uni ball in x around a poin x is called he Dikin ellipsoid [30]. For x, y K, le ρ(x, y) be he Riemannian disance ρ(x, y) = inf Γ z dγ z where he infimum is aken over all recifiable pahs Γ from x o y. Le M be he meric space whose poin se is K and meric is ρ, and define ρ(s 1, S ) = inf ρ(x, y). The firs main ingredien of he analysis is x S 1,y S an isoperimeric inequaliy. Theorem 1. Le S 1 and S be measurable subses of K and µ a probabiliy measure suppored on K ha possesses a densiy whose logarihm is concave. Then i holds ha µ((k \ S 1 ) \ S ) 1 (1 + 3ν) ρ(s 1, S )µ(s 1 )µ(s ). The heorem ensures ha wo subses well-separaed in ρ disance mus have a large mass beween hem. A lower bound on conducance of our Markov chain will follow from his isoperimeric inequaliy. We remark ha convexiy of he se K is crucial for he above propery. A classical example of a non-convex shape wih a boleneck is a dumbbell. For his body, he above saemen clearly fails, and a local random walk on such a body ges rapped in eiher of he wo pars for a long ime. 3. The Markov Chain. Le B be he Borel σ-field on K. Given an iniial probabiliy measure on K, a Markov chain is specified by a collecion of one-sep ransiion probabiliies {P(x, B), x K, B B} such ha x P(x, B) is a measurable map for any B B and P x ( ) P(x, ) is a probabiliy measure on K for any x K. For x in(k), le G r x denoe he unique Gaussian probabiliy densiy funcion on R d such ha ( G r x(y) exp d x ) y x r + V (x), V (x) 1 ln de D F (x) 3

and r is a parameer ha is chosen according o a condiion specified below. The covariance of his disribuion is given by he Hessian of F a poin x, and hus he conour lines are scaled Dikin ellipsoids. The Markov chain considered in his paper is based on he Dikin Walk inroduced by Kannan and Narayanan [1]. Adaped o sampling from log-concave disribuions in his paper, he Markov chain is paramerized by a convex funcion s and a sep size r. Raher han wriing ou he unwieldy explici form of he ransiion kernel P x, we can give i implicily as he following random walk: Wih probabiliy 1/, se w := x. Wih probabiliy 1/, sample z from G r x and If z / K, le w := x. { ( z wih prob. min 1, If z K, le w := x oherwise. ) G r z (x) exp(s(x)) G r x (z) exp(s(z)) The Markov chain is lazy, as i says a he curren poin wih probabiliy a leas 1/. This ensures uniqueness of he saionary disribuion [4]. Furhermore, a simple calculaion shows ha he deailed balance condiions are saisfied wih respec o a saionary disribuion µ whose densiy (wih respec o he Lebesgue measure) is proporional o exp( s(x)). Indeed, o see ha µ(x)p x (dz) = µ(z)p z (dx), i suffices o observe ha exp( s(x))g r x(z) min ( 1, G r ) z(x) exp(s(x)) G r x(z) exp(s(z)) = exp( s(z))g r z(x) min ( 1, G r ) x(z) exp(s(z)) G r. z(x) exp(s(x)) Therefore he Markov chain is reversible and has he desired saionary measure µ. The value of r has a specific meaning: mos of he y s sampled from G r x are wihin a hin Dikin shell of radius proporional o (E x y x) 1/ = r by measure-concenraion argumens. We will herefore refer o r as he effecive sep size. An imporan and non-rivial resul from he heory of inerior poin mehods is ha he uni Dikin ellipsoid is conained in he se K and gives a good approximaion o he local geomery of he se (see Figure 1 below). Thanks of his fac, he sampling procedure has in general beer mixing properies han he Ball Walk [4, 38]. 3.1. Sep Size Condiions. The analysis of he Markov chain requires he seps r o be no oo large o ensure ha differen enough ransiion probabiliy funcions happen only for far away poins. The precise upper bounds on r depend on he convex funcion s(x) and can be calculaed on he fly when we move o he seing of a ime-varying funcion. We give four condiions: Sufficien Condiion 1 (Linear Funcions). If s is linear, we may se r = 1/d. Sufficien Condiion (Lipschiz Funcions). For a funcion s ha is L-Lipschiz wih respec o he Euclidean norm, we may se he sep size r = min { 1 d, L} 1. Sufficien Condiion 3 (Smooh Funcions). Suppose s has Lipschiz-coninuous gradiens: here{ exiss σ > 0 such ha s(x) s(y) σ x y. We may hen se he sep size o be min 1 d, σ 1 }. 4

These hree condiions can be shown o follow from a more general sufficien sep size condiion ha is based on local informaion: Sufficien Condiion 4 (General Condiion). Fix consans C, C > 0. Given he convex funcion s(x), he sep size r min { 1 d, r } is a valid choice if here exiss a linear funcion < g, x > such ha { } r sup r : z, w K wih z w z C r, s(z) s(w) g, z w < C. The condiion says ha for wo poins, wih one being inside he O(r)-Dikin ellipsoid around he oher poin, he funcion is wihin a consan of being linear. I follows from he las condiion ha, for insance, if s(x) = b, x + a(x) is a sum of a linear and a non-linear Lipschiz par, he sep size is only affeced by he Lipschiz consan of he non-linear par. I is simple o verify ha he sep size in Condiion saisfies Condiion 4. Indeed, for any w such ha z w z C r, we have z w C rr (where R is he radius of he larges ball conained in K). Take g z and g w o be any subgradiens of s a z and w, respecively. We hen have s(z) s(w) g w, z w g z g w, z w L z w. Noice ha for Condiion 3, he above calculaion becomes g z g w, z w σ z w 1. In he remainder of his paper, C will denoe a universal consan ha may change from line o line. The exac value of he final consan in Lemma 4 below can be raced in he proofs; we omi his calculaion for he sake of breviy. 3.. Conducance of he Markov Chain. In order o show rapid mixing of he proposed Markov chain, we prove a lower bound on is conducance S Φ inf 1 P x (K \ S 1 )dµ(x), µ(s 1 ) 1 µ(s 1 ) where P x is he one-sep ransiion funcion defined earlier. Once such a lower bound is esablished, he following general resul on he reducion of disance beween disribuions will imply exponenially fas convergence. Theorem (Lovász-Simonovis [4]). Le γ 0 be he iniial disribuion for a lazy reversible ergodic Markov chain whose conducance is Φ and saionary measure is γ. For every bounded f, le f γ K f(x) dγ(x). For any fixed f, le Ef be he map ha akes x o K f(y)dp x(y). Then if K f(x)dγ(x) = 0, i holds ha E k f γ ) k (1 Φ f γ. To prove a lower bound on conducance Φ, we firs relae he Riemannian meric ρ o he proposed Markov Chain. Inuiively, he following resul says ha for close-by poins, heir ransiion disribuions canno be far apar in he oal variaion disance d T V. 5

Lemma 3. If x, y K and ρ(x, y) r C d for some consan C, hen for some consan C. d T V (P x, P y ) 1 1 C Lemma 3 ogeher wih he isoperimeric inequaliy of Theorem 1 give a lower bound on conducance of he Markov Chain. Lemma 4. Le µ be a log-concave disribuion wih suppor on K whose densiy wih respec o he Lebesgue measure is proporional o exp{ s(x)}, and suppose an appropriae sep size condiion (Secion 3.1) for he Markov chain is saisfied. Then here exiss a consan C > 0 such ha he conducance of he above Markov chain is bounded below as Φ r Cν d. We remark ha he sep size r eners he lower bound on Φ. While we would like he seps o be large, he condiions oulined earlier dicae a limiaion on how large r can be. In paricular, we always have r 1/d. The sep size needs o be even smaller for funcions s for which a linear approximaion is poor. 4. Tracking he Disribuions. Having specified he Markov chain and he sep size, we now urn o he problem of racking a sequence of disribuions µ 1,..., µ,.... For each 1, define a Markov chain wih parameers r and s, and le is ransiion kernel be denoed by P (x, B) for x K and B B. Le Φ denoe he conducance of his chain. The chain will be run for τ seps saring from he end of he chain a ime 1. Formally, le he i-h sep of he -h chain be denoed by he random variable X,i. Define τ 0 = 0 and le σ 0,0 be he iniial disribuion of X 0,0. Then X,i has disribuion σ 0,0 P τ 1 1 Pτ 1 1 Pi and we have made he idenificaion X s,τs = X s+1,0, gluing he successive chains ogeher. Le he disribuion of X,i be denoed by σ,i. By he definiion of he chain, σ,i is a disribuion wih bounded densiy, suppored on K. X,i+1 K X,i Fig 1. Seps of he Dikin Walk. The nex poin is sampled from a Gaussian disribuion wih a shape (conours depiced wih dashed lines) corresponding o Dikin ellipsoids. These ellipsoids approximae well he local geomery. 6

4.1. Measuring he Change. Le denoe he L norm wih respec o he measure µ, defined as f = ( K f dµ ) 1/ for a measurable funcion f : K R. Furher, le K denoe he supremum norm f K = sup x K f(x) and le (4.1) β +1 = max { dµ /dµ +1 K, dµ +1 /dµ K }. This raio provides an upper bound on he poin-wise change of he densiy funcion. A sraighforward way o upper bound β +1 is by wriing and, hence, sup x K e s(x) K e s+1(x) dx e s +1(x) K e s(x) dx sup x K e s(x) s +1(x) (4.) log β +1 s (x) s +1 (x) K. Anoher way o measure he change in successive disribuions is wih respec o he L norm: (4.3) α +1 = dµ /dµ +1 +1. In conras o he poin-wise change, he raio α +1 is more difficul o calculae. In his respec, he following resul, which follows from he proof of [5, 0], will be useful: Lemma 5. Le s be a convex funcion and s +1 = (1 δ) 1 s. Le µ and µ +1 be defined as in (1.1). Then ) d/ α +1 (1 + δ 1 δ In paricular, if δ d 1/ 1/3, hen α +1 5. We remark ha he raio beween µ and µ +1 measured in he supremum norm may be exponenially large, while he L change is small. As in [5, 0], his fac will be crucial in his paper when we sudy simulaed annealing. 4.. Tracking he Disribuions: Main Resuls. Denoe he error in approximaing he saionary disribuion a he end of -h chain by ξ,τ (4.4) dµ and le (4.5) Theorem 6. r Cdν. The errors ξ saisfy he recurrence ξ (1 ) τ (β 3/ ξ 1 + β (β 1)) for any 1. 7

Proof of Theorem 6. We ieraively apply Theorem wih f =,j dµ 1 and he saionary disribuion γ = µ, and observe ha Ef akes σ,j o σ,j+1. Then from Lemma 4, for 1 and i 1,,i dµ,0 dµ (1 ) i Using he firs par of Lemma 13 (see Secion 6),0 dµ β 3/,0 dµ 1 + β (β 1), 1 concluding he proof. An alernaive recurrence, using he second par of Lemma 13, is which is beer for large β bu worse for β 1. ξ (1 ) τ ( β ξ 1 + β 1), We would like o adapively choose τ o make he righ-hand side (4.5) small. While he value of he error ξ 1 a he previous round is no available for his purpose, le us mainain an upper bound u 1 on his error. Thus, we may wrie τ as a funcion τ (u 1, s, r, β ). Suppose a round = 0 we ensure ha ξ 0 u 0. Then, recursively, we may compue u as he upper bound in (4.5): (4.6) u (1 ) τ (β 3/ u 1 + β (β 1)) Then, given he iniial condiion, we have ξ u for all 0. Le us consider some consequences of Theorem 6. In paricular, we are ineresed in siuaions when we can rack he disribuions wih only one sep of he random walk. Corollary 7. Le τ = 1 for all 1 and suppose ξ 0 u 0 = β 0 (β 0 1)/ 0 wih 0 = 1. Assume ha β is non-decreasing and is non-increasing in, and suppose 1 Cd 3 ν (4.7) β 3/ 1 + 1 for all 1. Then we have β (β 1) ξ u = for all 0. In paricular, (4.7) is saisfied whenever β 1 0.4. The proof of he above corollary follows from he more general resul: Corollary 8. Fix a sequence ɛ 0,..., ɛ,... of posiive arge accuracies and assume ξ 0 ɛ 0. I is hen enough o se ( ) 1 τ = log β 3/ ɛ 1 β (β 1) (4.8) + ɛ in order o ensure ξ ɛ for each 0. 8 ɛ

Proof. Immediae by wriing u = (1 ) τ (β 3/ ɛ 1 + β (β 1)) ɛ, solving for τ, and using he approximaion log(1/(1 )) log(1 + ). We now consider he case when one has conrol on he L norm α of he change beween successive disribuions. Firs, observe ha closeness of he disribuions in he norm implies closeness in oal variaion disance as,i,i dµ = 1 dµ dµ,i (4.9) dµ. Proposiion 9. ɛ 0. Suppose we se (4.10) Fix a sequence ɛ 0,..., ɛ,... of posiive arge accuracies and assume d T V (σ 0,0, µ 0 ) τ = Then he oal variaion disance beween σ,τ 1 log ( α ɛ ) and µ is bounded as. (4.11) for each 0. d T V (σ,τ, µ ) s=0 ɛ s Proof. For any 1, le us wrie (4.1) σ,τ = µ + γ wih a signed measure γ = σ,τ µ. By way of inducion, suppose (4.11) holds for ime. Consider he operaor E +1 corresponding o he random walk of he + 1-s chain. The operaor acs on a funcion f by aking f o K f(y)dp +1(x, y). Then applying Theorem o he funcion dµ /dµ +1 1, we have ( Eτ +1 dµ +1 1) ɛ +1 dµ +1 +1 by he choice of τ +1 and he definiion of α +1. Tha is, upon he acion of E τ +1 +1, µ is mapped o µ +1 wihin an error of a mos ɛ +1 in he L sense (and, hence, in he oal variaion sense). Since he operaor E +1 is non-expanding in he L 1 sense, oal variaion of γ does no increase under he acion of E τ +1 +1. In view of he inducive hypohesis for sep, we conclude d T V (σ +1,τ+1, µ +1 ) s=0 ɛ s + ɛ +1, as desired. 5. Applicaions. Before diving ino he applicaions of he random walk, le us give several examples of ses K for which he self-concordan barrier F and is Hessian can be easily calculaed. In he following examples, assume ha K has non-empy inerior. Example 10. Suppose K is given by m linear consrains of he form a j, x b j, j = 1,..., m. Then F (x) = m j=1 log(b j a j, x ) is a self-concordan barrier wih parameer ν = m. The Hessian is easily compuable: m D a j a T j F (x) = (b j a j, x ). j=1 9

Example 11. Le K = {x R d : f j (x) 0, j = 1,..., m} where each f j is a convex quadraic form. Then F (x) = m j=1 log( f j(x)) is a self-concordan barrier wih parameer m. As an example, he funcion log(r x ) is a self-concordan barrier for he uni Euclidean sphere {x : x 1 0}, wih parameer ν = 1, and he Hessian is given by D F (x) = 1 x I + 4 (1 x ) xxt. Imporanly, here always exiss a self-concordan barrier wih ν = O(d); ye, for some convex ses (such as he sphere) he parameer can even be consan. Self-concordan barriers can be combined: if F j is ν j -self-concordan for K j, j = 1,..., m, hen j F j is j ν j-self-concordan for he inersecion i K i, given ha i has nonempy inerior. Thus, closed forms for he Hessian of he barrier, required for defining G r x in our Markov chain, can be calculaed for many ses K of ineres. We refer o [30, 31] for furher powerful mehods for consrucing he barriers. 5.1. Sampling from Poserior in Exponenial Families. Suppose daa y 1, y,... Y are disribued i.i.d. according o a member of an exponenial family wih naural parameer x: p(y x) = exp{ x, T (y) A(x)}h(y) where A(x) = h(y) exp { x, T (y) } is a convex funcion and T : Y R d is a sufficien saisic. Suppose x K; ha is, we have some knowledge abou he suppor of he parameer. We have in mind he siuaion where daa arrive one a a ime and we are ineresed in sampling from he associaed poserior disribuions. The likelihood funcion afer seeing y 1,..., y is { } l(x) exp x, T (y i ) A(x) i=1 and, ogeher wih a conjugae prior π κ1,κ (x) exp { x, κ 1 κ A(x)} for some (κ 1, κ ) R d+1, we obain he poserior disribuion a ime { } p (x y) exp x, κ 1 + T (y i ) ( + κ )A(x). We apply he sampling echnique o his scenario by defining s 0 (x) = x, κ 1 + κ A(x), s (x) = x, κ 1 + T (y i ) + ( + κ )A(x). i=1 I remains o calculae he number of seps required o rack he disribuions as addiional daa arrive one-by-one. Le L be he Lipschiz consan of A(x) over K wih respec { o Euclidean } norm, and le us assume L o be finie. Then Condiion 4 is saisfied wih r = min 1 (+κ )L, 1 d. Furhermore, we may se β = sup exp { x, T (y ) A(x) }, x K a quaniy ha depends on he observed daa. Imporanly, we do no need o provide an a priori daa-independen bound of his ype, which migh no be finie. 10 i=1

Suppose we would like o mainain a consan level ɛ > 0 of accuracy a each sep. Corollary 8 guaranees his accuracy if each chain is run for ( ) 1 τ = log β 3/ β (β 1) + = O ( dν max{( + κ ) L, d } + log(1/ɛ) ). ɛ One of he feaures of his bound is a relaively benign dependence on he dimension d, especially if he geomery of he se K allows he parameer ν = O(1), as in he case of a sphere. On he negaive side, he number of seps needed afer seeing daa poins is proporional o. Such an adverse dependence, however, is o be expeced as he poserior disribuion becomes concenraed very quickly. We now demonsrae ha sronger resuls can be achieved under addiional assumpions via Condiion 3. Suppose ha A is smooh: here exiss H 0 such ha A(x) A(w) + A(x), w x + (w x) T H(w x) for any w, x K. This is a naural assumpion, as he second derivaive of he log normalizaion funcion A corresponds o he variance of he random variable wih he given parameer; furhermore, A is differeniable. Le λ max be he larges eigenvalue of H. Then he condiion yields C r =. To obain ɛ-accuracy, i suffices o se (+κ )λ max τ = O ( dν max{( + κ )λ max, d } + log(1/ɛ) ), which has only linear dependence on he size of he daa seen so far. We remark ha each sep of he random walk requires evaluaion of he log-pariion funcion A(x). If his funcion is no available in closed form, we may approximae he value A(x) for each query x. In order o do his, we may run an addiional sampling procedure wih s (x) = x, T (y). Alernaively, we may appeal o known mehods for his problem, such as Hi-and-Run [38]. 5.. Sampling from Drifing Truncaed Disribuions. In he previous example, we employed he Markov chain o sample a parameer from a log-concave poserior. We now urn o he quesion of sampling from a log-concave disribuion resriced o a convex se. This problem has a long hisory (see e.g. [11, 17]), and i is recognized ha sampling from runcaed disribuions is difficul even for nice forms such as he Normal disribuion. One successful approach o his problem is he Gibbs sampling mehod [36, 10], ye he rae of convergence is no generally available. The MCMC mehod of his paper yields a provably fas algorihm for such siuaions. Furhermore, we can rack a drifing disribuion over K wih a small number of seps. For illusraion purposes, we sudy a simple example of a runcaed Normal disribuion; he same echniques, however, apply more generally. To simplify calculaions, suppose he disribuions µ are defined o be N (c, 1 d I) over a convex compac se K Rd and suppose he mean c is drifing wihin a Euclidean ball of radius R. Wih he definiion in (1.1) we have s (x) = 1 x c. Define he drif δ = c c 1. In view of (4.), log β sup c c 1 x c c 1 C R,K δ x K where C R,K depends on he radius R and he radius of a smalles Euclidean ball enclosing K. In he same manner, he Lipschiz consan of s (x) over K can be upper bounded by L R,K ha depends 11

solely on he wo radii. We may hus se he sep size o be r = min{ 1 d, 1 L R,K }. If we aim for a fixed arge accuracy ɛ for all, by Corollary 8, i is enough o make ( ) 1 τ = log β 3/ β (β 1) (5.1) + ɛ seps. In he case ha he drif δ is small enough, only one sep is sufficien. To quanify he regime when his happens, observe ha β exp{c R,K δ} 1 + Cδ, and i is hen enough o require δ = O ( ( ) min{1/d, 1/L R,K } ) = O dν in view of (4.7). I is quie remarkable ( ) ha he one-sep random walk can rack he changing disribuion up o he accuracy O, proporional o he size of he drif. Of course, beer δ dν r accuracy can be achieved by performing more seps, as per Corollary 8. Anoher relaed applicaion is o modeling wih mixures of log-concave disribuions. Such models have been successful in clusering [7, 41], wih a mixure of normal disribuions being a classical example [15]. A mixure of parameric log-concave disribuions can be wrien as k i=1 α iπ i (θ i ; x); here α i are posiive mixing weighs summing o one, and π i are a disribuions on K paramerized by θ i. A classical mehod for fiing models o daa is he EM algorihm. Given ha he parameers {θ i } k i=1 and he mixing weighs {α i} k i=1 have been esimaed from daa, one may require random samples from his model for inegraion or oher purposes. Given our procedure for sampling from a single log-concave disribuion, one may simply pick he mixure according o he weighs α i and hen sample from he componen. The siuaion becomes ineresing in he case of online arrival of daa, when we need o re-compue he EM soluion in ligh of addiional daa. By he argumens of [35, 6], he soluion o clusering problems (he analysis was performed for square loss) is sable in he following sense: addiion of o( n) new daa o a sample of size n is unlikely o drasically move he soluion (he argumen is based on uniqueness of he maximum of an empirical process). This in urn implies ha he parameers {θ i } are unlikely o change by a large amoun, and we may hus use he mehod of sampling from a drifing disribuion described earlier. We also remark ha he mehod can be easily parallelized since he Markov chains for he k componens do no inerac. 5.3. Simulaed Annealing for Convex Opimizaion. Le f(x) be a proper convex 1- Lipschiz funcion. The aim of convex opimizaion is o find x wih he propery f( x) min x K f(x) ɛ for a given arge accuracy ɛ > 0. We consider he special case of linear funcion f(x) = l, x, known as Linear Opimizaion. Complexiy of an opimizaion procedure is ofen measured in erms of oracle calls queries abou he unknown funcion. A query abou he funcion value is known as he zero-h order informaion, while a query abou a subgradien a a poin as he firs order informaion. In he case ha he oracle answer is given wihou noise, i is known ha he complexiy scales as O (poly(d, log(1/ɛ))). The sae-of-he-ar resul here is he mehod of [0, 5] which aains he d 4.5 dependence on he dimension. We now apply our machinery o obain a O ( ν d 3.5 log(1/ɛ) ) mehod. In paricular, his yields an improved d 3.5 dependence on he dimension for he case when K has a favorable geomery: here exiss a self-concordan barrier wih a parameer ν = O(1). We use he annealing scheme of [0]. To his end, we se s = ( 1 d 1/) f and observe ha he assumpion of Lemma 5 is saisfied wih δ = d 1/. Since funcions are linear, we may se he 1

sep size r = 1/d for all. Hence, α 5 whenever d > 8 (and a differen consan can be obained for smaller d from he proof). By Proposiion 9 wih a consan accuracy ɛ = ɛ ( d log(d/ɛ)) 1, by making ( τ = Cd 3 ν 5 ) d log(d/ɛ) (5.) log ɛ seps for = 1,..., k, we guaranee (5.3) d T V (σ k,τk, µ k ) kɛ( d log(d/ɛ)) 1. According o [0, Lemma 4.1], if X is chosen from a disribuion wih densiy proporional o exp{ T 1 l, x }, wih l = 1 and some emperaure T > 0, hen E( l, X ) min l, x dt. x K Hence, we ake he desired emperaure o be T = ɛ/d, and he number of chains ha permis he annealing schedule o reach his emperaure can be calculaed as k = d log( d ɛ ). In view of (5.3), he final oupu of he procedure is an ɛ-accurae soluion o he opimizaion problem. The complexiy of he mehod is hen O(d 3.5 ν log (d/ɛ)). This resul can be exended o Lipschiz convex funcions beyond linear opimizaion. However, he sep size condiion for convex Lipschiz funcions requires he seps o be O(1/ɛ) owards he end of he annealing schedule. This in urn implies only a subopimal Õ(dν /ɛ ) complexiy. I is an open quesion of wheher Dikin Walk can handle such annealing schedules in a more graceful manner. 5.4. Sequenial Predicion. Anoher applicaion of he proposed sampling echnique is o he problem of sequenial predicion wih convex cos funcions. Wihin his seing, he learner (or, he Saisician) is asked wih making a series of predicions while observing a sequence of oucomes on which we place no disribuional assumpions. The goal of he learner is o incur cos comparable o ha of a fixed sraegy chosen in hindsigh afer observing he daa. Iniially sudied by Hannan [18], Blackwell [4], and Cover [9], he problem of achieving low regre for all sequences has received much aenion in he las wo decades, and we refer he reader o [7] for a comprehensive reamen. As we show in his secion, a sraegy ha exponenially down-weighs he decisions wih large coss is a good regre-minimizaion sraegy, and his exponenial form is amenable o he sampling echnique of his paper whenever he coss are convex. More specifically, le K R d be a convex compac se of decisions of he learner. Le l 1,..., l T be a sequence of unknown cos funcions l : K R. On round, he learner chooses a disribuion (or, a mixed sraegy) µ 1 suppored on K and plays a decision Y µ 1. 1 Naure hen reveals he nex cos funcion l. For example, in he well-sudied problem of sequenial probabiliy assignmen, he Saisician predics he probabiliy x [0, 1] = K of he nex oucome {0, 1} and incurs he cos l (x ) = x y wih respec o he acual oucome y. A randomized sraegy Y hen incurs a cos l (Y ). The goal of he learner is o minimize expeced regre [ T ] T Reg T (U) E l (Y ) l (U) 1 The index 1 on µ 1 reflecs he fac ha Y is chosen wihou he knowledge of l. =1 13 =1

wih respec o all randomized sraegies defined by p U P, for some collecion of disribuions P. A procedure ha guaranees sublinear growh of regre wih respec o any disribuion p U P and for any sequence of cos funcions l 1,..., l T will be called consisen wih respec o P. Define he cumulaive cos funcions L (x) = s=1 l s(x), and le η > 0 be a parameer called he learning rae. Fix R(x) o be some convex funcion ha defines he prior, le (5.4) s (x) = ηl (x) + R(x), s 0 (x) = R(x) and define he probabiliy disribuions µ as in (1.1). I urns ou ha his choice of µ is indeed a good regre-minimizaion sraegy, as we show nex. The mehod is similar o he Mixure Forecaser used in he predicion conex [4, 40,, 19], and for a discree se of decisions i is known as he celebraed Exponenial Weighs Algorihm [39, ]. Le D(p q) sand for he Kullback-Leibler (KL) divergence beween disribuions p and q. Lemma 1. For each 1, le Y be a random variable wih disribuion µ 1 as defined in (1.1). The expeced regre wih respec o U wih disribuion p U is Reg T (U) = η 1 (D(p U µ 0 ) D(p U µ T )) + η 1 Specializing o he case l : K [0, 1] for all, Reg T (U) η 1 D(p U µ 0 ) + T η/8. T =1 D(µ 1 µ ). Before proceeding, le us make a few remarks. Firs, if he KL divergence beween he comparaor disribuion p U and he prior µ 0 is bounded for all p U P, he second saemen of he lemma yields consisency and, even sronger, a O( T ) rae of regre growh (by choosing η appropriaely). To bound he divergence beween a coninuous iniial µ 0 and a poin disribuion a some x K, he analysis can be carried ou in wo sages: comparison o a small-covariance Gaussian cenered a x, followed by an observaion ha he loss of he small-covariance Gaussian sraegy is no very differen from he loss of he deerminisic sraegy x. This analysis can be found in [7, p. 36] and gives a near-opimal O( T log T ) regre bound. We defer he easy proof of Lemma 1 o Secion 6. Having exhibied a good predicion sraegy, a naural quesion is wheher here exiss a compuaionally efficien algorihm for producing a random draw from a disribuion close o he desired mixed sraegy µ 1. To his end, we use he sampling mehod proposed in his paper. As a concree example, consider linear funcions l 1,..., l T and le R 0. For simpliciy assume boundedness l : K [0, 1]. In his case, we may choose η = O(1/ T ). Then β exp {η l K } 1 + Cη for large enough T. Furher, we se r = 1/d according o Condiion 1, and he requiremen (4.7) is seen o be saisfied for large enough T. Wih hese choices of he parameers, he sequence of disribuions µ 1,..., µ can be racked wih only one sep of a random walk per ieraion. The qualiy of his approximaion is O ( ηd 3 ν ) a each sep. Therefore, regre of he proposed random walk mehod is wihin O ( T ηd 3 ν ) from he ideal procedure of Lemma 1, as can be seen by wriing El (Y ) El (X 1,1 ) l (x) 1,1 (x) dµ 1 (x) Cηd 3 ν. x K 14

By choosing η = 1 d 3/ ν T, (5.5) Reg T (U) Cd 3/ νd(p U µ 0 ) T. A similar resuls holds for nonzero R, under he assumpion ha he L disance beween dµ 0 (x) exp{ R(x)}dx and he uniform disribuion on K is bounded. We now discuss ineresing parallels beween he proposed randomized mehod and he known deerminisic opimizaion-based regre minimizaion mehods. Firs, he saemen of Lemma 1 bears sriking similariy o upper bounds on regre in erms of Bregman divergences for he Follow he Regularized Leader and Mirror Descen mehods [34, 3], [7, Therem 11.1]. Ye, he randomized mehod operaes in he (infinie-dimensional) space of disribuions while he deerminisic mehods work direcly wih he se K. Second, deerminisic mehods of online convex opimizaion face he difficuly of projecions back o he se K. This issue does no arise when dealing wih disribuions, bu insead ranslaes ino he difficuly of sampling. We find hese parallels beween sampling and opimizaion inriguing. Third, a single sep of he proposed random walk requires sampling from a Gaussian disribuion wih covariance given by he Hessian of he self-concordan barrier. This sep can be implemened efficienly whenever he Hessian can be compued. The compuaion ime exacly maches [1, Algorihm ]: i is he same as ime spen invering a Hessian marix, which is O(d 3 ) or less. Finally, as already menioned, he idea of following a ime-varying disribuion is inspired by he mehod of following he cenral pah in he heory of inerior poin mehods [31, 30]. Similarly o he fas convergence of he chain under he lower bound on conducance, one has fas quadraic local convergence of inerior poin mehods. One may herefore make parallels beween conducance and local curvaure. A furher invesigaion of hese connecions is needed, especially in view of he recen developmens on posiive Ricci curvaure of Markov chains [33]. 6. Proofs. Lemma 13. For any and i 0, i holds ha,i dµ β 3/,i dµ 1 + β (β 1) 1 and, alernaively,,i dµ β 1/,i dµ 1 + β 1 1 Proof. Le us use he shorhand = +1,i and β = β +1. Using (4.1), we may wrie dµ +1 β +1 dµ +1 ( ) β 1 dµ +1 dµ + dµ. By he riangle inequaliy, dµ +1 dµ dµ +1 dµ. 15

For any funcion f : K R, le f + (x) = max(0, f(x)) and f (x) = min(0, f(x)). In view of (4.1), ( dµ +1 dµ = ) + ( + ) dµ +1 dµ dµ +1 dµ [ (β 1)1 1 < dµ ] ( + dµ dµ +1 1 1 ) [ 1 1 dµ ] dµ β dµ +1 (β 1) dµ. Therefore, ( ) dµ +1 dµ (β 1) dµ (β 1) 1 + dµ. The firs saemen follows by rearranging he erms. Alernaively, we can obain an inequaliy ha is slighly weaker for β 1 0 and sronger for large β by simply wriing dµ +1 +1 = K ( ) 1 dµ +1 = dµ +1 K 1 = dµ +1 K dµ Using β as an upper bound on he one-sided change dµ /dµ +1 K leads o β K dµ dµ 1 = β dµ + β 1 and subaddiiviy of he square roo funcion concludes he proof. dµ dµ +1 dµ 1. Proof of Theorem 1. Given inerior poins x, y in in(k), suppose p, q are he ends of he chord in K conaining x, y and p, x, y, q lie in ha order. Denoe he cross raio by and for wo ses S 1 and S le σ(x, y) = σ(s 1, S ) x y p q p x q y, inf σ(x, y). x S 1,y S A resul due o Lovász and Vempala [6] saes he following. If S 1 and S are measurable subses of K and µ a probabiliy measure suppored on K ha possesses a densiy whose logarihm is concave, hen µ((k \ S 1 ) \ S ) σ(s 1, S )µ(s 1 )µ(s ). This is a non-rivial isoperimeric inequaliy which says ha for any pariion of he convex se K ino S 1, S and S 3, he volume of S 3 is large relaive o ha of S 1 and S whenever S 1 and S are separaed. Given his isoperimeric resul, o prove he heorem i only remains o show ha he σ-disance can be lower bounded (up o a muliplicaive consan) by he Riemannian meric ρ. The proof of his fac goes hrough he Hilber (projecive) meric, which is defined by d H (x, y) ln (1 + σ(x, y)). 16

Furher, for x K and a vecor v, le v x sup α. x±αv K The following wo relaions beween he inroduced noions hold. The firs one (see Neserov and Nemirovskii [31, Theorem.3. (iii)]) is (6.1) h x h x (1 + 3ν) h x for all h R d and x in(k), where ν is he self-concordance parameer of F. The second relaion (see Neserov and Todd [3, Lemma 3.1]) saes ha (6.) x y x x y x ρ(x, y) ln(1 x y x ). whenever x y x < 1. For any z on he segmen xy an easy compuaion shows ha d H (x, z) + d H (z, y) = d H (x, y). ρ(x,y) Therefore i suffices o prove he resul infiniesimally. From (6.), lim y x x y x = 1, and a direc compuaion shows ha d H (x, y) σ(x, y) lim = lim 1. y x x y x y x x y x Hence, in view of (6.1), he Hilber meric and he Riemannian meric saisfy Using ln(1 + x) x concludes he proof. ρ(x, y) (1 + 3ν)d H (x, y). Proof of Lemma 4. The argumen roughly follows he sandard pah, which is explained, for insance, in [38]. Le S 1 be a measurable subse of K such ha µ(s 1 ) 1 and S = K \ S 1 be is complemen. Fix a C > 1 and le S 1 = S 1 {x P x (S ) 1/C} and S = S {y P y (S 1 ) 1/C}. Tha is, poins in he se S 1 are unlikely o ransiion o he se S, and S is analogously unlikely o reach S 1 in one sep. By he reversibiliy of he chain, which is easily checked, P x (S )dµ(x) = P y (S 1 )dµ(y). S 1 S For any x S 1 and y S, ( dpx d T V (P x, P y ) = 1 min K dµ (w), dp ) y dµ (w) dµ(w) 1 1 C. Tha is, he ransiion probabiliies for a pair in S 1 and S mus be dissimilar. Bu Lemma 3 implies ha if ρ(x, y) r C, hen d d T V (P x, P y ) 1 1 C. Therefore ρ(s 1, S ) 17 r C d.

We conclude ha he ses S 1 and S mus be well-separaed. Therefore, he isoperimeric resul of Theorem 1 implies ha µ((k \ S 1) \ S ) ρ(s 1, S ) (1 + 3ν) min(µ(s 1), µ(s )) r Cν d min(µ(s 1), µ(s )). Firs suppose µ(s 1 ) (1 1 C )µ(s 1) and µ(s ) (1 1 C )µ(s ). Then, P x (S )dµ(x) = 1 P x (S )dµ(x) + 1 P x (S 1 )dµ(x) S 1 S 1 S 1 C µ((k \ S 1) \ S ) r C ν d min(µ(s 1), µ(s )) 1 1/C C r ν d min(µ(s 1), µ(s )), proving he resul. Oherwise, wihou loss of generaliy, suppose µ(s 1 ) (1 1 C )µ(s 1). Then P x (S )dµ(x) = 1 P x (S )dµ(x) + 1 P x (S 1 )dµ(x) S 1 S 1 S 1 P x (S )dµ(x) µ(s 1) C, concluding he proof. S 1 \S 1 Proof of Lemma 5. The proof closely follows ha in [0]. By definiion, dµ /dµ +1 +1 = K ( dµ dµ +1 ) dµ +1 = K dµ exp{ s } Z +1 = dµ +1 K Z exp{ s +1 }. Wriing ou he normalizaion erms, dµ /dµ +1 +1 = K exp{ s +1} K exp{s +1 s } ( K exp{ s } ) Y (1)Y ( 1 + (1 δ)) = Y (1 δ)y (1 δ) where Y (a) = K exp{ as +1}. As shown in [0, Lemma 3.1], he funcion a d Y (a) is log-concave in a, and hus (( Y (a)y (b) a+b ) ) d Y ( ) a+b. ab Applying his inequaliy wih a = 1 and b = 1 + (1 δ), ( ) d dµ /dµ +1 +1 1 + δ. 1 δ { In paricular, if δ d 1/ 1/3 (ha is, d > 8), we obain an upper bound of exp d 1. 18 d d }

Proof of Lemma 1. Observe ha D(µ 1 µ ) can be wrien as dµ 1 log q 1Z = log Z + ηl (x)dµ 1 (x) = log Z (6.3) + ηel (Y ). Z 1 q Z 1 Z 1 K Rearranging, canceling he elescoping erms, and using he fac ha Z 0 = 1 ηe T l (Y ) = =1 K T D(µ 1 µ ) log Z T. Le U be a random variable wih a probabiliy disribuion p U. Then T El (U) = η 1 =1 Combining, [ T E l (Y ) =1 K =1 ηl T (u)dp U (u) = η 1 K dp U (u) log q T (u) q 0 (u) ] T l (U) = η 1 dp U (u) log q T (u)/z T T + η 1 D(µ 1 µ ) K q 0 (u) =1 =1 T = η 1 (D(p U µ 0 ) D(p U µ T )) + η 1 D(µ 1 µ ). Now, from Eq. (6.3), he KL divergence can be also wrien as K D(µ 1 µ ) = log e ηl(x) q 1 (x)dx K q + ηel (Y ) = log Ee η(l(y) El(Y)) 1(x)dx By represening he divergence in his form, one can obain upper bounds via known mehods, such as log-sobolev inequaliies (e.g. [5]). In he simples case of bounded loss, i is easy o show ha D(µ 1 µ ) O(η ), and he paricular consan 1/8 can be obained by, for insance, applying Lemma A.1 in [7]. This proves he second par of he lemma. 7. Smooh Variaion of he Transiion Kernel. In his secion, we sudy he ransiion x y. For his purpose, i is enough o assume ha x is he origin and ha he Dikin ellipsoid a x is a uni Euclidean ball. This can be achieved by an affine ransformaion, leading o no loss of generaliy since he resuling saemen abou measures on K is invarian wih respec o affine ransformaions. Hence, in wha follows, for he paricular x we have <, > x =<, > and x =. Since x is he origin, we have E z x = r for z sampled from G r x. Furher, wihou loss of generaliy, we may also assume s(x) = 0. Proof of Lemma 3. In view of he firs inequaliy in Eq. (6.), x y x x y x ρ(x, y) r C d. r Wihou loss of generaliy, assume C 1 d 8. Firs, we claim ha x y x mus be small. For he sake of conradicion, suppose x y x > 1/ and consider a poin y wih x y x = 1/ and lying on he geodesic pah beween x and y wih respec o he Riemannian meric. Clearly, 19 =1

r ρ(x, y ) C 1 d 8, ye by Eq. (6.) we have 1 4 ρ(x, y ), conradicing our assumpion. Hence, x y x 1/, and, herefore, x y x r C. d I remains o show ha if x, y K and hen x y x r C d, d T V (P x, P y ) = 1 1 C. By definiion, we have ha { }] 1 d T V (P x, P y ) = E z [min 1, Gr y(z) G r x(z), Gr z(x) exp(s(x)) G r x(z) exp(s(z)), Gr z(y) exp(s(y)) G r, x(z) exp(s(z)) where he expecaion is aken over a random poin z having densiy G r x. Thus, i suffices o prove ha for some C > 1 [ { G r } y (z) P min G r x(z), Gr z(x) exp(s(x)) G r x(z) exp(s(z)), Gr z(y) exp(s(y)) G r > 1 ] 1 x(z) exp(s(z)) C C. By our assumpion, x is he origin and D F (x) = I, he laer implying ha V (x) = 0. Thus, { G r y(z) G r x(z) = exp d y } z y r + V (y) + d z r, and G r { } z(x) exp(s(x)) G r x(z) exp(s(z)) = exp d z z r + V (z) + d z r + (s(x) s(z)), G r { z(y) exp(s(y)) G r x(z) exp(s(z)) = exp d y } z z r + V (z) + d z r + (s(y) s(z)) Thus, i remains o prove ha here exiss a consan C such ha [ { P max d y z y r V (y), d z z + r (s(z) s(x)) r V (z), } ] d z y z + r (s(z) s(y)) r V (z) < d z + r C 1 C. This fac is shown in echnical Lemmas 15 and 16 below. In proving he echnical lemmas, we will use he fac ha x y x C as shown above, d and ha x z x (for z sampled from G r x) is likely o be bounded above by a muliple of r by sraighforward concenraion argumens. r. Lemma 14. There exiss a consan C > 0 such ha P [max ( V (y), V (z)) < C] > 0.9 0

Proof. Fix a consan c. Firs, noice ha over a Euclidean ball of radius c/d around he origin, he Hessians D F (u) are lower-bounded by a facor of (1 c/d) from he Hessian a he origin (he ideniy) by (8.). Hence, he deerminan funcion can decrease from 1 by a mos a consan facor. Thus V (u) < C for some consan C for any u wih x u x c/d. Now recall ha y is deerminisically wihin he 1/d ball, while z is in he ball of radius c/d wih high probabiliy. Lemma 15. Under sep size Condiion 4, for any [ { } P max s(z) s(x), s(z) s(y) ] < C > 0.3. Proof. Since wih large enough probabiliy x y x < C r and x z x < C r, we also have z y x < C r. Then, by (8.), he norms a z and x are wihin a muliplicaive consan, and hus he pairs (z, x) and (z, y) are subjec o he sep size choice specified in he condiion. Tha is, here exiss a g such ha and similarly s(z) s(x) = s(z) s(x) g, z x + g, z x C + g, z x s(z) s(y) = s(z) s(y) g, z y + g, z y C + g, z y Then, assuming (wihou loss of generaliy) x = 0, P [max { g, z x, g, z y } < 0] = P [ g, z min {0, g, y }]. Observe ha g, z is a Gaussian random variable whose sandard deviaion is larger han g y. Therefore, ( P [ g, z min {0, g, y }] erfc 1/ ) > 0.3, where erfc(x) π x e d is he usual complemenary error funcion. The following probabilisic upper bound complees he proof. Lemma 16. There exiss a consan C > 0 such ha [ { } ] P max y z y, z z, z y z z < Cr > 0.9 d ha Proof of Lemma 16. Since y < Cr d, y y and y z are less han Cr d. So i suffices o show [ } ] P max { z y z, z z z, y, z y, y, z z < Cr > 0.9 d We proceed o do so by proving probabilisic upper bounds on each of he erms (a) z y z, (b) z z z, (c) y, z y, and (d) y, z z 1

separaely, and finally applying he union bound. We firs prove an upper bound on (a) and (b). Noe ha r 1 d and hus r3 r d. I suffices o observe ha by (8.) ( ( ) z z z 1 1) z 8 z 3, 1 z whenever z < 1/. Similarly, for y < 1/, ( ( ) z y z 1 1) z 8 z 3. 1 y There exiss a consan C such ha he quaniy z 3 is bounded by Cr 3 wih probabiliy a leas 0.99. We now urn o bounding (c) and (d). Le [0, u] denoe he line segmen beween he origin and u. By he mean-value heorem, y, z y = y, z + ( y, z y y, z ) y, z + sup y [0,y] D 3 F (y )[y, y, z] y, z z = y, z + ( y, z z y, z ) y, z + sup z [0,z] D 3 F (z )[y, z, z] Observe ha y, z C y z d wih probabiliy a leas 0.99 by a measure-concenraion argumen. Indeed, mos of he vecors z are almos perpendicular o he given vecor y. Now, using (8.1), and sup D 3 F (y )[y, y, z] sup y y z y Cr y [0,y] y [0,y] d sup D 3 F (z )[y, z, z] sup y z z z Cr3 Cr z [0,z] z [0,z] d d wih probabiliy a leas 0.99. Therefore, here exiss a consan C > 0 such ha ] P [ y, z y < Cr > 0.98 d and he same saemen holds for y, z z. We also have ha [ ] y z P + sup y z z z d Cr > 0.99 z [0,z] d Therefore, P ] [ y, z z < Cr > 0.98. d

8. Self-concordan barriers. Le K be a convex subse of R d ha is no conained in any (d 1)-dimensional affine subspace and in(k) denoe is inerior. Following Neserov and Nemirovskii, we call a real-valued funcion F : in(k) R, a regular self-concordan barrier if i saisfies he condiions saed below. For convenience, if x in(k), we define F (x) =. 1. (Convex, Smooh) F is a convex hrice coninuously differeniable funcion on in(k).. (Barrier) For every sequence of poins {x i } in(k) converging o a poin x in(k), lim i f(x i ) =. 3. (Differenial Inequaliies) For all h R d and all x in(k), he following inequaliies hold. (a) D F (x)[h, h] is -Lipschiz coninuous wih respec o he local norm, which is equivalen o D 3 F (x)[h, h, h] (D F (x)[h, h]) 3. (b) F (x) is ν-lipschiz coninuous wih respec o he local norm defined by F, DF (x)[h] νd F (x)[h, h]. We call he smalles posiive ineger ν for which his holds, he self-concordance parameer of he barrier. The following resuls can be found, for insance, in [31, 30, 9]. Firs, (8.1) D 3 F (x)[h 1,..., h k ] h 1 x h x h 3 x. Second, if δ = h x < 1, hen (8.) (1 δ) D F (x) D F (x + h) (1 δ) D F (x). References. [1] J. Abernehy, E. Hazan, and A. Rakhlin. Compeing in he dark: An efficien algorihm for bandi linear opimizaion. In Proceedings of The Tweny Firs Annual Conference on Learning Theory, 008. [] K. S. Azoury and M. K. Warmuh. Relaive loss bounds for on-line densiy esimaion wih he exponenial family of disribuions. Machine Learning, 43(3):11 46, June 001. [3] A. Beck and M. Teboulle. Mirror descen and nonlinear projeced subgradien mehods for convex opimizaion. Oper. Res. Le., 31(3):167 175, 003. [4] D. Blackwell. An analog of he minimax heorem for vecor payoffs. Pac. J. Mah., 6:1 8, 1956. [5] S. Boucheron, G. Lugosi, and P. Massar. Concenraion inequaliies using he enropy mehod. Annals of Probabiliy, 31:1583 1614, 003. [6] A. Caponneo and A. Rakhlin. Sabiliy properies of empirical risk minimizaion over Donsker classes. Journal of Machine Learning Research, 6:565 583, 006. [7] N. Cesa-Bianchi and G. Lugosi. Predicion, Learning, and Games. Cambridge Universiy Press, 006. [8] N. Chopin. A sequenial paricle filer mehod for saic models. Biomerika, 89(3):539 55, 00. [9] T. Cover. Behaviour of sequenial predicors of binary sequences. In Proc. 4h Prague Conf. Inform. Theory, Saisical Decision Funcions, Random Processes, 1965. [10] P. Damien and S. Walker. Sampling runcaed normal, bea, and gamma densiies. Journal of Compuaional and Graphical Saisics, 10(), 001. [11] L. Devroye. Non-uniform random variae generaion (1986). Springer Verlag, 1986. [1] P. Diaconis. The markov chain mone carlo revoluion. Bullein of he American Mahemaical Sociey, 46():179 05, 009. [13] A. Douce, N. De Freias, N. Gordon, e al. Sequenial Mone Carlo mehods in pracice, volume 1. Springer New York, 001. [14] M. Dyer, A. Frieze, and R. Kannan. A random polynomial-ime algorihm for approximaing he volume of convex bodies. Journal of he ACM (JACM), 38(1):1 17, 1991. 3

[15] C. Fraley and A. Rafery. Model-based clusering, discriminan analysis, and densiy esimaion. Journal of he American Saisical Associaion, 97(458):611 631, 00. [16] A. Frieze, R. Kannan, and N. Polson. Sampling from log-concave disribuions. The Annals of Applied Probabiliy, pages 81 837, 1994. [17] W. Gilks and P. Wild. Adapive rejecion sampling for gibbs sampling. Applied Saisics, pages 337 348, 199. [18] J. Hannan. Approximaion o Bayes risk in repeaed play. Conribuions o he Theory of Games, 3:97 139, 1957. [19] S. Kakade and A. Ng. Online bounds for Bayesian algorihms. In Proceedings of Neural Informaion Processing Sysems (NIPS 17), 005. [0] A.T. Kalai and S. Vempala. Simulaed annealing for convex opimizaion. Mahemaics of Operaions Research, 31():53 66, 006. [1] R. Kannan and H. Narayanan. Random walks on polyopes and an affine inerior poin mehod for linear programming. Mahemaics of Operaions Research, 37(1):1 0, 01. [] N. Lilesone and M. K. Warmuh. The weighed majoriy algorihm. Informaion and Compuaion, 108():1 61, 1994. [3] L. Lovász. Hi-and-run mixes fas. Mahemaical Programming, 86(3):443 461, 1999. [4] L. Lovász and M. Simonovis. Random walks in a convex body and an improved volume algorihm. Random Srucures and Algorihms, 4(4):359 41, 1993. [5] L. Lovász and S. Vempala. Simulaed annealing in convex bodies and an o (n 4 ) volume algorihm. J. Compu. Sys. Sci., 7():39 417, 006. [6] L. Lovász and S. Vempala. The geomery of logconcave funcions and sampling algorihms. Random Sruc. Algorihms, 30(3):307 358, 007. [7] GJ McLachlan and D Peel. Finie mixure models. 000. [8] S. Meyn and R. L. Tweedie. Markov chains and sochasic sabiliy. Cambridge Universiy Press, 009. [9] A. S. Nemirovski and M. J. Todd. Inerior-poin mehods for opimizaion. Aca Numerica, pages 191 34, 008. [30] A. S. Nemirovskii. Inerior poin polynomial ime mehods in convex programming, 004. [31] Y. E. Neserov and A. S. Nemirovskii. Inerior Poin Polynomial Algorihms in Convex Programming. SIAM, Philadelphia, 1994. [3] Y.E. Neserov and M. J. Todd. On he Riemannian geomery defined by self-concordan barriers and ineriorpoin mehods. Foundaions of Compuaional Mahemaics, (4):333 361, 008. [33] Y. Ollivier. Ricci curvaure of markov chains on meric spaces. Journal of Funcional Analysis, 56(3):810 864, 009. [34] A. Rakhlin. Lecure noes on online learning, 008. hp://sa.wharon.upenn.edu/~rakhlin/papers/online learning.pdf. [35] A. Rakhlin and A. Caponneo. Sabiliy of K-means clusering. In Advances in Neural Informaion Processing Sysems 19, pages 111 118. MIT Press, 006. [36] C. P. Rober. Simulaion of runcaed normal variables. Saisics and compuing, 5():11 15, 1995. [37] C. P. Rober and G. Casella. Mone Carlo saisical mehods, volume 319. 004. [38] S. Vempala. Geomeric random walks: A survey. In Combinaorial and compuaional geomery. Mah. Sci. Res. Ins. Publ, 5:577 616, 005. [39] V. Vovk. Aggregaing sraegies. In Proceedings of he Third Annual Workshop on Compuaional Learning Theory, pages 37 383. Morgan Kaufmann, 1990. [40] V. Vovk. Compeiive on-line saisics. Inernaional Saisical Review, 69:13 48, 001. [41] G. Walher. Inference and modeling wih log-concave disribuions. Saisical Science, pages 319 37, 009. [4] K. Yamanishi. Minimax relaive loss analysis for sequenial predicion algorihms using parameric hypoheses. In COLT 98, pages 3 43, New York, NY, USA, 1998. ACM. Deparmen of Saisics and Deparmen of Mahemaics Universiy of Washingon E-mail: harin@uw.edu Deparmen of Saisics, The Wharon School Universiy of Pennsylvania E-mail: rakhlin@wharon.upenn.edu 4