# Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Save this PDF as:

Size: px
Start display at page:

Download "Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh"

## Transcription

1 Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, / 30 The Problem We first consider the following problem: minimize f (x) (1) subject to x = (x 1,..., x n ) R n We will assume that f is: smooth (will be made precise later) strongly convex (will be made precise later) So, this is unconstrained minimization of a smooth convex function. 2 / 30

2 Randomized Coordinate Descent with Arbitrary Sampling NSync Algorithm (R. and Takáč 2014, [4]) Input: initial point x 0 R n subset probabilities {p S } for each S [n] = {1, 2,..., n} stepsize parameters v 1,..., v n > 0 for k = 0, 1, 2,... do a) Select a random set of coordinates S k [n] following the law P(S k = S) = p S, S [n] b) Update (possibly in parallel) selected coordinates: x k+1 = x k i S k 1 v i (e T i f (x k ))e i end for (e i is the ith unit coordinate vector) Remark: The NSync algorithm was introduced in The first coordinate descent algorithm using arbitrary sampling. 3 / 30 Two More Ways of Writing the Update Step 1. Coordinate-by-coordinate: x k+1 i = { xi k, i / S k, xi k 1 v i ( f (x k )) i, i S k. 2. Via projection to a subset of blocks: If for h R n and S [n] we write h S = h i e i, (2) i S then x k+1 = x k + h Sk for h = (Diag(v)) 1 f (x k ). (3) Depending on context, we shall interchangeably denote the ith partial derivative of f at x by i f (x) = e T i f (x) = ( f (x)) i 4 / 30

3 Samplings Definition 1 (Sampling) By the name sampling we refer to a set valued random mapping with values being subsets of [n] = {1, 2,..., n}. For sampling Ŝ we ine the probability vector p = (p 1,..., p n ) T by We say that Ŝ is proper, if p i > 0 for all i. p i = P(i Ŝ) (4) A sampling Ŝ is uniquely characterized by the probability mass function p S = P(Ŝ = S), S [n]; (5) that is, by assigning probabilities to all subsets of [n]. Later on it will be useful to also consider the probability matrix P = P(Ŝ) = (p ij) given by p ij = P(i Ŝ, j Ŝ) = S:{i,j} S p S. (6) 5 / 30 Samplings: A Basic Identity Lemma 2 ([3]) For any sampling Ŝ we have n p i = E[ Ŝ ]. (7) Proof. n (4)+(5) p i = n p S = p S = p S S = E[ Ŝ ]. S [n]:i S S [n] i:i S S [n] 6 / 30

4 Sampling Zoo - Part I Why consider different samplings? 1. Basic Considerations. It is important that each block i has a positive probability of being chosen, otherwise NSync will not be able to update some blocks and hence will not converge to optimum. For technical/sanity reasons, we ine: Proper sampling. pi = P(i Ŝ) > 0 for all i [n] Nil sampling: P( Ŝ = ) = 1 Vacuous sampling: P( Ŝ = ) > 0 2. Parallelism. Choice of sampling affects the level of parallelism: E[ Ŝ ] is the average number of updates performed in parallel in one iteration; and is hence closely related to the number of iterations. serial sampling: picks one block: P( Ŝ = 1) = 1 We call this sampling serial although nothing prevents us from computing the actual update to the block, and/or to apply he update in parallel. 7 / 30 Sampling Zoo - Part II fully parallel sampling: always picks all blocks: P(Ŝ = {1, 2,..., n}) = 1 3. Processor reliability. Sampling may be induced/informed by the computing environment: Reliable/dedicated processors. If one has reliable processors, it is sensible to choose sampling Ŝ such that P( Ŝ = τ) = 1 for some τ related to the number of processors. Unreliable processors. If processors given a computing task are busy or unreliable, they return answer later or not at all - it is then sensible to ignore such updates and move on. This then means that Ŝ varies from iteration to iteration. 4. Distributed computing. In a distributed computing environment it is sensible: to allow each compute node as much autonomy as possible so as to minimize communication cost, to make sure all nodes are busy at all times 8 / 30

5 Sampling Zoo - Part III This suggests a strategy where the set of blocks is partitioned, with each node owning a partition, and independently picking a chunky subset of blocks at each iteration it will update, ideally from local information. 5. Uniformity. It may or may not make sense to update some blocks more often than others: uniform samplings: P(i Ŝ) = P(j Ŝ) for all i, j [n] doubly uniform (DU): These are samplings characterized by: S = S P(Ŝ = S ) = P(Ŝ = S ) for all S, S [n] τ-nice: DU sampling with the additional property that P( Ŝ = τ) = 1 distributed τ-nice: will ine later independent sampling: union of independent uniform serial samplings nonuniform samplings 9 / 30 Sampling Zoo - Part IV 6. Complexity of generating a sampling. Some samplings are computationally more efficient to generate than others: the potential benefits of a sampling may be completely ruined by the difficulty to generate sets according to the sampling s distribution. a τ-nice sampling can be well approximated by an independent sampling, which is easy to generate... in general, many samplings will be hard to generate 10 / 30

6 Assumption: Strong convexity Assumption 1 (Strong convexity) Function f is differentiable and λ-strongly convex (with λ > 0) with respect to the standard Euclidean norm h = ( n ) 1/2 hi 2. That is, we assume that for all x, h R n, f (x + h) f (x) + f (x), h + λ 2 h 2. (8) 11 / 30 Assumption: Expected Separable Overapproximation Assumption 2 (ESO) Assume Ŝ is proper and that for some vector of positive weights v = (v 1,..., v n ) and all x, h R n, E[f (x + hŝ)] f (x) + f (x), h p h 2 p v. (9) Note that the ESO parameters v, p depend on both f and Ŝ. For simplicity, we will often instead of (9) use the compact notation Notation used above: h S = i S (f, Ŝ) ESO(v). h i e i R n (projection of h R n onto coordinates i S) g, h p p v = n p i g i h i R (weighted inner product) = (p 1 v 1,..., p n v n ) R n (Hadamard product) 12 / 30

7 Assumption: Expected Separable Overapproximation Here the ESO inequality again, now without the simplifying notation: E f x + h i e i f (x) + i Ŝ } {{ } complicated n p i i f (x)h i + } {{ } linear in h 1 n p i v i hi 2 2 } {{ } quadratic and separable in h 13 / 30 Complexity of NSync Theorem 3 (R. and Takáč 2013, [4]) Let x be a minimizer of f. Let Assumptions 1 and 2 be satisfied for a proper sampling Ŝ (that is, (f, Ŝ) ESO(v)). Choose starting point x 0 R n, error tolerance 0 < ɛ < f (x 0 ) f (x ) and confidence level 0 < ρ < 1. If {x k } are the random iterates generated by NSync, where the random sets S k are iid following the distribution of Ŝ, then K Ω ( f (x 0 λ log ) f (x ) ) P(f (x K ) f (x ) ɛ) 1 ρ, (10) ɛρ where Ω = max,...,n v i p i n v i E[ Ŝ ]. (11) 14 / 30

8 What does this mean? Linear convergence. NSync converges linearly (i.e., logarithmic dependence on ɛ) High confidence is not a problem. ρ appears inside the logarithm, so it easy to achieve high confidence (by running the method longer; there is no need to restart) Focus on the leading term. The leading term is Ω; and we have a closed-form expression for it in terms of parameters v1,..., v n (which depend on f and Ŝ) parameters p1,..., p n (which depend on Ŝ) Parallelization speedup. The lower bound suggests that if it was the case that the parameters v i did not grow with increasing E[ Ŝ ], then we could potentially be getting linear speedup in τ (average number of updates per iteration). So we shall study the dependence of vi on τ (this will depend on f and Ŝ) As we shall see, speedup is often guaranteed for sparse or well-conditioned problems. τ = Question: How to design sampling Ŝ so that Ω is minimized? 15 / 30 Analysis of the Algorithm (Proof of Theorem 3) 16 / 30

9 Tool: Markov s Inequality Theorem 4 (Markov s Inequality) Let X be a nonnegative random variable. Then for any ɛ > 0, P(X ɛ) E[X ]. ɛ Proof. Let 1 X ɛ be the random variable which is equal to 1 if X ɛ and 0 otherwise. Then 1 X ɛ X ɛ. By taking expectations of all terms, we obtain [ ] X P(X ɛ) = E [1 X ɛ ] E ɛ = E[X ]. ɛ 17 / 30 Tool: Tower Property of Expectations (Motivation) Example 5 Consider discrete random variables X and Y: X has 2 outcomes: x 1 and x 2 Y has 3 outcomes: y 1, y 2 and y 3 Their joint probability mass function if given in this table: y 1 y 2 y 3 x x Obviously, E[X] = 0.28x x 2. But we can also write: E[X] = (0.05x x 2 ) + (0.20x x 2 ) + (0.03x x 2 ) = 0.30 }{{} P(Y=y 1 ) 0.30 x x 2) +0.50( } {{ } x x 2) ( x x 2) E[X Y=y 1 ] ( 0.05 = E[E[X Y]]. 18 / 30

10 Tower Property Lemma 6 (Tower Property / Iterated Expectation) For any random variables X and Y, we have E[X ] = E[E[X Y ]]. Proof. We shall only prove this for discrete random variables; the proof is more technical in the continuous case. E[X ] = x xp(x = x) = x x y P(X = x & Y = y) = y = y xp(x = x & Y = y) x P(X = x & Y = y) P(Y = y)x P(Y = y) x = y P(Y = y) xp(x = x Y = y) x } {{ } E[X Y =y] = E[E[X Y ]]. 19 / 30 Proof of Theorem 3 - Part I If we let µ = λ/ω, then f (x + h) (8) f (x) + f (x), h + λ 2 h 2 f (x) + f (x), h + µ 2 h 2 v p 1. (12) Indeed, one can easily verify that µ is ined to be the largest number for which λ h 2 µ h 2 v p 1 holds for all h. Hence, f is µ-strongly convex with respect to the norm v p 1. Let x be a minimizer of f, i.e., an optimal solution of (1). Minimizing both sides of (12) in h, we get f (x ) f (x) (12) min h R n f (x), h + µ 2 h 2 v p 1 = 1 2µ f (x) 2 p v 1. (13) 20 / 30

11 Proof of Theorem 3 - Part II Let h k = v 1 f (x k ). Then in view of (3), we have x k+1 = x k + h k S k. Utilizing Assumption 2, we get E[f (x k+1 ) x k ] = E [ f (x k + h k S k ) x k] (9) f (x k ) + f (x k ), h k p hk 2 p v = f (x k ) 1 2 f (x k ) 2 p v 1 (13) f (x k ) µ(f (x k ) f (x )). Taking expectations in the last inequality, using the tower property, and subtracting f (x ) from both sides of the inequality, we get E[f (x k+1 ) f (x )] (1 µ)e[f (x k ) f (x )]. Unrolling the recurrence, we get E[f (x k ) f (x )] (1 µ) k (f (x 0 ) f (x )). (14) 21 / 30 Proof of Theorem 3 - Part III Using Markov s inequality, (14) and the inition of K, we get P(f (x K ) f (x ) ɛ) E[f (x K ) f (x )]/ɛ (14) (1 µ) K (f (x 0 ) f (x ))/ɛ (10) ρ. Finally, let us now establish the lower bound on Ω. Letting { } n = p R n : p 0, p i = E[ Ŝ ], we have Ω (11) = max i v i p i (7) min max p i v i p i = 1 E[ Ŝ ] n v i, where the last equality follows since optimal p i is proportional to v i. 22 / 30

12 How to compute the ESO stepsize parameters v 1,..., v n? 23 / 30 C 1 (A) Functions By inition, the ESO parameters v depend on both f and also highlighted by the notation we use: Ŝ. This is (f, Ŝ) ESO(v). Hence, in order to compute these parameters, we need to specify f and Ŝ. The following inition describes a wide class of functions, often appearing in computational practice, for which we will be able to do this. Definition 7 (C 1 (A) functions) Let A R m n. By C 1 (A) we denote the set of continuously differentiable functions f : R n R satisfying the following inequality for all x, h R n : f (x + h) f (x) + ( f (x)) T h ht A T Ah. (15) Example 8 (Least Squares) The function f (x) = 1 2 Ax b 2 satisfies (15) with an equality. Hence, f C 1 (A). 24 / 30

13 Are There More C 1 (A) Functions? Functions we wish to minimize in machine learning often have a finite sum structure: f (x) = j φ j(m j x), (16) where M j R m n are data matrices and φ j : R m R are loss functions. The next result says that under certain assumptions on the loss functions, f C 1 (A) for some A, and describes what this A is. Theorem 9 ([5]) Assume that for each j, function φ j is γ j -smooth: φ j (s) φ j (s ) γ j s s, for all s, s R m. Then f C 1 (A), where A satisfies A T A = J j=1 γ jm T j M j. Remark: The above theorem also says what A T A is. This is important, as will be clear from the next theorem. 25 / 30 ESO for C 1 (A) Functions and an Arbitrary Sampling Theorem 10 ([5]) Assume f C 1 (A) for some real matrix A R m n, let Ŝ be an arbitrary sampling and let P = P(Ŝ) be its probability matrix. If v = (v 1,..., v n ) satisfies P A T A Diag(p v), then (f, Ŝ) ESO(v). 26 / 30

14 Sampling Identity for a Quadratic In in order to prove Theorem 10, we will need the following lemma. Lemma 11 ([5]) Let G be any real n n matrix and Ŝ an arbitrary sampling. Then for any h R n we have E [ h Ṱ Gh ] ( ) S Ŝ = h T P(Ŝ) G h, (17) where denotes the Hadamard (elementwise) product of matrices, and P(Ŝ) is the probability matrix of Ŝ. 27 / 30 Proof of Lemma 11 Proof. Let 1 ij be the indicator random variable of the event i Ŝ & j Ŝ: 1 ij = { 1 if i Ŝ & j Ŝ, 0 otherwise. Note that E[1 ij ] = P(i Ŝ & j Ŝ) = p ij. We now have E [ h Ṱ Gh ] (2) S Ŝ = E n n G ij h i h j = E 1 ij G ij h i h j i Ŝ j Ŝ j=1 n n n n = E[1 ij ]G ij h i h j = p ij G ij h i h j j=1 = h T ( P(Ŝ) G ) h. j=1 28 / 30

15 Proof of Theorem 10 Having established Lemma 11, we are now ready prove Theorem 10. Proof. Fixing any x, h R n, we have E [ f (x + hŝ) ] (15) E [ ] f (x) + f (x), hŝ hṱ S AT AhŜ ] = f (x) + f (x), h p E [ h Ṱ S AT AhŜ (Lemma 11) = f (x) + f (x), h p ht ( P A T A ) h f (x) + f (x), h p ht Diag(p v)h. } {{ } = h 2 p v 29 / 30 References I [1] Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2): , 2012 [2] Peter Richtárik and Martin Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(2):1 38, 2014 [3] Peter Richtárik and Martin Takáč. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 1 52, 2015 [4] Peter Richtárik and Martin Takáč. On optimal probabilities in stochastic coordinate descent methods. Optimization Letters, 2015 [5] Zheng Qu and Peter Richtárik. Coordinate descent with arbitrary sampling II: expected separable overapproximation, arxiv: , / 30

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richtárik School of Mathematics The University of Edinburgh Joint work with Martin Takáč (Edinburgh)

### Lecture 4: BK inequality 27th August and 6th September, 2007

CSL866: Percolation and Random Graphs IIT Delhi Amitabha Bagchi Scribe: Arindam Pal Lecture 4: BK inequality 27th August and 6th September, 2007 4. Preliminaries The FKG inequality allows us to lower bound

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

### 1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

Copyright c 2009 by Karl Sigman 1 Stopping Times 1.1 Stopping Times: Definition Given a stochastic process X = {X n : n 0}, a random time τ is a discrete random variable on the same probability space as

### 17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function

17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function, : V V R, which is symmetric, that is u, v = v, u. bilinear, that is linear (in both factors):

### THEORY OF SIMPLEX METHOD

Chapter THEORY OF SIMPLEX METHOD Mathematical Programming Problems A mathematical programming problem is an optimization problem of finding the values of the unknown variables x, x,, x n that maximize

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### Probability Generating Functions

page 39 Chapter 3 Probability Generating Functions 3 Preamble: Generating Functions Generating functions are widely used in mathematics, and play an important role in probability theory Consider a sequence

### 5. Convergence of sequences of random variables

5. Convergence of sequences of random variables Throughout this chapter we assume that {X, X 2,...} is a sequence of r.v. and X is a r.v., and all of them are defined on the same probability space (Ω,

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Inner Product Spaces

Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction

### Mathematical finance and linear programming (optimization)

Mathematical finance and linear programming (optimization) Geir Dahl September 15, 2009 1 Introduction The purpose of this short note is to explain how linear programming (LP) (=linear optimization) may

### v 1. v n R n we have for each 1 j n that v j v n max 1 j n v j. i=1

1. Limits and Continuity It is often the case that a non-linear function of n-variables x = (x 1,..., x n ) is not really defined on all of R n. For instance f(x 1, x 2 ) = x 1x 2 is not defined when x

### Introduction to Flocking {Stochastic Matrices}

Supelec EECI Graduate School in Control Introduction to Flocking {Stochastic Matrices} A. S. Morse Yale University Gif sur - Yvette May 21, 2012 CRAIG REYNOLDS - 1987 BOIDS The Lion King CRAIG REYNOLDS

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

No: CITY UNIVERSITY LONDON BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION ENGINEERING MATHEMATICS 2 (resit) EX2005 Date: August

### 1 Limiting distribution for a Markov chain

Copyright c 2009 by Karl Sigman Limiting distribution for a Markov chain In these Lecture Notes, we shall study the limiting behavior of Markov chains as time n In particular, under suitable easy-to-check

### 1 Norms and Vector Spaces

008.10.07.01 1 Norms and Vector Spaces Suppose we have a complex vector space V. A norm is a function f : V R which satisfies (i) f(x) 0 for all x V (ii) f(x + y) f(x) + f(y) for all x,y V (iii) f(λx)

### Introduction to Scheduling Theory

Introduction to Scheduling Theory Arnaud Legrand Laboratoire Informatique et Distribution IMAG CNRS, France arnaud.legrand@imag.fr November 8, 2004 1/ 26 Outline 1 Task graphs from outer space 2 Scheduling

### Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

### Høgskolen i Narvik Sivilingeniørutdanningen STE6237 ELEMENTMETODER. Oppgaver

Høgskolen i Narvik Sivilingeniørutdanningen STE637 ELEMENTMETODER Oppgaver Klasse: 4.ID, 4.IT Ekstern Professor: Gregory A. Chechkin e-mail: chechkin@mech.math.msu.su Narvik 6 PART I Task. Consider two-point

### BANACH AND HILBERT SPACE REVIEW

BANACH AND HILBET SPACE EVIEW CHISTOPHE HEIL These notes will briefly review some basic concepts related to the theory of Banach and Hilbert spaces. We are not trying to give a complete development, but

### DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

### THE DYING FIBONACCI TREE. 1. Introduction. Consider a tree with two types of nodes, say A and B, and the following properties:

THE DYING FIBONACCI TREE BERNHARD GITTENBERGER 1. Introduction Consider a tree with two types of nodes, say A and B, and the following properties: 1. Let the root be of type A.. Each node of type A produces

### Proximal mapping via network optimization

L. Vandenberghe EE236C (Spring 23-4) Proximal mapping via network optimization minimum cut and maximum flow problems parametric minimum cut problem application to proximal mapping Introduction this lecture:

### Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number

### Numerical methods for American options

Lecture 9 Numerical methods for American options Lecture Notes by Andrzej Palczewski Computational Finance p. 1 American options The holder of an American option has the right to exercise it at any moment

### minimal polyonomial Example

Minimal Polynomials Definition Let α be an element in GF(p e ). We call the monic polynomial of smallest degree which has coefficients in GF(p) and α as a root, the minimal polyonomial of α. Example: We

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

### Probability Theory. Florian Herzog. A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T..

Probability Theory A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T.. Florian Herzog 2013 Probability space Probability space A probability space W is a unique triple W = {Ω, F,

### No: 10 04. Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

No: 10 04 Bilkent University Monotonic Extension Farhad Husseinov Discussion Papers Department of Economics The Discussion Papers of the Department of Economics are intended to make the initial results

### ALMOST COMMON PRIORS 1. INTRODUCTION

ALMOST COMMON PRIORS ZIV HELLMAN ABSTRACT. What happens when priors are not common? We introduce a measure for how far a type space is from having a common prior, which we term prior distance. If a type

### MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete

### GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

(R) User Manual Environmental Energy Technologies Division Berkeley, CA 94720 http://simulationresearch.lbl.gov Michael Wetter MWetter@lbl.gov February 20, 2009 Notice: This work was supported by the U.S.

### Stochastic Inventory Control

Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the

### Tail inequalities for order statistics of log-concave vectors and applications

Tail inequalities for order statistics of log-concave vectors and applications Rafał Latała Based in part on a joint work with R.Adamczak, A.E.Litvak, A.Pajor and N.Tomczak-Jaegermann Banff, May 2011 Basic

### Probability and Statistics

CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

### Randomized Coordinate Descent Methods for Big Data Optimization

Randomized Coordinate Descent Methods for Big Data Optimization Doctoral thesis Martin Takáč Supervisor: Dr. Peter Richtárik Doctor of Philosophy The University of Edinburgh 2014 Randomized Coordinate

### Integer Factorization using the Quadratic Sieve

Integer Factorization using the Quadratic Sieve Chad Seibert* Division of Science and Mathematics University of Minnesota, Morris Morris, MN 56567 seib0060@morris.umn.edu March 16, 2011 Abstract We give

### Ideal Class Group and Units

Chapter 4 Ideal Class Group and Units We are now interested in understanding two aspects of ring of integers of number fields: how principal they are (that is, what is the proportion of principal ideals

### Continuity of the Perron Root

Linear and Multilinear Algebra http://dx.doi.org/10.1080/03081087.2014.934233 ArXiv: 1407.7564 (http://arxiv.org/abs/1407.7564) Continuity of the Perron Root Carl D. Meyer Department of Mathematics, North

### MATHEMATICAL BACKGROUND

Chapter 1 MATHEMATICAL BACKGROUND This chapter discusses the mathematics that is necessary for the development of the theory of linear programming. We are particularly interested in the solutions of a

### Markov random fields and Gibbs measures

Chapter Markov random fields and Gibbs measures 1. Conditional independence Suppose X i is a random element of (X i, B i ), for i = 1, 2, 3, with all X i defined on the same probability space (.F, P).

### THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan

### ON SEQUENTIAL CONTINUITY OF COMPOSITION MAPPING. 0. Introduction

ON SEQUENTIAL CONTINUITY OF COMPOSITION MAPPING Abstract. In [1] there was proved a theorem concerning the continuity of the composition mapping, and there was announced a theorem on sequential continuity

### Approximation Algorithms

Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

### Two-Stage Stochastic Linear Programs

Two-Stage Stochastic Linear Programs Operations Research Anthony Papavasiliou 1 / 27 Two-Stage Stochastic Linear Programs 1 Short Reviews Probability Spaces and Random Variables Convex Analysis 2 Deterministic

### THREE DIMENSIONAL GEOMETRY

Chapter 8 THREE DIMENSIONAL GEOMETRY 8.1 Introduction In this chapter we present a vector algebra approach to three dimensional geometry. The aim is to present standard properties of lines and planes,

### Pacific Journal of Mathematics

Pacific Journal of Mathematics GLOBAL EXISTENCE AND DECREASING PROPERTY OF BOUNDARY VALUES OF SOLUTIONS TO PARABOLIC EQUATIONS WITH NONLOCAL BOUNDARY CONDITIONS Sangwon Seo Volume 193 No. 1 March 2000

### Chapter 4 Lecture Notes

Chapter 4 Lecture Notes Random Variables October 27, 2015 1 Section 4.1 Random Variables A random variable is typically a real-valued function defined on the sample space of some experiment. For instance,

### SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution Timo Koski 24.09.2015 Timo Koski Matematisk statistik 24.09.2015 1 / 1 Learning outcomes Random vectors, mean vector, covariance matrix,

### SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI. (B.Sc.(Hons.), BUAA)

SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI (B.Sc.(Hons.), BUAA) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF MATHEMATICS NATIONAL

### COMBINATORIAL PROPERTIES OF THE HIGMAN-SIMS GRAPH. 1. Introduction

COMBINATORIAL PROPERTIES OF THE HIGMAN-SIMS GRAPH ZACHARY ABEL 1. Introduction In this survey we discuss properties of the Higman-Sims graph, which has 100 vertices, 1100 edges, and is 22 regular. In fact

### Distributed Coordinate Descent Method for Learning with Big Data

Peter Richtárik Martin Takáč University of Edinburgh, King s Buildings, EH9 3JZ Edinburgh, United Kingdom PETER.RICHTARIK@ED.AC.UK MARTIN.TAKI@GMAIL.COM Abstract In this paper we develop and analyze Hydra:

### MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets. Norm The notion of norm generalizes the notion of length of a vector in R n. Definition. Let V be a vector space. A function α

### CHAPTER III - MARKOV CHAINS

CHAPTER III - MARKOV CHAINS JOSEPH G. CONLON 1. General Theory of Markov Chains We have already discussed the standard random walk on the integers Z. A Markov Chain can be viewed as a generalization of

Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one

### Markov Chains, part I

Markov Chains, part I December 8, 2010 1 Introduction A Markov Chain is a sequence of random variables X 0, X 1,, where each X i S, such that P(X i+1 = s i+1 X i = s i, X i 1 = s i 1,, X 0 = s 0 ) = P(X

### Vector and Matrix Norms

Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

### INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

### 1 Definition of Conditional Expectation

1 DEFINITION OF CONDITIONAL EXPECTATION 4 1 Definition of Conditional Expectation 1.1 General definition Recall the definition of conditional probability associated with Bayes Rule P(A B) P(A B) P(B) For

### Master s Theory Exam Spring 2006

Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

### An improved on-line algorithm for scheduling on two unrestrictive parallel batch processing machines

This is the Pre-Published Version. An improved on-line algorithm for scheduling on two unrestrictive parallel batch processing machines Q.Q. Nong, T.C.E. Cheng, C.T. Ng Department of Mathematics, Ocean

### 15 Markov Chains: Limiting Probabilities

MARKOV CHAINS: LIMITING PROBABILITIES 67 Markov Chains: Limiting Probabilities Example Assume that the transition matrix is given by 7 2 P = 6 Recall that the n-step transition probabilities are given

### RANDOM INTERVAL HOMEOMORPHISMS. MICHA L MISIUREWICZ Indiana University Purdue University Indianapolis

RANDOM INTERVAL HOMEOMORPHISMS MICHA L MISIUREWICZ Indiana University Purdue University Indianapolis This is a joint work with Lluís Alsedà Motivation: A talk by Yulij Ilyashenko. Two interval maps, applied

### Advanced Topics in Machine Learning (Part II)

Advanced Topics in Machine Learning (Part II) 3. Convexity and Optimisation February 6, 2009 Andreas Argyriou 1 Today s Plan Convex sets and functions Types of convex programs Algorithms Convex learning

### Several Views of Support Vector Machines

Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

### Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 1. Chapter 2. Convex Optimization

Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 1 Chapter 2 Convex Optimization Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 2 2.1. Convex Optimization General optimization problem: min f 0 (x) s.t., f i (x)

### Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

### Geometrical Characterization of RN-operators between Locally Convex Vector Spaces

Geometrical Characterization of RN-operators between Locally Convex Vector Spaces OLEG REINOV St. Petersburg State University Dept. of Mathematics and Mechanics Universitetskii pr. 28, 198504 St, Petersburg

### LINEAR SYSTEMS. Consider the following example of a linear system:

LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x +2x 2 +3x 3 = 5 x + x 3 = 3 3x + x 2 +3x 3 = 3 x =, x 2 =0, x 3 = 2 In general we want to solve n equations in

### Optimization of the HOTS score of a website s pages

Optimization of the HOTS score of a website s pages Olivier Fercoq and Stéphane Gaubert INRIA Saclay and CMAP Ecole Polytechnique June 21st, 2012 Toy example with 21 pages 3 5 6 2 4 20 1 12 9 11 15 16

### Let H and J be as in the above lemma. The result of the lemma shows that the integral

Let and be as in the above lemma. The result of the lemma shows that the integral ( f(x, y)dy) dx is well defined; we denote it by f(x, y)dydx. By symmetry, also the integral ( f(x, y)dx) dy is well defined;

### 1 Approximating Set Cover

CS 05: Algorithms (Grad) Feb 2-24, 2005 Approximating Set Cover. Definition An Instance (X, F ) of the set-covering problem consists of a finite set X and a family F of subset of X, such that every elemennt

### MATH 56A SPRING 2008 STOCHASTIC PROCESSES 31

MATH 56A SPRING 2008 STOCHASTIC PROCESSES 3.3. Invariant probability distribution. Definition.4. A probability distribution is a function π : S [0, ] from the set of states S to the closed unit interval

### Lecture 11. Shuanglin Shao. October 2nd and 7th, 2013

Lecture 11 Shuanglin Shao October 2nd and 7th, 2013 Matrix determinants: addition. Determinants: multiplication. Adjoint of a matrix. Cramer s rule to solve a linear system. Recall that from the previous

### Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Walrasian Demand Econ 2100 Fall 2015 Lecture 5, September 16 Outline 1 Walrasian Demand 2 Properties of Walrasian Demand 3 An Optimization Recipe 4 First and Second Order Conditions Definition Walrasian

### S = {1, 2,..., n}. P (1, 1) P (1, 2)... P (1, n) P (2, 1) P (2, 2)... P (2, n) P = . P (n, 1) P (n, 2)... P (n, n)

last revised: 26 January 2009 1 Markov Chains A Markov chain process is a simple type of stochastic process with many social science applications. We ll start with an abstract description before moving

### Eigenvalues and Markov Chains

Eigenvalues and Markov Chains Will Perkins April 15, 2013 The Metropolis Algorithm Say we want to sample from a different distribution, not necessarily uniform. Can we change the transition rates in such

### The Power Method for Eigenvalues and Eigenvectors

Numerical Analysis Massoud Malek The Power Method for Eigenvalues and Eigenvectors The spectrum of a square matrix A, denoted by σ(a) is the set of all eigenvalues of A. The spectral radius of A, denoted

### a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.

Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given

### NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing Alex Cloninger Norbert Wiener Center Department of Mathematics University of Maryland, College Park http://www.norbertwiener.umd.edu

### Lectures on Stochastic Processes. William G. Faris

Lectures on Stochastic Processes William G. Faris November 8, 2001 2 Contents 1 Random walk 7 1.1 Symmetric simple random walk................... 7 1.2 Simple random walk......................... 9 1.3

### Math 4310 Handout - Quotient Vector Spaces

Math 4310 Handout - Quotient Vector Spaces Dan Collins The textbook defines a subspace of a vector space in Chapter 4, but it avoids ever discussing the notion of a quotient space. This is understandable

### MATHEMATICAL METHODS OF STATISTICS

MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

### Error estimates for nearly degenerate finite elements

Error estimates for nearly degenerate finite elements Pierre Jamet In RAIRO: Analyse Numérique, Vol 10, No 3, March 1976, p. 43 61 Abstract We study a property which is satisfied by most commonly used

### Linear Algebra I. Ronald van Luijk, 2012

Linear Algebra I Ronald van Luijk, 2012 With many parts from Linear Algebra I by Michael Stoll, 2007 Contents 1. Vector spaces 3 1.1. Examples 3 1.2. Fields 4 1.3. The field of complex numbers. 6 1.4.

### Influences in low-degree polynomials

Influences in low-degree polynomials Artūrs Bačkurs December 12, 2012 1 Introduction In 3] it is conjectured that every bounded real polynomial has a highly influential variable The conjecture is known

### Separation Properties for Locally Convex Cones

Journal of Convex Analysis Volume 9 (2002), No. 1, 301 307 Separation Properties for Locally Convex Cones Walter Roth Department of Mathematics, Universiti Brunei Darussalam, Gadong BE1410, Brunei Darussalam

### Similarity and Diagonalization. Similar Matrices

MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

### An Additive Neumann-Neumann Method for Mortar Finite Element for 4th Order Problems

An Additive eumann-eumann Method for Mortar Finite Element for 4th Order Problems Leszek Marcinkowski Department of Mathematics, University of Warsaw, Banacha 2, 02-097 Warszawa, Poland, Leszek.Marcinkowski@mimuw.edu.pl

CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.