On Doubly Stochastic Graph Optimization

Similar documents
DATA ANALYSIS II. Matrix Algorithms

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Load Balancing and Switch Scheduling

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks

Introduction to Markov Chain Monte Carlo

Component Ordering in Independent Component Analysis Based on Data Power

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

Mathematical finance and linear programming (optimization)

Part 2: Community Detection

Introduction to Matrix Algebra

Big Data - Lecture 1 Optimization reminders

by the matrix A results in a vector which is a reflection of the given

Polytope Examples (PolyComp Fukuda) Matching Polytope 1

Continuity of the Perron Root

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

A Network Flow Approach in Cloud Computing

Statistical machine learning, high dimension and big data

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

The Master s Degree with Thesis Course Descriptions in Industrial Engineering

Linear Programming I

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Rank one SVD: un algorithm pour la visualisation d une matrice non négative

Proximal mapping via network optimization

Recurrent Neural Networks

Perron vector Optimization applied to search engines

SPECTRAL POLYNOMIAL ALGORITHMS FOR COMPUTING BI-DIAGONAL REPRESENTATIONS FOR PHASE TYPE DISTRIBUTIONS AND MATRIX-EXPONENTIAL DISTRIBUTIONS

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

Orthogonal Diagonalization of Symmetric Matrices

Gossip Algorithms. Devavrat Shah MIT

This exposition of linear programming

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Transportation Polytopes: a Twenty year Update

Approximation Algorithms

Practical Guide to the Simplex Method of Linear Programming

5.1 Bipartite Matching

UNCOUPLING THE PERRON EIGENVECTOR PROBLEM

Francesco Sorrentino Department of Mechanical Engineering

Adaptive Linear Programming Decoding

1 Solving LPs: The Simplex Algorithm of George Dantzig

7 Gaussian Elimination and LU Factorization

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

Linear Programming. March 14, 2014

Random graphs with a given degree sequence

Linear Programming. Solving LP Models Using MS Excel, 18

A linear algebraic method for pricing temporary life annuities

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Conductance, the Normalized Laplacian, and Cheeger s Inequality

Similarity and Diagonalization. Similar Matrices

Vector and Matrix Norms

Understanding and Applying Kalman Filtering

Solving Mass Balances using Matrix Algebra

Subspace Analysis and Optimization for AAM Based Face Alignment

2.3 Convex Constrained Optimization Problems

Data Mining: Algorithms and Applications Matrix Math Review

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization

Nan Kong, Andrew J. Schaefer. Department of Industrial Engineering, Univeristy of Pittsburgh, PA 15261, USA

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Continued Fractions and the Euclidean Algorithm

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling

High Performance Computing for Operation Research

arxiv: v1 [math.co] 7 Mar 2012

What is Linear Programming?

Minimizing Probing Cost and Achieving Identifiability in Probe Based Network Link Monitoring

MAP-Inference for Highly-Connected Graphs with DC-Programming

Ranking on Data Manifolds

24. The Branch and Bound Method

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Stochastic Inventory Control

4.6 Linear Programming duality

QUALITY ENGINEERING PROGRAM

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION

Nonlinear Iterative Partial Least Squares Method

A Sublinear Bipartiteness Tester for Bounded Degree Graphs

Chapter 6. Cuboids. and. vol(conv(p ))

Single item inventory control under periodic review and a minimum order quantity

Common Core Unit Summary Grades 6 to 8

TD(0) Leads to Better Policies than Approximate Value Iteration

State of Stress at Point

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

CHAPTER 9. Integer Programming

Linear Programming Notes V Problem Transformations

Inner Product Spaces

ON THE DEGREES OF FREEDOM OF SIGNALS ON GRAPHS. Mikhail Tsitsvero and Sergio Barbarossa

A Numerical Study on the Wiretap Network with a Simple Network Topology

3. Linear Programming and Polyhedral Combinatorics

Learning with Local and Global Consistency

Notes on Symmetric Matrices

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Learning with Local and Global Consistency

Maximum Margin Clustering

Classification of Cartan matrices

Minimally Infeasible Set Partitioning Problems with Balanced Constraints

Transcription:

On Doubly Stochastic Graph Optimization Abstract In this paper we introduce an approximate optimization framework for solving graphs problems involving doubly stochastic matrices. This is achieved by using a low dimensional formulation of the matrices and the approximate solution is achieved by a simple subgradient method. We also describe one problem that can be solved using our method. 1 Introduction Graph optimization is a class of problems that assigns edge weights or transition probabilities to a given graph which minimize a given criterion usually subject to some connectivity and other constraints. An example is the fastest mixing Markov chain problem [1], where the object is to assign transition probabilities that minimize the mixing rate of a Markov random walk on a given graph. The mixing rate problem has been shown to arise in a class of gossip algorithm problems [10] where the object is to find an averaging algorithm or equivalently a transition matrix such that the averaging time over the graph is minimized. Other related problems that involve optimization over spectral functions of doubly stochastic matrices include 1. Minimizing Effective Resistance on a Graph [2], where the idea is to choose a random walk on the graph that minimizes the average commute time between all nodes. 2. Finding the best doubly stochastic approximation to a given affinity matrix [5]. This arises in the context of spectral clustering in machine learning. Here we consider the fastest mixing Markov chain problem and present an efficient approximate solution based on using a smaller subset of the space of large doubly stochastic transition matrices. This involves a computationally expensive pre-processing step but needs to be executed only once for a given problem. In the next section we describe the notation and the underlying results used to reduce the dimensionality of the problem. 2 Problem Formulation 2.1 Fastest Mixing Markov Chain Consider a symmetric Markov chain on a graph G with a transition matrix P R n n. The stationary distribution in this case is the uniform distribution π = (1/n)e T and the mixing rate measures how fast an initial distribution converges to the uniform. This rate of convergence is measured by the second largest eigenvalue modulus (SLEM) of P, µ(p ) = max(λ 2 (P ), λ n (P )), the smaller it is the faster the Markov chain converges. Here λ 2 (P ) and λ n (P ) are the second largest and the smallest eigenvalues of P. The problem of finding the fastest mixing Markov chain was described in [1]. It can be written as 1

Input- A n n doubly stochastic matrix A 1. for l = 1 : n 2 + n 2 2. Using Bipartite Matching find a permutation π l of vertices 1,..., n such that each A i,πi is positive 3. θ l = min i (A i,πi ) 4. A = A θ l P πl, where P πl is a permutation matrix corresponding to π l. 5. Exit if all entries of A are zero Output (θ i, P πi ) such that A = M i=1 θ ip πi Table 1: BN Decomposition Algorithm for a Doubly Stochastic Matrix min µ(p ) (1) s.t. P ij = 0 if {i, j} / E P e = e,p T = P, P 0 The problem can be expressed as an SDP and solved using standard techniques and also for very large graphs with more 100, 000 or more edges a subgradient method is presented in [1]. 2.2 BN Decomposition Let G = (V, E) be an undirected graph with n vertices and m edges. There is a Markov chain associated with the graph such that the weight of each edge is the transition probability p ij from the i to the jth node. These transition probabilities are described by a n n matrix P such that P 0, P e = e, P = P T, P ij = 0 if {i, j} / E (2) where e is the vector of all 1 s and the inequality indicates element wise positivity for P ij. The entries are zero only if there is no corresponding edge in the given graph. We can write the symmetric stochastic matrix P as a convex combination of Permutation matrices using the following result due to Birkhoff [6] Theorem- Any n n matrix P is doubly stochastic if and only if there are M n n permutation matrices P 1,...,P M and positive scalars θ 1,...,θ M such that M P = θ i P i i=1 and M θ i = 1 (3) i=1 Some bounds on M exist in the literature [6], but as we shall see it becomes irrelevant for our purposes. Now given a matrix P an algorithm (table 1) based on a proof given by Dulmage and Halperin is described in [7]. It involves bipartite graph matching and to each such matching corresponds a permutation matrix. These permutation matrices define a basis for a certain subset of the space of n n doubly stochastic matrices. 2.2.1 Identifying Basis Subset If we have a reasonable choice for a Markov chain that mixes fast, for e.g. the Metropolis Hastings chain. We hope to use its BN decomposition to select a permutation basis. Having then identified such a permutation basis we can hope to solve for the fastest mixing chain by optimizing over the 2

θ s, keeping the basis matrices fixed. However the number of basis matrices or equivalently the problem size could be very large. Can we then identify a smaller subset of basis matrices and search over those? Consider the initial transition matrix P m, an input to the decomposition algorithm to obtain a permutation basis for the space of DS matrices to search on. The BN procedure returns a θ = (θ 1,..., θ M ) vector and M permutation matrices (P 1,..., P M ). Intuitively, since higher values of θ i contribute more to probability weight on a particular edge, it makes sense to ignore altogether very small θ i s and hence the corresponding P i s. This is reasonable if we make the assumption that since our initial choice P m is a good one any improvement that we hope to find over this one is structurally similar to P m. Then we use the following procedure values to filter out insignificant P i s Table 2: Select Basis Choose k << M Compute r i = P m θ i P i F for all i Return the P i s corresponding to the k smallest r i s. Later in the experiments section we will show the results for different values of k for a fixed M. Once we have the P i s we have in essence fixed a subset of Birkhoff polytope for our optimization procedure to search over. The parameter space for the search is then defined by the θ = (θ 1,..., θ k ) R k such that this new θ lies in the probability simplex and P (θ) = k i=1 θ ip i. Clearly the SLEM is also a nonlinear function of θ and will be written as µ(θ). Note that since P (θ) is symmetric, P (θ) = (P (θ) + P (θ) T )/2 = k i=1 θ i(p i + P T i )/2. Hence our basis matrices are (P i + P T i )/2 for each i to maintain the symmetry constraint. In the ensuing sections we show experimental evidence and demonstrate that it is possible to truncate the space considerably and still obtain reasonable results. 2.3 Basis Subset Optimization for Fastest Mixing Chain The most commonly used heuristic for fast mixing is the Metropolis-Hastings random walk. To obtain a Markov chain with the uniform stationary distribution the following transition matrix is constructed [1] Pij m min(1/d i, 1/d j ) if i j and{i, j} E = k max(0, 1/d i 1/d k ) i=j and {i, k} E 0 otherwise (4) Using the procedure to identify the basis subset in the previous section. We BN decompose the P m matrix to obtain the subset space to search over. Next we use a subgradient method to solve the fastest mixing problem on this smaller subset of DS matrices constrained by the fixed chosen permutation basis. 2.3.1 Subgradient Method In this parameter space the optimization problem becomes min µ(θ) (5) s.t. k θ i = 1, θ i 0 i=1 The SLEM in general is a non-differentiable function of the entries of the matrix and therefore as a function of the θ parameter. We use subgradients to solve the optimization problem. It can be 3

0.2 Graph (60 nodes,1692 edges) 0.18 0.16 0.14 Metropolis Hastings Maximum Degree Polytope Subset Optimal SLEM 0.12 0.1 0.08 0.06 0.04 0.02 500 1000 1500 2000 2500 BASIS DIMENSION Figure 1: SLEM for a fixed graph with varying basis dimension size for our method. The horizontal axis is the number of basis elements used, i.e. the number of variables being optimized. The vertical line is the number of edges, i.e. the number of variables being optimized in a direct optimization approach.the parallel lines include the corresponding SLEM values for Metropolis Hastings and the Optimal. The SLEM values comes close to the optimal as we increase the basis dimension. shown that the subgradient g corresponding to the equation below is dependent on the eigenvector corresponding to the second largest eigenvalue in magnitude µ( θ) µ(θ) + v(θ) T (P ( θ) P (θ))v(θ) = µ(θ) + δθ T g (6) and is given by g = (v(θ) T P i v(θ),..., v(θ) T P k v(θ)). Where (P 1,..., P k ) are the permutation basis and v(θ) is the eigenvector corresponding to the second largest eigenvalue in magnitude. Thus at each iteration an eigenvector computation is required. The projected subgradient method then proceeds as usual on this considerably smaller k-dimensional space and involves a projection step onto the probability simplex for which we use an efficient algorithm described in [8]. The overall algorithm can be given as 1. Compute the Metropolis Hastings matrix P m for a given Graph G. 2. Compute the BN decomposition of P m to obtain a permutation basis such that P m = M i=1 θm i P i and obtain the truncated basis by ignoring permutation matrices corresponding to the (M k) θ m i s returned by the truncation procedure. 3. Solve the optimization problem (5) using the subgradient method above for parameters θ 1,..., θ k. 4

0.6 0.5 n=60 nodes Metropolis Hastings Optimal Polytope Subset(10%) Polytope Subset(20%) Polytope Subset(40%) 0.4 SLEM 0.3 0.2 0.1 0 1000 1100 1200 1300 1400 1500 1600 1700 1800 EDGES 1800 1600 1400 10% 20% 40% #vars=edges BASIS DIMENSION 1200 1000 800 600 400 200 0 800 1000 1200 1400 1600 1800 EDGES Figure 2: a)slem for graphs with varying no. of edges. The % indicates the proportion of total basis matrices used. Even with 10% of the total variables, performance is much better than the Metropolis Hastings chain. b) The plot shows the basis dimension(problem size)(y) for each of the graphs (edges (Y)) in the previous plot for each of the % levels. The number of variables are considerably less than edges in the graph at the 10% level. The top line shows the no. of variables as the number of edges in the direct optimization approach. 5

2.4 Simulation We first present results on a small graph with 60 nodes generated uniformly as described in fastest mixing paper [1]. Figure 1 shows a graph with n = 60 nodes and the corresponding slem s as we increase the basis dimension or equivalently the number of permutation basis matrices used. The middle line indicates the number of edges in the graph. It can be seen that even with a limited basis dimension we do considerably better than the Metropolis-Hastings chain and stay relatively close to the optimal. Thus the basis dimension can also be used as a knob to control the accuracy vs efficiency tradeoff. Figure 2(a) shows the SLEM values on graphs with n = 60 nodes and varying number of edges. We can see that even with 10% of permutation basis we stay relatively close to the optimal value and significantly better than the Metropolis Hastings chain. In figure 2(b) we can see in the top line the gain we get if were to solve the optimization problem with the number of variables equal to the number of edges as opposed to a fraction of the total basis dimension. 3 Conclusion In this paper we presented an efficient subgradient algorithm based on the Birkhoff-von Neumann decomposition to get the approximate fastest mixing rate Markov chain on a graph. For future work we hope that there are problems involving doubly stochastic matrices that we can solve using our framework. We also intend to present distributed versions of the algorithm as the subgradient method is very amenable to parallel computation. On the theoretical front the issue of the optimality loss incurred in restricting the search space by using a fixed permutation basis or equivalently the quality of our approximation needs to be understood. 4 References [1] Boyd S., Diaconis P. & Xiao L., Fastest Mixing Markov Chain on a Graph., SIAM Review Vol.46,No.4,pp. 667-689,2004. [2] Ghosh A., Boyd S. & Saberi S., Minimizing Effective Resistance of a Graph., SIAM Review, problems and techniques section, 50(1):37-66, February 2008.. [3] Ghosh A. & Boyd S., Upper Bounds on Algebraic Connectivity using Convex Optimization., Linear Algebra and its Applications, 418:693-707, October 2006.. [4] Xiao L. & Boyd S., Fast linear iterations for distributed averaging., Systems and Control Letters, 53:65-78, 2004. [5] Zass R. & Shashua A., Doubly Stochastic Normalization for Spectral Clustering., Proceeding NIPS, 2006.. [6] Minc H. & Weiss Y. Nonnegative Matrices Wiley Intersciences Series in Discrete Mathematics and Optimization, 1988. [7] Marshall W. A. & Olkin I. Inequalities:Theory of Majorization and Applications, Academic Press 1979. [8] Duchi J., Shwartz S.S., Singer Y. & Chandra T., Efficient projections onto the L1-ball for learning in high dimensions. Proceedings ICML 2008. [9] Horn R.A. & Johnson C.R.,Matrix Analysis. Cambridge University Press,1985. [10] Boyd S.,Ghosh A.,Prabhakar B. & Shah D., Randomized Gossip Algorithms. IEEE Transactions on Information Theory, Vol. 52,NO. 6, June 2006. 6