Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff



Similar documents
CIS 700: algorithms for Big Data

Sketch As a Tool for Numerical Linear Algebra

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Linear Algebra Review. Vectors

Cycles and clique-minors in expanders

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Similarity and Diagonalization. Similar Matrices

Part 2: Community Detection

Network Algorithms for Homeland Security

Applied Algorithm Design Lecture 5

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Big Data & Scripting Part II Streaming Algorithms

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

NOTES ON LINEAR TRANSFORMATIONS

1 Determinants and the Solvability of Linear Systems

How To Prove The Dirichlet Unit Theorem

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Lecture 4 Online and streaming algorithms for clustering

Approximation Algorithms

We shall turn our attention to solving linear systems of equations. Ax = b

Social Media Mining. Graph Essentials

On Clusterings: Good, Bad and Spectral

Statistical Machine Learning

How To Solve The Cluster Algorithm

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

(67902) Topics in Theory and Complexity Nov 2, Lecture 7

B490 Mining the Big Data. 2 Clustering

The Open University s repository of research publications and other research outputs

Cluster Analysis: Advanced Concepts

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Graph theoretic techniques in the analysis of uniquely localizable sensor networks

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

Improving Experiments by Optimal Blocking: Minimizing the Maximum Within-block Distance

Section Inner Products and Norms

DATA ANALYSIS II. Matrix Algorithms

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

by the matrix A results in a vector which is a reflection of the given

Sublinear Algorithms for Big Data. Part 4: Random Topics

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Efficient Recovery of Secrets

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Lectures notes on orthogonal matrices (with exercises) Linear Algebra II - Spring 2004 by D. Klain

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

A =

LINEAR ALGEBRA. September 23, 2010

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

A Tutorial on Spectral Clustering

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications

Lecture 6 Online and streaming algorithms for clustering

A linear algebraic method for pricing temporary life annuities

Large induced subgraphs with all degrees odd

Rank one SVD: un algorithm pour la visualisation d une matrice non négative

CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

Review Jeopardy. Blue vs. Orange. Review Jeopardy

ISOMETRIES OF R n KEITH CONRAD

Unsupervised Data Mining (Clustering)

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

17. Inner product spaces Definition Let V be a real vector space. An inner product on V is a function

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

Markov random fields and Gibbs measures

1 Introduction to Matrices

Foundations of Data Science 1

Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = i.

Exponential time algorithms for graph coloring

3. Let A and B be two n n orthogonal matrices. Then prove that AB and BA are both orthogonal matrices. Prove a similar result for unitary matrices.

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

. P. 4.3 Basic feasible solutions and vertices of polyhedra. x 1. x 2

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Going Big in Data Dimensionality:

Data Mining: Algorithms and Applications Matrix Math Review

Inner Product Spaces and Orthogonality

The Butterfly, Cube-Connected-Cycles and Benes Networks

Orthogonal Diagonalization of Symmetric Matrices

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

Hadoop SNS. renren.com. Saturday, December 3, 11

Modélisation et résolutions numérique et symbolique

Foundations of Data Science 1

Single machine parallel batch scheduling with unbounded capacity

Recall that two vectors in are perpendicular or orthogonal provided that their dot

COMBINATORIAL PROPERTIES OF THE HIGMAN-SIMS GRAPH. 1. Introduction

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Topological Data Analysis Applications to Computer Vision

Random graphs with a given degree sequence

A Practical Scheme for Wireless Network Operation

Equilibrium computation: Part 1

Warshall s Algorithm: Transitive Closure

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

Direct Methods for Solving Linear Systems. Matrix Factorization

Foundations of Data Science 1

Algorithm Design and Analysis

3D Tranformations. CS 4620 Lecture 6. Cornell CS4620 Fall 2013 Lecture Steve Marschner (with previous instructors James/Bala)

Private Approximation of Clustering and Vertex Cover

Learning, Sparsity and Big Data

Transcription:

Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff

Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear space Cloud Complexity: time, space and communication [Cormode-Muthukrishnan-Ye 2008] Nimble algorithm: polynomial time/space (as usual) and sublinear (ideally polylog) communication between servers.

Cloud vs Streaming } Streaming algorithms make small sketches of data } Nimble algorithms must communicate small sketches } Are they equivalent? Simple observation: } Communication in cloud = O(memory in streaming) [Daume-Philips-Saha-Venkatasubramanian12] } Is cloud computing more powerful?

Basic Problems on large data sets } Frequency moments } Counting copies of subgraphs (homomorphisms) } Low-rank approximation } Clustering } } Matchings } Flows } Linear programs

Streaming Lower Bounds Frequency moments: Given a vector of frequencies f=( f 1, f 2,, f n ) presented as a set of increments, estimate f k = i f i k to relative error ε. [Alon-Matias-Szegedy99, Indyk-Woodruff05]: θ ( n 1 2/k ) space (k =1,2 by random projection) Counting homomorphisms: Estimate #triangles, # C 4, # K r,r in a large graph G. Ω ( n 2 ) space lower bounds in streaming.

Streaming Lower Bounds Low-rank approximation: Given n x d matrix A, find A of rank k s.t. A A F (1+ε) A A k F [Clarkson-Woodruff09] Any streaming algorithm needs Ω((n+d)k log nd) space.

Frequency moments in the cloud } Lower bound via multi-player set disjointness. } t players have sets S 1, S 2,, S t, subsets of [n] } Problem: determine if sets are disjoint or have one element in common. } Thm: Communication needed = Ω( n/t log t ) bits.

Frequency moments in the cloud Thm. Communication needed to determine set disjointness of t sets is Ω( n/t log t ) bits. Consider s sets being either (i) completely disjoint or (ii) with one common element (each set is on one server) Then k th frequency moment is either n or n 1+ s k Suppose we have a factor 2 approximation for the k th moment. With s k = n+1, then we can distinguish these cases. Therefore, communication needed is Ω( s k 1 ).

Frequency moments in the cloud Thm. [Kannan-V.-Woodruff13] Estimating k th frequency moment on s servers takes O( s k / ε 2 ) words of communication, with O(b+ log n ) bits per word. } Lower bound is s k 1 } Previous bound: s k 1 ( log n /ε ) O(k) Zhang12] } streaming space complexity is n 1 2/k [Woodruff- Main idea of algorithm: sample elements within a server according to higher moments.

Warm-up: 2 servers, third moment Goal: estimate i ( u i + v i ) 3 1. Estimate i u i 3 2. Sample j w.p. p j = u j 3 / i u i 3 ; announce 3. Second server computes X= u j 2 v j / p j 4. Average over many samples. E(X)= i u i 2 v i

Warm-up: 2 servers, third moment Goal: estimate i ( u i + v i ) 3 p j = u j 3 / i u i 3 X= u j 2 v j / p j E(X)= i u i 2 v i Var(X) i: v i >0 ( u i 2 v i ) 2 / p i i u i 3 i u i v i 2 ( i u i 3 + v i 3 ) 2 So, O( 1/ ε 2 ) samples suffice.

Many servers, k th moment }

Many servers, k th moment Each server j: } Sample i w. prob p i = f ij k / t f tj k k th moment. according to } Every j sends f i j if j < j and f i j < f ij or j >j and f i j f ij } Server j computes X i = j=1 s f i j r j / p i

Many servers, k th moment Each server j: } Sample i w. prob p i = f ij k / t f tj k k th moment. according to } Every j sends f i j if j < j and f i j < f ij or j >j and f i j f ij } Server j computes X i = j=1 s f i j r j / p i Lemma. E(X)= R j j f ij r j ( i f ij k ) 2 and Var(X) Theorem follows as there are < s k terms in total.

Counting homomorphisms } How many copies of graph H in large graph G? } E.g., H = triangle, 4-cycle, complete bipartite etc. } Linear lower bounds for counting 4-cycles, triangles. } We assume an (arbitrary) partition of the vertices among servers.

Counting homomorphisms } To count number of paths of length 2, in a graph with degrees d 1, d 2,, d n, we need: t( K 1,2,G)= i=1 n ( d i 2 ) This is a polynomial in frequency moments! } #stars is t( K 1,r,G)= i=1 n ( d i r ) } #C4 s: let d ij is the number of common neighbors of i and j. Then, t( C 4,G)= i=1 n ( d ij 2 ) } # K a,b : let d S be the number of common neighbors of a set of vertices S. Then, t( K a,b,g)= S V, S =a ( d S b )

Low-rank approximation Given n x d matrix A partitioned arbitrarily as A= A 1 + A 2 + + A s among s servers, find A of rank k s.t. A A F (1+ε)OPT. To avoid linear communication, on each server t, we leave a matrix A t, s.t. A = A 1 + A 2 + + A s and is of rank k. How to compute these matrices?

Low-rank approximation in the cloud Thm. [KVW13]. Low-rank approximation of n x d matrix A partitioned arbitrarily among s servers takes O (skd) communication.

Warm-up: row partition } Full matrix A is n x d with n >> d. } Each server j has a subset of rows A j } Computes A j T A j and sends to server 1. } Server 1 computes B= j=1 s A j T A j and announces V, the top k eigenvectors of B. } Now each server j can compute A j V V T. } Total communication = O(s d 2 ).

Low-rank approximation: arbitrary partition } To extend this to arbitrary partitions, we use limitedindependence random projection. } Subspace embedding: matrix P of size O( d/ ε 2 ) n s.t. for any x R d, PAx =(1±ε) Ax. } Agree on projection P via a random seed } Each server computes P A t, sends to server 1. } Server 1 computes PA= t P A t and its top k right singular vectors V. } Project rows of A to V. } Total communication = O( s d 2 / ε 2 ).

Low-rank approximation: arbitrary partition } Agree on projection P via a random seed } Each server computes P A t, sends to server 1. } Server 1 computes PA= t P A t and its top k right singular vectors V. } Project rows of A to V. Thm. A AV V T (1+O(ε))OPT. Pf. Extend V to a basis v 1, v 2,, v d. Then, A AV V T F 2 = i=k+1 d A v i 2 (1+ε) 2 i=k+1 d PA v i 2. And, with u 1, u 2,, u d singular vectors of A, i=k+1 d PA v i 2 i=k+1 d PA u i 2 (1+ε) 2 i=k+1 d A u i 2 =(1+O(ε))OP T 2.

Low-rank approximation in the cloud To improve to O(skd), we use a subspace embedding up front, and observe that O(k)-wise independence suffices for the random projection matrix. } Agree on O( k/ε ) n matrix S and O( k/ ε 2 ) n matrix P. } Each server computes S A t and sends to server 1. } S1 computes SA= t S A t and an orthonormal basis U T for its row space. } Apply previous algorithm to AU.

K-means clustering } Find a set of k centers c 1, c 2,, c k that minimize i S Min j=1 k A i c j 2 } A near-optimal (i.e. 1+ε) solution could be very different! } So, cannot project up front to reduce dimension and approximately preserve distances.

K-means clustering } Kannan-Kumar condition: } Every pair of cluster centers are f(k) standard deviations apart. } variance : maximum over 1-d projections, of the average squared distance of a point to its center. (e.g. for Gaussian mixtures, max directional variance) } Thm. [Kannan-Kumar10]. Under this condition, projection to the top k principal components followed by the k-means iteration starting at an approximately optimal set of centers finds a nearly correct clustering. } Finds centers close to the optimal ones, so that the induced clustering is same for most point.

K-means clustering in the cloud } Points (rows) are partitioned among servers } Low-rank approximation to project to SVD space. } How to find a good starting set of centers? } Need a constant-factor approximation. } Thm [Chen]. There exists a small subset ( core ) s.t. the k- means value of this set (weighted) is within a constant factor of the k-means value of the full set of points (for any set of centers!). } Chen s algorithm can be made nimble. Thm. K-means clustering in the cloud achieves the Kannan- Kumar guarantee with O( d 2 + k 4 ) communication on s = O(1) servers.

Cloud computing: What problems have nimble algorithms? } Approximate flow/matching? } Linear programs } Which graph properties/parameters can be checked/estimated in the cloud? (e.g., planarity? expansion? small diameter?) } Other Optimization/Clustering/ } Learning problems [Balcan-Blum-Fine-Mansour12, Daume-Philips-Saha-Venkatasubramaian12] } Random partition of data? } Connection to property testing?