CIS 700: algorithms for Big Data



Similar documents
Analyzing Graph Structure via Linear Measurements

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

arxiv: v1 [cs.dc] 7 Dec 2014

Sketch As a Tool for Numerical Linear Algebra

B669 Sublinear Algorithms for Big Data

Social Media Mining. Graph Essentials

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

5.1 Bipartite Matching

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Part 2: Community Detection

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

DATA ANALYSIS II. Matrix Algorithms

Distributed Computing over Communication Networks: Maximal Independent Set

Solutions to Homework 6

Every tree contains a large induced subgraph with all degrees odd

Applied Algorithm Design Lecture 5

ONLINE DEGREE-BOUNDED STEINER NETWORK DESIGN. Sina Dehghani Saeed Seddighin Ali Shafahi Fall 2015

8.1 Min Degree Spanning Tree

The Goldberg Rao Algorithm for the Maximum Flow Problem

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis. Contents. Introduction. Maarten van Steen. Version: April 28, 2014

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Algorithm Design and Analysis

Simplified External memory Algorithms for Planar DAGs. July 2004

Approximation Algorithms

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

Graph theoretic techniques in the analysis of uniquely localizable sensor networks

Section Continued

Distance Degree Sequences for Network Analysis

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Exponential time algorithms for graph coloring

On the k-path cover problem for cacti

Linear Programming I

Scheduling Shop Scheduling. Tim Nieberg

GRAPH THEORY LECTURE 4: TREES

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

6.852: Distributed Algorithms Fall, Class 2

Transportation Polytopes: a Twenty year Update

1 Definitions. Supplementary Material for: Digraphs. Concept graphs

SCAN: A Structural Clustering Algorithm for Networks

A Fast Algorithm For Finding Hamilton Cycles

Sublinear Algorithms for Big Data. Part 4: Random Topics

Network Algorithms for Homeland Security

Connectivity and cuts

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

ON INDUCED SUBGRAPHS WITH ALL DEGREES ODD. 1. Introduction

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

A Note on Maximum Independent Sets in Rectangle Intersection Graphs

V. Adamchik 1. Graph Theory. Victor Adamchik. Fall of 2005

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Analysis of Algorithms, I

Bargaining Solutions in a Social Network

Institut für Informatik Lehrstuhl Theoretische Informatik I / Komplexitätstheorie. An Iterative Compression Algorithm for Vertex Cover

Private Approximation of Clustering and Vertex Cover

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications

Discuss the size of the instance for the minimum spanning tree problem.

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

Large induced subgraphs with all degrees odd

Small Maximal Independent Sets and Faster Exact Graph Coloring

Steiner Tree Approximation via IRR. Randomized Rounding

On Integer Additive Set-Indexers of Graphs

11. APPROXIMATION ALGORITHMS

COMBINATORIAL PROPERTIES OF THE HIGMAN-SIMS GRAPH. 1. Introduction

Total colorings of planar graphs with small maximum degree

Course on Social Network Analysis Graphs and Networks

CS711008Z Algorithm Design and Analysis

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

Approximating the Minimum Chain Completion problem

Design of LDPC codes

Random graphs with a given degree sequence

Max Flow, Min Cut, and Matchings (Solution)

Secure Network Coding via Filtered Secret Sharing

Definition Given a graph G on n vertices, we define the following quantities:

Lecture 4 Online and streaming algorithms for clustering

136 CHAPTER 4. INDUCTION, GRAPHS AND TREES

Lecture 3: Finding integer solutions to systems of linear equations

Lecture 1: Course overview, circuits, and formulas

Minimum Spanning Trees

Minimum cost maximum flow, Minimum cost circulation, Cost/Capacity scaling

Decentralized Utility-based Sensor Network Design

Error Compensation in Leaf Power Problems

CSC2420 Spring 2015: Lecture 3

An Approximation Algorithm for Bounded Degree Deletion

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

Transcription:

CIS 700: algorithms for Big Data Lecture 6: Graph Sketching Slides at http://grigory.us/big-data-class.html Grigory Yaroslavtsev http://grigory.us

Sketching Graphs? We know how to sketch vectors: v Mv How about sketching graphs? G V, E A G (adjacency matrix): A G MA G Sketch columns of A G n = V, m = E O(poly(log n)) sketch per vertex / O(n) total Check connectivity Check bipartiteness As always, space rather than dimension. Why?

Graph Streams Semi-streaming model: [Muthukrishnan 05; Feigenbaum, Kannan, McGregor, Suri, Zhang 05] Graph defined by the stream of edges e 1,, e m Space O n, edges processed in order Connectivity is easy on O(n) space for insertion-only Dynamic graphs: Stream of insertion/deletion updates + e i1, e i2,, e it (assume sequence is correct) Resulting graph has edge e i if it wasn t deleted after the last insertion Linear sketching dynamic graphs: MA G e = MA G MA e

Distributed Computing Linear sketches for distributed processing S servers with o(m) memory: Send m/s edges (E 1,, E s ) to each server Compute sketches ME 1,, ME s locally Send sketches to a central server Compute MA G = ME i s i M has to have a small representation (same issue as in streaming)

Connectivity Thm. Connectivity is sketchable in O(n) space Framework: Take existing connectivity algorithm (Boruvka) Sketch A G MA G Run Boruvka on MA G Important that the sketch is homomorphic w.r.t the algorithm

Part 1: Parallel Connectivity (Boruvka) Repeat until no edges left: For each vertex, select any incident edge Contract selected edges Lemma: process converges in O(log n) steps

Part 2: Graph Representation For a vertex i let a i be a vector in R n 2 Non-zero entries for edges i, j a i i, j = 1 if j > i a i i, j = 1 if j < i Example: a 1 = 1, 1, 1, 1, 0,, 0 a 2 = 1, 0, 0, 0, 0, 0, 1, 0, 1,, 0 1,2, 1,3, 1,4, 1,5, 1,6, 1,7, 2,3, 2.4. 2,5, 6 7 4 1 3 5 2 Lem: For any S V supp i S a i = E(S, V S)

Part 3: L 0 -Sampling There is a distribution over M R d m with d = O(log 2 m) such w.p. 9/10 that a R m : Ca e supp(a) [Cormode, Muthukrishnan, Rozenbaum 05; Jowhari, Saglam, Tardos 11] Constant probability suffices still O log n Boruvka iterations

Final Algorithm Construct log n l 0 -samplers for each a i Run Boruvka on sketches: Use C 1 a j to get an edge incident on a node j For i = 2 to t: To get incident edge on a component S V use: C i a j = C i a j j S e supp a j j S = E(S, V S) j S

K-Connectivity Graph is k-connected is every cut has size k Thm: There is a O(nk log 3 n)-size linear sketch for k-connectivity Generalization: There is an O(n log 5 n /ε 2 )- size linear sketch which allows to approximate all cuts in a graph up to error (1 ± ε)

K-connectivity Algorithm Algorithm for k-connectivity: Let F 1 be a spanning forest of G(V, E) For i = 2,, k Let F i be a spanning forest of G(V, E F 1 F i 1 ) Lem: G(V, F 1 + + F k ) is k-connected iff G(V,E) is. Trivial k i=1 Consider a cut in G(V, F i i : this cut didn t grow in step i there is a cut in G(V, E) of size < k contradiction ) of size < k

K-connectivity Algorithm Construct k independent linear sketches *M 1 A G, M 2 A G, M k A G + for connectivity Run k-connectivity algorithm on sketches: Use M 1 A G to get a spanning forest F 1 of G Use M 2 A G M 2 A F1 = M 2 (A G F1 ) to find F 2 Use M 3 A G M 3 A F1 M 3 A F2 = M 3 (A G F1 F 2 ) to find F 3

Bipartiteness Reduction: Given G define G where vertices v (v 1, v 2 ); edges u, v (u 1, v 2 ) & (u 2, v 1 ) Lem: # connected components doubles iff the graph is bipartite. Thm: O(n log 3 n)-size linear sketch for k- connectivity (sketch G (implicitly).)

Minimum Spanning Tree If n i = # connected components in a subgraph induced by edges of weight 1 + ε i : w MST n 1 + ε r + λ i n i 1 + ε w MST i=0 r 1 where λ i = ( 1 + ε i+1 1 + ε i cc(g) = #connected components of G Round weights up to the nearest power of 1 + ε G i subgraph with edges of weight 1 + ε i Edges taken by the Kruskal s algorithm: n cc(g 0 ) edges of weight 1 cc G 0 cc(g 1 ) edges of weight (1 + ε) cc G i 1 cc G i edges of weight 1 + ε i

Minimum Spanning Tree Let r = log 1+ε W where W = max edge weight Overall weight: n cc G 0 + r 1 1 + ε i (cc G i 1 cc(g i )) r 1 = n 1 + ε r + ( 1 + ε i+1 1 + ε i ) cc(g i ) 0 Thm: 1 + ε -approx. MST weight can be computed with O n linear sketch for W = poly(n)

MST: Single Linkage Clustering [Zahn 71] Clustering via MST (Single-linkage): k clusters: remove k 1 longest edges from MST Maximizes minimum intercluster distance [Kleinberg, Tardos]

Cut Sparsification Two problems: Approximating Min-Cut in the graph (up to 1 ± ε) Preserving all cuts in the graph (up to 1 ± ε) General cut sparsification framework: Sample each edge e with probability p e Assign sampled edges weights 1/p e Expected weight of each cut is preserved, but too many cuts can t take union bound

Cut Sparsification For an edge e let λ e = weight of the minimum cut that contains e λ = size of the Min-Cut in G Thm [Fung et al.]: If G is an undirected weighted graph the if p e min C log2 n λ e ε 2, 1 then the cut sparsification alg. Preserves weights of all cuts up to (1 ± ε) Thm [Karger]: p e min Min-Cut up to (1 ± ε) C log n λ ε 2, 1 preserves

Minimum Cut Algorithm: For i = 0,1,, 2 log n : Let G i be the subgraph of G where each edge is sampled with probability 1/2 i Let H i = F 1,, F k where k = O 1 ε 2 log n and F i are forests constructed by the k-connectivity alg. Return 2 j λ(h j ) where j = min *i λ H i < k+ Space: O n log4 n ε 2, works for dynamic graph streams

Minimum Cut: Analysis Key property: If G i has k edges across a cut then H i contains all such edges i = log max 1, λε 2 6 log n i i 6 log n p e min, 1 min cut in G λε 2 i is approximating min-cut in G up to (1 ± ε) i = i : By Chernoff bound # edges in G i that crosses min-cut in G is O 1 ε2 log n k w.h.p.

Cut Sparsification Algorithm: For i = 0,1,, 2 log n : Let G i be the subgraph of G where each edge is sampled with probability 1/2 i Let H i = F 1,, F k where k = O 1 ε 2 log2 n and F i are forests constructed by the k-connectivity alg. For each edge e let j e = min i: λ e H i < k. If e H je then add e to the sparsifier with weight 2 j e n log5 n Space: O, works for dynamic graph streams ε 2 Analysis similar to the Min-Cut using [Fung et al.]