CIS 700: algorithms for Big Data Lecture 6: Graph Sketching Slides at http://grigory.us/big-data-class.html Grigory Yaroslavtsev http://grigory.us
Sketching Graphs? We know how to sketch vectors: v Mv How about sketching graphs? G V, E A G (adjacency matrix): A G MA G Sketch columns of A G n = V, m = E O(poly(log n)) sketch per vertex / O(n) total Check connectivity Check bipartiteness As always, space rather than dimension. Why?
Graph Streams Semi-streaming model: [Muthukrishnan 05; Feigenbaum, Kannan, McGregor, Suri, Zhang 05] Graph defined by the stream of edges e 1,, e m Space O n, edges processed in order Connectivity is easy on O(n) space for insertion-only Dynamic graphs: Stream of insertion/deletion updates + e i1, e i2,, e it (assume sequence is correct) Resulting graph has edge e i if it wasn t deleted after the last insertion Linear sketching dynamic graphs: MA G e = MA G MA e
Distributed Computing Linear sketches for distributed processing S servers with o(m) memory: Send m/s edges (E 1,, E s ) to each server Compute sketches ME 1,, ME s locally Send sketches to a central server Compute MA G = ME i s i M has to have a small representation (same issue as in streaming)
Connectivity Thm. Connectivity is sketchable in O(n) space Framework: Take existing connectivity algorithm (Boruvka) Sketch A G MA G Run Boruvka on MA G Important that the sketch is homomorphic w.r.t the algorithm
Part 1: Parallel Connectivity (Boruvka) Repeat until no edges left: For each vertex, select any incident edge Contract selected edges Lemma: process converges in O(log n) steps
Part 2: Graph Representation For a vertex i let a i be a vector in R n 2 Non-zero entries for edges i, j a i i, j = 1 if j > i a i i, j = 1 if j < i Example: a 1 = 1, 1, 1, 1, 0,, 0 a 2 = 1, 0, 0, 0, 0, 0, 1, 0, 1,, 0 1,2, 1,3, 1,4, 1,5, 1,6, 1,7, 2,3, 2.4. 2,5, 6 7 4 1 3 5 2 Lem: For any S V supp i S a i = E(S, V S)
Part 3: L 0 -Sampling There is a distribution over M R d m with d = O(log 2 m) such w.p. 9/10 that a R m : Ca e supp(a) [Cormode, Muthukrishnan, Rozenbaum 05; Jowhari, Saglam, Tardos 11] Constant probability suffices still O log n Boruvka iterations
Final Algorithm Construct log n l 0 -samplers for each a i Run Boruvka on sketches: Use C 1 a j to get an edge incident on a node j For i = 2 to t: To get incident edge on a component S V use: C i a j = C i a j j S e supp a j j S = E(S, V S) j S
K-Connectivity Graph is k-connected is every cut has size k Thm: There is a O(nk log 3 n)-size linear sketch for k-connectivity Generalization: There is an O(n log 5 n /ε 2 )- size linear sketch which allows to approximate all cuts in a graph up to error (1 ± ε)
K-connectivity Algorithm Algorithm for k-connectivity: Let F 1 be a spanning forest of G(V, E) For i = 2,, k Let F i be a spanning forest of G(V, E F 1 F i 1 ) Lem: G(V, F 1 + + F k ) is k-connected iff G(V,E) is. Trivial k i=1 Consider a cut in G(V, F i i : this cut didn t grow in step i there is a cut in G(V, E) of size < k contradiction ) of size < k
K-connectivity Algorithm Construct k independent linear sketches *M 1 A G, M 2 A G, M k A G + for connectivity Run k-connectivity algorithm on sketches: Use M 1 A G to get a spanning forest F 1 of G Use M 2 A G M 2 A F1 = M 2 (A G F1 ) to find F 2 Use M 3 A G M 3 A F1 M 3 A F2 = M 3 (A G F1 F 2 ) to find F 3
Bipartiteness Reduction: Given G define G where vertices v (v 1, v 2 ); edges u, v (u 1, v 2 ) & (u 2, v 1 ) Lem: # connected components doubles iff the graph is bipartite. Thm: O(n log 3 n)-size linear sketch for k- connectivity (sketch G (implicitly).)
Minimum Spanning Tree If n i = # connected components in a subgraph induced by edges of weight 1 + ε i : w MST n 1 + ε r + λ i n i 1 + ε w MST i=0 r 1 where λ i = ( 1 + ε i+1 1 + ε i cc(g) = #connected components of G Round weights up to the nearest power of 1 + ε G i subgraph with edges of weight 1 + ε i Edges taken by the Kruskal s algorithm: n cc(g 0 ) edges of weight 1 cc G 0 cc(g 1 ) edges of weight (1 + ε) cc G i 1 cc G i edges of weight 1 + ε i
Minimum Spanning Tree Let r = log 1+ε W where W = max edge weight Overall weight: n cc G 0 + r 1 1 + ε i (cc G i 1 cc(g i )) r 1 = n 1 + ε r + ( 1 + ε i+1 1 + ε i ) cc(g i ) 0 Thm: 1 + ε -approx. MST weight can be computed with O n linear sketch for W = poly(n)
MST: Single Linkage Clustering [Zahn 71] Clustering via MST (Single-linkage): k clusters: remove k 1 longest edges from MST Maximizes minimum intercluster distance [Kleinberg, Tardos]
Cut Sparsification Two problems: Approximating Min-Cut in the graph (up to 1 ± ε) Preserving all cuts in the graph (up to 1 ± ε) General cut sparsification framework: Sample each edge e with probability p e Assign sampled edges weights 1/p e Expected weight of each cut is preserved, but too many cuts can t take union bound
Cut Sparsification For an edge e let λ e = weight of the minimum cut that contains e λ = size of the Min-Cut in G Thm [Fung et al.]: If G is an undirected weighted graph the if p e min C log2 n λ e ε 2, 1 then the cut sparsification alg. Preserves weights of all cuts up to (1 ± ε) Thm [Karger]: p e min Min-Cut up to (1 ± ε) C log n λ ε 2, 1 preserves
Minimum Cut Algorithm: For i = 0,1,, 2 log n : Let G i be the subgraph of G where each edge is sampled with probability 1/2 i Let H i = F 1,, F k where k = O 1 ε 2 log n and F i are forests constructed by the k-connectivity alg. Return 2 j λ(h j ) where j = min *i λ H i < k+ Space: O n log4 n ε 2, works for dynamic graph streams
Minimum Cut: Analysis Key property: If G i has k edges across a cut then H i contains all such edges i = log max 1, λε 2 6 log n i i 6 log n p e min, 1 min cut in G λε 2 i is approximating min-cut in G up to (1 ± ε) i = i : By Chernoff bound # edges in G i that crosses min-cut in G is O 1 ε2 log n k w.h.p.
Cut Sparsification Algorithm: For i = 0,1,, 2 log n : Let G i be the subgraph of G where each edge is sampled with probability 1/2 i Let H i = F 1,, F k where k = O 1 ε 2 log2 n and F i are forests constructed by the k-connectivity alg. For each edge e let j e = min i: λ e H i < k. If e H je then add e to the sparsifier with weight 2 j e n log5 n Space: O, works for dynamic graph streams ε 2 Analysis similar to the Min-Cut using [Fung et al.]