Communication Cost in Big Data Processing

Transcription

1 Communication Cost in Big Data Processing Dan Suciu University of Washington Joint work with Paul Beame and Paris Koutris and the Database Group at the UW 1

2 Queries on Big Data Big Data processing on distributed clusters General programs: MapReduce, Spark SQL: PigLatin,, BigQuery, Shark, Myria Traditional Data Processing Main cost = disk I/O Big Data Processing Main cost = communication 2

3 This Talk How much communication is needed to solve a problem on p servers? Problem : In this talk = Full Conjunctive Query CQ ( SQL) Future work = extend to linear algebra, ML, etc. Some techniques discussed in this talk are implemented in Myria (Bill Howe s talk) 3

4 Models of Communication Combined communication/computation cost: PRAM à MRC [Karloff 2010] BSP [Valiant] à LogP Model [Culler] Explicit communication cost [Hong&Kung 81] red/blue pebble games, for I/O communication complexity à [Ballard 2012] extension to parallel algorithms MapReduce model [DasSarma 2013] MPC [Beame 2013,Beame 2014] this talk 4

5 Outline The MPC Model Communication Cost for Triangles Communication Cost for CQ s Discussion 5

6 Massively Parallel Communication Extends BSP [Valiant] Input data = size M bits Number of servers = p Model (MPC) Input (size=m) O(M/p) O(M/p) Server Server p 6

7 Massively Parallel Communication Extends BSP [Valiant] Input data = size M bits Number of servers = p Model (MPC) Input (size=m) O(M/p) O(M/p) Server Server p Round 1 One round = Compute & communicate Server Server p 7

8 Massively Parallel Communication Extends BSP [Valiant] Input data = size M bits Number of servers = p Model (MPC) Input (size=m) O(M/p) O(M/p) Server Server p Round 1 One round = Compute & communicate Server Server p Algorithm = Several rounds Server Round 2 Server p Round

9 Massively Parallel Communication Extends BSP [Valiant] Input data = size M bits Number of servers = p Model (MPC) Input (size=m) O(M/p) O(M/p) Server Server p L L Round 1 One round = Compute & communicate Server Server p Algorithm = Several rounds L Server L Round 2 Server p Max communication load per server = L L L Round

10 Massively Parallel Communication Extends BSP [Valiant] Input data = size M bits Number of servers = p Model (MPC) Input (size=m) O(M/p) O(M/p) Server Server p L L Round 1 One round = Compute & communicate Server Server p Algorithm = Several rounds L Server L Round 2 Server p Max communication load per server = L L L Round 3 Ideal: L = M/p.... This talk: L = M/p * p ε, where ε in [0,1] is called space exponent Degenerate: L = M (send everything to one server) 10

11 Example: Q(x,y,z) = R(x,y) S(y,z) size(r)+size(s)=m Input: R, S uniformly partitioned on p servers Server 1 M/p.... Server p M/p 11

12 Example: Q(x,y,z) = R(x,y) S(y,z) size(r)+size(s)=m Input: R, S uniformly partitioned on p servers M/p Server 1 O(M/p).... M/p Server p Round 1 O(M/p) Server Server p Round 1: each server Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p 12

13 Example: Q(x,y,z) = R(x,y) S(y,z) size(r)+size(s)=m Input: R, S uniformly partitioned on p servers M/p Server 1 O(M/p).... M/p Server p Round 1 O(M/p) Server Server p Round 1: each server R(x,y) S(y,z) Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p R(x,y) S(y,z) Output: Each server computes and outputs the local join R(x,y) S(y,z) 13

14 Example: Q(x,y,z) = R(x,y) S(y,z) size(r)+size(s)=m Input: R, S uniformly partitioned on p servers M/p Server 1 O(M/p).... M/p Server p Round 1 O(M/p) Server Server p Round 1: each server R(x,y) S(y,z) Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p R(x,y) S(y,z) Output: Each server computes and outputs the local join R(x,y) S(y,z) Assuming no Skew: y, degree R (y), degree S (y) M/p Space exponent = 0 Rounds = 1 Load L = O(M/p) = O(M/p * p 0 )

15 Example: Q(x,y,z) = R(x,y) S(y,z) size(r)+size(s)=m Input: R, S uniformly partitioned on p servers SQL is Server 1 O(M/p) Server 1 M/p Round 1: each server R(x,y) S(y,z) Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p embarrassingly parallel! M/p Server p Round 1 O(M/p) Server p R(x,y) S(y,z) Output: Each server computes and outputs the local join R(x,y) S(y,z) Assuming no Skew: y, degree R (y), degree S (y) M/p Space exponent = 0 Rounds = 1 Load L = O(M/p) = O(M/p * p 0 )

16 Questions Fix a query Q and a number of servers p What is the communication load L needed to compute Q in one round? This talk based on [Beame 2013, Beame 2014] Fix a maximum load L (space exponent ε) How many rounds are needed for Q? Preliminary results in [Beame 2013] 16

18 The Triangle Query T Z X Input: three tables R(X, Y), S(Y, Z), T(Z, X) size(r)+size(s)+size(t)=m Output: compute R S X Fred Jack Fred Carol Fred Y Z Jack Fred Alice Fred Y Jack Jim Carol Alice Fred Jim Jim Carol Alice Jim Alice Alice Jim Jim Alice Q(x,y,z) = R(x,y) S(y,z) T(z,x) 18

19 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Triangles in Two Rounds Round 1: compute temp(x,y,z) = R(X,Y) S(Y,Z) Round 2: compute Q(X,Y,Z) = temp(x,y,z) T(Z,X) Load L depends on the size of temp. Can be much larger than M/p

20 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Triangles in One Round [Ganguli 92, Afrati&Ullman 10, Suri 11, Beame 13] Factorize p = p ⅓ p ⅓ p ⅓ Each server identified by (i,j,k) Place the servers in a cube: Server (i,j,k) (i,j,k) j k i P ⅓ 20

21 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Triangles in One Round R S X Fred Jack Fred Carol T Z Fred Y Z Jack Fred Alice Fred Y Jack Jim Carol Alice Fred Jim Jim Carol Alice Jim Jim Jack Alice X Alice Jim Jim Alice Round 1: Send R(x,y) to all servers (h(x),h(y),*) Send S(y,z) to all servers (*, h(y), h(z)) Send T(z,x) to all servers (h(x), *, h(z)) Output: compute locally R(x,y) S(y,z) T(z,x) j = h(jim) Jim Jack Fred (i,j,k) Jim Jack Fred Jim Fred Jim Fred Jim Jim Fred Jack Jim Jack Fred Jim Jim k Jack Jim Jim Jack Jim i = h(fred) P ⅓ 21

22 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Upper Bound is L algo = O(M/p ⅔ ) M = size(r)+size(s)+size(t) (in bits) Theorem Assume data has no skew: all degree O(M/p ⅓ ). Then the algorithm computes Q in one round, with communication load L algo = O(M/p ⅔ ) Compare to two rounds: No more large intermediate result Skew-free up to degrees = O(M/p ⅓ ) (from O(M/p)) BUT: space exponent increased to ε = ⅓ (from ε = 1) 22

23 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Lower Bound is L lower = M / p ⅔ M = size(r)+size(s)+size(t) (in bits) # without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS

24 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Lower Bound is L lower = M / p ⅔ M = size(r)+size(s)+size(t) (in bits) Assume that the three input relations R, S, T are stored on disjoint servers # Theorem* Any one-round deterministic algorithm A for Q requires L lower bits of communication per server # without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS

25 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Lower Bound is L lower = M / p ⅔ M = size(r)+size(s)+size(t) (in bits) Assume that the three input relations R, S, T are stored on disjoint servers # Theorem* Any one-round deterministic algorithm A for Q requires L lower bits of communication per server We prove a stronger claim, about any algorithm communicating < L lower bits Let L be the algorithm s max communication load per server We prove: if L < L lower then, over random permutations R, S, T, the algorithm returns at most a fraction (L / L lower ) 3/2 of the answers to Q # without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS

26 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Lower Bound is L lower = M / p ⅔ Discussion: An algorithm with space exponent ε < ⅓, has load L = M/p 1-ε and reports only (L / L lower ) 3/2 = 1/p (1-3ε)/2 fraction of all answers. Fewer, as p increases! Lower bound holds only for random inputs: cannot hold for fixed input R, S, T since the algorithm may use a short bit encoding to signal this input to all servers By Yao s lemma, the result also holds for randomized algorithms, some fixed input 26

27 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Lower Bound is L lower = M / p ⅔ [Balard 2012] consider Strassen s matrix multiplication algorithm and prove n he following theorem for the CDAG model. If each server has M apple c 2 p 2/lg7 emory, then the communication load per server: Thus, if M = IO(n, p, M) = n2 then IO(n, p) = p 2/lg7 lg7 n M p M p n 2 p 2/lg7! M here same as n 2 here Stronger than MPC: no restriction on rounds Weaker than MPC Applies only to an algorithm, not to a problem CDAG model of computation v.s. bit-model 27

28 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Lower Bound is L lower = M / p ⅔ Proof 28

29 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Lower Bound is L lower = M / p ⅔ Proof Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T E [#answers to Q] = Σ i,j,k Pr [(i,j,k) R S T] = n 3 * (1/n) 3 = 1 29

30 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Lower Bound is L lower = M / p ⅔ Proof Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T E [#answers to Q] = Σ i,j,k Pr [(i,j,k) R S T] = n 3 * (1/n) 3 = 1 One server v [p]: receives L(v) = L R (v) + L S (v) + L T (v) bits Denote a ij = Pr [(i,j) R and v knows it] Then (1) a ij 1/n (obvious) (2) Σ i,j a ij L R (v) / log n (formal proof: entropy) Denote similarly b jk,c ki for S and T 30

31 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Lower Bound is L lower = M / p ⅔ Proof Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T E [#answers to Q] = Σ i,j,k Pr [(i,j,k) R S T] = n 3 * (1/n) 3 = 1 One server v [p]: receives L(v) = L R (v) + L S (v) + L T (v) bits Denote a ij = Pr [(i,j) R and v knows it] Then (1) a ij 1/n (obvious) (2) Σ i,j a ij L R (v) / log n (formal proof: entropy) Denote similarly b jk,c ki for S and T E [#answers to Q reported by server v] = Σ i,j,k a ij b jk c ki [(Σ i,j a ij2 ) * (Σ j,k b jk2 ) * (Σ k,i c ki2 )] ½ [Friedgut] (1/n) 3/2 * [ L R (v) * L S (v) * L T (v) / log 3 n ] ½ by (1), (2) (1/n) 3/2 * [ L(v) / 3* log n ] 3/2 geometric/arithmetic = [ L(v) / M ] 3/2 expression for M above 31

32 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r)+size(s)+size(t)=m Lower Bound is L lower = M / p ⅔ Proof Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T E [#answers to Q] = Σ i,j,k Pr [(i,j,k) R S T] = n 3 * (1/n) 3 = 1 One server v [p]: receives L(v) = L R (v) + L S (v) + L T (v) bits Denote a ij = Pr [(i,j) R and v knows it] Then (1) a ij 1/n (obvious) (2) Σ i,j a ij L R (v) / log n (formal proof: entropy) Denote similarly b jk,c ki for S and T E [#answers to Q reported by server v] = Σ i,j,k a ij b jk c ki [(Σ i,j a ij2 ) * (Σ j,k b jk2 ) * (Σ k,i c ki2 )] ½ [Friedgut] (1/n) 3/2 * [ L R (v) * L S (v) * L T (v) / log 3 n ] ½ by (1), (2) (1/n) 3/2 * [ L(v) / 3* log n ] 3/2 geometric/arithmetic = [ L(v) / M ] 3/2 expression for M above E [#answers to Q reported by all servers] Σ v [ L(v) / M ] 3/2 = p * [ L / M ] 3/2 = [ L / L lower ] 3/2 32

34 General Formula Consider an arbitrary conjunctive query (CQ): Q(x 1,...,x k )=S 1 (x 1 ) on...on S`(x`) size(s 1 )=M 1, size(s 2 )=M 2,...,size(S`) =M` We will: Give a lower bound L lower formula Give an algorithm with load L algo Prove that L lower = L algo 34

35 Q(x,y) = S 1 (x) S 2 (y) size(s 1 )=M 1, size(s 2 ) = M 2 Cartesian Product

36 Q(x,y) = S 1 (x) S 2 (y) size(s 1 )=M 1, size(s 2 ) = M 2 Cartesian Product Algorithm: factorize p = p 1 p 2 Send S 1 (x) to all servers (h(x),*) Send S 2 (y) to all servers (*, h(y)) Load = M 1 / p 1 + M 2 / p 2 ; Minimized when M 1 / p 1 = M 2 / p 2 = (M 1 M 2 /p) ½ L algo =2 S 1 (x) à M1 M 2 p S2 (y) à 1 2

37 Q(x,y) = S 1 (x) S 2 (y) size(s 1 )=M 1, size(s 2 ) = M 2 Cartesian Product Algorithm: factorize p = p 1 p 2 Send S 1 (x) to all servers (h(x),*) Send S 2 (y) to all servers (*, h(y)) Load = M 1 / p 1 + M 2 / p 2 ; Minimized when M 1 / p 1 = M 2 / p 2 = (M 1 M 2 /p) ½ Lower bound: Let L(v) = L 1 (v) + L 2 (v) = load at server v The p servers must report all answers to Q: L algo =2 S 1 (x) à M1 M 2 p S2 (y) à 1 2 M 1 M 2 Σ v L 1 (v) * L 2 (v) ½ Σ v (L(v)) 2 ½ p max v (L(v)) 2 L lower =2 M1 M 2 p

38 Q(x 1,...,x k )=S 1 (x 1 ) on...on S`(x`) size(s 1 )=M 1, size(s 2 )=M 2,...,size(S`) =M` A Simple Observation efinition An edge packing for Q is a set of atoms with no common variables: U ={S j1 (x j1 ), S j2 (x j2 ),...,S j U (x j U )} 8i, k : x ji \ x jk = ; Assume that the three input relations S 1, S 2, are stored on disjoint servers # Claim Any one-round algorithm for Q must also compute the cartesian product: Q 0 (x j1,...,x j U )=S j1 (x j1 ) on S j2 (x j2 ) on...on S j U (x j U ) Proof Because the algorithm doesn t know the other S j s. L lower = U Mj1 M j2 M j U p 1 U Proof similar to cartesian product on previous slide # otherwise, add another factor U

39 Q(x 1,...,x k )=S 1 (x 1 ) on...on S`(x`) size(s 1 )=M 1, size(s 2 )=M 2,...,size(S`) =M` The Lower Bound for CQ efinition A fractional edge packing are real numbers u 1,u 2,...,u` s.t.: 8j 2 [`] : u j 0 8i 2 [k] : Theorem* Any one-round deterministic algorithm A for Q requires O(L lower ) bits of communication per server: X j2[`]:x i 2vars(S j (x j )) u j apple 1 L lower = M1 u1 M 2 u2 M`u` p 1 u 1 +u 2 + +u` Note that we ignore constant factors (like U on the previous slide) *Beame, Koutris, Suciu, Skew in Parallel Data Processing, PODS

40 Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(r) = M 1, size(s) = M 2, size(t) = M 3 Triangle Query Revisited ½ z ½ x ½ y Packing u 1, u 2, u 3 L = (M u1 1 * M u2 2 * M u3 3 / p) 1/(u1+u2+u3) ½, ½, ½ (M 1 M 2 M 3 ) ⅓ / p ⅔ 1, 0, 0 M 1 / p 0, 1, 0 M 2 / p 0, 0, 1 M 3 / p L lower = the largest of these four values When M 1 =M 2 =M 3 =M, then the largest value is the first: M / p ⅔ 40

41 Q(x 1,...,x k )=S 1 (x 1 ) on...on S`(x`) size(s 1 )=M 1, size(s 2 )=M 2,...,size(S`) =M` An Algorithm for CQ [Afrati&Ullman 2010] Factorize p = p 1 p 2 p k ; a server is (v 1,, v k ) [p 1 ] [p k ] Send S 1 (x 11, x 12, ) to servers with coordinates h(x 11 ), h(x 12 ), Send S 2 (x 21, x 22, ) to servers with coordinates h(x 21 ), h(x 22 ), Compute the query locally at each server Load is O(L algo ), where: L algo = max j2[`] M j Q i2[k]:x i 2vars(S j (x j )) p i [A&U] use sum instead of max. With max we can compute the optimal p 1, p 2,, p k

42 Q(x 1,...,x k )=S 1 (x 1 ) on...on S`(x`) size(s 1 )=M 1, size(s 2 )=M 2,...,size(S`) =M` L lower = L algo L algo = max j M j Q i:x i 2vars(S j ) p i L lower = Qj M j u j p P 1 j u j Apply log in base p. Denote µ j = log p M j, e i = log p p i

43 Q(x 1,...,x k )=S 1 (x 1 ) on...on S`(x`) size(s 1 )=M 1, size(s 2 )=M 2,...,size(S`) =M` L lower = L algo L algo = max j M j Q i:x i 2vars(S j ) p i L lower = Qj M j u j p P 1 j u j 8j : + Apply log in base p. Denote µ j = log p M j, e i = log p p i X j:x i 2vars(S j ) minimize X e i apple 1 i e i µ j

44 Q(x 1,...,x k )=S 1 (x 1 ) on...on S`(x`) size(s 1 )=M 1, size(s 2 )=M 2,...,size(S`) =M` L lower = L algo L algo = max j M j Q i:x i 2vars(S j ) p i L lower = Qj M j u j p P 1 j u j X 8j : + Apply log in base p. Denote µ j = log p M j, e i = log p p i minimize X e i apple 1 i e i µ X j maximize ( f j µ j f 0 ) j:x i 2vars(S j ) j X X f j apple 1 Primal/Dual u 8i : f j apple f j = f j / f 0 0 u 0 = 1 / f 0 j:x i 2vars(S j ) at optimality: u 0 = Σ j u j j

46 Discussion Communication costs for multiple rounds Lower/upper bounds in [Beame 13] connect the number of rounds to log of the query diameter Weaker communication model: algorithm sends only tuples, not arbitrary bits Communication cost for skewed data Lower/upper bounds in [Beame 14] have a gap of poly log p Communication costs beyond full CQ s Open! 46