Filtering: A Method for Solving Graph Problems in MapReduce

Transcription

1 Filtering: A Method for Solving Graph Problems in MapReduce Benjamin Moseley Silvio Lattanzi Siddharth Suri Sergei Vassilvitski UIUC Google Yahoo! Yahoo!

2 Overview Introduction to MapReduce model Our settings Our results Open questions

3 Introduction to the MapReduce model

4 XXL Data Huge amount of data Main problem is to analyze information quickly

5 XXL Data Huge amount of data Main problem is to analyze information quickly New tools Suitable efficient algorithms

6 MapReduce MapReduce is the platform of choice for processing massive data INPUT MAP PHASE REDUCE PHASE

7 MapReduce MapReduce is the platform of choice for processing massive data Data are represented as tuple INPUT < key, value > MAP PHASE REDUCE PHASE

8 MapReduce MapReduce is the platform of choice for processing massive data Data are represented as tuple INPUT < key, value > Mapper decides how data is distributed MAP PHASE REDUCE PHASE

9 MapReduce MapReduce is the platform of choice for processing massive data Data are represented as tuple INPUT < key, value > Mapper decides how data is distributed Reducer performs non-trivial computation locally MAP PHASE REDUCE PHASE

10 How can we model MapReduce? PRAM No limit on the number of processors Memory is uniformly accessible from any processor No limit on the memory available

11 How can we model MapReduce? STREAMING Just one processor There is a limited amount of memory No parallelization

12 MapReduce model [Karloff, Suri and Vassilvitskii] N is the input size and > 0 is some fixed constant

13 MapReduce model [Karloff, Suri and Vassilvitskii] N > 0 is the input size and is some fixed constant Less than N 1 machines

14 MapReduce model [Karloff, Suri and Vassilvitskii] N > 0 is the input size and is some fixed constant Less than Less than N 1 N 1 machines memory on each machine

15 MapReduce model [Karloff, Suri and Vassilvitskii] N > 0 is the input size and is some fixed constant Less than Less than N 1 N 1 machines memory on each machine MRC i : problem that can be solved in O(log i N) rounds

16 Combining map and reduce phase Mapper and Reducer work only on a subgraph

17 Combining map and reduce phase Mapper and Reducer work only on a subgraph Keep time in each round polynomial

18 Combining map and reduce phase Mapper and Reducer work only on a subgraph Keep time in each round polynomial Time is constrained by the number of rounds

19 Algorithmic challenges No machine can see the entire input

20 Algorithmic challenges No machine can see the entire input No communication between machines during each phase

21 Algorithmic challenges No machine can see the entire input No communication between machines during each phase Total memory is N

22 MapReduce vs MUD algorithms In MUD framework each reducer operates on a stream of data. In MUD, each reducer is restricted to only using polylogarithmic space.

23 Our settings

24 Our settings We study the Karloff, Suri and Vassilvitskii model

25 Our settings We study the Karloff, Suri and Vassilvitskii model We focus on class MRC 0

26 Our settings m = n 1+c We assume to work with dense graph, for some constant c>0

27 Our settings m = n 1+c We assume to work with dense graph, for some constant c>0 Empirical evidences that social networks are dense graphs [Leskovec, Kleinberg and Faloutsos]

28 Dense graph motivation [Leskovec, Kleinberg and Faloutsos] They study 9 different social networks

29 Dense graph motivation [Leskovec, Kleinberg and Faloutsos] They study 9 different social networks They show that several graphs from different domains have n 1+c edges

30 Dense graph motivation [Leskovec, Kleinberg and Faloutsos] They study 9 different social networks They show that several graphs from different domains have n 1+c edges Lowest value of c founded.08 and four graphs have c>.5

31 Our results

32 Results Constant rounds algorithms for Maximal matching Minimum cut 8-approx for maximum weighted matching -approx for vertex cover 3/-approx for edge cover

33 Notation G =(V,E) : Input graph n : number of nodes m : number of edges η : memory available on each machine N : input size

34 Filtering Part of the input is dropped or filtered on the first stage in parallel

35 Filtering Part of the input is dropped or filtered on the first stage in parallel Next some computation is performed on the filtered input

36 Filtering Part of the input is dropped or filtered on the first stage in parallel Next some computation is performed on the filtered input Finally some patchwork is done to ensure a proper solution

37 Filtering

38 Filtering FILTER

39 Filtering FILTER COMPUTE SOLUTION

40 Filtering FILTER COMPUTE SOLUTION PATCHWORK ON REST OF THE GRAPH

41 Filtering FILTER COMPUTE SOLUTION PATCHWORK ON REST OF THE GRAPH

42 Warm-up: Compute the minimum spanning tree m = n 1+c

43 Warm-up: Compute the minimum spanning tree m = n 1+c PARTITION EDGES n c/

44 Warm-up: Compute the minimum spanning tree m = n 1+c PARTITION EDGES n c/ n 1+c/ edges

45 Warm-up: Compute the minimum spanning tree n 1+c/ edges

46 Warm-up: Compute the minimum spanning tree n 1+c/ edges COMPUTE MST n c/

47 Warm-up: Compute the minimum spanning tree n 1+c/ edges COMPUTE MST n c/ SECOND ROUND n c/ n 1+c/ edges

48 Show that the algorithm is in MRC 0 The algorithm is correct

49 Show that the algorithm is in MRC 0 The algorithm is correct No edge in the final solution is discarded in partial solution

50 Show that the algorithm is in MRC 0 The algorithm is correct No edge in the final solution is discarded in partial solution The algorithm runs in two rounds

51 Show that the algorithm is in MRC 0 The algorithm is correct No edge in the final solution is discarded in partial solution The algorithm runs in two rounds No more than O n c machines are used

52 Show that the algorithm is in MRC 0 The algorithm is correct No edge in the final solution is discarded in partial solution The algorithm runs in two rounds No more than O n c machines are used No machine uses memory greater than O(n 1+c/ )

53 Maximal matching Algorithmic difficulty is that each machine can only see edges assigned to the machine

54 Maximal matching Algorithmic difficulty is that each machine can only see edges assigned to the machine Is a partitioning based algorithm feasible?

55 Partitioning algorithm

56 Partitioning algorithm PARTITIONING

57 Partitioning algorithm PARTITIONING COMBINE MATCHINGS

58 Partitioning algorithm PARTITIONING COMBINE MATCHINGS It is impossible to build a maximal matching using only red edges!!

59 Algorithmic insight Consider any subset of the edges E

60 Algorithmic insight Consider any subset of the edges E Let M G[E ] be a maximal matching on

61 Algorithmic insight Consider any subset of the edges E Let M G[E ] be a maximal matching on The unmatched vertices form a independent set

62 Algorithmic insight We pick each edge with probability p

63 Algorithmic insight We pick each edge with probability p Find a matching on a sample and then find a matching on unmatched vertices

64 Algorithmic insight We pick each edge with probability p Find a matching on a sample and then find a matching on unmatched vertices For dense portions of the graph, many edges should be sampled

65 Algorithmic insight We pick each edge with probability p Find a matching on a sample and then find a matching on unmatched vertices For dense portions of the graph, many edges should be sampled Sparse portions of the graph are small and can be placed on a single machine

66 Algorithm Sample each edge independently with probability 10 log n p = n c/

67 Algorithm Sample each edge independently with probability 10 log n p = n c/ Find a matching on the sample

68 Algorithm Sample each edge independently with probability 10 log n p = n c/ Find a matching on the sample Consider the induced subgraph on unmatched vertices

69 Algorithm Sample each edge independently with probability 10 log n p = n c/ Find a matching on the sample Consider the induced subgraph on unmatched vertices Find a matching on this graph and output the union

70 Algorithm

71 Algorithm SAMPLE

72 Algorithm SAMPLE FIRST MATCHING

73 Algorithm SAMPLE FIRST MATCHING MATCHING ON UNMATCHED NODES

74 Algorithm SAMPLE FIRST MATCHING MATCHING ON UNMATCHED NODES UNION

75 Correctness Consider the last step of the algorithm

76 Correctness Consider the last step of the algorithm All unmatched vertices are placed on a machine

77 Correctness Consider the last step of the algorithm All unmatched vertices are placed on a machine All unmatched vertices or are matched at the last step or have only matched neighbors

78 Bounding the rounds Three rounds: Sampling the edges

79 Bounding the rounds Three rounds: Sampling the edges Find a matching on a single machine for the sampled edges

80 Bounding the rounds Three rounds: Sampling the edges Find a matching on a single machine for the sampled edges Find a matching for the unmatched vertices

81 Bounding the memory Each edge was sampled with probability p = 10 log n n c/

82 Bounding the memory Each edge was sampled with probability p = 10 log n n c/ Using Chernoff the sampled graph has size with high probability Õ(n 1+c/ )

83 Bounding the memory Each edge was sampled with probability p = 10 log n n c/ Using Chernoff the sampled graph has size with high probability Õ(n 1+c/ ) Can we bound the size of the induced subgraph on the unmatched vertices?

84 Bounding the memory For a fixed induced subgraph with at least edges the probability an edge is not sampled is: 1 10 log n n c/ n 1+c/ exp( 10n log n) n 1+c/

85 Bounding the memory For a fixed induced subgraph with at least edges the probability an edge is not sampled is: 1 10 log n n c/ n 1+c/ exp( 10n log n) n 1+c/ Union bounding over all n induced subgraphs shows that at least one edge is sampled from every dense subgraph with probability 1 exp( 10 log n) 1 1 n 10

86 Using less memory Can we run the algorithm with less then memory? Õ(n 1+c/ )

87 Using less memory Can we run the algorithm with less then memory? Õ(n 1+c/ ) We can amplify our sampling technique!

88 Using less memory Sample as many edges as possible that fits in memory

89 Using less memory Sample as many edges as possible that fits in memory Find a matching

90 Using less memory Sample as many edges as possible that fits in memory Find a matching If the edges between unmatched vertices fit onto a single machine, find a matching on those vertices

91 Using less memory Sample as many edges as possible that fits in memory Find a matching If the edges between unmatched vertices fit onto a single machine, find a matching on those vertices Otherwise, recurse on the unmatched nodes

92 Using less memory Sample as many edges as possible that fits in memory Find a matching If the edges between unmatched vertices fit onto a single machine, find a matching on those vertices Otherwise, recurse on the unmatched nodes n 1+ With memory each iteration a factor of edges are removed resulting in rounds O(c/) n

93 Parallel computation power Maximal matching algorithm does not use parallelization

94 Parallel computation power Maximal matching algorithm does not use parallelization We use a single machine in every step

95 Parallel computation power Maximal matching algorithm does not use parallelization We use a single machine in every step When parallelization is used?

96 Maximum weighted matching

97 Maximum weighted matching SPLIT THE GRAPH

98 Maximum weighted matching

99 Maximum weighted matching MATCHINGS 7 4 3

100 Maximum weighted matching 7 4 3

101 Maximum weighted matching SCAN THE EDGE SEQUENTIALLY 7 4 3

102 Bounding the rounds and memory Four rounds: Splitting the graph and running the maximal matching: 3 rounds and memory Õ(n 1+c/ ) Compute the final solution: 1 round and memory O(n log n)

103 Approximation guarantee (sketch) An edge in the solution can block at most edges in each subgraph of smaller weight

104 Approximation guarantee (sketch) An edge in the solution can block at most edges in each subgraph of smaller weight We loose a factor or because we do not consider the weight

105 Other algorithms Based on the same intuition: -approx for vertex cover 3/-approx for edge cover

106 Minimum cut Partition does not work, because we loose structural informations

107 Minimum cut Partition does not work, because we loose structural informations Sampling does not seem to work either

108 Minimum cut Partition does not work, because we loose structural informations Sampling does not seem to work either We can use the first steps of Karger algorithm as a filtering technique

109 Minimum cut Partition does not work, because we loose structural informations Sampling does not seem to work either We can use the first steps of Karger algorithm as a filtering technique The random choices made in the early rounds succeed with high probability, whereas the later rounds have a much lower probability of success

110 Minimum cut 1

111 Minimum cut RANDOM WEIGHT n δ

112 Minimum cut RANDOM WEIGHT n δ COMPRESS EDGES WITH SMALL WEIGTH

113 Minimum cut RANDOM WEIGHT n δ COMPRESS EDGES WITH SMALL WEIGTH From Karger at least one graph will contain a min cut w.h.p.

114 Minimum cut

115 Minimum cut COMPUTE MINIMUM CUT We will find the minimum cut w.h.p.

116 Empirical result (matching) %#!" '()(**+*",*-.)/01" %!!" 30)+(/4-""!"#$%&'(%)#"*&+,' $#!" $!!" #!"!"!&!!$"!&!!#"!&!$"!&!%"!&!#"!&$" $" -.%/0&'134.4)0)*5'(/,'

117 Open problems

118 Open problems 1 Maximum matching Shortest path Dynamic programming

119 Open problems Algorithms for sparse graph Does connected components require more than rounds? Lower bounds

120 Thank you!