General purpose Distributed Computing using a High level Language. Michael Isard

Transcription

1 Dryad and DryadLINQ General purpose Distributed Computing using a High level Language Michael Isard Microsoft Research Silicon Valley

2 Distributed Data Parallel Computing Workloads beyond standard SQL, HPC Data mining, graph analysis, Complex, long lived lived application software Cloud (shared clusters) Transparent scaling Resource virtualization Commodity hardware Fault tolerance with good performance

3 Talk overview Part I High level language: LINQ Computational model: DAG Execution layer: Dryad+Quincy Part tii Dryad systems issues Comparison with MapReduce DryadLINQ demo

4 LINQ Microsoft s Language INtegrated Query Operators to manipulate datasets in.net Dataset is a first class abstraction Select, Join, GroupBy, Aggregate, etc. Set at a time, instead of looping over Object at a time Integrated into.net programming languages Programscan call operators Operators can invoke arbitrary.net functions Data model Data elements are strongly typed.net objects Much more expressive than SQL tables Extensible Add new operators and implementations

5 Aggregating partial sums class PartialSum { public int sum; public int count; }; static double MergeSums(PartialSum[] sums) { int totalsum = 0, totalcount = 0; int i; for (i = 0; i < sums.length; ++i) { totalsum += sums[i].sum; totalcount += sums[i].count; } return (double) totalsum / (double) totalcount; }

6 Aggregating partial sums class PartialSum { public int sum; public int count; }; static double MergeSums(PartialSum[] sums) { return (double) sums.select(x => x.sum).sum() xsum)sum() / (double) sums.select(x => x.count).sum(); }

7 Convenient syntax var words = tableoflines.selectmany(l => l.split(' ')).GroupBy(w => w);

8 Convenient syntax var words = tableoflines.selectmany(l => l.split(' ')).GroupBy(w => w); IQueryable<IGrouping<string,string>> words = tableoflines.selectmany(mysplitfunction).groupby(mystringidentity); IEnumerable<string> mysplitfunction(string line) { return line.split(' '); } string mystringidentity(string word) { return word; }

9 K means algorithm

10 K means algorithm

13 K means helper functions class Vector { } Vector Mean(IEnumerable<Vector> set) { } Vector sum = set.aggregate( (x, y) => x + y ); return sum / set.count(); Vector NearestNeighbor(Vector vect, IEnumerable<Vector> set) { return set.min( e => (e vect).l2norm() ); }

14 K means algorithm IEnumerable<Vector> kmeansstep(ienumerable<vector> vectors, } var clusters = vectors.groupby( IEnumerable<Vector> centers) { vector => NearestNeighbor(vector, centers).vectorid); return clusters.select(cluster => Mean(cluster)); IEnumerable<Vector> kmeans(ienumerable<vector> vectors, IEnumerable<Vector> centers) { for (int i = 0; i< iterations; i++) centers = kmeansstep(vectors, centers); return centers; }

15 Data mining, machine learning, Decision tree training SVD Power iteration i (PageRank) Image feature extraction/indexing/clustering Network trace analysis Light field simulation

17 Computational model: DAG Distributed processing Partition computation across cores/cluster Minimize communication overhead Directed acyclic graph Edge is finite sequence of data items Vertex is computation over input edge sequences

18 DAG abstraction Explicit dataflow Exposes dependencies within computation Absence of cycles Allows re execution for fault tolerance Simplifies scheduling: no deadlock Cycles can often be replaced by unrolling Unsuitable for fine grain inner loops Very popular p Databases, functional languages,

19 Map Independent transformation of dataset for each x in S, output x = f(x) Eg E.g. simple grep for word w output line x only if x contains w

20 Map Independent transformation of dataset for each x in S, output x = f(x) Eg E.g. simple grep for word w output line x only if x contains w S f S

21 Map Independent transformation of dataset for each x in S, output x = f(x) Eg E.g. simple grep for word w output line x only if x contains w S 1 f S 1 S 2 f S 2 S 3 f S 3

22 Reduce Grouping plus aggregation 1) Group x in S according to key selector k(x) 2) For each group g, output r(g) E.g. simple word count group by k(x) () = x for each group g output key (word) and count of g

23 Reduce Grouping plus aggregation 1) Group x in S according to key selector k(x) 2) For each group g, output r(g) E.g. simple word count group by k(x) () = x for each group g output key (word) and count of g S G r S

24 Reduce S G r S

25 Reduce S 1 G r S S 2 S 3

26 Reduce S 1 D G r S 1 S 2 D G r S 2 S 3 D G r S 3 3 D is distribute, eg e.g. by hash or range

27 Reduce S 1 G ir D G r S 1 S 2 G ir D G r S 2 S 3 G ir D 3 ir is initial reduce, eg e.g. compute a partial sum

28 K means C 0 ac cc C 1 P ac cc C 2 ac cc C 3

29 K means C 0 P 1 ac ac ac cc C 1 P 2 P 3 ac 3 ac ac cc C 2 ac ac ac cc C 3

30 PageRank N 0 1 N 0 2 ae ae ae D D D cc cc cc N 1 1 N 1 2 N 1 3 N 0 3 E 1 E 2 ae ae cc D cc N2 1 D ae D cc N 2 2 N 2 3 E 3 ae ae ae D D cc cc N 3 1 N 3 2 D cc N 3 3

31 Distributed Word Count Count word frequency in a set of documents: var words = docs.selectmany(doc => doc.words); var groups = words.groupby(word => word); var counts = groups.select(g => new WordCount(g.Key, g.count())); IN SM GB S OUT doc => word => g => doc.words word new

32 Execution Plan for Word Count SM SelectMany SM Q GB sort groupby pipelined pp GB S (1) C D count distribute MS mergesort GB groupby pipelined Sum Sum

33 Execution Plan for Word Count SM SM SM SM SM Q GB Q GB Q GB Q GB GB S (1) C D (2) C D C D C D MS GB MS GB MS GB Sum Sum Sum

35 Dryad General purposeexecution execution engine Batch processing on immutable datasets Well tested on largeclusters Automatically handles Fault tolerance Distribution of code and intermediate data Scheduling of work to resources

36 Fault tolerance Buffer data in (some) edges Re execute on failure using buffered data Speculatively l re execute for stragglers

37 Rewrite graph at runtime Loop unrolling with convergence tests Adapt partitioning scheme at run time Choose #partitions based on runtime data dt volume Broadcast Join vs. Hash Join, etc. Adaptive aggregation and distribution trees Based on data skew and network topology Load balancing Data/processing skew (cf work stealing)

38 Dryad System Architecture Scheduler R

39 Dryad System Architecture Scheduler R R

40 Quincy DAG Scheduler Data locality and fairness (SLAs) SOSP 2009

41 Production system Dryadwell tested tested, scalable Daily use supporting Bing for over 3 years Clusters with >10k computers Applicable to large number of computations 250 computer cluster at MSR SVC, Mar >Nov 09 15k jobs (tens of millions of processes executed) Hundreds d of distinct t programs Network trace analysis, privacy preserving inference, lighttransport simulation, decision tree training, deep belief network training, image feature extraction,

42 Conclusion DryadLINQ supports many computations Easy to use, flexible DAG structured jobs scale to large clusters Transient failures common, disk failures daily Publically available for download

44 Dryad Inputs and Outputs Partitioned data set Records do not cross partition boundaries Data on compute machines: NTFS, SQLServer, Optional semantics Hash partition, range partition, sorted, etc. Loading external data Partitioning automatic File system chooses sensible partition sizes Or known partitioning from user

45 Partitioning driven by data S 1 G S 2 G S 3 G S 4 G S 5 G S 6 G

46 Partitioning driven by data S 1 G ir D S 2 G r S 1 S 3 G ir D S 4 G r S 2 S 5 G ir D S 6

47 Push vs Pull Databases typically pull using iterator model Avoids buffering Canprevent unnecessary computation But DAG must be fully materialized Complicates rewriting Prevents resource virtualization in shared cluster S 1 D G r S 1 S 2 D G r S 2

48 Channel abstraction S 1 G ir D G r S 1 S 2 G ir D G r S 2 S 3 G ir D 3

49 Push vs Pull Channel types define connected component Shared memory or TCP must be gang scheduled Pull within gang, push between gangs

50 Fault tolerance Buffer data in (some) edges Re execute on failure using buffered data Speculatively l re execute for stragglers Push model makes this very simple

51 DryadLINQ Internals Distributed execution plan Static optimizations: pipelining, eager aggregation, etc. Dynamic optimizations: data dependent partitioning, dynamic aggregation, etc. Automatic code generation Vertex code that runs on vertices Channel serialization ili i code Callback code for runtime optimizations Automatically distributed to cluster machines Separate LINQ query from its local context Distribute referenced objects to cluster machines Distribute application DLLs to cluster machines

52 Decomposable Functions Roughly, a function H is decomposable if it can be expressed as composition of two functions IR and C such that IR is commutative C is commutative and associative Some decomposable functions Sum: IR = Sum, C = Sum Count: IR = Count, C = Sum OrderBy.Take: IR = OrderBy.Take, C = SelectMany.OrderBy.Take

53 Two Key Questions How do we decompose a function? Two interfaces: iterator and accumulator Choice of interfaces can have significant impact on performance How do we deal with user defined functions? Try to infer automatically Provide a good annotation tti mechanism

54 Iterator Interface in DryadLINQ M M M [Decomposable("InitialReduce", lr " "Combine")] ")] public static IntPair SumAndCount(IEnumerable<int> g) { return new IntPair(g.Sum(), g.count()); } G1 IR D G1 IR D G1 IR D public static IntPair InitialReduce(IEnumerable<int> g) { return rn new IntPair(g.Sum(), g.count()); } public static IntPair Combine(IEnumerable<IntPair> g) { return new IntPair(g.Select(x => x.first).sum(), g.select(x => x.second).sum()); } MG G3 F MG G3 F MG G2 C MG G2 C X X

55 Accumulator Interface in DryadLINQ [Decomposable("Initialize", "Iterate", "Merge")] public static IntPair SumAndCount(IEnumerable<int> g) { return new IntPair(g.Sum(), g.count()); } public static IntPair Initialize() { return new IntPair(0, 0); } M G1 IR D M G1 IR D M G1 IR D MG MG public static IntPair Iterate(IntPair x, int r) { x.first += r; x.second += 1; return x; } public static IntPair Merge(IntPair x, IntPair o) { x.first += o.first; x.second += o.second; return x; } MG G3 F X MG G3 F X G2 C G2 C

56 G1+IR and G2+C Iterator PartialSort Keep only a fixed number of chunks in memory Chunks are processed in parallel: sorted, grouped, reduced by IR or C, and emitted G3+F Read the entire input into memory, perform a parallel sort, and apply F to each group Observations G1+IR can always a bepipelined with upstream G3+F can often be pipelined with downstream G1+IR may havepoor data reduction PartialSort is the closest to MapReduce

57 Accumulator FullHash G1+IR, G2+C, and G3+F Build an in memory parallel hash table: oneaccumulator object/key Each input record is accumulated into its accumulator object, and then discarded Output the hash table when all records are processed Observations Optimal data reduction for G1+IR Memory usage proportional to the number of unique keys, not records So, we by default enable upstream and downstream pipelining Used by DB2 and Oracle

59 MapReduce (Hadoop) MapReduce restricts Topology of DAG Semantics of function in compute vertex Sequence of instances for non trivial tasks S 1 f G ir D G r S 1 1 S 2 f G ir D G r S 2 S 3 f G ir D

60 MapReduce language complexity Simple to describe MapReduce model Can be hard to map algorithm to framework cf k means: combine C+P, CPbroadcast tc, iterate, t HIVE, PigLatin etc. mitigate programming issues

61 MapReduce system complexity Simple to describe MapReduce system Implementation not uniform Different fault tolerance t l for mappers, reducers Add more special cases for performance Hd Hadoop introducing i TCP channels, pipelines, Dryad has same state machine everywhere

62 DryadLINQ demo

63 Conclusions High level language is good For ease of use, maintainability, expressiveness Computational abstraction is important Suitable target for compiler, not developer Common patterns should be efficient i Optimization should be easy LINQ is a pretty good language abstraction bt ti DAG is a very good computational model