INFO5011. Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling

Transcription

1 INFO5011 Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling COMMONWEALTH OF Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of the university of Sydney pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice. The presentation is based on: Quincy: Fair Scheduling for Distributed Computing Clusters. Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg, SOSP'09 Improving MapReduce Performance in Heterogeneous Environment. Matei, Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica. OSDI 2008 Some content/diagrams are from the paper and the author s presentation

2 Outline Motivation - Default FIFO scheduling in Hadoop and its problems Quincy scheduling for distributed computing cluster - Proposed by Microsoft Research - Build on Dryad cluster - Homogeneous assumption - Focusing on determining which node to run which job Scheduling in Heterogeneous environment - Proposed by U.C. Berkley RAD lab - Heterogeneous assumption - Focusing on optimizing speculative execution 2

3 Motivation Big clusters used for jobs of varying sizes, durations Data from production search cluster used in Microsoft Google: 395s avg. time for Map Reduce job 3

4 The Hadoop scheduling The master knows before hand the number of mappers and number of reducers for each job. It also knows the number of available task slots on each worker node It is responsible for assigning tasks to node with vacant slot The default and the simplest scheduling using FIFO queue Diagram from the CACM version of the original MapReduce paper 4

5 The problem of simple FIFO scheduler maximum number of tasks Problem: Subsequent small jobs wait for large jobs to finish! 5

6 Hadoop running a single job One job: hadoop jar wordcount.jar info5011wordcounter.wordcount assn_input/n07.txt countout Hadoop job_ _1213 on hadoop = 7 blocks 6

7 Hadoop map task list for job_ _1213 7

8 Hadoop running 2 jobs concurrently hadoop jar wordcount.jar info5011wordcounter.wordcount assn_input/n00p2.txt countoutn00 hadoop jar wordcount.jar info5011wordcounter.wordcount user/zhouy/assn_input/n07.txt countoutn07 Similar job as the first one except with more reducers 8

9 Job smmaries first job finishes second job starts 9

10 Hadoop FIFO queue status The second, smaller job waits in queue 10

11 Hadoop map task list for job_ _1214 gpu1 gpu1 dm0 gpu0 gpu0 dm0 dm1 dm2 dm2 11

12 mapper 0 and mapper 9 12

14 Fair scheduling Job X takes t seconds when it runs exclusively on a cluster. X should take no more than Jt seconds when cluster has J concurrent jobs. Formally, for N computers and J jobs, each job should get atleast N/J computers. 14

15 Other scheduling approaches: Data Locality constrains - HPC jobs fetch data from a SAN, no need for co-location of data and computation. Data intensive workloads (MapReduce/Hadoop/Dryad) - have storage attached to computers. - Scheduling tasks near data improves performance. Jobtracker log: 15

16 Fairness vs. data locality The requirements of fairness and locality often conflict - A strategy that achieves optimal data locality will typically delay a job until its ideal resources are available - Fairness benefits from allocating the best available resources to a job as soon as possible after they are requested An important feature of Data intensive workloads (MapReduce/Hadoop/Dryad) - While running, tasks are independent of each other so killing one task will not impact another - In contrast, MPI jobs are made of sets of stateful processes tightly coupled by communicating with each other across the network. 16

17 Cluster architecture Diagram from authors original presentation 17

18 Baseline algorithm (Greedy) - Equivalent to FIFO scheduling in Hadoop Simple greedy fairness (GF) - Based on Hadoop s Fair Scheduler Queue based scheduling - Job j gets a baseline allocation A * j =min( M/K, N j ) where M: number of computer; K: number of concurrent jobs; N j total number of running and waiting tasks for job j. - If Σ j A * j<m the remaining slots are divided equally among jobs that have additional ready workers so that final allocation A j has Σ j A j =min(m, Σ j N j ). - The scheduler blocks job j whenever it is running A j tasks or more. It only assigns tasks of an unblocked job to a newly available computer. Fairness with preemption (GFP - When a job is running more than tasks, the scheduler will kill its over-quota tasks, starting with the most recently scheduled tasks first 18

19 Simple Greedy Fairness Diagram from authors original presentation 19

20 Sticky slots problem Under a steady state in which each job is occupying exactly its allocated quota of computers. - Whenever a task from job j completes on computer, another task from j will be assigned to m again - m sticks to j indefinitely whether or not j has any waiting tasks that have good data locality when run on m. Simple solution - Do not unblock job j once a task finishes - wait till j s running tasks falls below A j -M H, where M H is a hysteresis margin or Δ H seconds have passed. - In many cases, this delay is sufficient to allow another job s worker, with better locality to steal computer m. 20

21 Sticky slots illustrated (i) Diagrams and example in slides are from authors original presentation 21

22 Sticky slots illustrated (ii) 22

23 Sticky slots illustrated (iii) 23

24 Sticky slots illustrated (iv) 24

25 Sticky slots illustrated (v) 25

26 Sticky slots illustrated (vi) X 26

27 Sticky slots illustrated (vii) 27

28 Quincy-- Flow Based Scheduler Main idea - Matching = Scheduling - Each task is either scheduled on a computer c or un scheduled - Can assign a cost to any matching - Fairness constrains number of tasks that are scheduled - The goal is to minimize matching cost while obeying fairness constraints - Min-cost network flow problem - Instead of making local decisions [greedy], solve it globally. 28

29 Graph construction (i) Start with a directed graph representation of the cluster architecture. Rack aggregator Individual computers Sink node Cluster aggregator 29

30 Graph construction (ii) job 1 with 6 tasks and a root task Unscheduled node for job 1 Each receive one unit of flow as its supply Each task has an edge to U j. There is a single edge from U j to the sink. High cost on edges from tasks to U j. 30

31 Graph construction (iii) Add edges from tasks (T) to computers (C), if computer C has some data for task T. The cost is a function of the amount of data that would be transferred across rack and core switch 31

32 Graph construction (iv) Add edges from tasks (T) to rack (R), if R has some data for task T. The cost on the edge is set to the worst case cost that would result if the task were run on the least favorable computer in R 32

33 Graph construction (v) Add edges from all tasks (T) to cluster (X) The cost on the edge is set to the worst case cost for running the task on any computer in the cluster 33

34 Graph construction (vi) 0 cost edge from root task to computer to avoid preempting root task. Constrains how many tasks can run on each computer 34

35 A Feasible Matching Unscheduled job 35

36 Final graph Fairness constrains, setting it to 4 means at lest 2 tasks from job 1 needs to go through computer Fairness constrains, setting it to 2 means at lest 2 tasks from job 2 needs to go through computer 36

37 Workload: Some experiment results Typical Dryad jobs (Sort, Join, PageRank, WordCount, Prime). In total, 30 jobs with a mix of CPU, disk, and network intensive tasks Prime used as a worst-case job that hogs the cluster if started first. 240 computers in cluster. 8 racks, computers per rack. More than one metric used for evaluation. 37

38 Results (i) 38

39 Results (ii) 39

40 Results (iii) 40

41 Results (iv) 41

42 Solver overhead Discussion point - The observed average overhead in this 240 machine cluster is 7.64ms with a maximum cost of 57.59ms - Simulated average overhead in 2500 computers running 100 concurrent jobs is a little over a second per solution - Seems acceptable, but min-cost flow is recomputed from scratch each time a change occurs Applicable in other scheduling environment? - The easy mapping of the scheduling problem to a min-cost flow is due to - Tasks are relatively independent with each other, there is no correlation constraints - One dimensional capacity setting 42

44 Speculative execution Hadoop s straggler handling mechanism - If a node is available but is performing poorly, this is called a straggler - MapReduce has a build-in mechanism to run a speculative copy of its task on another machine to finish the computation faster. This paper tries to Improve the performance of speculative executions by - Define a new scheduling metric. - Choosing the right machines to run speculative tasks. - Capping the amount of speculative executions. 44

45 Progress score Hadoop monitors task progress using a progress score to select speculative tasks - Map task s progress score is the fraction of input data read - Reduce task s execution is divided into three phases, each of which account for 1/3 of the score. In each phases, the score is the fraction of data process When a task s progress score is less than the average for its category minus 0.2 and the task has run for at least one minute, it is marked as a straggler For a reduce task, the execution is divided into three phases, each of which accounts for 1/3 of the score progress score is the fraction of input data read Copy phase sort phase reduce phase 45

46 Hadoop s assumption Nodes can perform work at exactly the same rate Tasks progress at a constant rate throughout time There is no cost to launching a speculative task on an idle node The three phases of execution take approximately same time Tasks with a low progress score are stragglers Maps and Reduces require roughly the same amount of work 46

47 Breaking down the assumptions The first 2 assumptions talk about homogeneity. However - In a non-virtualized data center, there may be multiple generations of hardware - In a virtualized data center, multiple VMs are co-located on the same physical host. Diagrams from Ang Li, Xiaowei Yang, Srikanth Kandula, and Ming Zhang, CloudCmp: comparing public cloud providers. In Proceedings of the 10th annual conference on Internet measurement (IMC '10) 47

48 Heterogeneity are observed by other researchers Jorg Schad, Jens Dittrich, and Jorge-Arnulfo Quiane-Ruiz. Runtime measurements in the cloud: observing, analyzing, and reducing variance. In Proceedings of the 36 th International Conference on Very Large Data Bases(VLDB 10), September 13-17,2010, Singapore. Page

49 Other assumptions Assumption 3 that speculating tasks coast nothing, breaks down when resources are shared - Network is a bottleneck and speculative tasks may compete for disk I/O Assumption 4 that a task s progress score is approximately equal to its percent completion, does not hold especially for reduce tasks - The copy phase usually counts for more than 1/3 of the task execution time Assumption 5, that progress score is a good proxy for progress rate because tasks being at roughly the same time, can also be wrong - Number of mappers depends on number of blocks which might be much large the available slots. The mappers tend to run in waves (see slide 11). 49

50 Longest Approximate Time to End Design principle LATE scheduler - Always speculatively execute the task that we think will finish farthest into the future Different methods can be used to estimate time left Propose a simple heuristic based on progress rate - ProgressRate = ProgressScore/The amount of time the task has been running - Estimated time to completion = (1-ProgressScore)/ProgressRate - It assumes that tasks make progress at a roughly constant rate (there are exceptions to this assumption) 50

51 LATE parameters - SlowNodeThreshold: used to select fast node to launch speculative tasks - 25 th percentile of node progress - SpeculativeCap: used to control the number of speculative tasks that can be running at once - 10% of available task slots - SlowTaskThreshold: used to select task for speculative copy - 25 th percentile of task progress - Currently it does not consider data locality 51

52 Estimating finishing time the current ProgressRate computation assumes constant progress, which might not be true If a task s execution slows down in later phase, the ProgressRate might suggest the wrong task to speculative If it speed up, it won t affect the final prediction Mapper tasks progress in constant rate most of the time Reducer tasks are typically slowest in their first phase and speed up in later phases 52

53 Evaluation Environment - Amazon EC2 ( nodes) - Small Local Testbed (9 nodes) Measuring Heterogeneity on EC2 53

54 Heterogeneity setup Scheduling experiments - Assigning a varying number of VMs to each physical node - Create stragglers by running CPU and I/O intensive processes on same VM 54

55 EC2 Sort with Heterogeneity Each host sorted 128MB with a total of 30GB data Each job has 486 map tasks and 437 reduce tasks 55

56 EC2 Sort with Stragglers Each node sorted 256MB with a total of 25GB of data Stragglers created with 4 CPU (800KB array sort) and 4 disk (dd tasks) intensive processes 56

57 Sensitivity Analysis 57

58 Advantages Conclusion - Considers heterogeneity that appears in real life systems. - LATE speculatively executes the tasks that hurt the response time the most on fast nodes. - LATE caps speculative tasks to avoid overloading resources Limitations - Does not consider data locality - Finishing time estimation may predict wrong when tasks slows down 58