Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux

Transcription

1 Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux Third Workshop on Computer Architecture and Operating System co-design Paris, Tobias Beisel, Tobias Wiersema, Christian Plessl, and André Brinkmann Paderborn Center for Parallel Computing (PC²), University of Paderborn 2 ), - 4 * ) 4 ) /

2 Motivation How are heterogeneous systems currently used? Main thread Application... CPU_task_A(); GPU_task_B(); GPU_task_C(); FPGA_task_D(); CPU_task_E();... Operating System GPU task GPU task Hardware task CPU task GPU1 GPU2 FPGA Hardware CPU 2

3 Motivation How are heterogeneous systems currently used? Main thread Application... parallel do { CPU_task_A(); GPU_task_B(); GPU_task_C(); FPGA_task_D(); CPU_task_E(); }... Operating System GPU task GPU task Hardware task CPU task GPU1 GPU2 FPGA Hardware CPU 3

4 Motivation How are heterogeneous systems currently used? Application A Main thread... parallel do { CPU_task_A(); GPU_task_B(); GPU_task_C(); FPGA_task_D(); CPU_task_E(); }... Application B Main thread... parallel do { CPU_task_F(); GPU_task_G(); GPU_task_H(); FPGA_task_I(); CPU_task_J(); }... Operating System GPU task GPU task Hardware task CPU task CPU task GPU1 GPU2 FPGA Hardware CPU 4

5 Resource Management needed Motivation Application A Main thread... parallel do { CPUorGPU_task_A(); GPU_task_B(); GPU_task_C(); FPGAorGPU_task_D(); CPU_task_E(); }... Application B Main thread... parallel do { CPU_task_F(); CPUorGPU_task_G(); GPUorFPGA_task_H(); FPGA_task_I(); CPU_task_J(); }... Operating System GPU task GPU task Hardware task CPU task CPU task GPU1 GPU2 FPGA Hardware CPU 5

6 Motivation Concurrent use of accelerators may be beneficial A,B,C D GPU A B C (a) CPU D (b) GPU CPU A B C D (c) GPU CPU A B C A D B C D timestep A, B, C: Tasks only capable to run on the GPU, D: Task executable on CPU or on GPU with 5x Speedup All tasks have the same priority, scheduling overheads are neglected 6

7 Challenges Common properties of accelerators No preemption, interrupts, or migration support Full internal state not accessible Explicit non-uniform communication of data needed Use dedicated APIs, ISAs and binaries Handling several accelerators requires custom solutions Decision space is much broader than in traditional scheduling Dynamic run-time decisions still need to be very fast Decision parameter acquisition and choice is important Resource management in heterogeneous systems is difficult 7

8 Agenda Basic Concepts and Ideas Scheduling and Programming Model Task Management Experimental Evaluation Related Work Conclusions and Future Work 8

9 General Approach Scheduler objectives No architectures excluded by design Useable for future operating systems (OS) Easy to use by applications Approach One-node heterogeneous systems Scheduler component in kernel space OS adaptation instead of OS rewrite OS Scheduler Task Management by cooperative multitasking Delegate threads as scheduling entities submitted to be scheduled allocated hardware copy data & start Application Delegate Thread Hardware Specific Thread results Accelerator 9

10 Cooperative Multitasking Cooperative multitasking Tasks offer voluntarily release of hardware at checkpoints Checkpoints allow migration between different ISAs Overcomes the lack of preemption and migration Allows to use time-sharing Goal: Combine cooperative multitasking with checkpointing and time-sharing as a heterogeneous extension to the Completely Fair Scheduler (CFS) Running preempt run Ready wait signal Blocked 10

11 Cooperative Multitasking Cooperative multitasking Tasks offer voluntarily release of hardware at checkpoints Checkpoints allow migration between different ISAs Overcomes the lack of preemption and migration Allows to use time-sharing Goal: Combine cooperative multitasking with checkpointing and time-sharing as a heterogeneous extension to the Completely Fair Scheduler (CFS) Running release run Ready wait signal Blocked 11

13 Scheduling Model Application Scheduler component Main thread spawns Driver Delegate thread Delegate thread Delegate thread executes submits Thread information submitted to Scheduling policy Accelerator Architectures dequeue enqueue thread GPU1 GPU2 FPGA CPU Delegate thread Delegate thread Hardware unit queue GPU thread GPU thread Hardware thread Delegate thread dequeue CF scheduler 13

14 Scheduler Extension Scheduler API provides three system calls cu_allocate() Blocking call to enqueue tasks to most affine hardware queue cu_re-request() Request to further use the allocated hardware at checkpoint Scheduler decides based on scheduling policy cu_free() Free hardware resources Call load balancer if hardware runs idle Control API allows to manage accelerators in the system CFS data structures were adapted for heterogeneous use 14

15 Programming Model Application spawns threads Application uses system calls to acquire, keep and release an accelerator Application provides Thread meta information Checkpoint(s) Function pointers for each supported hardware for: Accelerator initialization (app_init) Computation to the next checkpoint (app_main) Free function to release accelerator and write a checkpoint (app_free) Checkpoint reached Yes Request Resource cu_allocate(meta_info) Copy Data & Code app_init() Start Computation app_main() Done? No Yes Reuse? cu_rerequest() No Copy Back Results/Checkpoint app_free() Free Resources cu_free() app_exit() 15

16 ich are Delegate Thread Implementation Back cu_free() Reduce Results pthread_join() Free Resources & Copy Results s using to and e there ciently ves all remove its and ing the ionally, lication ory for plifies Example for Delete Worker shutdown() Meta information Figure 2. Typical workflow of a delegate thread. Implementation multiplexer void Worker_example::workerMetaInfo(struct meta_info *mi){ mi->memory_to_copy=0; // in MB mi->type_affinity[cu_type_cuda]=2; mi->type_affinity[cu_type_cpu]=1; } void* Worker_example::getImplementationFor(int type, functions *af) switch(type) { case CU_TYPE_CPU: af->init=&worker_example::cpu_init; af->main=&worker_example::cpu_main; af->free=&worker_example::cpu_free; af->initialized=true; break; case CU_TYPE_CUDA:... //similar default: af->initialized=false; } Listing 1. Example implementation for mandatory worker functions. 16

17 Checkpointing A checkpoint is a preferably small set of data structures that Unambiguously defines the state of execution Is stored back to the host s main memory when a task is preempted Is readable and translatable to all supported ISAs Properties Currently user defined Size is application dependent and known at definition time Size influences the scheduling granularity typedef struct md5_resources { std::string hash_to_search; unsigned long long currentwordnumber; bool found_solution; } md5_resources_t; Listing 2. Example checkpoint for MD5 cracking. 17 T as be alp dif hig

19 Thread Scheduling Process and Parameters Initial enqueueing Static affinity of a task towards an accelerator Load balancing additionally considers load (current queue length) Queue size limitation for accelerators Internal queue scheduling Fairness Based on virtual runtime and defined consistently with the CFS!fairness: time a tasks needs to run to be treated fair Priorities Are inherited from the delegate thread Using the same priority adjustments of virtual runtime as CFS on accelerator queues Actual execution time slot granularity = 2 * t copy_checkpoint +!fairness + base _ granularity granularity depends on the checkpoint distance 19

20 Load Balancer Executed when any device runs idle 2 step concept for task migration 1. Search all run-queues for most affine waiting tasks 2. If 1. fails: check running tasks and leave a migration offer (flag) migrate First pass GPU Task 0 GPU: 5 CPU: 1 Task 1 GPU: 2 CPU: 1 Task 4 GPU: 6 CPU: 1 Task 2 GPU:10 CPU: 1 Task 5 GPU: 5 CPU: 1 offer migration Second pass GPU Task 0 GPU: 5 CPU: 1 Task 1 GPU:10 CPU: 0 Task 4 GPU: 6 CPU: 0 Task 2 GPU:10 CPU: 0 Task 5 GPU: 2 CPU: 0 running queued Inter CPU balancing by original CFS 20

22 Experimental Setup Test applications MD5 cracking Prime factorization Running several instances of both applications Hardware setup 2 * Intel Quad Core XEON E MHz, hyperthreading enabled 12 GB DDR NVIDIA Tesla 2075 Software setup Ubuntu Linux LTS, modified kernel NVIDIA Cuda 4.0 GCC

23 Impact of Time-sharing on GPU!"#$%&%'() Comparing different base_granularities with FCFS!"#$%%$& %)) 3.1 "#$,-."!+)!*)!')!%)!)) +) *) ') %) +$, )!"#$ %"#$ &"#$ '"#$ ($(" Avg. turnaround time running 25 MD5 and 50 prime factorization instances on the GPU &01 &+$, -+$,.+$, /,/+ Mean of total runtime for 30 runs with 25 MD5 and 50 prime factorization threads on the GPU Average turnaround times decrease with shorter base_granularities (40%) Long running tasks do not block short running tasks Makespan overheads increase with shorter base_granularities (13%) Overheads can be hidden by heterogeneous use of available hardware 23

24 Makespan reduction of up to 22% 765)(;$*(5*:$3459:,-+,++ &-+ & Heterogeneity Hides Overheads!"#$%%$& &+,+ -+ & $5)*)87$"9: Average runtimes of different counts of GPU affine prime factorization instances using a GPU and a combination of both GPU and CPUs./0*1*2/0 2/0 Effect will increase significantly with more applications providing Comparable affinities to several architectures Different hardware with highest affinity Using more architectures 24

26 Related Work StarPU (Augonnet et al., 2009) User Space runtime system for heterogeneous systems including data management, scheduling policies and profiling models No preemption and migration support, extensive programming model ReconOS (Lübbers et al., 2010) OS extension to manage partial reconfigurable system resources Aims on dynamically reconfigurable hardware Jimenez et al. (2009) Predictive runtime code scheduling for heterogeneous architectures Prediction based on performance history, estimated waiting times Harmony (Diamos and Yalamanchili, 2008) Execution model and runtime for heterogeneous many core systems Programming model with meta-data, runtime kernel division 26

28 Conclusions and Future Directions Cooperative multitasking with time-sharing is possible Reduces average turnaround times Increases interactivity and overall performance CFS extension implemented for conceptual validation Programming model Adds checkpoints and meta information to applications Next steps Compare to user space scheduler Extend scheduling policies, improve load-balancing Port further and more extensive example applications Approach automatic checkpoint identification Code available as open source 28

29 Thank you for your attention! Tobias Beisel This is work in context of the ENHANCE project. 29