Scheduling Support for Heterogeneous Hardware Accelerators under Linux Tobias Wiersema University of Paderborn Paderborn, December 2010 1 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Introduction Motivation Basics Concept Implementation Time sharing Migration Efficiency 2 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Motivation Basics Accelerator-based heterogeneous systems Accelerators: General-purpose GPUs, FPGAs, ClearSpeed boards,... Accelerated systems increasingly important Off-the-shelf components Hybrid processor/accelerator designs (Cell BE, Virtex II Pro, Excalibur) Supercomputing (TSUBAME, TianHe-1a) New programming models (data-parallel execution) Cost-efficient and power-efficient designs Problem: Development of accelerated software 3 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Current situation user space Accelerated applications kernel space Runtime library Driver Scheduler hardware Accelerator CPUs
Operating system integration Motivation Basics Integrate into OS kernel Hardware abstraction Master-Slave-relationship Communication Synchronization Scheduling: Global decisions 5 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Operating system integration Motivation Basics Integrate into OS kernel Hardware abstraction Master-Slave-relationship Communication Synchronization Scheduling: Global decisions 5 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Time Sharing Introduction Motivation Basics Historic predecessor: Multiprogramming Enhance processor utilization Time sharing Take fair turns on processor Shared usage of sought-after resources Tasks Users Today mostly relies on preemption 6 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Motivation Basics Completely Fair Scheduler Current Linux scheduler: CFS Previous: O(n), O(1) Fairness O(log n): Red-black-tree Virtual time 7 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Concept Implementation Introduction Motivation Basics Concept Implementation Time sharing Migration Efficiency 8 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Current situation user space Accelerated applications kernel space Runtime library Driver Scheduler hardware Accelerator CPUs
Scheduling possibilities user space Accelerated applications kernel space Runtime library Driver Scheduler hardware Accelerator CPUs
Concept Implementation Possible approaches Interrupt path: Application accelerator 1. Application runtime library 2. Runtime library driver Pro and contra + Transparent scheduling + Unchanged applications Accelerator specific No time sharing No task migration 11 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Concept Implementation Possible approaches Interrupt path: Application accelerator 1. Application runtime library 2. Runtime library driver Pro and contra + Transparent scheduling + Unchanged applications Accelerator specific No time sharing No task migration 11 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Approach: user space Accelerated applications kernel space Runtime library Driver Scheduler extension Scheduler hardware Accelerator CPUs
Concept Implementation Kernel extension and programming model with Cooperative multitasking Checkpointing Pro and contra No transparent scheduling No unchanged applications + Not accelerator specific + Time sharing without preemption + Task migration 13 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Concept Implementation Kernel extension and programming model with Cooperative multitasking Checkpointing Pro and contra No transparent scheduling No unchanged applications + Not accelerator specific + Time sharing without preemption + Task migration 13 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Scheduling granularity Concept Implementation time spent on accelerator a) granularity copy execution allocation re-request re-request free time spent on accelerator b) granularity copy execution allocation re-request free 14 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Concept Implementation Challenges / Design decisions Schedulable entity: Thread Ghost threads Static affinity model Problem: Affinity inversion 15 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Integration into Linux kernel Concept Implementation Once One per computing unit One per task struct computing_units struct computing_unit_info struct cfs_rq rq struct hardware_properties hp struct task_struct struct sched_entity se struct sched_entity hwse struct hardware_properties struct sched_entity struct meta_info mi struct cfs_rq list tasks rb_tree tasks_timeline struct meta_info mi 16 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Time sharing Migration Efficiency Introduction Motivation Basics Concept Implementation Time sharing Migration Efficiency 17 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Time sharing of an accelerator Time sharing Migration Efficiency 20 10 tasks on one GPU 18 16 searched strings in billions 14 12 10 8 6 4 2 0 0 50 100 150 200 250 300 350 seconds 18 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Heterogeneous task migration Time sharing Migration Efficiency 20 15 tasks on either CPU or GPU 18 16 searched strings in billions 14 12 10 8 6 4 2 0 0 100 200 300 400 500 600 700 800 900 seconds 19 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Time sharing Migration Efficiency Task switching overhead 300 75 tasks on one GPU 250 200 seconds 150 100 execution time average turnaround time 50 0 0sec 1sec 2sec 4sec fcfs granularity setting 20 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Time sharing Migration Efficiency Load balancer 100% 90% 80% 70% 60% 50% 40% 30% load balancing MD5 computation PF computation 20% 10% 0% 100 1000 5000 10000 100 1000 5000 10000 concurrent tasks 21 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Conclusions of the kernel extension Kernel-controlled time sharing Accelerator sharing Fairness Heterogeneous task migration Speedup Load balancing Exploit heterogeneity Enhance accelerator utilization 22 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Future work Enhance load balancer Combine with enhanced runtime libraries Automate programming model Enhance ghost threads Communication Synchronization 23 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Thank you for your attention! Do you have questions? 24 / 24 Tobias Wiersema Linux scheduler extension for accelerators
Completely Fair Scheduler Current Linux scheduler: CFS Previous: O(n), O(1) Fairness O(log n): Red-black-tree Virtual clock No time slices 25 Tobias Wiersema Linux scheduler extension for accelerators
Notation of time sharing task timings turnaround time a) A1 B1 A2 A3 A1 B1 A2 latency execution time latency execution time A3 b) A1 B1 A2 B1 A3 B1 A3 A2 B1 A1 turnaround time 26 Tobias Wiersema Linux scheduler extension for accelerators
Task migration with true time sharing 20 18 16 searched strings in billions 14 12 10 8 6 4 2 a) b) 0 0 100 200 300 400 500 600 700 800 900 1000 seconds 27 Tobias Wiersema Linux scheduler extension for accelerators
Task migration with QL5 and no fixed set of tasks 20 18 16 searched strings in billions 14 12 10 8 6 4 2 a) 0 0 100 200 300 400 500 600 700 800 900 seconds 28 Tobias Wiersema Linux scheduler extension for accelerators
Runqueue of an accelerator Running Waiting... Runqueue with ghost threads... Execution units Accelerator 29 Tobias Wiersema Linux scheduler extension for accelerators
Affinity Inversion example Step 1 Step 2 Step 3 Step 4 Unit 1 A A A C C Unit 2 B B B 30 Tobias Wiersema Linux scheduler extension for accelerators
Overview of data structures Once One per computing unit One per task struct computing_units list list_of_cus[type] current_id access_mutex struct computing_unit_info id type struct cfs_rq rq struct hardware_properties hp struct task_struct struct sched_entity se struct sched_entity hwse struct hardware_properties concurrent_kernels bandwitdth struct cfs_rq load min_vruntime list tasks rb_tree tasks_timeline count maxcount access_mutex struct sched_entity load semaphore_up need_migrate task_granularity_nsec current_affinity offerer offered_affinity struct meta_info mi 31 Tobias Wiersema Linux scheduler extension for accelerators
Task states if using an accelerator cooperatively Non-accelerated computation Accelerated computation No computing unit assigned free allocate computing unit assignment invalid finished computation or denied re-request Computing unit assignment valid successful re-request 32 Tobias Wiersema Linux scheduler extension for accelerators
Turnaround times with infinite granularity 75 Tasks (25 MD5, 50 PF), started at the same time 250 200 150 seconds 100 50 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 tasks 33 Tobias Wiersema Linux scheduler extension for accelerators
Turnaround times with four second granularity 75 Tasks (25 MD5, 50 PF), started at the same time 250 200 150 seconds 100 50 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 tasks 34 Tobias Wiersema Linux scheduler extension for accelerators