Scheduling Support for Heterogeneous Hardware Accelerators under Linux

Similar documents

Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

Linux Process Scheduling. sched.c. schedule() scheduler_tick() hooks. try_to_wake_up() ... CFS CPU 0 CPU 1 CPU 2 CPU 3

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

OPERATING SYSTEMS SCHEDULING

Operating System: Scheduling

Task Scheduling for Multicore Embedded Devices

Kernel comparison of OpenSolaris, Windows Vista and. Linux 2.6

Page 1 of 5. IS 335: Information Technology in Business Lecture Outline Operating Systems

CHAPTER 1 INTRODUCTION

Completely Fair Scheduler and its tuning 1

CPU SCHEDULING (CONT D) NESTED SCHEDULING FUNCTIONS

CPU Scheduling Outline

CPU Scheduling. Core Definitions

Linux Scheduler Analysis and Tuning for Parallel Processing on the Raspberry PI Platform. Ed Spetka Mike Kohler

Objectives. Chapter 5: CPU Scheduling. CPU Scheduler. Non-preemptive and preemptive. Dispatcher. Alternating Sequence of CPU And I/O Bursts

CPU Scheduling. CPU Scheduling

Linux scheduler history. We will be talking about the O(1) scheduler

Process Scheduling II

Operating Systems Concepts: Chapter 7: Scheduling Strategies

Operating System Tutorial

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Operating Systems 4 th Class

Road Map. Scheduling. Types of Scheduling. Scheduling. CPU Scheduling. Job Scheduling. Dickinson College Computer Science 354 Spring 2010.

Multi-core Programming System Overview

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Networking Virtualization Using FPGAs

Chapter 5 Process Scheduling

Objectives. Chapter 5: Process Scheduling. Chapter 5: Process Scheduling. 5.1 Basic Concepts. To introduce CPU scheduling

Improvement of Scheduling Granularity for Deadline Scheduler

CPU Scheduling. Basic Concepts. Basic Concepts (2) Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems

Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum

EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Linux O(1) CPU Scheduler. Amit Gud amit (dot) gud (at) veritas (dot) com

ICS Principles of Operating Systems

Update on big.little scheduling experiments. Morten Rasmussen Technology Researcher

Process Scheduling in Linux

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Chapter 5: CPU Scheduling. Operating System Concepts 8 th Edition

Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems

Mitigating Starvation of Linux CPU-bound Processes in the Presence of Network I/O

Control 2004, University of Bath, UK, September 2004

Tasks Schedule Analysis in RTAI/Linux-GPL

Linux Process Scheduling Policy

FPGA-based Multithreading for In-Memory Hash Joins

Large-scale performance monitoring framework for cloud monitoring. Live Trace Reading and Processing

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Overlapping Data Transfer With Application Execution on Clusters

MCA Standards For Closely Distributed Multicore

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Processor Scheduling. Queues Recall OS maintains various queues

Announcements. Basic Concepts. Histogram of Typical CPU- Burst Times. Dispatcher. CPU Scheduler. Burst Cycle. Reading

Resource Scheduling Best Practice in Hybrid Clusters

Process Description and Control william stallings, maurizio pizzonia - sistemi operativi

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

OS OBJECTIVE QUESTIONS

Operating Systems, 6 th ed. Test Bank Chapter 7

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

ò Scheduling overview, key trade-offs, etc. ò O(1) scheduler older Linux scheduler ò Today: Completely Fair Scheduler (CFS) new hotness

W4118 Operating Systems. Instructor: Junfeng Yang

Agenda. Context. System Power Management Issues. Power Capping Overview. Power capping participants. Recommendations

Effective Computing with SMP Linux

2. is the number of processes that are completed per time unit. A) CPU utilization B) Response time C) Turnaround time D) Throughput

Operating Systems OBJECTIVES 7.1 DEFINITION. Chapter 7. Note:

Scheduling 0 : Levels. High level scheduling: Medium level scheduling: Low level scheduling

Basics of Virtualisation

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE

A Survey of Parallel Processing in Linux

Real-Time Operating Systems for MPSoCs

Xeon+FPGA Platform for the Data Center

CS Fall 2008 Homework 2 Solution Due September 23, 11:59PM

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

Job Scheduling Model

GPU Profiling with AMD CodeXL

The QEMU/KVM Hypervisor

Linux Scheduler. Linux Scheduler

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Solution Guide Parallels Virtualization for Linux

Main Points. Scheduling policy: what to do next, when there are multiple threads ready to run. Definitions. Uniprocessor policies

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Real-Time Scheduling 1 / 39

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing Systems

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Embedded Systems: map to FPGA, GPU, CPU?

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

REAL TIME OPERATING SYSTEMS. Lesson-10:

Multi-Threading Performance on Commodity Multi-Core Processors

Types Of Operating Systems

Chapter 2: OS Overview

GPU Parallel Computing Architecture and CUDA Programming Model

Module 8. Industrial Embedded and Communication Systems. Version 2 EE IIT, Kharagpur 1

Chapter 3 Operating-System Structures

Operating Systems Lecture #6: Process Management

Why Threads Are A Bad Idea (for most purposes)

Stream Processing on GPUs Using Distributed Multimedia Middleware

serious tools for serious apps

Transcription:

Scheduling Support for Heterogeneous Hardware Accelerators under Linux Tobias Wiersema University of Paderborn Paderborn, December 2010 1 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Introduction Motivation Basics Concept Implementation Time sharing Migration Efficiency 2 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Motivation Basics Accelerator-based heterogeneous systems Accelerators: General-purpose GPUs, FPGAs, ClearSpeed boards,... Accelerated systems increasingly important Off-the-shelf components Hybrid processor/accelerator designs (Cell BE, Virtex II Pro, Excalibur) Supercomputing (TSUBAME, TianHe-1a) New programming models (data-parallel execution) Cost-efficient and power-efficient designs Problem: Development of accelerated software 3 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Current situation user space Accelerated applications kernel space Runtime library Driver Scheduler hardware Accelerator CPUs

Operating system integration Motivation Basics Integrate into OS kernel Hardware abstraction Master-Slave-relationship Communication Synchronization Scheduling: Global decisions 5 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Operating system integration Motivation Basics Integrate into OS kernel Hardware abstraction Master-Slave-relationship Communication Synchronization Scheduling: Global decisions 5 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Time Sharing Introduction Motivation Basics Historic predecessor: Multiprogramming Enhance processor utilization Time sharing Take fair turns on processor Shared usage of sought-after resources Tasks Users Today mostly relies on preemption 6 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Motivation Basics Completely Fair Scheduler Current Linux scheduler: CFS Previous: O(n), O(1) Fairness O(log n): Red-black-tree Virtual time 7 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Concept Implementation Introduction Motivation Basics Concept Implementation Time sharing Migration Efficiency 8 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Current situation user space Accelerated applications kernel space Runtime library Driver Scheduler hardware Accelerator CPUs

Scheduling possibilities user space Accelerated applications kernel space Runtime library Driver Scheduler hardware Accelerator CPUs

Concept Implementation Possible approaches Interrupt path: Application accelerator 1. Application runtime library 2. Runtime library driver Pro and contra + Transparent scheduling + Unchanged applications Accelerator specific No time sharing No task migration 11 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Concept Implementation Possible approaches Interrupt path: Application accelerator 1. Application runtime library 2. Runtime library driver Pro and contra + Transparent scheduling + Unchanged applications Accelerator specific No time sharing No task migration 11 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Approach: user space Accelerated applications kernel space Runtime library Driver Scheduler extension Scheduler hardware Accelerator CPUs

Concept Implementation Kernel extension and programming model with Cooperative multitasking Checkpointing Pro and contra No transparent scheduling No unchanged applications + Not accelerator specific + Time sharing without preemption + Task migration 13 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Concept Implementation Kernel extension and programming model with Cooperative multitasking Checkpointing Pro and contra No transparent scheduling No unchanged applications + Not accelerator specific + Time sharing without preemption + Task migration 13 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Scheduling granularity Concept Implementation time spent on accelerator a) granularity copy execution allocation re-request re-request free time spent on accelerator b) granularity copy execution allocation re-request free 14 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Concept Implementation Challenges / Design decisions Schedulable entity: Thread Ghost threads Static affinity model Problem: Affinity inversion 15 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Integration into Linux kernel Concept Implementation Once One per computing unit One per task struct computing_units struct computing_unit_info struct cfs_rq rq struct hardware_properties hp struct task_struct struct sched_entity se struct sched_entity hwse struct hardware_properties struct sched_entity struct meta_info mi struct cfs_rq list tasks rb_tree tasks_timeline struct meta_info mi 16 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Time sharing Migration Efficiency Introduction Motivation Basics Concept Implementation Time sharing Migration Efficiency 17 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Time sharing of an accelerator Time sharing Migration Efficiency 20 10 tasks on one GPU 18 16 searched strings in billions 14 12 10 8 6 4 2 0 0 50 100 150 200 250 300 350 seconds 18 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Heterogeneous task migration Time sharing Migration Efficiency 20 15 tasks on either CPU or GPU 18 16 searched strings in billions 14 12 10 8 6 4 2 0 0 100 200 300 400 500 600 700 800 900 seconds 19 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Time sharing Migration Efficiency Task switching overhead 300 75 tasks on one GPU 250 200 seconds 150 100 execution time average turnaround time 50 0 0sec 1sec 2sec 4sec fcfs granularity setting 20 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Time sharing Migration Efficiency Load balancer 100% 90% 80% 70% 60% 50% 40% 30% load balancing MD5 computation PF computation 20% 10% 0% 100 1000 5000 10000 100 1000 5000 10000 concurrent tasks 21 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Conclusions of the kernel extension Kernel-controlled time sharing Accelerator sharing Fairness Heterogeneous task migration Speedup Load balancing Exploit heterogeneity Enhance accelerator utilization 22 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Future work Enhance load balancer Combine with enhanced runtime libraries Automate programming model Enhance ghost threads Communication Synchronization 23 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Thank you for your attention! Do you have questions? 24 / 24 Tobias Wiersema Linux scheduler extension for accelerators

Completely Fair Scheduler Current Linux scheduler: CFS Previous: O(n), O(1) Fairness O(log n): Red-black-tree Virtual clock No time slices 25 Tobias Wiersema Linux scheduler extension for accelerators

Notation of time sharing task timings turnaround time a) A1 B1 A2 A3 A1 B1 A2 latency execution time latency execution time A3 b) A1 B1 A2 B1 A3 B1 A3 A2 B1 A1 turnaround time 26 Tobias Wiersema Linux scheduler extension for accelerators

Task migration with true time sharing 20 18 16 searched strings in billions 14 12 10 8 6 4 2 a) b) 0 0 100 200 300 400 500 600 700 800 900 1000 seconds 27 Tobias Wiersema Linux scheduler extension for accelerators

Task migration with QL5 and no fixed set of tasks 20 18 16 searched strings in billions 14 12 10 8 6 4 2 a) 0 0 100 200 300 400 500 600 700 800 900 seconds 28 Tobias Wiersema Linux scheduler extension for accelerators

Runqueue of an accelerator Running Waiting... Runqueue with ghost threads... Execution units Accelerator 29 Tobias Wiersema Linux scheduler extension for accelerators

Affinity Inversion example Step 1 Step 2 Step 3 Step 4 Unit 1 A A A C C Unit 2 B B B 30 Tobias Wiersema Linux scheduler extension for accelerators

Overview of data structures Once One per computing unit One per task struct computing_units list list_of_cus[type] current_id access_mutex struct computing_unit_info id type struct cfs_rq rq struct hardware_properties hp struct task_struct struct sched_entity se struct sched_entity hwse struct hardware_properties concurrent_kernels bandwitdth struct cfs_rq load min_vruntime list tasks rb_tree tasks_timeline count maxcount access_mutex struct sched_entity load semaphore_up need_migrate task_granularity_nsec current_affinity offerer offered_affinity struct meta_info mi 31 Tobias Wiersema Linux scheduler extension for accelerators

Task states if using an accelerator cooperatively Non-accelerated computation Accelerated computation No computing unit assigned free allocate computing unit assignment invalid finished computation or denied re-request Computing unit assignment valid successful re-request 32 Tobias Wiersema Linux scheduler extension for accelerators

Turnaround times with infinite granularity 75 Tasks (25 MD5, 50 PF), started at the same time 250 200 150 seconds 100 50 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 tasks 33 Tobias Wiersema Linux scheduler extension for accelerators

Turnaround times with four second granularity 75 Tasks (25 MD5, 50 PF), started at the same time 250 200 150 seconds 100 50 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 tasks 34 Tobias Wiersema Linux scheduler extension for accelerators