Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux

Size: px
Start display at page:

Download "Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux"

Transcription

1 Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux Third Workshop on Computer Architecture and Operating System co-design Paris, Tobias Beisel, Tobias Wiersema, Christian Plessl, and André Brinkmann Paderborn Center for Parallel Computing (PC²), University of Paderborn 2 ), - 4 * ) 4 ) /

2 Motivation How are heterogeneous systems currently used? Main thread Application... CPU_task_A(); GPU_task_B(); GPU_task_C(); FPGA_task_D(); CPU_task_E();... Operating System GPU task GPU task Hardware task CPU task GPU1 GPU2 FPGA Hardware CPU 2

3 Motivation How are heterogeneous systems currently used? Main thread Application... parallel do { CPU_task_A(); GPU_task_B(); GPU_task_C(); FPGA_task_D(); CPU_task_E(); }... Operating System GPU task GPU task Hardware task CPU task GPU1 GPU2 FPGA Hardware CPU 3

4 Motivation How are heterogeneous systems currently used? Application A Main thread... parallel do { CPU_task_A(); GPU_task_B(); GPU_task_C(); FPGA_task_D(); CPU_task_E(); }... Application B Main thread... parallel do { CPU_task_F(); GPU_task_G(); GPU_task_H(); FPGA_task_I(); CPU_task_J(); }... Operating System GPU task GPU task Hardware task CPU task CPU task GPU1 GPU2 FPGA Hardware CPU 4

5 Resource Management needed Motivation Application A Main thread... parallel do { CPUorGPU_task_A(); GPU_task_B(); GPU_task_C(); FPGAorGPU_task_D(); CPU_task_E(); }... Application B Main thread... parallel do { CPU_task_F(); CPUorGPU_task_G(); GPUorFPGA_task_H(); FPGA_task_I(); CPU_task_J(); }... Operating System GPU task GPU task Hardware task CPU task CPU task GPU1 GPU2 FPGA Hardware CPU 5

6 Motivation Concurrent use of accelerators may be beneficial A,B,C D GPU A B C (a) CPU D (b) GPU CPU A B C D (c) GPU CPU A B C A D B C D timestep A, B, C: Tasks only capable to run on the GPU, D: Task executable on CPU or on GPU with 5x Speedup All tasks have the same priority, scheduling overheads are neglected 6

7 Challenges Common properties of accelerators No preemption, interrupts, or migration support Full internal state not accessible Explicit non-uniform communication of data needed Use dedicated APIs, ISAs and binaries Handling several accelerators requires custom solutions Decision space is much broader than in traditional scheduling Dynamic run-time decisions still need to be very fast Decision parameter acquisition and choice is important Resource management in heterogeneous systems is difficult 7

8 Agenda Basic Concepts and Ideas Scheduling and Programming Model Task Management Experimental Evaluation Related Work Conclusions and Future Work 8

9 General Approach Scheduler objectives No architectures excluded by design Useable for future operating systems (OS) Easy to use by applications Approach One-node heterogeneous systems Scheduler component in kernel space OS adaptation instead of OS rewrite OS Scheduler Task Management by cooperative multitasking Delegate threads as scheduling entities submitted to be scheduled allocated hardware copy data & start Application Delegate Thread Hardware Specific Thread results Accelerator 9

10 Cooperative Multitasking Cooperative multitasking Tasks offer voluntarily release of hardware at checkpoints Checkpoints allow migration between different ISAs Overcomes the lack of preemption and migration Allows to use time-sharing Goal: Combine cooperative multitasking with checkpointing and time-sharing as a heterogeneous extension to the Completely Fair Scheduler (CFS) Running preempt run Ready wait signal Blocked 10

11 Cooperative Multitasking Cooperative multitasking Tasks offer voluntarily release of hardware at checkpoints Checkpoints allow migration between different ISAs Overcomes the lack of preemption and migration Allows to use time-sharing Goal: Combine cooperative multitasking with checkpointing and time-sharing as a heterogeneous extension to the Completely Fair Scheduler (CFS) Running release run Ready wait signal Blocked 11

12 Agenda Basic Concepts and Ideas Scheduling and Programming Model Task Management Experimental Evaluation Related Work Conclusions and Future Work 12

13 Scheduling Model Application Scheduler component Main thread spawns Driver Delegate thread Delegate thread Delegate thread executes submits Thread information submitted to Scheduling policy Accelerator Architectures dequeue enqueue thread GPU1 GPU2 FPGA CPU Delegate thread Delegate thread Hardware unit queue GPU thread GPU thread Hardware thread Delegate thread dequeue CF scheduler 13

14 Scheduler Extension Scheduler API provides three system calls cu_allocate() Blocking call to enqueue tasks to most affine hardware queue cu_re-request() Request to further use the allocated hardware at checkpoint Scheduler decides based on scheduling policy cu_free() Free hardware resources Call load balancer if hardware runs idle Control API allows to manage accelerators in the system CFS data structures were adapted for heterogeneous use 14

15 Programming Model Application spawns threads Application uses system calls to acquire, keep and release an accelerator Application provides Thread meta information Checkpoint(s) Function pointers for each supported hardware for: Accelerator initialization (app_init) Computation to the next checkpoint (app_main) Free function to release accelerator and write a checkpoint (app_free) Checkpoint reached Yes Request Resource cu_allocate(meta_info) Copy Data & Code app_init() Start Computation app_main() Done? No Yes Reuse? cu_rerequest() No Copy Back Results/Checkpoint app_free() Free Resources cu_free() app_exit() 15

16 ich are Delegate Thread Implementation Back cu_free() Reduce Results pthread_join() Free Resources & Copy Results s using to and e there ciently ves all remove its and ing the ionally, lication ory for plifies Example for Delete Worker shutdown() Meta information Figure 2. Typical workflow of a delegate thread. Implementation multiplexer void Worker_example::workerMetaInfo(struct meta_info *mi){ mi->memory_to_copy=0; // in MB mi->type_affinity[cu_type_cuda]=2; mi->type_affinity[cu_type_cpu]=1; } void* Worker_example::getImplementationFor(int type, functions *af) switch(type) { case CU_TYPE_CPU: af->init=&worker_example::cpu_init; af->main=&worker_example::cpu_main; af->free=&worker_example::cpu_free; af->initialized=true; break; case CU_TYPE_CUDA:... //similar default: af->initialized=false; } Listing 1. Example implementation for mandatory worker functions. 16

17 Checkpointing A checkpoint is a preferably small set of data structures that Unambiguously defines the state of execution Is stored back to the host s main memory when a task is preempted Is readable and translatable to all supported ISAs Properties Currently user defined Size is application dependent and known at definition time Size influences the scheduling granularity typedef struct md5_resources { std::string hash_to_search; unsigned long long currentwordnumber; bool found_solution; } md5_resources_t; Listing 2. Example checkpoint for MD5 cracking. 17 T as be alp dif hig

18 Agenda Basic Concepts and Ideas Scheduling and Programming Model Task Management Experimental Evaluation Related Work Conclusions and Future Work 18

19 Thread Scheduling Process and Parameters Initial enqueueing Static affinity of a task towards an accelerator Load balancing additionally considers load (current queue length) Queue size limitation for accelerators Internal queue scheduling Fairness Based on virtual runtime and defined consistently with the CFS!fairness: time a tasks needs to run to be treated fair Priorities Are inherited from the delegate thread Using the same priority adjustments of virtual runtime as CFS on accelerator queues Actual execution time slot granularity = 2 * t copy_checkpoint +!fairness + base _ granularity granularity depends on the checkpoint distance 19

20 Load Balancer Executed when any device runs idle 2 step concept for task migration 1. Search all run-queues for most affine waiting tasks 2. If 1. fails: check running tasks and leave a migration offer (flag) migrate First pass GPU Task 0 GPU: 5 CPU: 1 Task 1 GPU: 2 CPU: 1 Task 4 GPU: 6 CPU: 1 Task 2 GPU:10 CPU: 1 Task 5 GPU: 5 CPU: 1 offer migration Second pass GPU Task 0 GPU: 5 CPU: 1 Task 1 GPU:10 CPU: 0 Task 4 GPU: 6 CPU: 0 Task 2 GPU:10 CPU: 0 Task 5 GPU: 2 CPU: 0 running queued Inter CPU balancing by original CFS 20

21 Agenda Basic Concepts and Ideas Scheduling and Programming Model Task Management Experimental Evaluation Related Work Conclusions and Future Work 21

22 Experimental Setup Test applications MD5 cracking Prime factorization Running several instances of both applications Hardware setup 2 * Intel Quad Core XEON E MHz, hyperthreading enabled 12 GB DDR NVIDIA Tesla 2075 Software setup Ubuntu Linux LTS, modified kernel NVIDIA Cuda 4.0 GCC

23 Impact of Time-sharing on GPU!"#$%&%'() Comparing different base_granularities with FCFS!"#$%%$& %)) 3.1 "#$,-."!+)!*)!')!%)!)) +) *) ') %) +$, )!"#$ %"#$ &"#$ '"#$ ($(" Avg. turnaround time running 25 MD5 and 50 prime factorization instances on the GPU &01 &+$, -+$,.+$, /,/+ Mean of total runtime for 30 runs with 25 MD5 and 50 prime factorization threads on the GPU Average turnaround times decrease with shorter base_granularities (40%) Long running tasks do not block short running tasks Makespan overheads increase with shorter base_granularities (13%) Overheads can be hidden by heterogeneous use of available hardware 23

24 Makespan reduction of up to 22% 765)(;$*(5*:$3459:,-+,++ &-+ & Heterogeneity Hides Overheads!"#$%%$& &+,+ -+ & $5)*)87$"9: Average runtimes of different counts of GPU affine prime factorization instances using a GPU and a combination of both GPU and CPUs./0*1*2/0 2/0 Effect will increase significantly with more applications providing Comparable affinities to several architectures Different hardware with highest affinity Using more architectures 24

25 Agenda Basic Concepts and Ideas Scheduling and Programming Model Task Management Experimental Evaluation Related Work Conclusions and Future Work 25

26 Related Work StarPU (Augonnet et al., 2009) User Space runtime system for heterogeneous systems including data management, scheduling policies and profiling models No preemption and migration support, extensive programming model ReconOS (Lübbers et al., 2010) OS extension to manage partial reconfigurable system resources Aims on dynamically reconfigurable hardware Jimenez et al. (2009) Predictive runtime code scheduling for heterogeneous architectures Prediction based on performance history, estimated waiting times Harmony (Diamos and Yalamanchili, 2008) Execution model and runtime for heterogeneous many core systems Programming model with meta-data, runtime kernel division 26

27 Agenda Basic Concepts and Ideas Scheduling and Programming Model Task Management Experimental Evaluation Related Work Conclusions and Future Work 27

28 Conclusions and Future Directions Cooperative multitasking with time-sharing is possible Reduces average turnaround times Increases interactivity and overall performance CFS extension implemented for conceptual validation Programming model Adds checkpoints and meta information to applications Next steps Compare to user space scheduler Extend scheduling policies, improve load-balancing Port further and more extensive example applications Approach automatic checkpoint identification Code available as open source 28

29 Thank you for your attention! Tobias Beisel This is work in context of the ENHANCE project. 29

Scheduling Support for Heterogeneous Hardware Accelerators under Linux

Scheduling Support for Heterogeneous Hardware Accelerators under Linux Scheduling Support for Heterogeneous Hardware Accelerators under Linux Tobias Wiersema University of Paderborn Paderborn, December 2010 1 / 24 Tobias Wiersema Linux scheduler extension for accelerators

More information

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6 Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6 Winter Term 2008 / 2009 Jun.-Prof. Dr. André Brinkmann Andre.Brinkmann@uni-paderborn.de Universität Paderborn PC² Agenda Multiprocessor and

More information

Multi-core Programming System Overview

Multi-core Programming System Overview Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs A general-purpose virtualization service for HPC on cloud computing: an application to GPUs R.Montella, G.Coviello, G.Giunta* G. Laccetti #, F. Isaila, J. Garcia Blas *Department of Applied Science University

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

10.04.2008. Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

10.04.2008. Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details Thomas Fahrig Senior Developer Hypervisor Team Hypervisor Architecture Terminology Goals Basics Details Scheduling Interval External Interrupt Handling Reserves, Weights and Caps Context Switch Waiting

More information

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE MODULE 3 VIRTUALIZED DATA CENTER COMPUTE Module 3: Virtualized Data Center Compute Upon completion of this module, you should be able to: Describe compute virtualization Discuss the compute virtualization

More information

Next Generation Operating Systems

Next Generation Operating Systems Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015 The end of CPU scaling Future computing challenges Power efficiency Performance == parallelism Cisco Confidential 2 Paradox of the

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens

More information

Embedded Systems: map to FPGA, GPU, CPU?

Embedded Systems: map to FPGA, GPU, CPU? Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware

More information

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run SFWR ENG 3BB4 Software Design 3 Concurrent System Design 2 SFWR ENG 3BB4 Software Design 3 Concurrent System Design 11.8 10 CPU Scheduling Chapter 11 CPU Scheduling Policies Deciding which process to run

More information

Page 1 of 5. IS 335: Information Technology in Business Lecture Outline Operating Systems

Page 1 of 5. IS 335: Information Technology in Business Lecture Outline Operating Systems Lecture Outline Operating Systems Objectives Describe the functions and layers of an operating system List the resources allocated by the operating system and describe the allocation process Explain how

More information

Project No. 2: Process Scheduling in Linux Submission due: April 28, 2014, 11:59pm

Project No. 2: Process Scheduling in Linux Submission due: April 28, 2014, 11:59pm Project No. 2: Process Scheduling in Linux Submission due: April 28, 2014, 11:59pm PURPOSE Getting familiar with the Linux kernel source code. Understanding process scheduling and how different parameters

More information

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operatin g Systems: Internals and Design Principle s Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Bear in mind,

More information

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances: Scheduling Scheduling Scheduling levels Long-term scheduling. Selects which jobs shall be allowed to enter the system. Only used in batch systems. Medium-term scheduling. Performs swapin-swapout operations

More information

CPU Scheduling Outline

CPU Scheduling Outline CPU Scheduling Outline What is scheduling in the OS? What are common scheduling criteria? How to evaluate scheduling algorithms? What are common scheduling algorithms? How is thread scheduling different

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Networking Virtualization Using FPGAs

Networking Virtualization Using FPGAs Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Experiences on using GPU accelerators for data analysis in ROOT/RooFit Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,

More information

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER Tender Notice No. 3/2014-15 dated 29.12.2014 (IIT/CE/ENQ/COM/HPC/2014-15/569) Tender Submission Deadline Last date for submission of sealed bids is extended

More information

Advanced topics: reentrant function

Advanced topics: reentrant function COSC 6374 Parallel Computation Advanced Topics in Shared Memory Programming Edgar Gabriel Fall 205 Advanced topics: reentrant function Functions executed in a multi-threaded environment need to be re-rentrant

More information

OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC 2.0 and the PGI Accelerator Compilers OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present

More information

Multi-core and Linux* Kernel

Multi-core and Linux* Kernel Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

Linux scheduler history. We will be talking about the O(1) scheduler

Linux scheduler history. We will be talking about the O(1) scheduler CPU Scheduling Linux scheduler history We will be talking about the O(1) scheduler SMP Support in 2.4 and 2.6 versions 2.4 Kernel 2.6 Kernel CPU1 CPU2 CPU3 CPU1 CPU2 CPU3 Linux Scheduling 3 scheduling

More information

CPU Scheduling. CSC 256/456 - Operating Systems Fall 2014. TA: Mohammad Hedayati

CPU Scheduling. CSC 256/456 - Operating Systems Fall 2014. TA: Mohammad Hedayati CPU Scheduling CSC 256/456 - Operating Systems Fall 2014 TA: Mohammad Hedayati Agenda Scheduling Policy Criteria Scheduling Policy Options (on Uniprocessor) Multiprocessor scheduling considerations CPU

More information

Full and Para Virtualization

Full and Para Virtualization Full and Para Virtualization Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF x86 Hardware Virtualization The x86 architecture offers four levels

More information

CFD Implementation with In-Socket FPGA Accelerators

CFD Implementation with In-Socket FPGA Accelerators CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline

More information

Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems

Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems Carsten Emde Open Source Automation Development Lab (OSADL) eg Aichhalder Str. 39, 78713 Schramberg, Germany C.Emde@osadl.org

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

Mitigating Starvation of Linux CPU-bound Processes in the Presence of Network I/O

Mitigating Starvation of Linux CPU-bound Processes in the Presence of Network I/O Mitigating Starvation of Linux CPU-bound Processes in the Presence of Network I/O 1 K. Salah 1 Computer Engineering Department Khalifa University of Science Technology and Research (KUSTAR) Sharjah, UAE

More information

Enabling Legacy Applications on Heterogeneous Platforms

Enabling Legacy Applications on Heterogeneous Platforms Enabling Legacy Applications on Heterogeneous Platforms Michela Becchi, Srihari Cadambi and Srimat Chakradhar NEC Laboratories America, Inc. {mbecchi, cadambi, chak}@nec-labs.com Abstract In this paper

More information

EECE 276 Embedded Systems

EECE 276 Embedded Systems EECE 276 Embedded Systems Embedded SW Architectures Round-robin Function-queue scheduling EECE 276 Embedded Systems Embedded Software Architectures 1 Software Architecture How to do things how to arrange

More information

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

More information

Case Study on Productivity and Performance of GPGPUs

Case Study on Productivity and Performance of GPGPUs Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia

More information

Linux Process Scheduling. sched.c. schedule() scheduler_tick() hooks. try_to_wake_up() ... CFS CPU 0 CPU 1 CPU 2 CPU 3

Linux Process Scheduling. sched.c. schedule() scheduler_tick() hooks. try_to_wake_up() ... CFS CPU 0 CPU 1 CPU 2 CPU 3 Linux Process Scheduling sched.c schedule() scheduler_tick() try_to_wake_up() hooks RT CPU 0 CPU 1 CFS CPU 2 CPU 3 Linux Process Scheduling 1. Task Classification 2. Scheduler Skeleton 3. Completely Fair

More information

Chapter 5 Cloud Resource Virtualization

Chapter 5 Cloud Resource Virtualization Chapter 5 Cloud Resource Virtualization Contents Virtualization. Layering and virtualization. Virtual machine monitor. Virtual machine. Performance and security isolation. Architectural support for virtualization.

More information

Inside the Erlang VM

Inside the Erlang VM Rev A Inside the Erlang VM with focus on SMP Prepared by Kenneth Lundin, Ericsson AB Presentation held at Erlang User Conference, Stockholm, November 13, 2008 1 Introduction The history of support for

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

System Requirements Table of contents

System Requirements Table of contents Table of contents 1 Introduction... 2 2 Knoa Agent... 2 2.1 System Requirements...2 2.2 Environment Requirements...4 3 Knoa Server Architecture...4 3.1 Knoa Server Components... 4 3.2 Server Hardware Setup...5

More information

SYSTEM ecos Embedded Configurable Operating System

SYSTEM ecos Embedded Configurable Operating System BELONGS TO THE CYGNUS SOLUTIONS founded about 1989 initiative connected with an idea of free software ( commercial support for the free software ). Recently merged with RedHat. CYGNUS was also the original

More information

An Implementation Of Multiprocessor Linux

An Implementation Of Multiprocessor Linux An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

ò Paper reading assigned for next Thursday ò Lab 2 due next Friday ò What is cooperative multitasking? ò What is preemptive multitasking?

ò Paper reading assigned for next Thursday ò Lab 2 due next Friday ò What is cooperative multitasking? ò What is preemptive multitasking? Housekeeping Paper reading assigned for next Thursday Scheduling Lab 2 due next Friday Don Porter CSE 506 Lecture goals Undergrad review Understand low-level building blocks of a scheduler Understand competing

More information

Comp 204: Computer Systems and Their Implementation. Lecture 12: Scheduling Algorithms cont d

Comp 204: Computer Systems and Their Implementation. Lecture 12: Scheduling Algorithms cont d Comp 204: Computer Systems and Their Implementation Lecture 12: Scheduling Algorithms cont d 1 Today Scheduling continued Multilevel queues Examples Thread scheduling 2 Question A starvation-free job-scheduling

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

HP ProLiant SL270s Gen8 Server. Evaluation Report

HP ProLiant SL270s Gen8 Server. Evaluation Report HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich schoenemeyer@cscs.ch

More information

Effective Computing with SMP Linux

Effective Computing with SMP Linux Effective Computing with SMP Linux Multi-processor systems were once a feature of high-end servers and mainframes, but today, even desktops for personal use have multiple processors. Linux is a popular

More information

Version 3.7 Technical Whitepaper

Version 3.7 Technical Whitepaper Version 3.7 Technical Whitepaper Virtual Iron 2007-1- Last modified: June 11, 2007 Table of Contents Introduction... 3 What is Virtualization?... 4 Native Virtualization A New Approach... 5 Virtual Iron

More information

Chapter 5 Process Scheduling

Chapter 5 Process Scheduling Chapter 5 Process Scheduling CPU Scheduling Objective: Basic Scheduling Concepts CPU Scheduling Algorithms Why Multiprogramming? Maximize CPU/Resources Utilization (Based on Some Criteria) CPU Scheduling

More information

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014 HPC Cluster Decisions and ANSYS Configuration Best Practices Diana Collier Lead Systems Support Specialist Houston UGM May 2014 1 Agenda Introduction Lead Systems Support Specialist Cluster Decisions Job

More information

MCA Standards For Closely Distributed Multicore

MCA Standards For Closely Distributed Multicore MCA Standards For Closely Distributed Multicore Sven Brehmer Multicore Association, cofounder, board member, and MCAPI WG Chair CEO of PolyCore Software 2 Embedded Systems Spans the computing industry

More information

Real-Time Scheduling 1 / 39

Real-Time Scheduling 1 / 39 Real-Time Scheduling 1 / 39 Multiple Real-Time Processes A runs every 30 msec; each time it needs 10 msec of CPU time B runs 25 times/sec for 15 msec C runs 20 times/sec for 5 msec For our equation, A

More information

High Performance Matrix Inversion with Several GPUs

High Performance Matrix Inversion with Several GPUs High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República

More information

Cornell University Center for Advanced Computing

Cornell University Center for Advanced Computing Cornell University Center for Advanced Computing David A. Lifka - lifka@cac.cornell.edu Director - Cornell University Center for Advanced Computing (CAC) Director Research Computing - Weill Cornell Medical

More information

Virtual Private Systems for FreeBSD

Virtual Private Systems for FreeBSD Virtual Private Systems for FreeBSD Klaus P. Ohrhallinger 06. June 2010 Abstract Virtual Private Systems for FreeBSD (VPS) is a novel virtualization implementation which is based on the operating system

More information

Accelerating variant calling

Accelerating variant calling Accelerating variant calling Mauricio Carneiro GSA Broad Institute Intel Genomic Sequencing Pipeline Workshop Mount Sinai 12/10/2013 This is the work of many Genome sequencing and analysis team Mark DePristo

More information

Resource Scheduling Best Practice in Hybrid Clusters

Resource Scheduling Best Practice in Hybrid Clusters Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Resource Scheduling Best Practice in Hybrid Clusters C. Cavazzoni a, A. Federico b, D. Galetti a, G. Morelli b, A. Pieretti

More information

Operating Systems OBJECTIVES 7.1 DEFINITION. Chapter 7. Note:

Operating Systems OBJECTIVES 7.1 DEFINITION. Chapter 7. Note: Chapter 7 OBJECTIVES Operating Systems Define the purpose and functions of an operating system. Understand the components of an operating system. Understand the concept of virtual memory. Understand the

More information

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study CS 377: Operating Systems Lecture 25 - Linux Case Study Guest Lecturer: Tim Wood Outline Linux History Design Principles System Overview Process Scheduling Memory Management File Systems A review of what

More information

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la

More information

High Performance or Cycle Accuracy?

High Performance or Cycle Accuracy? CHIP DESIGN High Performance or Cycle Accuracy? You can have both! Bill Neifert, Carbon Design Systems Rob Kaye, ARM ATC-100 AGENDA Modelling 101 & Programmer s View (PV) Models Cycle Accurate Models Bringing

More information

Enabling Preemptive Multiprogramming on GPUs

Enabling Preemptive Multiprogramming on GPUs Enabling Preemptive Multiprogramming on GPUs Ivan Tanasic 1,2, Isaac Gelado 3, Javier Cabezas 1,2, Alex Ramirez 1,2, Nacho Navarro 1,2, Mateo Valero 1,2 1 Barcelona Supercomputing Center 2 Universitat

More information

CPU Scheduling. Basic Concepts. Basic Concepts (2) Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems

CPU Scheduling. Basic Concepts. Basic Concepts (2) Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems Based on original slides by Silberschatz, Galvin and Gagne 1 Basic Concepts CPU I/O Burst Cycle Process execution

More information

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts

More information

Operating System Tutorial

Operating System Tutorial Operating System Tutorial OPERATING SYSTEM TUTORIAL Simply Easy Learning by tutorialspoint.com tutorialspoint.com i ABOUT THE TUTORIAL Operating System Tutorial An operating system (OS) is a collection

More information

Xeon+FPGA Platform for the Data Center

Xeon+FPGA Platform for the Data Center Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system

More information

W4118 Operating Systems. Instructor: Junfeng Yang

W4118 Operating Systems. Instructor: Junfeng Yang W4118 Operating Systems Instructor: Junfeng Yang Outline Introduction to scheduling Scheduling algorithms 1 Direction within course Until now: interrupts, processes, threads, synchronization Mostly mechanisms

More information

Operating System: Scheduling

Operating System: Scheduling Process Management Operating System: Scheduling OS maintains a data structure for each process called Process Control Block (PCB) Information associated with each PCB: Process state: e.g. ready, or waiting

More information

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,

More information

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 11 Virtualization 2011-2012 Up until now Introduction. Definition of Cloud Computing Grid Computing Content Distribution Networks Map Reduce Cycle-Sharing 1 Process Virtual Machines

More information

Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu. Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu

Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu. Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu Continuous Monitoring using MultiCores Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu Motivation Intrusion detection Intruder gets

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Auto-Tunning of Data Communication on Heterogeneous Systems

Auto-Tunning of Data Communication on Heterogeneous Systems 1 Auto-Tunning of Data Communication on Heterogeneous Systems Marc Jordà 1, Ivan Tanasic 1, Javier Cabezas 1, Lluís Vilanova 1, Isaac Gelado 1, and Nacho Navarro 1, 2 1 Barcelona Supercomputing Center

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Processes and Non-Preemptive Scheduling. Otto J. Anshus Processes and Non-Preemptive Scheduling Otto J. Anshus 1 Concurrency and Process Challenge: Physical reality is Concurrent Smart to do concurrent software instead of sequential? At least we want to have

More information

Completely Fair Scheduler and its tuning 1

Completely Fair Scheduler and its tuning 1 Completely Fair Scheduler and its tuning 1 Jacek Kobus and Rafał Szklarski 1 Introduction The introduction of a new, the so called completely fair scheduler (CFS) to the Linux kernel 2.6.23 (October 2007)

More information

Virtualization for Cloud Computing

Virtualization for Cloud Computing Virtualization for Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF CLOUD COMPUTING On demand provision of computational resources

More information

Linux Scheduler. Linux Scheduler

Linux Scheduler. Linux Scheduler or or Affinity Basic Interactive es 1 / 40 Reality... or or Affinity Basic Interactive es The Linux scheduler tries to be very efficient To do that, it uses some complex data structures Some of what it

More information

1. Computer System Structure and Components

1. Computer System Structure and Components 1 Computer System Structure and Components Computer System Layers Various Computer Programs OS System Calls (eg, fork, execv, write, etc) KERNEL/Behavior or CPU Device Drivers Device Controllers Devices

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Performance Study Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Introduction With more and more mission critical networking intensive workloads being virtualized

More information

Dynamic Task-Scheduling and Resource Management for GPU Accelerators in Medical Imaging

Dynamic Task-Scheduling and Resource Management for GPU Accelerators in Medical Imaging This is the author s version of the work. The definitive work was published in Proceedings of the 25th International Conference on Architecture of Computing Systems (ARCS), Munich, Germany, February 28

More information

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X DU-05348-001_v6.5 August 2014 Installation and Verification on Mac OS X TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2. About

More information

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel

More information

Linux Scheduler Analysis and Tuning for Parallel Processing on the Raspberry PI Platform. Ed Spetka Mike Kohler

Linux Scheduler Analysis and Tuning for Parallel Processing on the Raspberry PI Platform. Ed Spetka Mike Kohler Linux Scheduler Analysis and Tuning for Parallel Processing on the Raspberry PI Platform Ed Spetka Mike Kohler Outline Abstract Hardware Overview Completely Fair Scheduler Design Theory Breakdown of the

More information

Load-Balancing for a Real-Time System Based on Asymmetric Multi-Processing

Load-Balancing for a Real-Time System Based on Asymmetric Multi-Processing LIFL Report # 2004-06 Load-Balancing for a Real-Time System Based on Asymmetric Multi-Processing Éric PIEL Eric.Piel@lifl.fr Philippe MARQUET Philippe.Marquet@lifl.fr Julien SOULA Julien.Soula@lifl.fr

More information

CPU SCHEDULING (CONT D) NESTED SCHEDULING FUNCTIONS

CPU SCHEDULING (CONT D) NESTED SCHEDULING FUNCTIONS CPU SCHEDULING CPU SCHEDULING (CONT D) Aims to assign processes to be executed by the CPU in a way that meets system objectives such as response time, throughput, and processor efficiency Broken down into

More information

POSIX. RTOSes Part I. POSIX Versions. POSIX Versions (2)

POSIX. RTOSes Part I. POSIX Versions. POSIX Versions (2) RTOSes Part I Christopher Kenna September 24, 2010 POSIX Portable Operating System for UnIX Application portability at source-code level POSIX Family formally known as IEEE 1003 Originally 17 separate

More information

12. Introduction to Virtual Machines

12. Introduction to Virtual Machines 12. Introduction to Virtual Machines 12. Introduction to Virtual Machines Modern Applications Challenges of Virtual Machine Monitors Historical Perspective Classification 332 / 352 12. Introduction to

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

IBM Software Group. Lotus Domino 6.5 Server Enablement

IBM Software Group. Lotus Domino 6.5 Server Enablement IBM Software Group Lotus Domino 6.5 Server Enablement Agenda Delivery Strategy Themes Domino 6.5 Server Domino 6.0 SmartUpgrade Questions IBM Lotus Notes/Domino Delivery Strategy 6.0.x MRs every 4 months

More information

GeoImaging Accelerator Pansharp Test Results

GeoImaging Accelerator Pansharp Test Results GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance

More information

Peter Ruissen Marju Jalloh

Peter Ruissen Marju Jalloh Peter Ruissen Marju Jalloh Agenda concepts >> To research the possibilities for High Availability (HA) failover mechanisms using the XEN virtualization technology and the requirements necessary for implementation

More information

EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun

EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun EECS 750: Advanced Operating Systems 01/28 /2015 Heechul Yun 1 Recap: Completely Fair Scheduler(CFS) Each task maintains its virtual time V i = E i 1 w i, where E is executed time, w is a weight Pick the

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

Road Map. Scheduling. Types of Scheduling. Scheduling. CPU Scheduling. Job Scheduling. Dickinson College Computer Science 354 Spring 2010.

Road Map. Scheduling. Types of Scheduling. Scheduling. CPU Scheduling. Job Scheduling. Dickinson College Computer Science 354 Spring 2010. Road Map Scheduling Dickinson College Computer Science 354 Spring 2010 Past: What an OS is, why we have them, what they do. Base hardware and support for operating systems Process Management Threads Present:

More information