Synchronization. Todd C. Mowry CS 740 November 24, 1998. Topics. Locks Barriers



Similar documents
A Locally Cache-Coherent Multiprocessor Architecture

Scalability evaluation of barrier algorithms for OpenMP

Interconnection Networks

Concurrent Data Structures

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Principles and characteristics of distributed systems and environments

CS Fall 2008 Homework 2 Solution Due September 23, 11:59PM

Chapter 11 I/O Management and Disk Scheduling

Scaling Networking Applications to Multiple Cores

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Multiprocessor Cache Coherence

Local Area Networks transmission system private speedy and secure kilometres shared transmission medium hardware & software

A Survey of Parallel Processing in Linux

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Multi-Threading Performance on Commodity Multi-Core Processors

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Distributed Systems LEEC (2005/06 2º Sem.)

First-class User Level Threads

Chapter 18: Database System Architectures. Centralized Systems

Components: Interconnect Page 1 of 18

Scalability and Classifications

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

W4118 Operating Systems. Instructor: Junfeng Yang

Operating Systems 4 th Class

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

High Performance Computing. Course Notes HPC Fundamentals

Road Map. Scheduling. Types of Scheduling. Scheduling. CPU Scheduling. Job Scheduling. Dickinson College Computer Science 354 Spring 2010.

Real-Time Scheduling 1 / 39

Concepts of Concurrent Computation

Computer Systems Structure Input/Output

How To Make A Distributed System Transparent

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

An Implementation Of Multiprocessor Linux

Efficient Synchronization: Let Them Eat QOLB 1

Mutual Exclusion using Monitors

Overlapping Data Transfer With Application Execution on Clusters

Lecture 2 Parallel Programming Platforms

Readings for this topic: Silberschatz/Galvin/Gagne Chapter 5

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

CPU Scheduling. CPU Scheduling

Real- Time Scheduling

Performance monitoring with Intel Architecture

SYSTEM ecos Embedded Configurable Operating System

How To Understand The Concept Of A Distributed System

Chapter 11 I/O Management and Disk Scheduling

Week 1 out-of-class notes, discussions and sample problems

Operating System: Scheduling

Tasks Schedule Analysis in RTAI/Linux-GPL

Load Balancing. Load Balancing 1 / 24

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Linux Process Scheduling Policy

Optimizing Shared Resource Contention in HPC Clusters

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Operating Systems Concepts: Chapter 7: Scheduling Strategies

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

OPERATING SYSTEMS SCHEDULING

Binary search tree with SIMD bandwidth optimization using SSE

Chapter 2. Multiprocessors Interconnection Networks

Preserving Message Integrity in Dynamic Process Migration

/ Operating Systems I. Process Scheduling. Warren R. Carithers Rob Duncan

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Processor Architectures

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

Linux Scheduler. Linux Scheduler

CPU Scheduling 101. The CPU scheduler makes a sequence of moves that determines the interleaving of threads.

A Fast Inter-Kernel Communication and Synchronization Layer for MetalSVM

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

COS 318: Operating Systems. Virtual Machine Monitors

Design and Implementation of Distributed Process Execution Environment

Concurrent Programming

The Bus (PCI and PCI-Express)

CPU Scheduling Outline

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Operating System Components

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Konzepte von Betriebssystem-Komponenten. Linux Scheduler. Valderine Kom Kenmegne Proseminar KVBK Linux Scheduler Valderine Kom

Chapter 6, The Operating System Machine Level

ZooKeeper. Table of contents

Operating Systems. III. Scheduling.

ICS Principles of Operating Systems

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Operating Systems. Lecture 03. February 11, 2013

Announcements. Basic Concepts. Histogram of Typical CPU- Burst Times. Dispatcher. CPU Scheduler. Burst Cycle. Reading

Objectives. Chapter 5: CPU Scheduling. CPU Scheduler. Non-preemptive and preemptive. Dispatcher. Alternating Sequence of CPU And I/O Bursts

Distributed Systems. REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Operating Systems, 6 th ed. Test Bank Chapter 7

The Design of the Inferno Virtual Machine. Introduction

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner

Transcription:

Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers

Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based (barriers) Point-to-point tightly coupled with data (full-empty bits, futures) separate from data (flags) 2

Blocking vs. Busy-Waiting Busy-waiting is preferable when: scheduling overhead is larger than expected wait time processor resources are not needed for other tasks schedule-based blocking is inappropriate (e.g., in OS kernel) But typical busy-waiting produces lots of network traffic hot-spots 3

Reducing Busy-Waiting Traffic Trend was toward increasing hardware support sophisticated atomic operations combining networks multistage networks with special sync variables in switches special-purpose cache hardware to maintain queues of waiters But (Mellor-Crummey and Scott), Appropriate software synchronization algorithms can eliminate most busy-waiting traffic. 4

Software Synchronization Hardware support required simple fetch-and-op support ability to cache shared variables Implications efficient software synch can be designed for large machines special-purpose hardware has only small additional benefits small constant factor for locks best case factor of log P for barriers 5

Mutual Exclusion Discuss four sets of primitives: test-and-set lock ticket lock array-based queueing lock list-based queueing lock 6

Test and Set Lock Cacheable vs. non-cacheable Cacheable: good if locality on lock access reduces both latency and lock duration particularly if lock acquired in exclusive state not good for high-contention locks Non-cacheable: read-modify-write at main memory easier to implement high latency better for high-contention locks 7

Test and Test and Set A: while (lock!= free) if (test&set(lock) == free) { critical section; } else goto A; (+) spinning happens in cache (-) can still generate a lot of traffic when many processors go to do test&set 8

Test and Set with Backoff Upon failure, delay for a while before retrying either constant delay or exponential backoff Tradeoffs: (+) much less network traffic (-) exponential backoff can cause starvation for high-contention locks new requestors back off for shorter times But exponential found to work best in practice 9

Test and Set with Update Test and Set sends updates to processors that cache the lock Tradeoffs: (+) good for bus-based machines (-) still lots of traffic on distributed networks Main problem with test&set-based schemes is that a lock release causes all waiters to try to get the lock, using a test&set to try to get it. 10

Ticket Lock (fetch&incr based) Ticket lock has just one proc do a test&set when a lock is released. Two counters: next_ticket (number of requestors) now_serving (number of releases that have happened) Algorithm: First do a fetch&incr on next_ticket (not test&set) When release happens, poll the value of now_serving if my_ticket, then I win Use delay; but how much? exponential is bad tough to find a delay that makes network traffic constant for a single lock 11

Ticket Lock Tradeoffs (+) guaranteed FIFO order; no starvation possible (+) latency can be low if fetch&incr is cacheable (+) traffic can be quite low (-) but traffic is not guaranteed to be O(1) per lock acquire 12

Array-Based Queueing Locks Every process spins on a unique location, rather than on a single no_serving counter fetch&incr gives a process the address on which to spin Tradeoffs: (+) guarantees FIFO order (like ticket lock) (+) O(1) traffic with coherence caches (unlike ticket lock) (-) requires space per lock proportional to P 13

List-Base Queueing Locks (MCS) All other good things + O(1) traffic even without coherent caches (spin locally) Uses compare&swap to build linked lists in software Locally-allocated flag per list node to spin on Can work with fetch&store, but loses FIFO guarantee Tradeoffs: (+) less storage than array-based locks (+) O(1) traffic even without coherent caches (-) compare&swap not easy to implement 14

Lock Performance (See attached sheets.) 15

Implementing Fetch&Op Primitives One possibility for implementation in caches: LL/SC Load Linked (LL) loads the lock and sets a bit When atomic operation is finished, Store Conditional (SC) succeeds only if bit was not reset in interim Fits within the load/store (aka RISC) architecture paradigm I.e. no instruction performs more than one memory operation good for pipelining and clock rates Good for bus-based machines: SC result known on bus More complex for directory-based machines: wait for SC to go to directory and get ownership (long latency) have LL load in exclusive mode, so SC succeeds immediately if still in exclusive mode 16

Bottom Line for Locks Lots of options Using simple hardware primitives, we can build very successful algorithms in software LL/SC works well if there is locality of synch accesses Otherwise, in-memory fetch&ops are good for high contention 17

Lock Recommendations Criteria: scalability one-processor latency space requirements fairness atomic operation requirements 18

Scalability MCS locks most scalable in general array-based queueing locks also good with coherent caches Test&set and Ticket locks scale well with backoff but add network traffic 19

Single-Processor Latency Test&set and Ticket locks are best MCS and Array-based queueing locks also good when good fetch&op primitives are available array-based bad in terms of space if too many locks needed in OS kernel, for example when more processes than processors 20

Fairness Ticket lock, array-based lock and MCS all guarantee FIFO MCS needs compare&swap Fairness can waste CPU resources in case of preemption can be fixed by coscheduling 21

Primitives Required: Test&Set: test&set Ticket: fetch&incr MCS: fetch&store compare&swap Queue-based: fetch&store 22

Lock Recommendations For scalability and minimizing network traffic: MCS one-proc latency good if efficient fetch&op provided If one-proc latency is critical, or fetch&op not available: Ticket with proportional backoff If preemption possible while spinning, or if neither fetch&store nor fetch&incr provided: Test&Set with exponential backoff 23

Barriers We will discuss five barriers: centralized software combining tree dissemination barrier tournament barrier MCS tree-based barrier 24

Centralized Barrier Basic idea: notify a single shared counter when you arrive poll that shared location until all have arrived Simple implementation require polling/spinning twice: first to ensure that all procs have left previous barrier second to ensure that all procs have arrived at current barrier Solution to get one spin: sense reversal Tradeoffs: (+) simple (-) high traffic, unless machines are cache-coherent with broadcast 25

Software Combining Tree Barrier Represent shared barrier variable as a tree of variables each node is in a separate memory module leaf is a group of processors, one of which plays representative Writes into one tree for barrier arrival Reads from another tree to allow procs to continue Sense reversal to distinguish consecutive barriers Disadvantage: except on cache-coherent machines, assignment of spin flags to processors is hard, so may spin remotely 26

Dissemination Barrier log P rounds of synchronization In round k, proc i synchronizes with proc (i+2 k ) mod P Advantage: Can statically allocate flags to avoid remote spinning 27

Tournament Barrier Binary combining tree Representative processor at a node is statically chosen no fetch&op needed In round k, proc i=2 k sets a flag for proc j=i-2 k i then drops out of tournament and j proceeds in next round i waits for global flag signalling completion of barrier to be set could use combining wakeup tree Disadvantage: without coherent caches and broadcast, suffers from either: traffic due to single flag, or same problem as combining trees (for wakeup tree) 28

MCS Software Barrier Modifies tournament barrier to allow static allocation in wakeup tree, and to use sense reversal Every processor is a node in two P-node trees: has pointers to its parent building a fanin-4 arrival tree has pointers to its children to build a fanout-2 wakeup tree Advantages: spins on local flag variables requires O(P) space for P processors achieves theoretical minimum number of network transactions: (2P - 2) O(logP) network transactions on critical path 29

Barrier Performance (See attached sheets.) 30

Barrier Recommendations Criteria: length of critical path number of network transactions space requirements atomic operation requirements 31

Space Requirements Centralized: constant MCS, combining tree: O(P) Dissemination, Tournament: O(PlogP) 32

Network Transactions Centralized, combining tree: O(P) if broadcast and coherent caches; unbounded otherwise Dissemination: O(PlogP) Tournament, MCS: O(P) 33

Critical Path Length If independent parallel network paths available: all are O(logP) except centralized, which is O(P) Otherwise (e.g., shared bus): linear factors dominate 34

Primitives Needed Centralized and combining tree: atomic increment atomic decrement Others: atomic read atomic write 35

Barrier Recommendations Without broadcast on distributed memory: Dissemination MCS is good, only critical path length is about 1.5X longer MCS has somewhat better network load and space requirements Cache coherence with broadcast (e.g., a bus): MCS with flag wakeup centralized is best for modest numbers of processors Big advantage of centralized barrier: adapts to changing number of processors across barrier calls 36