Herb Sutter: THE FREE LUNCH IS OVER (30th March 05, Dr. Dobb s)



Similar documents
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Multi-Threading Performance on Commodity Multi-Core Processors

Software and the Concurrency Revolution

High Performance or Cycle Accuracy?

Embedded Parallel Computing

Control 2004, University of Bath, UK, September 2004

Multi-core Programming System Overview

Introduction to Cloud Computing

Course Development of Programming for General-Purpose Multicore Processors

Embedded Systems: map to FPGA, GPU, CPU?

Scalability evaluation of barrier algorithms for OpenMP

Synchronization. Todd C. Mowry CS 740 November 24, Topics. Locks Barriers

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Facing the Challenges for Real-Time Software Development on Multi-Cores

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

How To Write A Multi Threaded Software On A Single Core (Or Multi Threaded) System

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Parallel Computing. Parallel shared memory computing with OpenMP

Real Time Programming: Concepts

A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms

Processes and Non-Preemptive Scheduling. Otto J. Anshus

OpenMP Programming on ScaleMP

Chapter 6, The Operating System Machine Level

BLM 413E - Parallel Programming Lecture 3

Understanding Hardware Transactional Memory

Operating Systems OBJECTIVES 7.1 DEFINITION. Chapter 7. Note:

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Stream Processing on GPUs Using Distributed Multimedia Middleware

Replication on Virtual Machines

Parallel Algorithm Engineering

Multi-GPU Load Balancing for Simulation and Rendering

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Going Linux on Massive Multicore

Parallelism and Cloud Computing

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Chapter 1 Computer System Overview

An Introduction to Parallel Computing/ Programming

MPI and Hybrid Programming Models. William Gropp

Cortex-A9 MPCore Software Development

Microkernels & Database OSs. Recovery Management in QuickSilver. DB folks: Stonebraker81. Very different philosophies

Benchmarking Large Scale Cloud Computing in Asia Pacific

Java Environment for Parallel Realtime Development Platform Independent Software Development for Multicore Systems

An Easier Way for Cross-Platform Data Acquisition Application Development

Chapter 11 I/O Management and Disk Scheduling

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

SSC - Concurrency and Multi-threading Java multithreading programming - Synchronisation (I)

An Implementation Of Multiprocessor Linux

FPGA-based Multithreading for In-Memory Hash Joins

Introduction to GPU hardware and to CUDA

Sequential Performance Analysis with Callgrind and KCachegrind

Next Generation GPU Architecture Code-named Fermi

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

- An Essential Building Block for Stable and Reliable Compute Clusters

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Operating Systems 4 th Class

Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems

MAQAO Performance Analysis and Optimization Tool

Hardware accelerated Virtualization in the ARM Cortex Processors

Overlapping Data Transfer With Application Execution on Clusters

Debugging with TotalView

A Survey of Parallel Processing in Linux

Chapter 18: Database System Architectures. Centralized Systems

Multicore Parallel Computing with OpenMP

Incorporating Multicore Programming in Bachelor of Science in Computer Engineering Program

GPI Global Address Space Programming Interface

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Decomposition into Parts. Software Engineering, Lecture 4. Data and Function Cohesion. Allocation of Functions and Data. Component Interfaces

Principles and characteristics of distributed systems and environments

Introduction to GPU Programming Languages

HPC Wales Skills Academy Course Catalogue 2015

Intel Threading Building Blocks (Intel TBB) 2.2. In-Depth

evm Virtualization Platform for Windows

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Choosing a Computer for Running SLX, P3D, and P5

Optimizing Shared Resource Contention in HPC Clusters

Why Threads Are A Bad Idea (for most purposes)

SEMS: The SIP Express Media Server. FRAFOS GmbH

Chapter 11 I/O Management and Disk Scheduling

Multi- and Many-Core Technologies: Architecture, Programming, Algorithms, & Application

Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Parallel Programming Survey

COS 318: Operating Systems. Virtual Machine Monitors

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

End-user Tools for Application Performance Analysis Using Hardware Counters

Turbomachinery CFD on many-core platforms experiences and strategies

GPUs for Scientific Computing

MOSIX: High performance Linux farm

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Operating System Tutorial

SYSTEM ecos Embedded Configurable Operating System

Transcription:

Implementing a Work-Stealing Task Scheduler on the ARM11 MPCore - How to keep it balanced - Wagner, Jahanpanah, Kujawski NEC Laboratories Europe IT Research Division

Content 1. Introduction and Motivation 2. Parallel Concepts and related Background 3. TPI a Task Processing Interface 4. Landscape Rendering 5. TPI - Code examples 6. Results and Outlook 2

1 Herb Sutter: THE FREE LUNCH IS OVER (30th March 05, Dr. Dobb s) 3

1 Multicores are coming MPCore Test Chip was available since May 2004 Slow acceptance (because of no free lunch?) Multicore revolution Sequential software does not exploit the new architecture Automatic parallelization does (still) not work well rewrite code for parallel execution: 1. Thread based static parallelization 2. Fine-granular task based parallelization 3. Language change (from C/C++ to a parallel language) 4. MPI / OpenMP Code needs to get touched anyway change it in a way that it scales with any number of cores (if problem allows that) 4

1 Divergent HW Multicore architecture changes N General purpose CPUs + X specialized cores Cell: PPC + SPUs X86: CPU + GPU AMD s Fusion concept Sun T2 Shared memory vs. distributed memory SM is wonderful from an algorithmic point of view Difficulties in coherence and access bandwidth when increasing N DM is easier to maintain but puts another burden on the user? Is there a programming concept which fit s on most architectures? 5

2 Concepts Static parallelization Divide problem s domain into subdomains matching the number of cores But: Number of available cores might change at runtime High-Priority apps might interrupt us Time required to compute each partition might be different What s the proper domain of dividing the problem? Is the needed amount of time known in advance? Dynamically adjustment of partition sizes needed for good load balancing! 6

2 Concepts Solution: Generate much more threads than processors available Each thread should do less work OS es scheduler will ensure proper balancing without the need of runtime partitioning adjustments But: Who will implement the complicated synchronization? Imagine usually a thread count > 100 for fine granularity of the problem Scheduling overhead (in the OS) grows Synchronization costs are getting higher Lock-based synchronization suffers by swapped out threads (if they hold the lock) 7

2 Concepts Schedule atomic tasks instead of threads using a userspace scheduler Don t have the OS do things you can do more efficient because of deeper knowledge of the problem A task is an elementary working unit not meant to get interrupted just to serve another task of the same problem Tasks don t compete against others (like threads do) Very simple scheduling A task should be created only when the problem needs it No waiting on lock-based conditions (blocking) Restricted ways of communication between tasks Work, don t chat! Only provide tree-like dependencies in data flow Try handling other dependencies implicitly just-in-time 8

2 Background Idea of User-Space Task Scheduler Task queues are considered at least since 81 (Burton, Sleep) One big queue serves all nodes Node-local queues since 91 Lowers the synchronization overhead Randomized work stealing 94 (Blumofe, Leiserson) If a node runs out of tasks, pick a random neighbor's next one Non-blocking work stealing (Blumofe, Papadopoulus) 98 Each queue runs fully independent of the neighbor s queue s state Migration-avoiding work stealing (Acar, Blelloch, Blumofe) 00 Optimized scheduling to exploit cache locality 9

2 Background Implementing a custom scheduler Ingredients from Theory Lock free queues Worker thread(s) per local queue Embedded system restrictions (the Engineer s Compromise) Cooperativity and power consumption we are not the only application busy-waiting vs. sleeping in the idle case We don t have N=1000nds of cores (yet) O(log N) vs. O(N), how big is c for current architectures? Maintain option to switch to complicated O(log N) if time is right 10

2 Background Locks are not optimal if Frequent concurrent access to data structures is needed More threads than #processors need to get synchronized If the lock holder is swapped out, all others have to wait Locks are elements of Dead-Locks Lock free techniques circumvent that but have Slightly higher costs for the ideal execution time Test and Set vs. complicated program flow Exploding numbers of possible program states Problem for state-based certification schemes? Theoretical occurrence of Life-Locks in certain designs Complicated and dedicated algorithms for every basic data structure 11

2 Background Use published and tested standard algorithms for lock free LIFOs FIFOs Linked lists, skip-lists, dictionaries Memory management Build everything on top of lock free atomic primitives TestAndSet AtomicAdd / AtomicSub CompareAndSwap / LLSC Still complicated to get it always right Universal wait-free constructs [Herlihy 91] 12

2 Background On ARMv6 use ldrex / strex combination for implementing atomic primitives retrytns: asm { ldrex oldval,[addr] strex tmp,one,[addr] teq tmp,#0 bne retrytns } ABA safe Life lock free for r0p2 rev. Beware of architecture specific behaviours Read/Write re-ordering might happen (store buffers) Use proper acquire / release semantics for signaling atomic operations using CP15 s Data Memory Barrier command on ARMv6 asm { mcr p15,0,0,c7,c10,5 } 13

2 Background How do the caches behave? SCU takes care of L1-sync ing on MPCore Other architectures may need explicit syncs Have an eye on the compiler Instruction scheduling barriers scheduler_barrier() for armcc asm volatile ( : : : memory ) for gcc volatile declared variables (Hopefully) prevent compiler from doing aggressive optimizations Important for loop counters Check asm output if switching to a different tool chain 14

2 Background Re-using Hans Boehm s atomic_ops library: Supports commonly used atomic operations Part of the famous Boehm-Demers-Weiser GC for C/C++ It already comes with x86, x86-64, Sparc, IA64, Works with icc, gcc, msvc, Linux, Windows, Open Source Optional: lock-based emulation for error-checking Extended it with paths for ARMv6 using gcc or armcc Patch is currently reviewed for integration 15

2 Background Relative timings for atomic increment vs. local increment (1.0) 14 12 10 Expect factor 5 in weak concurrent scenarios when having it combined with a full barrier (2x DMB) Expect atomic operations become a serialization point in high concurrent scenarios 8 6 4 2 0 Simpe Inc LL/SC Inc LL/SC Inc w. Barrier 1 2 3 4 16

2 Background Testing on different multi-core platforms faster reveals bugs ARM11 MPCore on EB is not the fastest development environment (yet) NEC : 700 900 MHz ARM11 MPCore ready before end of 2007 EB board provides r0p0 silicon running at 175 MHz We implemented everything on Dual/Quad-Core x86 machines first 17

3 TPI Task Processing Interface Built to handle tasks via a small and neat interface Register tasks Tasks can have a finalizer assigned Finalizer gets enqueued when all children are finished Enqueue tasks Either from external (non-worker) thread or from within TPI Wait for tasks Blocking wait from external threads for enqueued tasks Abort tasks Asynchronously cancel tasks (and their children) Browser based inspection interface Simulate variable number of CPUs at runtime of the application Graphical overview about queue s contents Cross-platform (x86, x86-64, ARM, Linux, Windows) 18

3 TPI Internals Local FIFO per worker thread storing pending tasks New tasks are stored in the creator s queue External Tasks are put into a separate queue External Spawned from non-worker threads Work stealing uses a random permutation when checking queues Prevents convoys when several threads are stealing work Worker threads fall asleep if stealing failed No further stealing attempts after one circular round trip over all queues 19

3 TPI Internals Pending tasks are identified via handles allowing Asynchronous aborts Blocking waiting on finishing Reference counting Changing the finalizer from within a running task Each task only knows it s parent Master tasks have no parent Each task counts the number of pending children Easy way to trigger a finalizer call as soon as the last child has finished execution Finalizers may get executed Instantaneously As child task again As independent sibling task Finalizers act like a barrier waiting for all children 20

4 Raytracing a Landscape using TPI IBM s demo on CELL what can be achieved on MPCore? 3.2 GHz vs. 175 MHz (on ARM EB) CPU Design 2:8 vs. 4:0 Current EB has slooow memory connection MPCore has no floating point SIMD (VMX/SSE) HD 1080p vs. QVGA Landscape given by height map and texture 6 DOF camera Partitioning scheme: Fixed in screen space 21

4 Raytracing a Landscape using TPI 1 st optimize inner loops VFP => fixed point integer layout source data for optimal cache usage Morton order pattern for heightmap and texture data 0 1 4 5 2 3 6 7 8 9 A B Use hierarchical collision tests (quick skip) 16 x 16 blocks Totally memory bound 22

4 Raytracing a Landscape using TPI 2 nd parallelize everything possible via task scheme Double-Buffering X-Window update only uses one core Without DB we would have to block here having 3 cores idle Rendering into partitions 2,70 3,00 3,30 3,40 3,5 3,5 70,0 1 4 FPS 0,88 Idle time % 13,0 12,5 4,0 2,3 1,8 1,4 13 26 52 104 208 Partitions Partitions TPI overhead with 208 tasks: <0.2 % (measured via oprofile) 23

5 TPI Code examples Program flow diagram at task level Partition DB Swap Input + Camera O(N) vs. O(log N) Rendering Finalizer enqueues task again Color Conv. X11 Present Sequential task creation is not a bottleneck here for ~ 200 tasks and a little bit simpler than hierarchical creation [log(n)] 24

5 TPI Code examples // MAIN thread TPI *tpi = TPI::Create(); frametaskid = tpi->register(frametask,taskid_self); tracetaskid = tpi->register(tracetask,taskid_none); presenttaskid = tpi->register(graphpresenttask); tpi->extenqueue(frametaskid); while(!input.m_shouldquit) input.update(0, true); gmustquit = true; tpi->extwaitallfinished(); tpi->destroy(); 25

5 TPI Code examples TaskParam* frametask(taskparam* parameter) { graphswap(); TPI::EnqueueChild(presentTaskID); timer.advance(); camera->domovement(timer.getdelta()); for(int t=0; t < tracepartition::partitioncount; t++) TPI::EnqueueChild(traceTaskID, partitions + t); if(gmustquit) TPI::GetCurrentTaskDesc()->SetFinalizer(TASKID_NONE); } return parameter; 26

6 Results Encountered problems Only 16 bit frame buffer Color conversion must be done via CPU but we have not parallelized it Often gcc is producing faster code than armcc Kernel 2.6.22 + glibc 2.5 + r0p0 MPCore Mutual exclusion via pthread mutexes does not work reliably The why is still unclear? r1p0 seems to work reliable in FPGA implementation Lesson learned: Write testing code even for the basic things Give it time to crash, multi-core races don t occur frequently! 27

6 Results Higher level C++ abstractions on top of basic scheduler functions parallel_for_each<> split_and_gather<> parallel_while<> In-House test cases using C++ templates for parallel matrix multiplication (parallel_for_each) Fibonacci series (split_and_gather) 28

6 Outlook Existing work stealing alternatives (providing higher level abstractions) Cilk-5 Language Hood scheduler (Blumofe 1998) Intel TBB (Open Source now) Factory scheduler Mercury MCF for CELL http://supertech.lcs.mit.edu/cilk www.intel.com/software/products/tbb http://people.cs.vt.edu/~scschnei/factory http://www.mc.com/ OS Functionality (involves kernel overhead) Linux: Tasklets, WorkQueues Windows: WorkQueues 29

6 Outlook What about a lowest common standard? Lower level task abstraction and handling only needs a couple of functions Spawning Aborting Simple signaling (e.g. finalizer scheme) Simple barriers If possible don t bind it to specific language Wouldn t a FORTRAN interface be nice for some people? Build custom higher level algorithms on top of standard e.g. C++ template constructs Serve distributed systems (Clusters)? Serve heterogeneous systems? 30

Closing the circle Herb Sutter: O(N) parallelization is the key to re-enabling the free lunch (3 rd August 07, Dr. Dobb s) But we don t want to have different dress-code for every party, a common standard would be preferable. 31

Q & A wagner@it.neclab.eu Literature: A Simple Load Balancing Scheme for Task Allocation in Parallel Machines (1991) Larry Rudolph Scheduling Multithreaded Computations by Work Stealing (1994), Robert D. Blumofe, Charles E. Leiserson The Performance of Work Stealing in Multiprogrammed Environments (1998) Robert D. Blumofe, Dionisios Papadopoulos The data locality of work stealing (2000), Umut A. Acar, Guy E. Blelloch, Robert D. Blumofe 32