Software and the Concurrency Revolution



Similar documents
Multi-core Programming System Overview

Processor Architectures

Spring 2011 Prof. Hyesoon Kim

Control 2004, University of Bath, UK, September 2004

Why Threads Are A Bad Idea (for most purposes)

SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms

Parallel Algorithm Engineering

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 5 - DBMS Architecture

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Understanding Hardware Transactional Memory

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

A Deduplication File System & Course Review

Performance evaluation

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

An Introduction to Parallel Computing/ Programming

Scalability evaluation of barrier algorithms for OpenMP

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Parallelism and Cloud Computing

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Stream Processing on GPUs Using Distributed Multimedia Middleware

Disk Storage Shortfall

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Chapter 6, The Operating System Machine Level

Scala Actors Library. Robert Hilbrich

Facing the Challenges for Real-Time Software Development on Multi-Cores

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

How To Write A Multi Threaded Software On A Single Core (Or Multi Threaded) System

Embedded Systems: map to FPGA, GPU, CPU?

Application Performance Analysis Tools and Techniques

Course Development of Programming for General-Purpose Multicore Processors

Java Performance. Adrian Dozsa TM-JUG

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Next Generation Operating Systems

Lecture 1 Introduction to Parallel Programming

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

COM 444 Cloud Computing

MS SQL Performance (Tuning) Best Practices:

Chapter 1 Computer System Overview

There are a number of factors that increase the risk of performance problems in complex computer and software systems, such as e-commerce systems.

9/26/2011. What is Virtualization? What are the different types of virtualization.

Performance metrics for parallel systems

Design Issues in a Bare PC Web Server

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Multicore Programming with LabVIEW Technical Resource Guide

End-user Tools for Application Performance Analysis Using Hardware Counters

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Introduction to Cloud Computing

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Optimizing Shared Resource Contention in HPC Clusters

Boost SQL Server Performance Buffer Pool Extensions & Delayed Durability

Eloquence Training What s new in Eloquence B.08.00

Virtuoso and Database Scalability

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Fastboot Techniques for x86 Architectures. Marcus Bortel Field Application Engineer QNX Software Systems

Performance of the JMA NWP models on the PC cluster TSUBAME.

Informatica Ultra Messaging SMX Shared-Memory Transport

Multi-Threading Performance on Commodity Multi-Core Processors

CHAPTER 1 INTRODUCTION

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Symmetric Multiprocessing

SYSTEM ecos Embedded Configurable Operating System

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Scaling in a Hypervisor Environment

Effective Java Programming. efficient software development

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Bigdata High Availability (HA) Architecture

International Journal of Advancements in Research & Technology, Volume 1, Issue6, November ISSN

Beyond Virtualization: A Novel Software Architecture for Multi-Core SoCs. Jim Ready September 18, 2012

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

Turbomachinery CFD on many-core platforms experiences and strategies

Intel DPDK Boosts Server Appliance Performance White Paper

Data Centric Systems (DCS)

Multicore Parallel Computing with OpenMP

Maximizing SQL Server Virtualization Performance

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Parallel Computing. Parallel shared memory computing with OpenMP

Parallel Programming Survey

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Transcription:

Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox ) 2007 Page: 1

Truths Consequences Futures The Last Slide First Although driven by hardware changes, the concurrency revolution is primarily a software revolution. Parallel hardware is not more of the same. It s a fundamentally different mainstream hardware architecture, even if the instruction set looks the same. (But using the same ISA does give us a great compatibility/migration story.) Software requires the most changes to regain the free lunch. The concurrency sea change impacts the entire software stack: Tools, languages, libraries, runtimes, operating systems. No programming language can ignore it and remain relevant. (Side benefit: Responsiveness, the other reason to want async code.) Software is also the gating factor. We will now do for concurrency what we did for OO and GUIs. Beyond the gate, hardware concurrency is coming more and sooner than most people yet believe. 2007 Page: 2

Each Year We Get Faster More Processors Intel CPU Trends (sources: Intel, Wikipedia, K. Olukotun) 386 Pentium Montecito Moore s Law Historically: Boost singlestream performance via more complex chips. Now: Deliver more cores per chip (+ GPU, NIC, SoC). The free lunch is over for today s sequential apps and many concurrent apps. We need killer apps with lots of latent parallelism. A generational advance >OO is necessary to get above the threads+locks programming model. Each Year We Get Faster More Processors 1.7Bt 4.5Mt = ~100 P55Cs + 16 MB L3$ Montecito Intel CPU Trends (sources: Intel, Wikipedia, K. Olukotun) 386 Pentium Moore s Law Historically: Boost singlestream performance via more complex chips. Now: Deliver more cores per chip (+ GPU, NIC, SoC). The free lunch is over for today s sequential apps and many concurrent apps. We need killer apps with lots of latent parallelism. A generational advance >OO is necessary to get above the threads+locks programming model. 2007 Page: 3

Educational State of the Union: May 2007 We have achieved general awareness. Most people know that the free lunch is over for sequential applications, and that the future is multicore and manycore. But few people really yet believe the magnitude, speed, and gating factor of the change: Magnitude: Comparable to the GUI revolution plus moving to a new hardware platform, simultaneously. Enabling manycore affects the entire software stack, from tools to languages to frameworks/libraries to runtimes. Speed: (Recall: Intel could build 100-Pentium chips today if they wanted to.) 100-way HW concurrency could be available in commodity desktops as soon as the year Gating factor: manycore-exploiting software is available. Truths Consequences Futures 2007 Page: 4

The Issue Is (Mostly) On the Client Solved : Server apps (e.g., DB servers, web services). Lots of independent requests one thread per request is easy. Typical to execute many copies of the same code. Shared data usually via structured databases: Automatic implicit concurrency control via transactions. With some care, concurrency problem is already solved here. Not solved: Typical client apps (i.e., not Photoshop). Somehow employ many threads per user request. Highly atypical to execute many copies of the same code. Shared data in unstructured shared memory: Error prone explicit locking where are the transactions? A Tale of Six Software Waves Each was born during 1958-73, bided its time until 1990s/00s, then took 5+ years to build a mature tool/language/framework/runtime ecosystem. GUIs Objects GC Generics Net Concurrency 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2007 Page: 5

Comparative Impacts Across the Stack Application programming model Libraries and frameworks Languages and compilers Runtimes and OS Tools (design, measure, test) GUIs Objects GC Generics Web Concurrency Extent of impact and/or enabling importance = Some, could drive one major product release = Major, could drive more than one major product release = New way of thinking / essential enabler, or multiple major releases O(1), O(K), or O(N) Concurrency? 1. Sequential apps. The free lunch is over (if CPU-bound): Flat or merely incremental perf. improvements. Potentially poor responsiveness. 2. Explicitly threaded apps. Hardwired # of threads that prefer K CPUs (for a given input workload). Can penalize <K CPUs, doesn t scale >K CPUs. 12 3. Scalable concurrent apps. Workload decomposed into a sea of heterogeneous chores. Lots of latent concurrency we can map down to N cores. 2007 Page: 6

An OO for Concurrency OO Fortran, C, asm threads + locks semaphores Three Pillars of the Dawn A Framework for Evaluating Concurrency Summary Examples Asynchronous Agents Tasks that run independently and communicate via messages GUIs, background printing, disk/net access Concurrent Collections Operations on groups of things; exploit parallelism in data and algorithm structures Trees, quicksort, compilation Key metrics Responsiveness Throughput, manycore scalability Mutable Shared State Avoid races by synchronizing mutable objects in shared memory Locked data (99%), lock-free libraries (written by wizards) Race-free, deadlock-free Requirements Isolation, messaging Low overhead Composability Today s abstractions Possible new abstractions Threads, message queues Active objects, futures Thread pools, OpenMP Chores, futures, parallel STL, PLINQ Locks Transactional memory, declarative support for locks 2007 Page: 7

15 The Trouble With Locks How may I harm thee? Let me count the ways: { Lock lock1( muttable1 ); Lock lock2( muttable2 ); table1.erase( x ); table2.insert( x ); } Locks are the best we have, and known to be inadequate: Which lock? The connection between a lock and the data it protects exists primarily in the mind of a programmer. Deadlock? If the mutexes aren t acquired in the same order. Simplicity or scalability? Coarse-grained locks are simpler to use correctly, but easily become bottlenecks. Lost wake-ups? Blocking typically uses condition variables, and it s easy to forget to signal the correct ones. (Example coming up.) Not composable. In today s world, this is a deadly sin. 2007 Page: 8

Enter Transactional Memory A transactional programming model: atomic { table1.erase( x ); table2.insert( x ); } Idea: Version memory like a database. Automatic concurrency control, rollback and retry for competing transactions. Benefits: No need to remember which lock to take. No deadlock: No need to remember a sync order discipline. Both fine-grained and scalable without sacrificing correctness. No wake-up calls as atomic blocks are automatically re-run. Best of all: Composable. Can once again build a big (correct) program out of smaller (correct) programs. Drawbacks: Still active research. And the elephant in the room is I/O (more later). Enter Transactional Memory (2) What if we want to block if x isn t in table1 yet? Today, we d typically use a wake-up: Wait on a condition variable, and have every thread inserting into table1 remember to signal the right cv. Tedious, error-prone, etc. Instead: atomic { if( table1.find(x) == table1.end() ) retry; table1.erase( x ); table2.insert( x ); } retry restarts the current atomic block from the beginning. Blocking is entirely modular: The call to retry might be deeply buried inside the implementation of debit, or of credit, or both. Can use transaction log to defer re-running until at least one memory location read by the transaction has changed. 2007 Page: 9

A 4-Step Program For the Lock Addiction Greatly reduce locks. (Alas, not eliminate. ) 1. Enable transactional programming: Transactional memory is our best hope. Composable atomic { } blocks. Naturally enables speculative execution. (The elephant: Allowing I/O. The Achilles heel: Some resources are not transactable.) 2. Abstractions to reduce shared : Messages. Futures. Private data (e.g., active objects). 3. Techniques to reduce mutable : Immutable objects. Internally versioned objects. 4. Some locks will remain. Let the programmer declare: Which shared objects are protected by which locks. Lock hierarchies (caveat: also not composable). An OO for Concurrency active objects, chores, and futures parallel algorithms OO atomic { } locks: declare intent Fortran, C, asm threads + locks semaphores 2007 Page: 10

Truths Consequences Futures A Good Question: How Many Cores Are You Coding For? OoO - cores 32 32 16 16 4 4 8 8 2006 2007 2008 2009 2010 2011 2012 2013 2007 Page: 11

the truth is somewhere in here Future Many- Core Systems Complex core (OoO, prediction, pipelining, speculation) Simple core (InO) All large cores All small cores Mixed large & small cores (legacy code & Amdahl s Law) A Better Question: How Many Hardware Threads Are You Coding For? InO - threads InO - cores OoO - cores 512 256 256 128 128 64 64 4 4 8 8 32 16 16 32 32 2006 2007 2008 2009 2010 2011 2012 2013 2007 Page: 12

The First Slide Last Although driven by hardware changes, the concurrency revolution is primarily a software revolution. Parallel hardware is not more of the same. It s a fundamentally different mainstream hardware architecture, even if the instruction set looks the same. (But using the same ISA does give us a great compatibility/migration story.) Software requires the most changes to regain the free lunch. The concurrency sea change impacts the entire software stack: Tools, languages, libraries, runtimes, operating systems. No programming language can ignore it and remain relevant. (Side benefit: Responsiveness, the other reason to want async code.) Software is also the gating factor. We will now do for concurrency what we did for OO and GUIs. Beyond the gate, hardware concurrency is coming more and sooner than most people yet believe. For More Information The Free Lunch Is Over (Dr. Dobb s Journal, March 2005) http://www.gotw.ca/publications/concurrency-ddj.htm The article that first used the terms the free lunch is over and concurrency revolution to describe the sea change. Software and the (with Jim Larus; ACM Queue, September 2005) http://acmqueue.com/modules.php?name=content&pa= showpage&pid=332 Why locks, functional languages, and other silver bullets aren t the answer, and observations on what we need for a great leap forward in languages and also in tools. 2007 Page: 13