Multi-core Systems What can we buy today?

Similar documents
Parallel Programming Survey

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Creative Commons Attribution-Share 3.0 United States License

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Parallel Algorithm Engineering

Multi-Threading Performance on Commodity Multi-Core Processors

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Multicore Processor and GPU. Jia Rao Assistant Professor in CS

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Improving System Scalability of OpenMP Applications Using Large Page Support

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

LS DYNA Performance Benchmarks and Profiling. January 2009

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Enabling Technologies for Distributed Computing

12 rdzeni w procesorze - nowa generacja platform serwerowych AMD Opteron Maj Krzysztof Łuka krzysztof.luka@amd.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Thread level parallelism

Introduction to Microprocessors

GPUs for Scientific Computing

ECLIPSE Performance Benchmarks and Profiling. January 2009

OPENSPARC T1 OVERVIEW

Trends in High-Performance Computing for Power Grid Applications

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Enabling Technologies for Distributed and Cloud Computing

x64 Servers: Do you want 64 or 32 bit apps with that server?

Thread Level Parallelism II: Multithreading

Family 10h AMD Phenom II Processor Product Data Sheet

Generations of the computer. processors.

OpenSPARC T1 Processor

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

AMD Opteron Quad-Core

Putting it all together: Intel Nehalem.

Memory Architecture and Management in a NoC Platform

Multicore Processors A Necessity By Bryan Schauer

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Supercomputing Status und Trends (Conference Report) Peter Wegner

SPARC64 VIIIfx: CPU for the K computer

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

SOC architecture and design

Embedded Parallel Computing

Muse Server Sizing. 18 June Document Version Muse

Overview of HPC Resources at Vanderbilt

Introduction to GPU Architecture

Next Generation GPU Architecture Code-named Fermi

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Performance Impacts of Non-blocking Caches in Out-of-order Processors

HPC with Multicore and GPUs

Introduction to GPGPU. Tiziano Diamanti

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Understanding Hardware Transactional Memory

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR

Quad-Core Intel Xeon Processor

AMD Opteron Processor Pricing

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Embedded Systems: map to FPGA, GPU, CPU?

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Lecture 1. Course Introduction

SUN ORACLE EXADATA STORAGE SERVER

Industry First X86-based Single Board Computer JaguarBoard Released

UltraSPARC T1: A 32-threaded CMP for Servers. James Laudon Distinguished Engineer Sun Microsystems james.laudon@sun.com

Scaling in a Hypervisor Environment

A quick tutorial on Intel's Xeon Phi Coprocessor

Choosing a Computer for Running SLX, P3D, and P5

Design Cycle for Microprocessors

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

On-Chip Interconnection Networks Low-Power Interconnect

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Virtuoso and Database Scalability

Scalability evaluation of barrier algorithms for OpenMP

Symmetric Multiprocessing

FPGA-based Multithreading for In-Memory Hash Joins

Application Performance Analysis of the Cortex-A9 MPCore

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Intel Pentium 4 Processor on 90nm Technology

Why the Network Matters

Advanced Micro Devices - AMD

A Practical Approach to Compiling Multipleiversion and Embracing Failure

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

AMD and SAP. Linux Alliance Overview. Key Customer Benefits

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

A Quantum Leap in Enterprise Computing

Performance Characteristics of Large SMP Machines

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Chip Multithreading: Opportunities and Challenges

Chapter 2 Parallel Computer Architecture

Experiences With Mobile Processors for Energy Efficient HPC

Energy-Efficient, High-Performance Heterogeneous Core Design

Legal Notices Introduction... 3

Multi-core and Linux* Kernel

ST810 Advanced Computing

Overview. CPU Manufacturers. Current Intel and AMD Offerings

Transcription:

Multi-core Systems What can we buy today? Ian Watson & Mikel Lujan Advanced Processor Technologies Group COMP60012 Future Multi-core Computing 1 A Bit of History AMD Opteron introduced in 2003 Hypertransport On-chip memory controller In 2006 Sun Niagara I: 8 cores/32 thread & 4 on-chip memory controllers Intel & AMD: dual core IBM: dual core & Cell Azul: 24 cores In 2007 Sun Niagara II: 8 cores/64 threads Intel quad-core vs. AMD quad-core (Barcelona) Azul: 48 cores In 2009 Sun Victoria Falls: 4 Niagara II+ (256 threads) - Server (web, DBs) Intel Nehalem (core i7 & Xeon) vs. AMD quad-core (Shanghai) six-core (Istanbul) - Desktop/workstations/Server ARM: Cortex A9 4 cores - low-power embedded systems (e.g. mobile phones) Nvidia: Tesla (HPC), GTX 2 1

A bit of history (2) 3 Is cooling a problem? 4 2

Problems: heat & memory wall Processor utilization (15%-25%) 5 Chip Multi-Threading Towards Niagara 6 3

Niagara SPARC architecture 8 cores (1GHz - 1.4GHz) 4 hardware threads per core Banked L2 cache 4 on-chip memory controllers 23GB/s off-chip bandwidth One FPU Crossbar 7 Niagara Floorplan 378 mm 2 die 90nm process ~300M transistors Single core 11 mm 2 Multi-Threading 20% extra area 8 4

Niagara Core Pipeline 9 Niagara Memory System Instruction cache 16kB, 4-way set associative, 32B line size Data cache 8kB, 4-way set associative, 16B line size Write-through cache, write-around on miss L2 cache 3 MB, 12-way set associative, 64B line size Write-through 4-way banked by line Pseudo-random eviction policy Coherency Data cache lines have 2 state: valid or invalid Data cache is write through -> no modified state Crossbar enforces ordering according to memory consistency Caches kept coherent by tracking lines in directories in L2 Latencies Load-to-use: 3 cycles L2 latency: 22 cycles Memory latency: ~90 ns 10 5

Thread Selection Policy 11 Towards Niagara 2 Goal: Double throughput versus Niagara 1 staying within the same power envelop 12 6

Modifications in Niagara 2 Double number of threads from 4 to 8 Choose 2 threads out of 8 to execute each cycle Double execution units from 1 to 2 and add FPU Double set associativity of L1 instruction cache to 8- way Double size of fully associative DTLB from 64 to 128 entries Double L2 banks from 4 to 8 Hardware tablewalk for ITLB and DTLB misses One extra stage in the pipeline (select 2 threads out of 8) 13 Niagara 2 Floorplan (65nm &342mm 2 ) 14 7

Niagara 2 New Elements 8 threads per core Two integer pipes per core One fully-pipelined FPU per core 60+GB/s off-chip 4MB L2$ (8 banks) 16 way Security co-processor per core On-chip 15 Sun Microsystems Open Source CMP http://www.opensparc.net OpenSPARC T1 and T2 Verilog available (RTL) Book available Webcast of talks More information on support for Virtualization, RAS 16 8

A Bit of History AMD Opteron introduced in 2003 Hypertransport On-chip memory controller In 2006 Sun Niagara I: 8 cores/32 thread & 4 on-chip memory controllers Intel & AMD: dual core IBM: dual core & Cell Azul: 24 cores In 2007 Sun Niagara II: 8 cores/64 threads Intel quad-core vs. AMD quad-core (Barcelona) Azul: 48 cores In 2009 Sun Victoria Falls: 4 Niagara II+ (256 threads) - Server (web, DBs) Intel Nehalem (core i7 & Xeon) vs. AMD quad-core (Shanghai) six-core (Istanbul) - Desktop/workstations/Server ARM: Cortex A9 4 cores - low-power embedded systems (e.g. mobile phones) Nvidia: Tesla (HPC), GTX 17 AMD Barcelona (65nm, <3GHz) 18 9

AMD Barcelona L1: 2-way set associative, LRU replacement, block size 64 bytes, split I & D, write-back & write-allocate, 3 cycles latency L2: idem. except 9 cycles latency L3: idem. except evict block shared by fewest core and 34 cycles latency 19 AMD Shanghai (45nm, <2.6GHz) L3: 48-way set associative and 29 cycles latency 20 10

Intel Tick Tock 21 Core i7 (Nehalem, 2.6GHz -3.2GHz) 2 Simultaneous Multi-Threading per core 22 11

Nehalem 23 Nehalem Caches L1: split D$ & I$, 32KB each, 4-way I$ & 8-way set associative, approx. LRU, block size 64 bytes, write-back & write-allocate L2: 8-way set associative, idem. L3: 16-way set associative, idem 24 12

NUMA 25 Other things not discussed Sun, AMD & Intel Throttling to reduce power Support for VMM New instructions 26 13

A Bit of History AMD Opteron introduced in 2003 Hypertransport On-chip memory controller In 2006 Sun Niagara I: 8 cores/32 thread & 4 on-chip memory controllers Intel & AMD: dual core IBM: dual core & Cell Azul: 24 cores In 2007 Sun Niagara II: 8 cores/64 threads Intel quad-core vs. AMD quad-core (Barcelona) Azul: 48 cores In 2009 Sun Victoria Falls: 4 Niagara II+ (256 threads) - Server (web, DBs) Intel Nehalem (core i7 & Xeon) vs. AMD quad-core (Shanghai) six-core (Istanbul) - Desktop/workstations/Server ARM: Cortex A9 4 cores - low-power embedded systems (e.g. mobile phones) Nvidia: Tesla (HPC), GTX 27 Mobile phone Texas Instrument OMAP 44x MPSoC 28 14

ARM Cortex A9 MPCore 29 Atomic Instructions Compare-And-Swap Niagara, Intel & AMD Load Linked & Store Conditional ARM & IBM Power 30 15

2010 New Systems 31 2010 New Systems AMD 12-core Opteron Sun Microsystems - Niagara 3 (or RainbowFalls) 16-cores (8 or 16 threads per core?) Nvidia Fermi 512 Cuda cores ARM Cortex A9 at 2GHz 32 16

Looking at the future 33 Papers for Next Week Simultaneous multithreading: maximizing onchip parallelism. ISCA 1995. http://doi.acm.org/10.1145/223982.224449 The SGI Origin: a ccnuma highly scalable server. ISCA 1997. http://doi.acm.org/10.1145/264107.264206 Java Tutorial: Concurrency http://java.sun.com/docs/books/tutorial/essential/c oncurrency/ Remember to email the slides before 14:00 13 th November 2009 34 17