INF5063: Programming heterogeneous multi-core processors Introduction

Similar documents
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GPGPU. Tiziano Diamanti

NVIDIA GeForce GTX 580 GPU Datasheet

Parallel Programming Survey

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Next Generation GPU Architecture Code-named Fermi

Fast Implementations of AES on Various Platforms

GPU Parallel Computing Architecture and CUDA Programming Model

GPGPU Computing. Yong Cao

Introduction to GPU Architecture

Introduction to GPU hardware and to CUDA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Console Architecture. By: Peter Hood & Adelia Wong

GPU Architecture. Michael Doggett ATI

PikeOS: Multi-Core RTOS for IMA. Dr. Sergey Tverdyshev SYSGO AG , Moscow

What is a System on a Chip?

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

The Motherboard Chapter #5

~ Greetings from WSU CAPPLab ~

Using Many-Core Hardware to Correlate Radio Astronomy Signals

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Chapter 1 Computer System Overview

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

IP Video Rendering Basics

Embedded Systems: map to FPGA, GPU, CPU?

Parallel Firewalls on General-Purpose Graphics Processing Units

The Central Processing Unit:

Generations of the computer. processors.

GPUs for Scientific Computing

Enabling Technologies for Distributed Computing

Enabling Technologies for Distributed and Cloud Computing

High Performance GPGPU Computer for Embedded Systems

7a. System-on-chip design and prototyping platforms

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

SOC architecture and design

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Application Performance Analysis of the Cortex-A9 MPCore

Discovering Computers Living in a Digital World

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Choosing a Computer for Running SLX, P3D, and P5

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Intel PCI and PCI Express*

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Lecture 2: Computer Hardware and Ports.

1. Memory technology & Hierarchy

NVIDIA GeForce GTX 750 Ti

PCI Express IO Virtualization Overview

ME964 High Performance Computing for Engineering Applications

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Introduction to Cloud Computing

Fast Software AES Encryption

Architectures and Platforms

Stream Processing on GPUs Using Distributed Multimedia Middleware

Products. CM-i586 Highlights. Página Web 1 de 5. file://c:\documents and Settings\Daniel\Os meus documentos\humanoid\material_o...

Overview. CPU Manufacturers. Current Intel and AMD Offerings

Several tips on how to choose a suitable computer

Computer Architecture TDTS10

Optimizing Code for Accelerators: The Long Road to High Performance

QuickSpecs. NVIDIA Quadro M GB Graphics INTRODUCTION. NVIDIA Quadro M GB Graphics. Overview

The Load Balancing System Design of Service Based on IXP2400 Yi Shijun 1, a, Jing Xiaoping 1,b

Hardware Acceleration for CST MICROWAVE STUDIO

Chapter 13. PIC Family Microcontroller

Autodesk Revit 2016 Product Line System Requirements and Recommendations

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

CISC, RISC, and DSP Microprocessors

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

TESLA K20X GPU ACCELERATOR

Computer Graphics Hardware An Overview

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Latency and Bandwidth Impact on GPU-systems

Computer Systems Structure Input/Output

Datacenter Operating Systems

Priority Pro v17: Hardware and Supporting Systems

Why You Need the EVGA e-geforce 6800 GS

MIDeA: A Multi-Parallel Intrusion Detection Architecture

Embedded Parallel Computing

CHAPTER 7: The CPU and Memory

Lecture 1: the anatomy of a supercomputer

Several tips on how to choose a suitable computer

Transcription:

INF5063: Programming heterogeneous multi-core processors Introduction 28/8-2009

Overview Course topic and scope Background for the use and parallel processing using heterogeneous multi-core processors Examples of heterogeneous architectures

INF5063: The Course

People Håvard Espeland (TA) email: haavares @ ifi Håkon Kvale Stensland (TA) email: haakonks @ ifi Carsten Griwodz email: griff @ ifi Pål Halvorsen email: paalh @ ifi

Time and place Lectures: Fridays 13.15 15.00 3B NB! The web page states that we will have group exercises on Thursdays 10.15-12.00, 3B. However, there will NOT be any weekly exercises, but this hour is assigned for your mandatory assignments (we will NOT be there).

About INF5063: Topic & Scope Content: The course gives an overview of heterogeneous multi-core processors in general and three variants in particular (architectures and use) an introduction to working with heterogenous multi-core processors Intel IXP 2400 network processor card nvidia s G80 family of GPUs and the CUDA programming framework The Cell Broadband Engine Architecture some ideas of how to use/program heterogeneous multi-core processors

About INF5063: Topic & Scope Tasks: The important part of the course is lab-assignments where you program each of the three examples of heterogeneous multi-core processors 3 graded home exams (counting 33% each): Deliver code Make a demonstration and explain your design and code to the class 1. On the Intel IXP Multi-link proxy download and run wwpingbump and then extend it to be a multi-link proxy 2. On the Cell processor Video encoding download Motion JPEG compression software and improve its performance by using the Cell processor s SBEs 3. On the nvidia graphics cards Video encoding the same goal as above, but exploit the parallelity of the G80 architecture

Available Resources Resources will be placed at http://www.ifi.uio.no/~griff/inf5062 Login: inf5062 Password: ixp Manuals, papers, code example,

Background and Motivation: Moore s Law

Motivation: Intel View >billion transistors integrated

Motivation: Intel View >billion transistors integrated Clock frequency can still increase

Motivation: Intel View >billion transistors integrated Clock frequency can still increase Future applications will demand TIPS

Motivation: Intel View >billion transistors integrated Clock frequency can still increase Future applications will demand TIPS Power? Heat?

Motivation: Intel View Soon >billion transistors integrated Clock frequency can still increase Future applications will demand TIPS Power? Heat?

Motivation Future applications will demand TIPS Think platform beyond a single processor Exploit concurrency at multiple levels Power will be the limiter due to complexity and leakage Distribute workload on multiple cores

Background and Motivation: Symmetric multi-processing

Symmetric Multi-Core Processors Phenom X4

Symmetric Multi-Core Processors UltraSparc

Intel Multi-Core Processors Symmetric multi-processors allow multi-threaded applications to achieve higher performance at less die area and power consumption than single-core processors

Symmetric Multi-Core Processors Good Growing computational power Problematic Growing die sizes Some cores used much more than others Individual cores frequently unused Many core parts frequently unused Why not spread the load better? Functions exist only once per core Parallel programming is hard Asymmetric multi-core processors

Asymmetric Multi-Core Processors Asymmetric multi-processors consume power and provide increased computational power only on demand Highly parallel Moderately parallel Sequential

Multi-Core Processors Operating systems scale only to a limited number of threads Where does the increase in core numbers stop? How many applications need more than 64 threads? Performance? the issue is software limits Application-specific engines? Intel Academic Forum 2006: YES But: Programming model?????

Motivation Future applications will demand TIPS Think platform beyond a single processor Exploit concurrency at multiple levels Power will be the limiter due to complexity and leakage Distributed workload on multiple cores + simple processors are easier to program +consume less energy heterogeneous multi-core processors

Background and Motivation: Heterogeneous multi-processors

Co-Processors The original IBM PC included a socket for an Intel 8087 floating point co-processor (FPU) 50-fold speed up of floating point operations Intel kept the co-processor up to i486 486DX contained an optimized i487 block Still separate pipeline (pipeline flush when starting and ending use) Communication over an internal bus Commodore Amiga was one of the earlier machines that used multiple processors Motorola 680x0 main processor Blitter (block image transferrer - moving data, fill operations, line drawing, performing boolean operations) Copper (Co-Processor - change address for video RAM on the fly) And finally: IBM PowerPC - relegate the 680x0 to a co-processor job

Review of General Data Path on Conventional Computer Hardware Architectures sending: receiving: forwarding: application application application user space kernel space transport (TCP/UDP) communication network system (IP) link communication system communication system

IXA: Internet Exchange Architecture IXA a broad term to describe the Intel network architecture HW & SW, control- & data plane IXP: Internet Exchange Processor processor that implements IXA IXP1200 is the first IXP chip (4 versions) IXP2xxx has now replaced the first version

IXA: Internet Exchange Architecture IXP1200 basic features 1 embedded 232 MHz StrongARM 6 packet 232 MHz µengines onboard memory 4 x 100 Mbps Ethernet ports multiple, independent busses low-speed serial interface interfaces for external memory and I/O busses

IXA: Internet Exchange Architecture IXP2400 basic features 1 embedded 600 MHz XScale 8 packet 600 MHz µengines onboard memory 3 x 1 Gbps Ethernet ports multiple, independent busses low-speed serial interface interfaces for external memory and I/O busses

IXP1200 Architecture RISC processor: - StrongARM running Linux - control, higher layer protocols and exceptions - 232 MHz Access units: - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and synchronization Microengines: - low-level devices with limited set of instructions - transfers between memory devices - packet processing - 232 MHz

IXP1200 IXP2400 IXP1200 SRAM bus PCI bus SRAM SRAM access PCI access Embedded RISK CPU (StrongARM) FLASH MEMORY MAPPED I/O SCRATCH memory multiple independent internal buses microengine 1 microengine 2 microengine 3 microengine 4 microengine 5 DRAM SDRAM access IX access microengine 6 DRAM bus IX bus

IXP2400 Architecture Coprocessors - hash unit IXP2400-4 timers SRAM - general purpose I/O pins bus - external JTAG connections (in-circuit tests) - several bulk cyphers (IXP2850 only) SRAM - checksum (IXP2850 only) - SRAM access PCI bus PCI access Embedded RISK CPU (XScale) RISC processor: - StrongArm XScale - 233 MHz 600 MHz Slowport coprocessor FLASH - shared inteface to external units - used for FlashRom during bootstrap DRAM SCRATCH memory slowport access SDRAM access multiple independent internal buses Media Switch Fabric - forms fast path for transfers MSF access microengine 1 microengine 2 microengine 3 microengine 4 microengine 5 - interconnect for several IXP2xxx microengine 8 Microengines - 6 8-233 MHz 600 MHz Receive/transmit buses DRAM bus - shared bus separate busses receive bus transmit bus

Graphics Processing Units (GPUs) buss connector & memory hub GPU: a dedicated graphics rendering device First GPUs, 2D 3D 80s: for early 2D operations Amiga and Atari used a blitter, Amiga had also the copper 90s: 3D hardware for game consoles like PS and N64 3dfx Voodoos 3D add-on card for PCs

Graphics Processing Units (GPUs) buss connector & memory hub New powerful GPUs, e.g.,: Nvidia GeForce GX280 240 1476 MHz core 1 GB memory memory BW: 159 GB/sec PCI Express 2.0 similar to other manufacturers

General Purpose Computing on GPU The high arithmetic precision extreme parallel nature optimized, special-purpose instructions available resources of the GPU allows for general, non-graphics related operations to be performed on the GPU Generic computing workload is off-loaded from CPU and to GPU More generically: Heterogeneous multi-core processing

nvidia G200 nvidia GT285 1.4 billion transistors 240 shaders 512 bit memory bus (GDDR3) 159 GB/sec memory bandwidth 933 Gflops PCI Express 2.0

nvidia GT200 Stream Multiprocessors (SMs) fundamental thread block unit 8 stream processors (SPs) (scalar ALU for threads) 2 super function units (SFUs) (cos, sin, log,...) 8 32KB local register files (RFs) 16 kb level 1 cache 64 kb shared memory 256 kb global level 2 cache S Streaming Multiprocessor ( SM ) Instruction Fetch Instruction L 1 Cache Thread / Instruction Dispatch Shared Memory SP 0 RF 0 RF 4 SP 4 SP 1 RF 1 RF 5 SP 5 L 1 Fill S Work Control Results F F Number of stream multiprocessors 1 - Quadro NVS 130M 16 - GeForce 8800 GTX 30 - GeForce GTX 285 4x30 - Tesla S1070 U SP 2 RF 2 SP 3 RF 3 Constant L Load from Memory RF 6 SP 6 RF 7 SP 7 Load Texture 1 Cache Store to Store to Memory U L 1 Fill

Memory Bandwidth for CPU and GPU Marketed as GPGPUs

STI (Sony, Toshiba, IBM) Cell Motivation for the Cell Cheap processor Energy efficient For games and media processing Short time-to-market Conclusion Use a multi-core chip Design around an existing, powerefficient design Add simple cores specific for game and media processing requirements

STI (Sony, Toshiba, IBM) Cell Cell is a 9-core processor combining a light-weight generalpurpose processor with multiple co-processors into a coordinated whole Power Processing Element (PPE) conventional Power processor not supposed to perform all operations itself, acting like a controller running conventional OSes 16 KB instruction/data level 1 cache 512 KB level 2 cache

STI (Sony, Toshiba, IBM) Cell Synergistic Processing Elements (SPE) specialized co-processors for specific types of code, i.e., very high performance vector processors local stores can do general purpose operations the PPE can start, stop, interrupt and schedule processes running on an SPE Element Interconnect Bus (EIB) internal communication bus connects on-chip system elements: PPE & SPEs the memory controller (MIC) two off-chip I/O interfaces 25.6 GBps each way

STI (Sony, Toshiba, IBM) Cell memory controller Rambus XDRAM interface to Rambus XDR memory dual channels at 12.8 GBps 25.6 GBps I/O controller Rambus FlexIO interface which can be clocked independently dual configurable channels maximum ~ 76.8 GBps

STI (Sony, Toshiba, IBM) Cell Cell has in essence traded running everything at moderate speed for the ability to run certain types of code at high speed used for example in Sony PlayStation 3: 3.2 GHz clock 7 SPEs for general operations 1 SPE for security for the OS Toshiba home cinema: decoding of 48 HDTV MPEG streams dozens of thumbnail videos simultaneously on screen IBM blade centers: 3.2 GHz clock Linux 2.6.11

The End: Summary Heterogeneous multi-core processors are already everywhere Challenge: programming Need to know the capabilities of the system Different abilities in different cores Memory bandwidth Memory sharing efficiency Need new methods to program the different components Next time: how to start programming the Intel IXP