Architecture of Hitachi SR-8000



Similar documents
and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Building an Inexpensive Parallel Computer

AMD Opteron Quad-Core

Access to the Federal High-Performance Computing-Centers

A Deduplication File System & Course Review

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

OC By Arsene Fansi T. POLIMI

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

OpenMP and Performance

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Real-time Process Network Sonar Beamformer

Computer Systems Structure Input/Output

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Multi-Threading Performance on Commodity Multi-Core Processors

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Computer-System Architecture

ECLIPSE Performance Benchmarks and Profiling. January 2009

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Storage. The text highlighted in green in these slides contain external hyperlinks. 1 / 14

LS DYNA Performance Benchmarks and Profiling. January 2009

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

ADVANCED COMPUTER ARCHITECTURE: Parallelism, Scalability, Programmability

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

I/O. Input/Output. Types of devices. Interface. Computer hardware

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

High Performance Computing

OPENSPARC T1 OVERVIEW

CMSC 611: Advanced Computer Architecture

Data Storage - I: Memory Hierarchies & Disks

OpenSPARC T1 Processor

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Disk Storage & Dependability

SYSTEM ecos Embedded Configurable Operating System

Operating System for the K computer

VLIW Processors. VLIW Processors

CSE 6040 Computing for Data Analytics: Methods and Tools

Performance of the JMA NWP models on the PC cluster TSUBAME.

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

1. Computer System Structure and Components

Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication

Scalability and Classifications

Price/performance Modern Memory Hierarchy

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Communicating with devices

Binary search tree with SIMD bandwidth optimization using SSE

Chapter 2 Parallel Architecture, Software And Performance

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Enterprise Applications

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

POSIX. RTOSes Part I. POSIX Versions. POSIX Versions (2)

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Operating Systems Lab Exercises: WINDOWS 2000/XP Task Manager

Parallel Programming Survey

Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization

Introduction to GPGPU. Tiziano Diamanti

Putting it all together: Intel Nehalem.

Performance evaluation

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

IA-64 Application Developer s Architecture Guide

Performance Evaluation of InfiniBand with PCI Express

SOC architecture and design

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

Big Picture. IC220 Set #11: Storage and I/O I/O. Outline. Important but neglected

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Lecture 2 Parallel Programming Platforms

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Modellierung, Simulation, Optimierung Computer Architektur

Configuring Apache Derby for Performance and Durability Olav Sandstå

Thread level parallelism

CS 6290 I/O and Storage. Milos Prvulovic

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Introduction to Operating Systems. Perspective of the Computer. System Software. Indiana University Chen Yu

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

Introduction to GPU Architecture

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Transcription:

Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2

the problem modern computer are data movers Floating Point Performance increases with 60% per year (Moore s law) memory access is the bottleneck bandwidth is too small latencies are decreasing slowly but increasing in clocks Slide 3 Latency from processor to various locations seconds clocks future development clock rate 1-4 nsec decreasing register 1 L1 cache 1-2 L2 cache 4-8 L3 cache 8-20 memory 30-200 nsec 30-200 slowly decreasing remote memory 500-700 slowly decreasing shmem ~3 µsec 6-20 µsec slowly decreasing disk ~8 msec tape ~30 sec Slide 4

Behandlung von Latenzzeiten hiding of latency is necessary different possibilities: Prefetching and Preloading from own and remote memory Hardware-Multithreading decreasing the number of transactions by pipelining prozessor in memory (IRAM) Slide 5 prefetch, example do n=1,nmax! prefetch b(n+100), c(n+100)! compiler generated a(n) = b(n) + c(n) enddo Slide 6

handling of latencies compiler can hide latency by prefetching variables but only in a semantical clear situation modern processors support prefetching explicit loops allow prefetching because future addresses can be calculated directly branches are dangerous for prefetching if wrong data are prefetched and vaste the bandwidth: avoid unnecessary branches calls break compiler optimization: avoid frequent calls Slide 7 16 processor SR8000 Slide 8

System Architecture SR 8000 Cross-bar Inter-node Network Node (ION) CPU CPU Node (PRN) Node (PRN) PCI System Control Main Memory Network Control Ether, ATM, HPI RAID Disk Service Processor Console Slide 9 Hitachi SR-8000 node Address Data FPR 0 Cache... Node 7 8 X-Bar: NIA I/O: Switch (Fully non-blocking)... 0 1 7 16 Banks Memory (SDRAM) 64 Banks 512 Banks Slide 10

SR-8000: Memory Hierarchy 32 b/cyc fp registers (128+32) 16 b/cyc Other s L1 cache (128 Kb 4-way) Store buffer (16 entries) 16 b/cyc Switch Memory (2 to 16 Gb, 512 banks) Slide 11 Address Translation Virtual address Virtual page number Page offset Cache recently used entries of page table in TLB Page table Main memory Slide 12

Large TLB Virtual address Virtual page number Page offset Large TLB covers whole address space with 256 entries. Page size 16Mb to 128 Mb Large page table Main memory Slide 13 Topology SR-8000 3D Torus 8X8X8 (=512) processors largest size Slide 14

Parameter debis-sfr Hitachi SR-8000 see Simmendinger: http://thales.rz.st.dlr.de/cgi-bin/fom?file=16 Peak performance 128 Gflop/s Processors 16x8=128 Frequency 250 MHz Memory 16x8=128 GB Disk 440 GB Inter-Node Bandwidth Hardware 1 GB/s Inter-Node Bandwidth /RDMA 950 MB/s Inter-Node Bandwidth 770 MB/s Inner-Node Bandwidth Memory 32 GB/s/Node L1 Cache direct 4-way assoc L1 Cache size 8x128KB/Node L1 Cacheline size 128 B Bandwidth Register - L1 Cache 32 B/cycle Bandwidth L1 Cache - Memory 16 B/cycle Bandwidth Register - Memory 8 B/cycle Slide 15 Hitachi SR 8000: nodes with 8+1 processors processors have rotating registers (pseudo vector registers) 8 processors are cooperating for data parallel code or for shared memory parallelism Slide 16

CPU Architecture 16 bytes/cycle memory BW 128 Kbyte L1 cache Pre-fetch ; 16 open transactions pre-load bypassing cache 160 FP registers 2 FP pipelines 4 flops/cycle Pre-load Main Main Memory Memory Memory Memory Switch Switch Pre-fetch Cache Load Floating Point Registers Arithmetic Arithmetic Unit Unit Slide 17 Slide Window Registers Logical 126-7 124-7 16 to 31 16 to 31 32 to 125 32 to 123 0 to 15 0 to 15 Base=2 Base=4 Physical Sliding part: 0 to 127 Global part: 128 to 159 Registers for all instructions Registers for extended instructions only Fixed registers: 4, 8, 16, 32 (16 illustrated) Fixed + sliding = 128 see also in IA-64 Slide 18

Instruction Set Extensions Load and store with extended registers Floating point arithmetic with extended registers Slide window control Pre-fetch and pre-load Thread start-up and finish Predicate instructions Slide 19 Programming Models Hardware Programming Model Example Single CPU Pseudo-vector processing Vector application Independent processing on Compilation, parallel each make Message passing MPP application Single Node DO loop distribution with Vector application COMPAS Parallel processing of independent blocks of code Message passing MPP application Multiple Nodes COMPAS and message passing Vector parallel application Slide 20

SR8000 Parallel Programming Models Pseudo-Vector Processing: PVP COoperative MicroProcessors in single Address Space: Message Passing: COMPAS Slide 21 Pre-fetch and Pre-load Pre-fetch: load cache line from memory to cache Pre-load: load one word from memory to register 16 streams Pre-load Main Main Memory Memory Memory Memory Switch Switch Pre-fetch Cache Load Floating Point Registers Arithmetic Arithmetic Unit Unit Slide 22

Software Pipelining No SWPL I=1 I=2 I=3 Infinite resource I=1 I=2 I=3 Recurrence =a I=1 a= =a I=2 a= Finite resource =a I=3 a= I=1 Initiation interval I=2 I=3 Resources: registers, f.p. units, instruction issue, memory bandwidth etc Slide 23 Pre-fetch Iteration 1 2 3 4 5 6 PF Latency LD Use data L D Use data LD Use data LD Use data PF Latency LD Use data LD Use data Pre-fetch 128 bytes to cache Follow by LD to register Slide 24

Pre-load Iteration 1 2 3 4 5 6 PL Latency Use data PL Latency Use data PL Latency Use data PL Latency Use data PL Latency Use data PL Latency Use data Pre-load 8 bytes to register LD not required Slide 25 Pseudo-vector Processing VLD Vector VADD VST A(:) = A(:) + N P F Pseudo-Vector Lat LD + ST LD + ST LD + ST LD + ST P F Lat LD + ST LD + ST LD + ST Slide 26

Effect of PVP 900 Mflops 800 700 600 500 400 300 200 100 0 Dot product: S = A(1:N)*B(1:N) 1 10 100 1000 10000 100000 Loop length (N) PVP off PVP on Slide 27 COMPAS Multi-dimensional Crossbar Network Node.... Main memory (shared) Node Node Node Node Automatic Parallel Processing COMPAS (Start Inst.) Pre-fetch Load Arithmetic Store Branch COMPAS ( End Inst.) Pre-fetch Load Arithmetic Store Branch process thread : Instruction Processor COMPAS: Co-operative Micro-Processors in single Address Space Slide 28

Hardware Support Software Scalar Part (waiting for startup) (waiting for startup) (waiting for startup) Start Parallel Inst. Loop Part Loop Part Loop Part Loop Part End Parallel Inst. Scalar Part Hardware Support Barrier Synchronization Mechanism SC MS :Instruction Processor SC:Storage Controller MS:Main Storage Slide 29 Effect of COMPAS: Dot product: S = A(1:N)*B(1:N) 5000 4500 4000 3500 Mflops 3000 2500 2000 1500 1000 500 0 1 10 100 1000 10000 100000 Loop length (N) COMPAS off COMPAS on Slide 30

Remote DMA Send Buffer Normal Transfer memor y copy Node Program data data O S Protocol Processing Context Switch Interrupt Handling Remote DMA Transfer Node Program data O S No Buffering in Kernel No OS System Call data memor y copy Receive Buffer Crossbar Network Slide 31 Inter-node Cross-bar Inter-node Network One process per node; RDMA transfer possible Slide 32

Intra-node Cross-bar Inter-node Network Shared memory Shared memory Shared memory One process per ; RDMA transfer not possible Slide 33 SR8000: Parallelism on all levels Instruction level (PVP) Multi-thread (COMPAS) Message passing () Node 1 Node 2 Slide 34

Key Features of SR8000 High performance RISC CPU with PVP High performance node with COMPAS High sustained memory bandwidth High scalability with fast network Low energy and space requirements Slide 35