OPENSPARC T1 OVERVIEW

Similar documents

OpenSPARC T1 Processor

UltraSPARC T1: A 32-threaded CMP for Servers. James Laudon Distinguished Engineer Sun Microsystems james.laudon@sun.com

OpenSPARC T1 Microarchitecture Specification

OpenSPARC T1 Microarchitecture Specification

OC By Arsene Fansi T. POLIMI

- Nishad Nerurkar. - Aniket Mhatre

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR

AMD Opteron Quad-Core

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Putting it all together: Intel Nehalem.

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Architecture of Hitachi SR-8000

Introduction to Cloud Computing

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

PROBLEMS #20,R0,R1 #$3A,R2,R4

Azul Compute Appliances

StrongARM** SA-110 Microprocessor Instruction Timing

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Central Processing Unit (CPU)

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Computer Systems Structure Input/Output

Next Generation GPU Architecture Code-named Fermi

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

1 Storage Devices Summary

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Enabling Technologies for Distributed and Cloud Computing

CPU Organization and Assembly Language

Low Power AMD Athlon 64 and AMD Opteron Processors

Intel StrongARM SA-110 Microprocessor

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

Introduction to AMBA 4 ACE and big.little Processing Technology

Architectures and Platforms

Multi-Threading Performance on Commodity Multi-Core Processors

Introduction to Microprocessors

CHAPTER 7: The CPU and Memory

EE361: Digital Computer Organization Course Syllabus

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Computer Organization and Components

SAN Conceptual and Design Basics

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Enabling Technologies for Distributed Computing

Computer Architecture

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Intel Pentium 4 Processor on 90nm Technology

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Chapter 2 Logic Gates and Introduction to Computer Architecture

Photonic Networks for Data Centres and High Performance Computing

picojava TM : A Hardware Implementation of the Java Virtual Machine

Thread Level Parallelism II: Multithreading

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Scaling Hadoop for Multi-Core and Highly Threaded Systems

Thread level parallelism

COS 318: Operating Systems. Virtual Memory and Address Translation

IA-64 Application Developer s Architecture Guide

SOC architecture and design

A Deduplication File System & Course Review

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Lecture 17: Virtual Memory II. Goals of virtual memory

What is a System on a Chip?

DDR subsystem: Enhancing System Reliability and Yield

CS 61C: Great Ideas in Computer Architecture Virtual Memory Cont.

Sun CoolThreads Servers and Zeus Technology Next-generation load balancing and application traffic management

760 Veterans Circle, Warminster, PA Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA

SPARC64 VIIIfx: CPU for the K computer

CPU Performance Equation

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

FPGA-based Multithreading for In-Memory Hash Joins

Virtual Memory. How is it possible for each process to have contiguous addresses and so many of them? A System Using Virtual Addressing

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

An Implementation Of Multiprocessor Linux

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Hardware Based Virtualization Technologies. Elsie Wahlig Platform Software Architect

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

1. Memory technology & Hierarchy

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Parallel Programming Survey

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Full and Para Virtualization

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Binary search tree with SIMD bandwidth optimization using SSE

Memory Architecture and Management in a NoC Platform

Transcription:

Chapter Four OPENSPARC T1 OVERVIEW Denis Sheahan Distinguished Engineer Niagara Architecture Group Sun Microsystems Creative Commons 3.0United United States License Creative CommonsAttribution-Share Attribution-Share 3.0 States License 46

UltraSPARC T1 Processor SPARC V9 (Level 1) implementation Up to eight 4-threaded cores (32 simultaneous threads) All cores connected through high bandwidth (134.4GB/s) crossbar switch High-bandwidth, 12-way associative 3MB Level-2 cache on chip 4 DDR2 channels (23GB/s) Power : < 80W ~300M transistors 378 sq. mm die 1 of 8 Cores DDR-2 SDRAM DDR-2 SDRAM L2$ DDR-2 SDRAM L2$ L2$ DDR-2 SDRAM L2$ Xbar FPU C1 C2 C3 C4 C5 C6 C7 C8 Sys I/F Buffer Switch Core BUS 47

Faster Can Be Cooler Single-Core Processor CMT Processor 107C C1 C2 C3 C4 102C 96C 91C 85C 80C 74C 69C 63C 58C C5 C6 C7 C8 (Not to Scale) 48

CMT: On-chip = High Bandwidth 32-thread Traditional SMP System 32-thread OpenSPARC T1 Processor Example: Typical SMP Machine Configuration One motherboard, no switch ASICs Switch M M M M M M M M P P P P P P P P Switch Switch M M M M P P P P IO IO Switch Switch IO P P P P P P P P L2 L2 L2 L2 Mem Ctlr Mem Ctlr M M M M P P P P Switch Switch M M M M M M M M P P P P P P P P Switch Switch M M M M M M M M Mem Ctlr IO XBar IO P P P P Switch IO P P P P I/O Mem Ctlr IO Direct crossbar interconnect IO -- Lower cost -- better RAS -- lower BTUs, -- lower and uniform latency, -- greater and uniform bandwidth... 49

CMT Benefits Performance Cost Fewer servers Less floor space Reduced power consumption Less air conditioning Lower administration and maintenance Reliability 50

CMT Pays Off with CoolThreads Technology TM Sun Fire T2000 Sun Fire T1000 Up to 5x the performance As low as 1/5 the energy As small as 1/4 the size *See disclosures 51

UltraSPARC-T1: Choices & Benefits Simple core (6-stage, only 11mm2 in 90nm), 1 FPU maximum # of cores/threads on die pipeline built from scratch, useful for multiple generations modular, flexible design... scalable (up and down) Caches, DRAM channels shared across cores better area utilization Shared L2 cache cost of coherence misses decrease by order of magnitude enables highly efficient multi-threaded software On-die memory controllers reduce miss latency Crossbar switch good for b/w, latency, functional verification For reference: in 90nm technology, included 8 cores, 32 threads, and only dissipate 70 W 52

UltraSPARC-T1 Processor Core MUL EXU Four threads per core Single issue 6 stage pipeline 16KB I-Cache, 8KB D-Cache > Unique resources per thread > Registers > Portions of I-fetch datapath > Store and Miss buffers > Resources shared by 4 threads > Caches, TLBs, Execution Units > Pipeline registers and DP IFU MMU TRAP LSU Core Area = 11mm2 in 90nm MT adds ~20% area to core 53

UltraSPARC T1 Processor Core Pipeline 1 2 3 4 Fetch Thread Select Decode 5 Execute Reg file x4 ICache ITLB Inst buf x 4 Thrd Sel Mux Thread selects Thrd Sel Mux Decode Thread select logic 6 Memory Writeback Crypto Accelerator Alu Mul Shft Div DCache D TLB Crossbar Stbuf x 4 Interface Instruction type misses traps & interrupts resource conflicts PC logic x 4...blue units are replicated per thread on core 54

Thread Selection Policy Every cycle, switch among available (ready to run) threads priority given to least-recently-executed thread Thread becomes not-ready-to-run due to: Long latency operations like load, branch, mul, or div Pipeline stall such as cache miss, trap, or resource conflict Loads are speculated as cache hits, and the thread is switched in with lower priority 55

OpenSPARC T1 Power Power Efficient Architecture Single issue, in-order six stage pipeline Minimal speculation, predication or branch prediction Thermal monitoring for power throttling 3 external power throttle pins Controlled by thermal diodes Stall cycles injected, affecting all threads Memory throttling T1 Power Components Design Implementation Fully static design Fine granularity clock gating Wire Limited clock issue on stall, FGU Limited L2 Cache & Memory clock gating Cores Leakage IO's Global Clock Interconnect Misc Units Floating Point L2Buf Unit L2Tag Unit L2Data Leakage Crossbar SPARC Cores Wire classes optimized for power x delay 56

OpenSPARC T1 Microarchitecture OpenSPARC October 2008

Thread Selection All Threads Ready Instructions Cycles St0 sub Dt0 sub Et0 sub Mt0 sub Wt0 sub Ft0 add St1 sub Dt1 sub Et1 sub Mt1 sub Wt1 sub Ft1 ld St2 ld Dt2 ld Et2 ld Mt2 ld Wt2 ld Ft2 br St3 add Dt3 add Et3 add Mt3 add Thread 0: sub add Thread 1: sub ld Ft3 sub St0 add Dt0 add Et0 add Thread2: ld br Thread 3: add sub 58

Thread Selection Two Threads Ready Cycles St0 ld Instructions Dt0 ld Et0 ld Mt0 ld Wt0 ld Ft0 add St1 sub Dt1 sub Et1 sub Mt1 sub Wt1 sub Ft1 ld St1 ld Dt1 ld Et1 ld Mt1 ld Wt1 ld Ft1 br St0 add Dt0 add Et0 add Mt0 add Thread 0: ld add Thread 1: sub ld br Thread '0' is speculatively switched in before cache hit information is available, in time for the 'load' to bypass data to the 'add' 59

Instruction Fetch/Switch/Decode Unit(IFU) I-cache > 16KB data, 4ways, 32B line size > Single-ported Instruction Tag > Dual-ported(1R/1W) Valid-bit array to hold Cache line state of valid/invalid > Invalidate operations access Valid-bit array, not Instruction Tags > Pseudo-random replacement Fully Associative Instruction TLB > 64 entries, Page sizes: 8k, 64k, 4M, 256M > Pseudo LRU replacement. > Multiple hits in TLB prevented by doing auto-demap on fill 60

IFU Functions (Cont'd) 2 instructions fetched each cycle, though only one is issued/clk. Reduces I$ activity and allows opportunistic line fill. 1 outstanding miss/thread, and 4 per core. Duplicate misses do not request to L2 PC's, NPC's for all live instructions in machine maintained in IFU 61

(16 reg x 8 windows + 8 global regs x 4 sets)x4 threads Windowed Integer Register File 5KB 3R/2W/1T structure > 640 64-bit regs with ECC! Only the 32 registers from current window is visible to thread Window changing in background under thread switch. Other threads continue to access IRF Compact design with 6T cells for architectural set & multi ported cell for working set. Single cycle R/W access 62

Execution Units Single ALU and Shifter. ALU reused for Branch Address and Virtual Address Calculation Integer Multiplier > 5 clock latency, throughput of ½ per cycle for area saving > Contains accumulate function for Mod Arithmetic. > 1 integer mul allowed outstanding per core. > Multiplier shared between Core Pipe and Modular Arithmetic unit on a round robin basis. Simple non restoring divider, with one divide outstanding per core. Thread issuing a MUL/DIV will rollback and switch out if another thread is occupying the mul/div units. 63

Load Store Unit(LSU) D-Cache complex > 8KB data, 4ways, 16B line size > Single ported Data Tag. > Dual ported(1r/1w) Valid bit array to hold Cache line state of valid/invalid > Invalidates access Vbit array but not Data Tag > Pseudo-random replacement > Loads are allocating, stores are non allocating. DTLB: common macro to ITLB(64 entry FA) 8 entry store buffer per thread, unified into single 32 entry array, with RAW bypassing. 64

LSU(Cont'd) Single load per thread outstanding. Duplicate request for the same line not sent to L2 Crossbar interface > LSU prioritizes requests to the crossbar for FPops, Streaming ops, I and D misses, stores and interrupts etc. > Request priority: imiss>ldmiss>stores, {fpu,strm,interrupt}. > Packet assembly for pcx. Handles returns from crossbar and maintains order for cache updates and invalidates. 65

Other Functions Support for 6 trap levels. Traps cause pipeline flush and thread switch until trap PC is available Support for up to 64 pending interrupts per thread Floating Point > FP registers and decode located within core > On detecting an Fpop > The thread switches out > Fpop is further decoded and FRF is read > Fpop with operands are packetized and shipped over the crossbar to the FPU > Computation done in FPU and result returned via crossbar > Writeback completed to FRF (floating-point register file) and thread restart 66

Crossbar Each requestor queues up to 2 packets per destination. 3 stage pipeline: Request, Arbitrate and Transmit Centralized arbitration with oldest requestor getting priority Core to cache bus optimized for address + doubleword store Cache to core bus optimized for 16B line fill. 32B I$ line fill delivered in 2 back to back clks 67

L2 Cache 3MB, 4-way banked, 12-way SA, Writeback 64B line size, 64B interleaved between banks Pipeline latency: 8 clks for Load, 9 clks for I-miss, with critical chunk returned first 16 outstanding misses per bank -> 64 total Coherence maintained by shadowing L1 tags in an L2 directory structure. L2 is point of global visibility. DMA from IO is serialized wrt traffic from cores in L2 68

L2 Cache Directory Directory shadows L1 tags >L1 set index and L2 bank interleaving is such that ¼ of L1 entries come from each L2 bank >On an L1 miss, the L1 replacement way and set index identify the physical location of the tag which will be updated by miss address >On a store, directory will be CAM'ed Directory entries collated by set so only 64 entries need to be CAM ed. Scheme is quite power efficient. Invalidate operations are a pointer to the physical location in the L1, eliminating the need for a tag lookup in L1. 69

Coherence/Ordering Loads update directory & fill the L1 on return Stores are non-allocating in L1 > Two flavors of stores: TSO, RMO. > > > > One TSO store outstanding to L2 per thread to preserve store ordering. No such limitation on RMO stores No tag check done at store buffer insert Stores check directory and determine L1 hit. Directory sends store ack/inv to core Store update happens to D$ on store ack Crossbar orders responses across cache banks 70

On-Chip Memory Controller 4 independent DDR2 DRAM channels Can supports memory size of upto 128GB 25GB/s peak bandwidth Schedules across 8 reads + 8 writes Can be programmed to 2 channel mode in reduced configuration (reduced corresponding cores) 128+16b interface, chip-kill support, nibble error correction, byte error detection Designed to work from 125-200Mhz 71

Virtualization Hypervisor layer virtualizes CPU Multiple OS instances Better RAS: failures in one domain do not affect other domains Improved OS portability to newer hardware OS instance 1 OS instance 2 Hypervisor UltraSPARC T1 hardware 72

Virtualization on UltraSPARC T1 Implementation on UltraSPARC-T1 > Hypervisor uses Physical Addresses > Supervisor* sees 'Real Addresses' a PA abstraction > VA translated to RA, and then to PA. Niagara(T1) MMU and TLB provides h/w support. > Up to 8 partitions can be supported. 3-bit partition ID is part of TLB translation checks > Additional trap level added for hypervisor use * supervisor = privileged-mode software = operating system (for example, Solaris, Linux, *BSD,...) 73

OpenSPARC Slide-Cast In 12 Chapters Presented by OpenSPARC designers, developers, and programmers to guide users as they develop their own OpenSPARC designs and to assist professors as they teach the nextavailable generation This material is made under Creative Commons 3.0United United States License Creative CommonsAttribution-Share Attribution-Share 3.0 States License 74