UltraSPARC T1: A 32-threaded CMP for Servers. James Laudon Distinguished Engineer Sun Microsystems james.laudon@sun.com

Similar documents

OPENSPARC T1 OVERVIEW

NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

OpenSPARC T1 Processor

Thread level parallelism

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Thread Level Parallelism II: Multithreading

Putting it all together: Intel Nehalem.

Low Power AMD Athlon 64 and AMD Opteron Processors

Multithreading Lin Gao cs9244 report, 2006

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

Innovativste XEON Prozessortechnik für Cisco UCS

Introduction to Microprocessors

OC By Arsene Fansi T. POLIMI

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Summary. Key results at a glance:

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Energy-Efficient, High-Performance Heterogeneous Core Design

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Managing Data Center Power and Cooling

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

Enterprise Applications

SPARC64 VIIIfx: CPU for the K computer

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Memory Architecture and Management in a NoC Platform

Chapter 2 Parallel Computer Architecture

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Intel Xeon Processor E5-2600

Design Cycle for Microprocessors

Intel Pentium 4 Processor on 90nm Technology

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Why Latency Lags Bandwidth, and What it Means to Computing

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

- Nishad Nerurkar. - Aniket Mhatre

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

Performance evaluation

Leading Virtualization Performance and Energy Efficiency in a Multi-processor Server

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

POWER8 Performance Analysis

Scaling in a Hypervisor Environment

What is a System on a Chip?

ECLIPSE Performance Benchmarks and Profiling. January 2009

Multi-Threading Performance on Commodity Multi-Core Processors

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Week 1 out-of-class notes, discussions and sample problems

Parallel Algorithm Engineering

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

Enabling Technologies for Distributed Computing

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

AMD Opteron Quad-Core

Parallel Programming Survey

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Digital Design for Low Power Systems

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

FPGA-based Multithreading for In-Memory Hash Joins

Enabling Technologies for Distributed and Cloud Computing

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Computer Systems Structure Input/Output

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

On-Chip Interconnection Networks Low-Power Interconnect

1. Memory technology & Hierarchy

Computer Architecture TDTS10

Towards Energy Efficient Query Processing in Database Management System

Virtuoso and Database Scalability

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Introducción. Diseño de sistemas digitales.1

Architecture of Hitachi SR-8000

Itanium 2 Platform and Technologies. Alexander Grudinski Business Solution Specialist Intel Corporation

Operating System Impact on SMT Architecture

LS DYNA Performance Benchmarks and Profiling. January 2009

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors

Overview. CPU Manufacturers. Current Intel and AMD Offerings

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

BIG DATA The Petabyte Age: More Isn't Just More More Is Different wired magazine Issue 16/.07

Transcription:

UltraSPARC T1: A 32-threaded CMP for Servers James Laudon Distinguished Engineer Sun Microsystems james.laudon@sun.com

Outline Page 2 Server design issues > Application demands > System requirements Building a better server-oriented CMP > Maximizing thread count > Keeping the threads fed > Keeping the threads cool UltraSPARC T1 (Niagara) > Micro-architecture > Performance > Power

Attributes of Commercial Workloads Attribute Web99 jbob (JBB) TPC-C SAP 2T SAP 3T DB TPC-H Application Category Web server Server Java OLTP ERP ERP DSS Instruction-level parallelism low low low med low high Thread-level parallelism high high high high high high Instruction/Data working set large large large med large large low med high med high med Data sharing Adapted from A Performance methodology for commercial servers, S. R. Kunkel et al, IBM J. Res. Develop. vol. 44 no. 6 Nov 2000 Page 3

Commercial Server Workloads SpecWeb05, SpecJappserver04, SpecJBB05, SAP SD, TPC-C, TPC-E, TPC-H High degree of thread-level parallelism (TLP) Large working sets with poor locality leading to high cache miss rates Low instruction-level parallelism (ILP) due to high cache miss rates, load-load dependencies, and difficult to predict branches Performance is bottlenecked by stalls on memory accesses Superscalar and superpipelining will not help much Page 4

ILP Processor on Server Application Thread C M C M C M Time Memory Latency Compute Scalar processor Time Saved Thread C M C M C M Time Memory Latency Compute Processor optimized for ILP ILP reduces the compute time and overlaps computation with L2 cache hits, but memory stall time dominates overall performance Page 5

Attacking the Memory Bottleneck Exploit the TLP-rich nature of server applications Replace each large, superscalar processor with multiple simpler, threaded processors > Increases core count (C) > Increases thread per core count (T) > Greatly increases total thread count (C*T) Threads share a large, high-bandwidth L2 cache and memory system Overlap the memory stalls of one thread with the computation of other threads Page 6

TLP Processor on Server Application Core7 Core6 Core5 Core4 Core3 Core2 Core1 Core0 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Memory Latency Compute Time TLP focuses on overlapping memory references to improve throughput; needs sufficient memory bandwidth Page 7

Server System Requirements Very large power demands > Often run at high utilization and/or with large amounts of memory > Deployed in dense rack-mounted datacenters Power density affects both datacenter construction and ongoing costs Current servers consume far more power than state of the art datacenters can provide > 500W per 1U box possible > Over 20 kw/rack, most datacenters at 5 kw/rack > Blades make this even worse... Page 8

Server System Requirements Processor power is a significant portion of total > Database: 1/3 processor, 1/3 memory, 1/3 disk > Web serving: 2/3 processor, 1/3 memory Perf/watt has been flat between processor generations Acquisition cost of server hardware is declining > Moore's Law more performance at same cost or same performance at lower cost Total cost of ownership (TCO) will be dominated by power within five years The Power Wall Page 9

Performance/Watt Trends Source: L. Barroso, The Price of Performance, ACM Queue vol 3 no 7 Page 10

Impact of Flat Perf/Watt on TCO Source: L. Barroso, The Price of Performance, ACM Queue vol 3 no 7 Page 11

Implications of the Power Wall With TCO dominated by power usage, the metric that matters is performance/watt Performance/Watt has been mostly flat for several generations of ILP-focused designs > Should have been improving as a result of 2 voltage scaling (fcv + TILCV) > C, T, ILC, and f increases have offset voltage decreases TLP-focused processors reduce f and C/T (perprocessor) and can greatly improve performance/watt for server workloads Page 12

Outline Page 13 Server design issues > Application demands > System requirements Building a better server-oriented CMP > Maximizing thread count > Keeping the threads fed > Keeping the threads cool UltraSPARC T1 (Niagara) > Micro-architecture > Performance > Power

Building a TLP-focused processor Maximizing the total number of threads > Simple cores > Sharing at many levels Keeping the threads fed > Bandwidth! > Increased associativity Keeping the threads cool > Performance/watt as a design goal > Reasonable frequency > Mechanisms for controlling the power envelope Page 14

Maximizing the thread count Tradeoff exists between large number of simple cores and small number of complex cores > Complex cores focus on ILP for higher single thread performance > ILP scarce in commercial workloads > Simple cores can deliver more TLP Need to trade off area devoted to processor cores, L2 and L3 caches, and system-on-a-chip Balance performance and power in all subsystems: processor, caches, memory and I/O Page 15

Maximizing CMP Throughput with 1 Mediocre Cores J. Davis, J. Laudon, K. Olukotun PACT '05 paper Examined several UltraSPARC II, III, IV, and T1 designs, accounting for differing technologies Constructed an area model based on this exploration Assumed a fixed-area large die (400 mm2), and accounted for pads, pins, and routing overhead Looked at performance for a broad swath of scalar and in-order superscalar processor core designs 1 Mediocre: adj. ordinary; of moderate quality, value, ability, or performance Page 16

CMP Design Space I$: Instruction Cache D$: Data Cache IDP Integer Pipeline Thread Superscalar Processor Core Core 1 superscalar pipeline with 1 or more threads per pipeline I$ IDP IDP D$ OR Crossbar I$ Scalar Processor Shared L2 Cache 1 or more pipelines with 1 or more threads per pipeline DRAM DRAM DRAM IDP D$ DRAM Large simulation space: 13k runs/benchmark/technology (pruned) Fixed die size: number of cores in CMP depends on the core size Page 17

Scalar vs. Superscalar Core Area 5.5 1 IDP 2 IDP 5.0 3 IDP 4 IDP Relative Core Area 4.5 2-SS 4-SS 4.0 1.84 X 3.5 1.75 X 3.0 2.5 1.54 X 2.0 1.36 X 1.5 1.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Threads per Core Page 18

Trading complexity, cores and caches 12-14 7-9 5-7 14-17 7 4 Source: J. Davis, J. Laudon, K. Olukotun, Maximizing CMP Throughput with Medicore Cores, PACT '05 Page 19

The Scalar CMP Design Space 16 14 High Thread Count, Small L1/L2 Mediocre Cores Aggregate IPC 12 10 8 Medium Thread Count, Large L1/L2 6 Low Thread Count, Small L1/L2 4 2 0 0 5 10 15 20 25 Total Cores Page 20

Limitations of Simple Cores Lower SPEC CPU2000 ratio performance > Not representative of most single-thread code > Abstraction increases frequency of branching and indirection > Most applications wait on network, disk, memory; rarely execution units Large number of threads per chip > 32 for UltraSPARC T1, 100+ threads soon > Is software ready for this many threads? > Many commercial applications scale well > Workload consolidation Page 21

Simple core comparison UltraSPARC T1 379 mm2 Page 22 Pentium Extreme Edition 206 mm2

Comparison Disclaimers Different design teams and design environments Chips fabricated in 90 nm by TI and Intel UltraSPARC T1: designed from ground up as a CMP Pentium Extreme Edition: two cores bolted together Apples to watermelons comparison, but still interesting Page 23

Pentium EE- US T1 Bandwidth Comparison Feature Clock Speed Pentium Extreme Edition UltraSPARC T1 3.2 Ghz 1.2 Ghz 31 stages 6 stages 130 W (@ 1.3 V) 72W (@ 1.3V) 206 mm2 379 mm2 Transistor Count 230 million 279 million Number of cores 2 8 Number of threads 4 32 12 kuop Instruction/16 kb Data 16 kb Instruction/8 kb Data 1.1 ns 2.5 ns Two copies of 1 MB, 8-way associative 3 MB, 12-way associative 7.5 ns 19 ns ~180 GB/s 76.8 GB/s 80 ns 90 ns 6.4 GB/s 25.6 GB/s Pipeline Depth Power Die Size L1 caches Load-to-use latency L2 cache L2 unloaded latency L2 bandwidth Memory unloaded latency Memory bandwidth Page 24

Sharing Saves Area & Ups Utilization Hardware threads within a processor core share: > Pipeline and execution units > L1 caches, TLBs and load/store port Processor cores within a CMP share: > L2 and L3 caches > Memory and I/O ports Increases utilization > Multiple threads fill pipeline and overlap memory stalls with computation > Multiple cores increase load on L2 and L3 caches and memory Page 25

Sharing to save area MUL EXU IFU MMU TRAP Page 26 LSU UltraSPARC T1 Four threads per core Multithreading increases: > Register file > Trap unit > Instruction buffers and fetch resources > Store queues and miss buffers 20% area increase in core excluding cryptography unit

Sharing to increase utilization UltraSPARC T1 Database App Utilization CPI Breakdown Single Thread Miscellaneous Pipeline Latency Store Buffer Full L2 Miss Data Cache Miss One Thread of Four Inst Cache Miss Pipeline Busy Thread D Active Thread C Active Thread B Active Thread A Active Four Threads 0 1 2 3 4 CPI Page 27 5 6 7 8 Application run with both 8 and 32 threads With 32 threads, pipeline and memory contention slow each thread by 34% However, increased utilization leads to 3x speedup with four threads

Keeping the threads fed Dedicated resources for thread memory requests > Private store buffers and miss buffers Large, banked, and highly-associative L2 cache > Multiple banks for sufficient bandwidth > Increased size and associativity to hold the working sets of multiple threads Direct connection to high-bandwidth memory > Fallout from shared L2 will be larger than from a private L2 > But increase in L2 miss rate will be much smaller than increase in number of threads Page 28

Keeping the threads cool Sharing of resources increases unit utilization and thus leads to increase in power Cores must be power efficient > Minimal speculation high-payoff only > Moderate pipeline depth and frequency Extensive mechanisms for power management > Voltage and frequency control > Clock gating and unit shutdown > Leakage power control > Minimizing cache and memory power Page 29

Outline Page 30 Server design issues > Application demands > System requirements Building a better server-oriented CMP > Maximizing thread count > Keeping the threads fed > Keeping the threads cool UltraSPARC T1 (Niagara) > Micro-architecture > Performance > Power

UltraSPARC T1 Overview TLP-focused CMP for servers > 32 threads to hide memory and pipeline stalls Extensive sharing > Four threads share each processor core > Eight processor cores share a single L2 cache High-bandwidth cache and memory subsystem > Banked and highly-associative L2 cache > Direct connection to DDR II memory Performance/Watt as a design metric Page 31

UltraSPARC T1 Block Diagram Floating Point Unit DRAM Control L2 Core 1 Bank 0 L2 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Bank 3 Control Register Interface Channel 0 DDR2 144@400 MT/s Channel 1 DDR2 144@400 MT/s Bank 1 L2 Bank 2 L2 JTAG Clock & Test Unit Page 32 Crossbar Core 0 Channel 2 DDR2 144@400 MT/s Channel 3 DDR2 144@400 MT/s JBUS System Interface SSI ROM Interface JBUS (200 MHz) SSI (50 MHz)

UltraSPARC T1 Micrograph SPARC SPARC SPARC SPARC L2 Data Bank 0 L2Tag Bank 0 L2 Buff CLK & Bank 0 L2 Buff Test DDR2_1 Bank 1 DRAM Ctl 0,2 IO L2Tag Bank 2 Bridge CROSSBAR Unit L2 Buff FPU Bank 2 L2 Buff Bank 3 DRAM L2Tag Ctl 1,3 L2Tag L2 Data Bank 1 JBUS Bank 3 L2 Data Bank 1 Bank 3 SPARC SPARC SPARC SPARC Core 1 Core 3 Core 5 Core 7 Page 33 Bank 2 DDR2_2 L2 Data DDR2_3 DDR2_0 Core 0 Core 2 Core4 Core 6 Features: 8 64-bit Multithreaded SPARC Cores Shared 3 MB, 12-way 64B line writeback L2 Cache 16 KB, 4-way 32B line ICache per Core 8 KB, 4-way 16B line writethrough DCache per Core 4 144-bit DDR-2 channels 3.2 GB/sec JBUS I/O Technology: TI's 90nm CMOS Process 9LM Cu Interconnect 63 Watts @ 1.2GHz/1.2V Die Size: 379mm2 279M Transistors Flip-chip ceramic LGA

UltraSPARC T1 Floorplanning Modular design for step and repeat Main issue is that all cores want to be close to all the L2 cache banks > Crossbar and L2 tags located in the center > Processor cores on the top and bottom > L2 data on the left and right > Memory controllers and SOC fill in the holes Page 34

Maximing Thread Count on US-T1 Power-efficient, simple cores > Six stage pipeline, almost no speculation > 1.2 GHz operation > Four threads per core >Shared: pipeline, L1 caches, TLB, L2 interface >Dedicated: register and other architectural state, instruction buffers, 8-entry store buffers > Pipeline switches between available threads every cycle (interleaved/vertical multithreading) > Cryptography acceleration unit per core Page 35

UltraSPARC T1 Pipeline Fetch Thrd Sel Decode Execute Memory WB Regfile x4 ICache Inst Itlb buf x 4 Thrd Sel Mux Thread selects Decode Thread select logic Thrd Sel Mux Page 36 Alu Mul Shft Div DCache Dtlb Crossbar Stbuf x 4 Interface Instruction type misses traps & interrupts resource conflicts PC logic x4

Thread Selection: All Threads Ready Cycles St0-ld Dt0-ld Et0-ld Mt0-ld Wt0-ld Ft0-add St1-sub Dt1-sub Et1-sub Mt1-sub Instructions Page 37 Ft1-ld St2-ld Wt1-sub Dt2-ld Et2-ld Mt2-ld Wt2-ld Ft2-br St3-add Dt3-add Et3-add Mt3-add Ft3-add St0-add Dt0-add Et0-add

Thread Selection: Two Threads Ready Instructions Cycles St0-ld Ft0-add St1-sub Dt1-sub Et1-sub Mt1-sub Dt0-ld Et0-ld Mt0-ld Wt0-ld Ft1-ld St1-ld Wt1-sub Dt1-ld Et1-ld Mt1-ld Wt1-ld Ft1-br St0-add Dt0-add Et0-add Mt0-add Thread '0' is speculatively switched in before cache hit information is available, in time for the 'load' to bypass data to the 'add' Page 38

Feeding the UltraSPARC T1 Threads Shared L2 cache > 3 MB, writeback, 12-way associative, 64B lines > 4 banks, interleaved on cache line boundary > Handles multiple outstanding misses per bank > MESI coherence L2 cache orders all requests > Maintains directory and inclusion of L1 caches Direct connection to memory > Four 144-bit wide (128+16) DDR II interfaces > Supports up to 128 GB of memory > 25.6 GB/s memory bandwidth Page 39

Keeping the US-T1 Threads Cool Power efficient cores > 1.2 GHz 6-stage single-issue pipeline Features to keep peak power close to average > Ability to suspend issue from any thread > Limit on number of outstanding memory requests Extensive clock gating > Coarse-grained (unit shutdown, partial activation) > Fine-grained (selective gating within datapaths) Static design for most of chip 63 Watts typical power at 1.2V and 1.2 GHz Page 40

UltraSPARC T1 Power Breakdown 63W @ 1.2Ghz / 1.2V < 2 Watts / Thread IOs Cores (11%) (26%) Leakage (25%) Page 41 Route (16%) Xbar (4%) Top SPARC Cores Leakage Interconnect L2Cache (12%) Fully static design Fine granularity clock gating for datapaths (30% flops disabled) Lower 1.5 P/N width ratio for library cells Interconnect wire classes optimized for power x delay SRAM activation control L2Data L2Tag Unit L2 Buff Unit Crossbar IO's Global Clock Floating Point Misc Units

TM Advantages of CoolThreads 59oC 59oC 66oC 66oC 59oC 59oC 59o C Page 42 No need for exotic cooling technologies Improved reliability from lower and more uniform junction temperatures Improved performance/reliability tradeoff in design 107o C

UltraSPARC T1 System (T1000) Page 43

UltraSPARC T1 System (T2000) Page 44

T2000 Power Breakdown Sun Fire T2000 Power UltraSPARC T1 16 GB memory IO Disks Service Proc Fans AC/DC Conversion Page 45 271W running SPECJBB 2000 Power breakdown > > > > > > > 25% processor 22% memory 22% I/O 4% disk 1% service processor 10% fans 15% AC/DC conversion

UltraSPARC T1 Performance Sun Fire T2000 CPU UltraSPARC T1 Sockets Height Performance SpecWeb Power 2005 Perf/Watt Performance SpecJBB Power 2005 Perf/Watt 1 2U 14001 330 W 42.4 Page 46 63378 BOPS 298 W 212.7 E10K 1997 32 x US2 77.4 ft3 2000 lbs 13,456 W 52,000 BTUs/hr T2000 2005 1 x US T1 0.85 ft3 37 lbs ~300 W 1,364 BTUs/hr

Future Trends Improved thread performance > Deeper pipelines > More high-payoff speculation Increased number of threads per core More of the system components will move on-chip Continued focus on delivering high performance/watt and performance/watt/volume (SWaP) Page 47

Conclusions Server TCO will soon be dominated by power Server CMPs need to be designed from ground up to improve performance/watt > Simple MT cores => threads => performance > Lower frequency and less speculation => power > Must provide enough bandwidth to keep threads fed UltraSPARC T1 employs these principles to deliver outstanding performance and performance/watt on a broad range of commercial workloads Page 48

Legal Disclosures SPECweb2005 Sun Fire T2000 (8 cores, 1 chip) 14001 SPECweb2005 SPEC, SPECweb reg tm of Standard Performance Evaluation Corporation Sun Fire T2000 results submitted to SPEC Dec 6th 2005 Sun Fire T2000 server power consumption taken from measurements made during the benchmark run SPECjbb2005 Sun Fire T2000 Server (1 chip, 8 cores, 1way) 63,378 bops SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation Sun Fire T2000 results submitted to SPEC Dec 6th 2005 Sun Fire T2000 server power consumption taken from measurements made during the benchmark run Page 49