UltraSPARC T1: A 32-threaded CMP for Servers. James Laudon Distinguished Engineer Sun Microsystems [email protected]
|
|
|
- Ronald Henry
- 10 years ago
- Views:
Transcription
1 UltraSPARC T1: A 32-threaded CMP for Servers James Laudon Distinguished Engineer Sun Microsystems [email protected]
2 Outline Page 2 Server design issues > Application demands > System requirements Building a better server-oriented CMP > Maximizing thread count > Keeping the threads fed > Keeping the threads cool UltraSPARC T1 (Niagara) > Micro-architecture > Performance > Power
3 Attributes of Commercial Workloads Attribute Web99 jbob (JBB) TPC-C SAP 2T SAP 3T DB TPC-H Application Category Web server Server Java OLTP ERP ERP DSS Instruction-level parallelism low low low med low high Thread-level parallelism high high high high high high Instruction/Data working set large large large med large large low med high med high med Data sharing Adapted from A Performance methodology for commercial servers, S. R. Kunkel et al, IBM J. Res. Develop. vol. 44 no. 6 Nov 2000 Page 3
4 Commercial Server Workloads SpecWeb05, SpecJappserver04, SpecJBB05, SAP SD, TPC-C, TPC-E, TPC-H High degree of thread-level parallelism (TLP) Large working sets with poor locality leading to high cache miss rates Low instruction-level parallelism (ILP) due to high cache miss rates, load-load dependencies, and difficult to predict branches Performance is bottlenecked by stalls on memory accesses Superscalar and superpipelining will not help much Page 4
5 ILP Processor on Server Application Thread C M C M C M Time Memory Latency Compute Scalar processor Time Saved Thread C M C M C M Time Memory Latency Compute Processor optimized for ILP ILP reduces the compute time and overlaps computation with L2 cache hits, but memory stall time dominates overall performance Page 5
6 Attacking the Memory Bottleneck Exploit the TLP-rich nature of server applications Replace each large, superscalar processor with multiple simpler, threaded processors > Increases core count (C) > Increases thread per core count (T) > Greatly increases total thread count (C*T) Threads share a large, high-bandwidth L2 cache and memory system Overlap the memory stalls of one thread with the computation of other threads Page 6
7 TLP Processor on Server Application Core7 Core6 Core5 Core4 Core3 Core2 Core1 Core0 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Memory Latency Compute Time TLP focuses on overlapping memory references to improve throughput; needs sufficient memory bandwidth Page 7
8 Server System Requirements Very large power demands > Often run at high utilization and/or with large amounts of memory > Deployed in dense rack-mounted datacenters Power density affects both datacenter construction and ongoing costs Current servers consume far more power than state of the art datacenters can provide > 500W per 1U box possible > Over 20 kw/rack, most datacenters at 5 kw/rack > Blades make this even worse... Page 8
9 Server System Requirements Processor power is a significant portion of total > Database: 1/3 processor, 1/3 memory, 1/3 disk > Web serving: 2/3 processor, 1/3 memory Perf/watt has been flat between processor generations Acquisition cost of server hardware is declining > Moore's Law more performance at same cost or same performance at lower cost Total cost of ownership (TCO) will be dominated by power within five years The Power Wall Page 9
10 Performance/Watt Trends Source: L. Barroso, The Price of Performance, ACM Queue vol 3 no 7 Page 10
11 Impact of Flat Perf/Watt on TCO Source: L. Barroso, The Price of Performance, ACM Queue vol 3 no 7 Page 11
12 Implications of the Power Wall With TCO dominated by power usage, the metric that matters is performance/watt Performance/Watt has been mostly flat for several generations of ILP-focused designs > Should have been improving as a result of 2 voltage scaling (fcv + TILCV) > C, T, ILC, and f increases have offset voltage decreases TLP-focused processors reduce f and C/T (perprocessor) and can greatly improve performance/watt for server workloads Page 12
13 Outline Page 13 Server design issues > Application demands > System requirements Building a better server-oriented CMP > Maximizing thread count > Keeping the threads fed > Keeping the threads cool UltraSPARC T1 (Niagara) > Micro-architecture > Performance > Power
14 Building a TLP-focused processor Maximizing the total number of threads > Simple cores > Sharing at many levels Keeping the threads fed > Bandwidth! > Increased associativity Keeping the threads cool > Performance/watt as a design goal > Reasonable frequency > Mechanisms for controlling the power envelope Page 14
15 Maximizing the thread count Tradeoff exists between large number of simple cores and small number of complex cores > Complex cores focus on ILP for higher single thread performance > ILP scarce in commercial workloads > Simple cores can deliver more TLP Need to trade off area devoted to processor cores, L2 and L3 caches, and system-on-a-chip Balance performance and power in all subsystems: processor, caches, memory and I/O Page 15
16 Maximizing CMP Throughput with 1 Mediocre Cores J. Davis, J. Laudon, K. Olukotun PACT '05 paper Examined several UltraSPARC II, III, IV, and T1 designs, accounting for differing technologies Constructed an area model based on this exploration Assumed a fixed-area large die (400 mm2), and accounted for pads, pins, and routing overhead Looked at performance for a broad swath of scalar and in-order superscalar processor core designs 1 Mediocre: adj. ordinary; of moderate quality, value, ability, or performance Page 16
17 CMP Design Space I$: Instruction Cache D$: Data Cache IDP Integer Pipeline Thread Superscalar Processor Core Core 1 superscalar pipeline with 1 or more threads per pipeline I$ IDP IDP D$ OR Crossbar I$ Scalar Processor Shared L2 Cache 1 or more pipelines with 1 or more threads per pipeline DRAM DRAM DRAM IDP D$ DRAM Large simulation space: 13k runs/benchmark/technology (pruned) Fixed die size: number of cores in CMP depends on the core size Page 17
18 Scalar vs. Superscalar Core Area IDP 2 IDP IDP 4 IDP Relative Core Area SS 4-SS X X X X Threads per Core Page 18
19 Trading complexity, cores and caches Source: J. Davis, J. Laudon, K. Olukotun, Maximizing CMP Throughput with Medicore Cores, PACT '05 Page 19
20 The Scalar CMP Design Space High Thread Count, Small L1/L2 Mediocre Cores Aggregate IPC Medium Thread Count, Large L1/L2 6 Low Thread Count, Small L1/L Total Cores Page 20
21 Limitations of Simple Cores Lower SPEC CPU2000 ratio performance > Not representative of most single-thread code > Abstraction increases frequency of branching and indirection > Most applications wait on network, disk, memory; rarely execution units Large number of threads per chip > 32 for UltraSPARC T1, 100+ threads soon > Is software ready for this many threads? > Many commercial applications scale well > Workload consolidation Page 21
22 Simple core comparison UltraSPARC T1 379 mm2 Page 22 Pentium Extreme Edition 206 mm2
23 Comparison Disclaimers Different design teams and design environments Chips fabricated in 90 nm by TI and Intel UltraSPARC T1: designed from ground up as a CMP Pentium Extreme Edition: two cores bolted together Apples to watermelons comparison, but still interesting Page 23
24 Pentium EE- US T1 Bandwidth Comparison Feature Clock Speed Pentium Extreme Edition UltraSPARC T1 3.2 Ghz 1.2 Ghz 31 stages 6 stages 130 W (@ 1.3 V) 72W (@ 1.3V) 206 mm2 379 mm2 Transistor Count 230 million 279 million Number of cores 2 8 Number of threads kuop Instruction/16 kb Data 16 kb Instruction/8 kb Data 1.1 ns 2.5 ns Two copies of 1 MB, 8-way associative 3 MB, 12-way associative 7.5 ns 19 ns ~180 GB/s 76.8 GB/s 80 ns 90 ns 6.4 GB/s 25.6 GB/s Pipeline Depth Power Die Size L1 caches Load-to-use latency L2 cache L2 unloaded latency L2 bandwidth Memory unloaded latency Memory bandwidth Page 24
25 Sharing Saves Area & Ups Utilization Hardware threads within a processor core share: > Pipeline and execution units > L1 caches, TLBs and load/store port Processor cores within a CMP share: > L2 and L3 caches > Memory and I/O ports Increases utilization > Multiple threads fill pipeline and overlap memory stalls with computation > Multiple cores increase load on L2 and L3 caches and memory Page 25
26 Sharing to save area MUL EXU IFU MMU TRAP Page 26 LSU UltraSPARC T1 Four threads per core Multithreading increases: > Register file > Trap unit > Instruction buffers and fetch resources > Store queues and miss buffers 20% area increase in core excluding cryptography unit
27 Sharing to increase utilization UltraSPARC T1 Database App Utilization CPI Breakdown Single Thread Miscellaneous Pipeline Latency Store Buffer Full L2 Miss Data Cache Miss One Thread of Four Inst Cache Miss Pipeline Busy Thread D Active Thread C Active Thread B Active Thread A Active Four Threads CPI Page Application run with both 8 and 32 threads With 32 threads, pipeline and memory contention slow each thread by 34% However, increased utilization leads to 3x speedup with four threads
28 Keeping the threads fed Dedicated resources for thread memory requests > Private store buffers and miss buffers Large, banked, and highly-associative L2 cache > Multiple banks for sufficient bandwidth > Increased size and associativity to hold the working sets of multiple threads Direct connection to high-bandwidth memory > Fallout from shared L2 will be larger than from a private L2 > But increase in L2 miss rate will be much smaller than increase in number of threads Page 28
29 Keeping the threads cool Sharing of resources increases unit utilization and thus leads to increase in power Cores must be power efficient > Minimal speculation high-payoff only > Moderate pipeline depth and frequency Extensive mechanisms for power management > Voltage and frequency control > Clock gating and unit shutdown > Leakage power control > Minimizing cache and memory power Page 29
30 Outline Page 30 Server design issues > Application demands > System requirements Building a better server-oriented CMP > Maximizing thread count > Keeping the threads fed > Keeping the threads cool UltraSPARC T1 (Niagara) > Micro-architecture > Performance > Power
31 UltraSPARC T1 Overview TLP-focused CMP for servers > 32 threads to hide memory and pipeline stalls Extensive sharing > Four threads share each processor core > Eight processor cores share a single L2 cache High-bandwidth cache and memory subsystem > Banked and highly-associative L2 cache > Direct connection to DDR II memory Performance/Watt as a design metric Page 31
32 UltraSPARC T1 Block Diagram Floating Point Unit DRAM Control L2 Core 1 Bank 0 L2 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Bank 3 Control Register Interface Channel 0 DDR2 144@400 MT/s Channel 1 DDR2 144@400 MT/s Bank 1 L2 Bank 2 L2 JTAG Clock & Test Unit Page 32 Crossbar Core 0 Channel 2 DDR2 144@400 MT/s Channel 3 DDR2 144@400 MT/s JBUS System Interface SSI ROM Interface JBUS (200 MHz) SSI (50 MHz)
33 UltraSPARC T1 Micrograph SPARC SPARC SPARC SPARC L2 Data Bank 0 L2Tag Bank 0 L2 Buff CLK & Bank 0 L2 Buff Test DDR2_1 Bank 1 DRAM Ctl 0,2 IO L2Tag Bank 2 Bridge CROSSBAR Unit L2 Buff FPU Bank 2 L2 Buff Bank 3 DRAM L2Tag Ctl 1,3 L2Tag L2 Data Bank 1 JBUS Bank 3 L2 Data Bank 1 Bank 3 SPARC SPARC SPARC SPARC Core 1 Core 3 Core 5 Core 7 Page 33 Bank 2 DDR2_2 L2 Data DDR2_3 DDR2_0 Core 0 Core 2 Core4 Core 6 Features: 8 64-bit Multithreaded SPARC Cores Shared 3 MB, 12-way 64B line writeback L2 Cache 16 KB, 4-way 32B line ICache per Core 8 KB, 4-way 16B line writethrough DCache per Core bit DDR-2 channels 3.2 GB/sec JBUS I/O Technology: TI's 90nm CMOS Process 9LM Cu Interconnect GHz/1.2V Die Size: 379mm2 279M Transistors Flip-chip ceramic LGA
34 UltraSPARC T1 Floorplanning Modular design for step and repeat Main issue is that all cores want to be close to all the L2 cache banks > Crossbar and L2 tags located in the center > Processor cores on the top and bottom > L2 data on the left and right > Memory controllers and SOC fill in the holes Page 34
35 Maximing Thread Count on US-T1 Power-efficient, simple cores > Six stage pipeline, almost no speculation > 1.2 GHz operation > Four threads per core >Shared: pipeline, L1 caches, TLB, L2 interface >Dedicated: register and other architectural state, instruction buffers, 8-entry store buffers > Pipeline switches between available threads every cycle (interleaved/vertical multithreading) > Cryptography acceleration unit per core Page 35
36 UltraSPARC T1 Pipeline Fetch Thrd Sel Decode Execute Memory WB Regfile x4 ICache Inst Itlb buf x 4 Thrd Sel Mux Thread selects Decode Thread select logic Thrd Sel Mux Page 36 Alu Mul Shft Div DCache Dtlb Crossbar Stbuf x 4 Interface Instruction type misses traps & interrupts resource conflicts PC logic x4
37 Thread Selection: All Threads Ready Cycles St0-ld Dt0-ld Et0-ld Mt0-ld Wt0-ld Ft0-add St1-sub Dt1-sub Et1-sub Mt1-sub Instructions Page 37 Ft1-ld St2-ld Wt1-sub Dt2-ld Et2-ld Mt2-ld Wt2-ld Ft2-br St3-add Dt3-add Et3-add Mt3-add Ft3-add St0-add Dt0-add Et0-add
38 Thread Selection: Two Threads Ready Instructions Cycles St0-ld Ft0-add St1-sub Dt1-sub Et1-sub Mt1-sub Dt0-ld Et0-ld Mt0-ld Wt0-ld Ft1-ld St1-ld Wt1-sub Dt1-ld Et1-ld Mt1-ld Wt1-ld Ft1-br St0-add Dt0-add Et0-add Mt0-add Thread '0' is speculatively switched in before cache hit information is available, in time for the 'load' to bypass data to the 'add' Page 38
39 Feeding the UltraSPARC T1 Threads Shared L2 cache > 3 MB, writeback, 12-way associative, 64B lines > 4 banks, interleaved on cache line boundary > Handles multiple outstanding misses per bank > MESI coherence L2 cache orders all requests > Maintains directory and inclusion of L1 caches Direct connection to memory > Four 144-bit wide (128+16) DDR II interfaces > Supports up to 128 GB of memory > 25.6 GB/s memory bandwidth Page 39
40 Keeping the US-T1 Threads Cool Power efficient cores > 1.2 GHz 6-stage single-issue pipeline Features to keep peak power close to average > Ability to suspend issue from any thread > Limit on number of outstanding memory requests Extensive clock gating > Coarse-grained (unit shutdown, partial activation) > Fine-grained (selective gating within datapaths) Static design for most of chip 63 Watts typical power at 1.2V and 1.2 GHz Page 40
41 UltraSPARC T1 Power Breakdown 1.2Ghz / 1.2V < 2 Watts / Thread IOs Cores (11%) (26%) Leakage (25%) Page 41 Route (16%) Xbar (4%) Top SPARC Cores Leakage Interconnect L2Cache (12%) Fully static design Fine granularity clock gating for datapaths (30% flops disabled) Lower 1.5 P/N width ratio for library cells Interconnect wire classes optimized for power x delay SRAM activation control L2Data L2Tag Unit L2 Buff Unit Crossbar IO's Global Clock Floating Point Misc Units
42 TM Advantages of CoolThreads 59oC 59oC 66oC 66oC 59oC 59oC 59o C Page 42 No need for exotic cooling technologies Improved reliability from lower and more uniform junction temperatures Improved performance/reliability tradeoff in design 107o C
43 UltraSPARC T1 System (T1000) Page 43
44 UltraSPARC T1 System (T2000) Page 44
45 T2000 Power Breakdown Sun Fire T2000 Power UltraSPARC T1 16 GB memory IO Disks Service Proc Fans AC/DC Conversion Page W running SPECJBB 2000 Power breakdown > > > > > > > 25% processor 22% memory 22% I/O 4% disk 1% service processor 10% fans 15% AC/DC conversion
46 UltraSPARC T1 Performance Sun Fire T2000 CPU UltraSPARC T1 Sockets Height Performance SpecWeb Power 2005 Perf/Watt Performance SpecJBB Power 2005 Perf/Watt 1 2U W 42.4 Page BOPS 298 W E10K x US ft lbs 13,456 W 52,000 BTUs/hr T x US T ft3 37 lbs ~300 W 1,364 BTUs/hr
47 Future Trends Improved thread performance > Deeper pipelines > More high-payoff speculation Increased number of threads per core More of the system components will move on-chip Continued focus on delivering high performance/watt and performance/watt/volume (SWaP) Page 47
48 Conclusions Server TCO will soon be dominated by power Server CMPs need to be designed from ground up to improve performance/watt > Simple MT cores => threads => performance > Lower frequency and less speculation => power > Must provide enough bandwidth to keep threads fed UltraSPARC T1 employs these principles to deliver outstanding performance and performance/watt on a broad range of commercial workloads Page 48
49 Legal Disclosures SPECweb2005 Sun Fire T2000 (8 cores, 1 chip) SPECweb2005 SPEC, SPECweb reg tm of Standard Performance Evaluation Corporation Sun Fire T2000 results submitted to SPEC Dec 6th 2005 Sun Fire T2000 server power consumption taken from measurements made during the benchmark run SPECjbb2005 Sun Fire T2000 Server (1 chip, 8 cores, 1way) 63,378 bops SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation Sun Fire T2000 results submitted to SPEC Dec 6th 2005 Sun Fire T2000 server power consumption taken from measurements made during the benchmark run Page 49
OPENSPARC T1 OVERVIEW
Chapter Four OPENSPARC T1 OVERVIEW Denis Sheahan Distinguished Engineer Niagara Architecture Group Sun Microsystems Creative Commons 3.0United United States License Creative CommonsAttribution-Share Attribution-Share
NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR
NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR THE NIAGARA PROCESSOR IMPLEMENTS A THREAD-RICH ARCHITECTURE DESIGNED TO PROVIDE A HIGH-PERFORMANCE SOLUTION FOR COMMERCIAL SERVER APPLICATIONS. THE HARDWARE
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
OpenSPARC T1 Processor
OpenSPARC T1 Processor The OpenSPARC T1 processor is the first chip multiprocessor that fully implements the Sun Throughput Computing Initiative. Each of the eight SPARC processor cores has full hardware
Thread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano
Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview
Thread Level Parallelism II: Multithreading
Thread Level Parallelism II: Multithreading Readings: H&P: Chapter 3.5 Paper: NIAGARA: A 32-WAY MULTITHREADED Thread Level Parallelism II: Multithreading 1 This Unit: Multithreading (MT) Application OS
Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719
Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code
Low Power AMD Athlon 64 and AMD Opteron Processors
Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD
Multithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar
Memory ICS 233 Computer Architecture and Assembly Language Prof. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals Presentation Outline Random
Innovativste XEON Prozessortechnik für Cisco UCS
Innovativste XEON Prozessortechnik für Cisco UCS Stefanie Döhler Wien, 17. November 2010 1 Tick-Tock Development Model Sustained Microprocessor Leadership Tick Tock Tick 65nm Tock Tick 45nm Tock Tick 32nm
Introduction to Microprocessors
Introduction to Microprocessors Yuri Baida [email protected] [email protected] October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?
OC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
Summary. Key results at a glance:
An evaluation of blade server power efficiency for the, Dell PowerEdge M600, and IBM BladeCenter HS21 using the SPECjbb2005 Benchmark The HP Difference The ProLiant BL260c G5 is a new class of server blade
This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends
This Unit CIS 501 Computer Architecture! Metrics! Latency and throughput! Reporting performance! Benchmarking and averaging Unit 2: Performance! CPU performance equation & performance trends CIS 501 (Martin/Roth):
Energy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012
"JAGUAR AMD s Next Generation Low Power x86 Core Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 TWO X86 CORES TUNED FOR TARGET MARKETS Mainstream Client and Server Markets Bulldozer
Performance Impacts of Non-blocking Caches in Out-of-order Processors
Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order
Managing Data Center Power and Cooling
White PAPER Managing Data Center Power and Cooling Introduction: Crisis in Power and Cooling As server microprocessors become more powerful in accordance with Moore s Law, they also consume more power
RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29
RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f. one large disk) Parallelism improves performance Plus extra disk(s) for redundant data storage Provides fault tolerant
Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011
Oracle Database Reliability, Performance and scalability on Intel platforms Mitch Shults, Intel Corporation October 2011 1 Intel Processor E7-8800/4800/2800 Product Families Up to 10 s and 20 Threads 30MB
Enterprise Applications
Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting
SPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6
WHITE PAPER PERFORMANCE REPORT PRIMERGY BX620 S6 WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6 This document contains a summary of the benchmarks executed for the PRIMERGY BX620
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of
Memory Architecture and Management in a NoC Platform
Architecture and Management in a NoC Platform Axel Jantsch Xiaowen Chen Zhonghai Lu Chaochao Feng Abdul Nameed Yuang Zhang Ahmed Hemani DATE 2011 Overview Motivation State of the Art Data Management Engine
Chapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads
HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads Gen9 Servers give more performance per dollar for your investment. Executive Summary Information Technology (IT) organizations face increasing
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor
Intel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
Design Cycle for Microprocessors
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
Intel Pentium 4 Processor on 90nm Technology
Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers
X: Fujitsu s New Generation 16 Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright
Why Latency Lags Bandwidth, and What it Means to Computing
Why Latency Lags Bandwidth, and What it Means to Computing David Patterson U.C. Berkeley [email protected] October 2004 Bandwidth Rocks (1) Preview: Latency Lags Bandwidth Over last 20 to 25 years,
Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis
White Paper Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis White Paper March 2014 2014 Cisco and/or its affiliates. All rights reserved. This document
A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey
A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
A Scalable VISC Processor Platform for Modern Client and Cloud Workloads
A Scalable VISC Processor Platform for Modern Client and Cloud Workloads Mohammad Abdallah Founder, President and CTO Soft Machines Linley Processor Conference October 7, 2015 Agenda Soft Machines Background
- Nishad Nerurkar. - Aniket Mhatre
- Nishad Nerurkar - Aniket Mhatre Single Chip Cloud Computer is a project developed by Intel. It was developed by Intel Lab Bangalore, Intel Lab America and Intel Lab Germany. It is part of a larger project,
AMD PhenomII. Architecture for Multimedia System -2010. Prof. Cristina Silvano. Group Member: Nazanin Vahabi 750234 Kosar Tayebani 734923
AMD PhenomII Architecture for Multimedia System -2010 Prof. Cristina Silvano Group Member: Nazanin Vahabi 750234 Kosar Tayebani 734923 Outline Introduction Features Key architectures References AMD Phenom
Performance evaluation
Performance evaluation Arquitecturas Avanzadas de Computadores - 2547021 Departamento de Ingeniería Electrónica y de Telecomunicaciones Facultad de Ingeniería 2015-1 Bibliography and evaluation Bibliography
Leading Virtualization Performance and Energy Efficiency in a Multi-processor Server
Leading Virtualization Performance and Energy Efficiency in a Multi-processor Server Product Brief Intel Xeon processor 7400 series Fewer servers. More performance. With the architecture that s specifically
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing
T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing Robert Golla Senior Hardware Architect Paul Jordan Senior Principal Hardware Engineer Oracle
POWER8 Performance Analysis
POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs [email protected] #OpenPOWERSummit Join the conversation at #OpenPOWERSummit
Scaling in a Hypervisor Environment
Scaling in a Hypervisor Environment Richard McDougall Chief Performance Architect VMware VMware ESX Hypervisor Architecture Guest Monitor Guest TCP/IP Monitor (BT, HW, PV) File System CPU is controlled
What is a System on a Chip?
What is a System on a Chip? Integration of a complete system, that until recently consisted of multiple ICs, onto a single IC. CPU PCI DSP SRAM ROM MPEG SoC DRAM System Chips Why? Characteristics: Complex
ECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Week 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge
CS 159 Two Lecture Introduction Parallel Processing: A Hardware Solution & A Software Challenge We re on the Road to Parallel Processing Outline Hardware Solution (Day 1) Software Challenge (Day 2) Opportunities
Enabling Technologies for Distributed Computing
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance
What You Will Learn... Computers Are Your Future Chapter 6 Understand how computers represent data Understand the measurements used to describe data transfer rates and data storage capacity List the components
AMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications
1 A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications Simon McIntosh-Smith Director of Architecture 2 Multi-Threaded Array Processing Architecture
High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team
High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1 2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004)
Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure
White Paper Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure White Paper March 2014 2014 Cisco and/or its affiliates. All rights reserved. This
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
Digital Design for Low Power Systems
Digital Design for Low Power Systems Shekhar Borkar Intel Corp. Outline Low Power Outlook & Challenges Circuit solutions for leakage avoidance, control, & tolerance Microarchitecture for Low Power System
The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy
The Quest for Speed - Memory Cache Memory CSE 4, Spring 25 Computer Systems http://www.cs.washington.edu/4 If all memory accesses (IF/lw/sw) accessed main memory, programs would run 20 times slower And
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Enabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng
Slide Set 8 for ENCM 369 Winter 2015 Lecture Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2015 ENCM 369 W15 Section
Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers
WHITE PAPER FUJITSU PRIMERGY AND PRIMEPOWER SERVERS Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers CHALLENGE Replace a Fujitsu PRIMEPOWER 2500 partition with a lower cost solution that
Computer Systems Structure Input/Output
Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices
LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID
The vast majority of the world s servers count on LSI SAS & MegaRAID Trust us, build the LSI credibility in storage, SAS, RAID Server installed base = 36M LSI SAS inside 60% of servers 21 million LSI SAS
On-Chip Interconnection Networks Low-Power Interconnect
On-Chip Interconnection Networks Low-Power Interconnect William J. Dally Computer Systems Laboratory Stanford University ISLPED August 27, 2007 ISLPED: 1 Aug 27, 2007 Outline Demand for On-Chip Networks
1. Memory technology & Hierarchy
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
Computer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
Towards Energy Efficient Query Processing in Database Management System
Towards Energy Efficient Query Processing in Database Management System Report by: Ajaz Shaik, Ervina Cergani Abstract Rising concerns about the amount of energy consumed by the data centers, several computer
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
Introduction to RISC Processor. ni logic Pvt. Ltd., Pune
Introduction to RISC Processor ni logic Pvt. Ltd., Pune AGENDA What is RISC & its History What is meant by RISC Architecture of MIPS-R4000 Processor Difference Between RISC and CISC Pros and Cons of RISC
Introducción. Diseño de sistemas digitales.1
Introducción Adapted from: Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Original from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] Diseño de sistemas digitales.1
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
Itanium 2 Platform and Technologies. Alexander Grudinski Business Solution Specialist Intel Corporation
Itanium 2 Platform and Technologies Alexander Grudinski Business Solution Specialist Intel Corporation Intel s s Itanium platform Top 500 lists: Intel leads with 84 Itanium 2-based systems Continued growth
Operating System Impact on SMT Architecture
Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
Copyright 2013, Oracle and/or its affiliates. All rights reserved.
1 Oracle SPARC Server for Enterprise Computing Dr. Heiner Bauch Senior Account Architect 19. April 2013 2 The following is intended to outline our general product direction. It is intended for information
Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors
Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors Hyungjun Kim, 1 Arseniy Vitkovsky, 2 Paul V. Gratz, 1 Vassos Soteriou 2 1 Department of Electrical and Computer Engineering, Texas
Overview. CPU Manufacturers. Current Intel and AMD Offerings
Central Processor Units (CPUs) Overview... 1 CPU Manufacturers... 1 Current Intel and AMD Offerings... 1 Evolution of Intel Processors... 3 S-Spec Code... 5 Basic Components of a CPU... 6 The CPU Die and
OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING
OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted
BIG DATA The Petabyte Age: More Isn't Just More More Is Different wired magazine Issue 16/.07
BIG DATA The Petabyte Age: More Isn't Just More More Is Different wired magazine Issue 16/.07 Martien Ouwens Datacenter Solutions Architect Sun Microsystems 1 Big Data, The Petabyte Age: Relational Next
