Thread Level Parallelism II: Multithreading
|
|
|
- Avis Bradley
- 10 years ago
- Views:
Transcription
1 Thread Level Parallelism II: Multithreading Readings: H&P: Chapter 3.5 Paper: NIAGARA: A 32-WAY MULTITHREADED Thread Level Parallelism II: Multithreading 1
2 This Unit: Multithreading (MT) Application OS Compiler Firmware CPU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)? Utilization vs. performance Three implementations Coarse-grained MT Fine-grained MT Simultaneous MT (SMT) Example The Sun UltraSparc T1 Research topics Thread Level Parallelism II: Multithreading 2
3 Performance And Utilization Performance (IPC) important Utilization (actual IPC / peak IPC) important too Even moderate superscalars (e.g., 4-way) not fully utilized Average sustained IPC: < 50% utilization Mis-predicted branches Cache misses, especially L2 Data dependences Multi-threading (MT) Improve utilization by multi-plexing multiple threads on single CPU One thread cannot fully utilize CPU? Maybe 2, 4 (or 100) can Multi-programmed OS generalization: I/O CPU Stall Context Switches Thread Switches Stack of software Zero software Thread Level Parallelism II: Multithreading 3
4 Superscalar Under-utilization Time evolution of issue slot 4-issue processor cache miss Superscalar Thread Level Parallelism II: Multithreading 4
5 time Simple Multithreading Time evolution of issue slot 4-issue processor cache miss Fill in with instructions from another thread Superscalar Multithreading Where does it find a thread? Same problem as multi-core Same shared-memory abstraction Thread Level Parallelism II: Multithreading 5
6 Latency vs Throughput MT trades (single-thread) latency for throughput Sharing processor degrades latency of individual threads + But improves aggregate latency of both threads + Improves utilization Example Thread A: individual latency=10s, latency with thread B=15s Thread B: individual latency=20s, latency with thread A=25s Sequential latency (first A then B or vice versa): 30s Parallel latency (A and B simultaneously): 25s MT slows each thread by 5s + But improves total latency by 5s Different workloads have different parallelism SpecFP has lots of ILP (can use an 8-wide machine) Server workloads have TLP (can use multiple threads) Thread Level Parallelism II: Multithreading 6
7 MT Implementations: Similarities How do multiple threads share a single processor? Different sharing mechanisms for different kinds of structures Depend on what kind of state structure stores No state: ALUs Dynamically shared Persistent hard state (aka context ): PC, registers Replicated Persistent soft state: caches, bpred Dynamically partitioned (like on a multi-programmed uni-processor) TLBs need thread ids, caches/bpred tables don t (PA/VA???) Exception: ordered soft state (BHR, RAS) is replicated Transient state: pipeline latches, ROB, RS Partitioned somehow Thread Level Parallelism II: Multithreading 7
8 MT Implementations: Differences Main question: thread scheduling policy When to switch from one thread to another? Related question: pipeline partitioning How exactly do threads share the pipeline itself? Choice depends on What kind of latencies (specifically, length) you want to tolerate How much single thread performance you are willing to sacrifice Three designs Coarse-grain multithreading (CGMT) Fine-grain multithreading (FGMT) Simultaneous multithreading (SMT) Thread Level Parallelism II: Multithreading 8
9 time The Standard Multithreading Picture Time evolution of issue slots Color = thread Superscalar CGMT FGMT SMT Thread Level Parallelism II: Multithreading 9
10 Coarse-Grain Multithreading (CGMT) Coarse-Grain Multi-Threading (CGMT) + Sacrifices very little single thread performance (of one thread) Tolerates only long latencies (e.g., L2 misses) Thread scheduling policy Designate a preferred thread (e.g., thread A) Switch to thread B on thread A L2 miss Switch back to A when A L2 miss returns Pipeline partitioning None, flush on switch Can t tolerate latencies shorter than twice pipeline depth Need short in-order pipeline for good performance Example: IBM Northstar/Pulsar (RS/6000) Thread Level Parallelism II: Multithreading 10
11 CGMT regfile I$ B P D$ CGMT thread scheduler I$ B P regfile regfile D$ L2 miss? Thread Level Parallelism II: Multithreading 11
12 Fine-Grain Multithreading (FGMT) Fine-Grain Multithreading (FGMT) Sacrifices significant single thread performance + Tolerates latencies (e.g., L2 misses, mispredicted branches, etc.) Thread scheduling policy Switch threads every cycle (round-robin), L2 miss or no Pipeline partitioning Dynamic, no flushing Length of pipeline doesn t matter so much Need a lot of threads Extreme example: Denelcor HEP ( 81-85) So many threads (100+), it didn t even need caches Failed commercially Not popular today Many threads many register files Thread Level Parallelism II: Multithreading 12
13 Fine-Grain Multithreading FGMT Multiple threads in pipeline at once (Many) more threads thread scheduler regfile regfile regfile regfile I$ B P D$ Thread Level Parallelism II: Multithreading 13
14 time Vertical and Horizontal Under-Utilization FGMT and CGMT reduce vertical under-utilization Loss of all slots in an issue cycle Do not help with horizontal under-utilization Loss of some slots in an issue cycle (in a superscalar processor) CGMT FGMT SMT Thread Level Parallelism II: Multithreading 14
15 SMT Thread Level Parallelism II: Multithreading 15
16 Simultaneous Multithreading (SMT) What can issue insns from multiple threads in one cycle? Same thing that issues insns from multiple parts of same program out-of-order execution How different PCs are managed? Simultaneous multithreading (SMT): OOO + FGMT Aka hyper-threading Observation: once insns are renamed, scheduler doesn t care which thread they come from (well, for non-loads at least) Some examples IBM Power5: 4-way issue, 2 threads Intel Pentium4: 3-way issue, 2 threads Intel Nehalem : 4-way issue, 2 threads Alpha 21464: 8-way issue, 4 threads (canceled) Notice a pattern? #threads (T) * 2 = #issue width (N) Thread Level Parallelism II: Multithreading 16
17 Simultaneous Multithreading (SMT) map table regfile I$ B P D$ SMT Replicate map table, share (larger) physical register file thread scheduler map tables regfile I$ B P D$ Thread Level Parallelism II: Multithreading 17
18 SMT Resource Partitioning Physical regfile and insn buffer entries shared at fine-grain Physically unordered and so fine-grain sharing is possible How are physically ordered structures (ROB/LSQ) shared? Fine-grain sharing (below) would entangle retire (and flush) Allowing threads to commit independently is important thread scheduler map tables regfile I$ B P D$ Thread Level Parallelism II: Multithreading 18
19 Static & Dynamic Resource Partitioning Static partitioning (below) T equal-sized contiguous partitions ± No starvation, sub-optimal utilization (fragmentation) Dynamic partitioning P > T partitions, available partitions assigned on need basis ± Better utilization, possible starvation ICOUNT: fetch policy prefers thread with fewest in-flight insns Couple both with larger ROBs/LSQs regfile I$ B P D$ Thread Level Parallelism II: Multithreading 19
20 Multithreading Issues Shared soft state (caches, branch predictors, TLBs, etc.) Key example: cache interference General concern for all MT variants Can the working sets of multiple threads fit in the caches? Shared memory SPMD threads help here + Same insns share I$ + Shared data less D$ contention MT is good for workloads with shared insn/data To keep miss rates low, SMT might need a larger L2 (which is OK) Out-of-order tolerates L1 misses Large physical register file (and map table) physical registers = (#threads * #arch-regs) + #in-flight insns map table entries = (#threads * #arch-regs) Thread Level Parallelism II: Multithreading 20
21 Notes About Sharing Soft State Caches are shared naturally Physically-tagged: address translation distinguishes different threads but TLBs need explicit thread IDs to be shared Virtually-tagged: entries of different threads indistinguishable Thread IDs are only a few bits: enough to identify on-chip contexts Thread IDs make sense on BTB (branch target buffer) BTB entries are already large, a few extra bits / entry won t matter Different thread s target prediction automatic mis-prediction but not on a BHT (branch history table) BHT entries are small, a few extra bits / entry is huge overhead Different thread s direction prediction mis-prediction not automatic Ordered soft-state should be replicated Examples: Branch History Register (BHR), Return Address Stack (RAS) Otherwise it becomes meaningless Fortunately, it is typically small Thread Level Parallelism II: Multithreading 21
22 The real cost of SMT Thread Level Parallelism II: Multithreading 22
23 Not always a good thing Thread Level Parallelism II: Multithreading 23
24 Multithreading vs. Multicore If you wanted to run multiple threads would you build a A multicore: multiple separate pipelines? A multithreaded processor: a single larger pipeline? Both will get you throughput on multiple threads Multicore core will be simpler, possibly faster clock SMT will get you better performance (IPC) on a single thread SMT is basically an ILP engine that converts TLP to ILP Multicore is mainly a TLP (thread-level parallelism) engine Do both Sun s Niagara (UltraSPARC T1), Chip Multiprocessor 8 processors, each with 4-threads (non-smt threading) 1Ghz clock, in-order, short pipeline (6 stages or so) Designed for power-efficient throughput computing Thread Level Parallelism II: Multithreading 24
25 Multithreading Summary Latency vs. throughput Partitioning different processor resources Three multithreading variants Coarse-grain: no single-thread degradation, but long latencies only Fine-grain: other end of the trade-off Simultaneous: fine-grain with out-of-order Multithreading vs. chip multiprocessing Thread Level Parallelism II: Multithreading 25
26 EXAMPLE: THE SUN UNLTRASPARC T1 MULTIPROCESSOR Thread Level Parallelism II: Multithreading 26
27 The SUN UltraSparc T1 Thread Level Parallelism II: Multithreading 27
28 Floor plan Thread Level Parallelism II: Multithreading 28
29 Enclosure (SunFire T2000) Thread Level Parallelism II: Multithreading 29
30 Power Breakdown Thread Level Parallelism II: Multithreading 30
31 The SUN UltraSparc T1 Pipeline Thread Level Parallelism II: Multithreading 31
32 Cost Thread Level Parallelism II: Multithreading 32
33 Performance Thread Level Parallelism II: Multithreading 33
34 Performance (Pentium D normalized) Thread Level Parallelism II: Multithreading 34
35 Sun UltraSparc T2 Second iteration of Niagara 8 pipelined cores, 8-way SMT 64 hardware threads per chip 4MB L2 8 Fully pipelined FPU (1 per core) Dual 10 GbE and PCIe integrated Security processor per core: DES, 3DES, AES, etc 65nm, 1.4Ghz, <95W (chip), +60GB/s BW (4 FB DIMM controllers) Sun UltraSparc T3 (a.k.a. Niagara Falls) 16 cores 16-way SMT, 4-chip servers Thread Level Parallelism II: Multithreading 35
36 RESEARCH Thread Level Parallelism II: Multithreading 36
37 Research: Speculative Multithreading Speculative multithreading Use multiple threads/processors for single-thread performance Speculatively parallelize sequential loops, that might not be parallel Processing elements (called PE) arranged in logical ring Compiler or hardware assigns iterations to consecutive PEs Hardware tracks logical order to detect mis-parallelization Techniques for doing this on non-loop code too Detect reconvergence points (function calls, conditional code) Effectively chains ROBs of different processors into one big ROB Global commit head travels from one PE to the next Mis-parallelization flushes one PEs, but not all PEs Also known as split-window or Multiscalar Not commercially available yet eventually will be?? But it is the biggest idea from academia not yet adopted Thread Level Parallelism II: Multithreading 37
38 Research: Multithreading for Reliability Can multithreading help with reliability? Design bugs/manufacturing defects? No Gradual defects, e.g., thermal wear? No Transient errors? Yes Staggered redundant multithreading (SRT) Run two copies of program at a slight stagger Compare results, difference? Flush both copies and restart Significant performance overhead Thread Level Parallelism II: Multithreading 38
39 Research: QoS CMP Server Consolidation Definition : Server consolidation is an approach to the efficient usage of computer server resources in order to reduce the total number of servers or server locations that an organization requires www server database server #2 database server #1 middleware server #1 middleware server #1 Thread Level Parallelism II: Multithreading 39
40 Motivation: CMP & Server consolidation CMP based system are pervasive and an ideal hardware platform (in terms of cost) to deploy server consolidation www server 64-core CMP L2 Cache database server #2 Core L1 database server #1 middleware server #1 middleware server #1 Thread Level Parallelism II: Multithreading 40
41 Virtualization Virtualization is the most suitable tool to achieve server consolidation Hides all the nasty details to the user Eases system administration Performance? Most virtualization layers provide a huge set of parameters to regulate the access to shared resources Disk Memory Network Time sharing CPU Thread Level Parallelism II: Multithreading 41
42 Performance Isolation & CMP CMP introduces a new dimension in shared resources A large portion of the memory hierarchy will be shared Two mayor potential problems On-chip cache capacity and/or bandwidth Off-chip bandwidth Question that arises: Too fine grained to be effective at software level? Has to be the CMP aware of virtualization layer? Has to provide the hardware some mechanism to apply virtualization policies? Thread Level Parallelism II: Multithreading 42
43 Is it a problem? Let s analyze the problem with state-of-the-art infrastructure Consolidated servers Significant workload to be consolidates (SPECWeb2005) with Apache Tomcat Misbehaving workload that stresses CMP (on-chip capacity & off-chip bandwidth) Software stack Virtualization layer (XEN & Solaris Zones) with resource controls Realistic Operating systems (GNU/Linux Debian5, Solaris 10) Thread Level Parallelism II: Multithreading 43
44 Hardware Setup Sun Fire T1000 HP Proliant DL160 G5 Processor Model Sun UltraSparc T1 Intel Xeon X5472 Threads, Cores, dies, chips 32, 8, 1, 1 8, 8, 4, 2 L1I L1D Private, 16KB 4-way set associative, 32-byte lines, 1-per core Private, 8KB 4-way set associative, 16-byte lines, writethrough, 1 per core Private, 32KB, 8-way set associative, 64-byte lines, 1-per core Private, 32KB, 8-way set associative, 64-byte lines, 1-per core L2 3MB Shared per chip, 12-way set associative, 64-byte lines 6MB Shared per die, 24-way set associative, 64 byte lines Memory 8GB, 533Mhz, DDR2 16GB, 667Mhz FB-DIMM Hard disk 1 SATA 2 SATA RAID 0 NIC Dual Gigabit Ethernet Dual Gigabit Ethernet (bounded) (bounded) Thread Level Parallelism II: Multithreading 44
45 Normalized Off-Chip Bandwidth Available to Primary VPS Normalized Performance Misbehaving application on T Primary VPS / Secondary VPS Primary VPS / Secondary VPS Thread Level Parallelism II: Multithreading 45
46 Normalized Performance Performance isolation on T Bank E-Commerce Support Average Benchmark Idle harm.on-chip harm.off-chip Thread Level Parallelism II: Multithreading 46
47 Normalized Performance Performance Isolation on Xeon Die ( 1core per VPS) Chip (2 cores per VPS) Chip Interleaved (2 cores per VPS) Bank E-Commerce Support Average Thread Level Parallelism II: Multithreading 47
48 What to do? Current virtualization + hardware is not enough to provide the expected level of performance Could a provider guarantee SLA? Potentially could reduce the server consolidation degree Potentially could loss its customers With many core-cmp advent the problem will exacerbate We need some sort of QoS provision Hardware level mechanisms to control the access to common resources Software could specify policies for those mechanisms Thread Level Parallelism II: Multithreading 48
49 Acknowledgments Slides developed by Amir Roth of University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. Slides enhanced by Milo Martin and Mark Hill with sources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith, Sohi, Vijaykumar, and Wood Slides re-adapted by V. Puente of University of Cantabria Thread Level Parallelism II: Multithreading 49
This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
Thread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
Multithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719
Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
Operating System Impact on SMT Architecture
Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th
Energy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
Enabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
U. Wisconsin CS/ECE 752 Advanced Computer Architecture I
U. Wisconsin CS/ECE 752 Advanced Computer Architecture I Prof. David A. Wood Unit 0: Introduction Slides developed by Amir Roth of University of Pennsylvania with sources that included University of Wisconsin
Enabling Technologies for Distributed Computing
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies
Enterprise Applications
Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
OC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
Chapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies
Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Kurt Klemperer, Principal System Performance Engineer [email protected] Agenda Session Length:
Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
UltraSPARC T1: A 32-threaded CMP for Servers. James Laudon Distinguished Engineer Sun Microsystems [email protected]
UltraSPARC T1: A 32-threaded CMP for Servers James Laudon Distinguished Engineer Sun Microsystems [email protected] Outline Page 2 Server design issues > Application demands > System requirements Building
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
IOS110. Virtualization 5/27/2014 1
IOS110 Virtualization 5/27/2014 1 Agenda What is Virtualization? Types of Virtualization. Advantages and Disadvantages. Virtualization software Hyper V What is Virtualization? Virtualization Refers to
Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes
basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static
Scaling in a Hypervisor Environment
Scaling in a Hypervisor Environment Richard McDougall Chief Performance Architect VMware VMware ESX Hypervisor Architecture Guest Monitor Guest TCP/IP Monitor (BT, HW, PV) File System CPU is controlled
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano
Introduction to Microprocessors
Introduction to Microprocessors Yuri Baida [email protected] [email protected] October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?
Chapter 6. 6.1 Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig. 6.1. I/O devices can be characterized by. I/O bus connections
Chapter 6 Storage and Other I/O Topics 6.1 Introduction I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections
VMWARE WHITE PAPER 1
1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the
Oracle Database Scalability in VMware ESX VMware ESX 3.5
Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises
Performance Impacts of Non-blocking Caches in Out-of-order Processors
Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order
! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends
This Unit CIS 501 Computer Architecture! Metrics! Latency and throughput! Reporting performance! Benchmarking and averaging Unit 2: Performance! CPU performance equation & performance trends CIS 501 (Martin/Roth):
Symmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison
April 23 11 Aviation Parkway, Suite 4 Morrisville, NC 2756 919-38-28 Fax 919-38-2899 32 B Lakeside Drive Foster City, CA 9444 65-513-8 Fax 65-513-899 www.veritest.com [email protected] Microsoft Windows
NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR
NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR THE NIAGARA PROCESSOR IMPLEMENTS A THREAD-RICH ARCHITECTURE DESIGNED TO PROVIDE A HIGH-PERFORMANCE SOLUTION FOR COMMERCIAL SERVER APPLICATIONS. THE HARDWARE
<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing
T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing Robert Golla Senior Hardware Architect Paul Jordan Senior Principal Hardware Engineer Oracle
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor
Communicating with devices
Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.
Copyright 2013, Oracle and/or its affiliates. All rights reserved.
1 Oracle SPARC Server for Enterprise Computing Dr. Heiner Bauch Senior Account Architect 19. April 2013 2 The following is intended to outline our general product direction. It is intended for information
Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1
Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.
Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December
VLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009
Performance Study Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Introduction With more and more mission critical networking intensive workloads being virtualized
RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29
RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f. one large disk) Parallelism improves performance Plus extra disk(s) for redundant data storage Provides fault tolerant
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Microsoft Exchange Server 2007 and Hyper-V high availability configuration on HP ProLiant BL680c G5 server blades
Microsoft Exchange Server 2007 and Hyper-V high availability configuration on HP ProLiant BL680c G5 server blades Executive summary... 2 Introduction... 2 Exchange 2007 Hyper-V high availability configuration...
Accelerating Server Storage Performance on Lenovo ThinkServer
Accelerating Server Storage Performance on Lenovo ThinkServer Lenovo Enterprise Product Group April 214 Copyright Lenovo 214 LENOVO PROVIDES THIS PUBLICATION AS IS WITHOUT WARRANTY OF ANY KIND, EITHER
Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7
Introduction 1 Performance on Hosted Server 1 Figure 1: Real World Performance 1 Benchmarks 2 System configuration used for benchmarks 2 Figure 2a: New tickets per minute on E5440 processors 3 Figure 2b:
HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads
HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads Gen9 Servers give more performance per dollar for your investment. Executive Summary Information Technology (IT) organizations face increasing
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
Solving I/O Bottlenecks to Enable Superior Cloud Efficiency
WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011
Oracle Database Reliability, Performance and scalability on Intel platforms Mitch Shults, Intel Corporation October 2011 1 Intel Processor E7-8800/4800/2800 Product Families Up to 10 s and 20 Threads 30MB
Power efficiency and power management in HP ProLiant servers
Power efficiency and power management in HP ProLiant servers Technology brief Introduction... 2 Built-in power efficiencies in ProLiant servers... 2 Optimizing internal cooling and fan power with Sea of
MAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 [email protected] 1.866.963.0424 www.simplehelix.com 2 Table of Contents
Table Of Contents. Page 2 of 26. *Other brands and names may be claimed as property of others.
Technical White Paper Revision 1.1 4/28/10 Subject: Optimizing Memory Configurations for the Intel Xeon processor 5500 & 5600 series Author: Scott Huck; Intel DCG Competitive Architect Target Audience:
Managing Data Center Power and Cooling
White PAPER Managing Data Center Power and Cooling Introduction: Crisis in Power and Cooling As server microprocessors become more powerful in accordance with Moore s Law, they also consume more power
Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers
WHITE PAPER FUJITSU PRIMERGY AND PRIMEPOWER SERVERS Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers CHALLENGE Replace a Fujitsu PRIMEPOWER 2500 partition with a lower cost solution that
Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.
This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'
enabling Ultra-High Bandwidth Scalable SSDs with HLnand
www.hlnand.com enabling Ultra-High Bandwidth Scalable SSDs with HLnand May 2013 2 Enabling Ultra-High Bandwidth Scalable SSDs with HLNAND INTRODUCTION Solid State Drives (SSDs) are available in a wide
Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano
Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview
Seradex White Paper. Focus on these points for optimizing the performance of a Seradex ERP SQL database:
Seradex White Paper A Discussion of Issues in the Manufacturing OrderStream Microsoft SQL Server High Performance for Your Business Executive Summary Microsoft SQL Server is the leading database product
SQL Server Business Intelligence on HP ProLiant DL785 Server
SQL Server Business Intelligence on HP ProLiant DL785 Server By Ajay Goyal www.scalabilityexperts.com Mike Fitzner Hewlett Packard www.hp.com Recommendations presented in this document should be thoroughly
VTrak 15200 SATA RAID Storage System
Page 1 15-Drive Supports over 5 TB of reliable, low-cost, high performance storage 15200 Product Highlights First to deliver a full HW iscsi solution with SATA drives - Lower CPU utilization - Higher data
OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING
OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Fusionstor NAS Enterprise Server and Microsoft Windows Storage Server 2003 competitive performance comparison
Fusionstor NAS Enterprise Server and Microsoft Windows Storage Server 2003 competitive performance comparison This white paper compares two important NAS operating systems and examines their performance.
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Improving Economics of Blades with VMware
Improving Economics of Blades with VMware Executive Summary Today IT efficiency is critical for competitive viability. However, IT organizations face many challenges, including, growing capacity while
This Unit: Virtual Memory. CIS 501 Computer Architecture. Readings. A Computer System: Hardware
This Unit: Virtual CIS 501 Computer Architecture Unit 5: Virtual App App App System software Mem CPU I/O The operating system (OS) A super-application Hardware support for an OS Virtual memory Page tables
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
Computer Systems Structure Main Memory Organization
Computer Systems Structure Main Memory Organization Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Storage/Memory
Leading Virtualization Performance and Energy Efficiency in a Multi-processor Server
Leading Virtualization Performance and Energy Efficiency in a Multi-processor Server Product Brief Intel Xeon processor 7400 series Fewer servers. More performance. With the architecture that s specifically
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage
The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage sponsored by Dan Sullivan Chapter 1: Advantages of Hybrid Storage... 1 Overview of Flash Deployment in Hybrid Storage Systems...
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck
Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering
Intel Pentium 4 Processor on 90nm Technology
Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Multi-Core Programming
Multi-Core Programming Increasing Performance through Software Multi-threading Shameem Akhter Jason Roberts Intel PRESS Copyright 2006 Intel Corporation. All rights reserved. ISBN 0-9764832-4-6 No part
PARALLELS CLOUD SERVER
PARALLELS CLOUD SERVER Performance and Scalability 1 Table of Contents Executive Summary... Error! Bookmark not defined. LAMP Stack Performance Evaluation... Error! Bookmark not defined. Background...
High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team
High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1 2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004)
A Quantum Leap in Enterprise Computing
A Quantum Leap in Enterprise Computing Unprecedented Reliability and Scalability in a Multi-Processor Server Product Brief Intel Xeon Processor 7500 Series Whether you ve got data-demanding applications,
Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820
Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820 This white paper discusses the SQL server workload consolidation capabilities of Dell PowerEdge R820 using Virtualization.
Adaptec: Snap Server NAS Performance Study
March 2006 www.veritest.com [email protected] Adaptec: Snap Server NAS Performance Study Test report prepared under contract from Adaptec, Inc. Executive summary Adaptec commissioned VeriTest, a service
How To Compare Two Servers For A Test On A Poweredge R710 And Poweredge G5P (Poweredge) (Power Edge) (Dell) Poweredge Poweredge And Powerpowerpoweredge (Powerpower) G5I (
TEST REPORT MARCH 2009 Server management solution comparison on Dell PowerEdge R710 and HP Executive summary Dell Inc. (Dell) commissioned Principled Technologies (PT) to compare server management solutions
Computer Systems Structure Input/Output
Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices
7 Real Benefits of a Virtual Infrastructure
7 Real Benefits of a Virtual Infrastructure Dell September 2007 Even the best run IT shops face challenges. Many IT organizations find themselves with under-utilized servers and storage, yet they need
Gigabit Ethernet Design
Gigabit Ethernet Design Laura Jeanne Knapp Network Consultant 1-919-254-8801 [email protected] www.lauraknapp.com Tom Hadley Network Consultant 1-919-301-3052 [email protected] HSEdes_ 010 ed and
OpenSPARC T1 Processor
OpenSPARC T1 Processor The OpenSPARC T1 processor is the first chip multiprocessor that fully implements the Sun Throughput Computing Initiative. Each of the eight SPARC processor cores has full hardware
