This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
|
|
|
- Hortense Gray
- 10 years ago
- Views:
Transcription
1 This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)? Utilization vs. performance Three implementations Coarse-grained MT Fine-grained MT Simultaneous MT (SMT) Slides originally developed by Amir Roth with contributions by Milo Martin at University of ennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. CIS 501 (Martin): Multithreading 1 CIS 501 (Martin): Multithreading 2 Readings Textbook (MA:FSTCM) Section 8.1 aper Tullsen et al., Exploiting Choice CIS 501 (Martin): Multithreading 3 erformance And Utilization Even moderate superscalar (e.g., 4-way) not fully utilized Average sustained IC: 1.5 2! < 50% utilization. Utilization is (actual IC / peak IC) Why so low? Many dead cycles, due too: Mis-predicted branches Cache misses, especially misses to off-chip memory Data dependences Some workloads worse than others, for example, databases ig data, lots of instructions, hard-to-predict branches Have resource idle is wasteful How can we better utilize the core? Multi-threading (MT) Improve utilization by multiplexing multiple threads on single core If one thread cannot fully utilize core? Maybe 2 or 4 can CIS 501 (Martin): Multithreading 4
2 Superscalar Under-utilization Time evolution of issue slot 4-issue processor Simple Multithreading Time evolution of issue slot 4-issue processor cache miss time cache miss Fill in with instructions from another thread Superscalar CIS 501 (Martin): Multithreading 5 Superscalar Multithreading Where does it find a thread? Same problem as multi-core Same shared-memory abstraction CIS 501 (Martin): Multithreading 6 Latency vs Throughput MT trades (single-thread) latency for throughput Sharing processor degrades latency of individual threads + ut improves aggregate latency of both threads + Improves utilization Example Thread A: individual latency=10s, latency with thread =15s Thread : individual latency=20s, latency with thread A=25s Sequential latency (first A then or vice versa): 30s arallel latency (A and simultaneously): 25s MT slows each thread by 5s + ut improves total latency by 5s Different workloads have different parallelism SpecF has lots of IL (can use an 8-wide machine) Server workloads have TL (can use multiple threads) CIS 501 (Martin): Multithreading 7 MT Implementations: Similarities How do multiple threads share a single processor? Different sharing mechanisms for different kinds of structures Depend on what kind of state structure stores No state: ALUs Dynamically shared ersistent hard state (aka context ): C, registers Replicated ersistent soft state: caches, branch predictor Dynamically shared (like on a multi-programmed uni-processor) TLs need thread ids, caches/bpred tables don t Exception: ordered soft state (HR, RAS) is replicated Transient state: pipeline latches, RO, Instruction Queue artitioned somehow CIS 501 (Martin): Multithreading 8
3 MT Implementations: Differences Key question: thread scheduling policy When to switch from one thread to another? Related question: pipeline partitioning How exactly do threads share the pipeline itself? Choice depends on What kind of latencies (specifically, length) you want to tolerate How much single thread performance you are willing to sacrifice The Standard Multithreading icture Time evolution of issue slots Color = thread time Three designs Coarse-grain multithreading (CGMT) Fine-grain multithreading (FGMT) Simultaneous multithreading (SMT) Superscalar CGMT FGMT SMT CIS 501 (Martin): Multithreading 9 CIS 501 (Martin): Multithreading 10 Coarse-Grain Multithreading (CGMT) CGMT Coarse-Grain Multi-Threading (CGMT) + Sacrifices very little single thread performance (of one thread) Tolerates only long latencies (e.g., L2 misses) riority thread scheduling policy Designate a preferred thread (e.g., thread A) Switch to thread on thread A cache miss Switch back to A when A cache miss returns Can also give threads equal priority ipeline partitioning None, flush the pipe on thread switch Can t tolerate short latencies (<2x pipeline depth) est with short in-order pipelines Example: IM Northstar/ulsar CGMT CGMT CIS 501 (Martin): Multithreading 11 L2 miss? CIS 501 (Martin): Multithreading 12
4 Fine-Grain Multithreading (FGMT) (Extremely) Fine-Grain Multithreading (FGMT) Key idea: have so many threads that never stall on cache misses If miss latency is N cycles, have N threads! Thread scheduling policy Switch threads every cycle (round-robin) Need a lot of threads Many threads! many register files Sacrifices significant single thread performance + Tolerates latencies (e.g., cache misses) + No bypassing needed! ipeline partitioning Dynamic, no flushing Extreme example: Denelcor HE, Tera MTA So many threads (100+), it didn t even need caches Failed commercially, but good for certain niche apps FGMT CIS 501 (Martin): Multithreading 13 Fine-Grain Multithreading (FGMT) (Adaptive) Fine-Grain Multithreading (FGMT) Instead of switching each cycle, just pick from ready threads Thread scheduling policy Each cycle, select one thread to execute from + Fewer threads needed + Tolerates latencies (e.g., cache misses, branch mispred) + Sacrifices little single-thread performance Requires modified pipeline Must handle multiple threads in-flight ipeline partitioning Dynamic, no flushing FGMT CIS 501 (Martin): Multithreading 14 Fine-Grain Multithreading FGMT Multiple threads in pipeline at once (Many) more threads Vertical and Horizontal Under-Utilization FGMT and CGMT reduce vertical under-utilization Loss of all slots in an issue cycle Do not help with horizontal under-utilization Loss of some slots in an issue cycle (in a superscalar processor) time CIS 501 (Martin): Multithreading 15 CGMT FGMT SMT CIS 501 (Martin): Multithreading 16
5 Simultaneous Multithreading (SMT) What can issue insns from multiple threads in one cycle? Same thing that issues insns from multiple parts of same program out-of-order execution Simultaneous multithreading (SMT): OOO + FGMT Aka hyper-threading Observation: once insns are renamed to physical registers... Scheduler need to know which thread they came from Some examples IM ower5: 4-way issue, 2 threads Intel entium4: 3-way issue, 2 threads Intel Core i7: 4-way issue, 2 threads Alpha 21464: 8-way issue, 4 threads (canceled) Notice a pattern? #threads (T) * 2 = #issue width (N) CIS 501 (Martin): Multithreading 17 Simultaneous Multithreading (SMT) SMT Replicate map table, share (larger) physical register file map table map tables CIS 501 (Martin): Multithreading 18 SMT Resource artitioning hysical and insn buffer entries shared at fine-grain hysically unordered and so fine-grain sharing is possible How are ordered structures (RO/LSQ) shared? Fine-grain sharing (below) would entangle commit (and squash) Allowing threads to commit independently is important Thus, typically replicated or statically partitioned RO is cheap (just a FIFO, so it can be big) LSQ is more of an issue map tables Static & Dynamic Resource artitioning Static partitioning (below) T equal-sized contiguous partitions ± No starvation, sub-optimal utilization (fragmentation) Dynamic partitioning > T partitions, available partitions assigned on need basis ± etter utilization, possible starvation ICOUNT: fetch policy prefers thread with fewest in-flight insns Couple both with larger ROs/LSQs CIS 501 (Martin): Multithreading 19 CIS 501 (Martin): Multithreading 20
6 Multithreading Issues Shared soft state (caches, branch predictors, TLs, etc.) Key example: cache interference General concern for all multithreading variants Can the working sets of multiple threads fit in the caches? Shared memory threads help here (versus multiple programs) + Same insns! share instruction cache + Shared data! share data cache To keep miss rates low SMT might need more associative & larger caches (which is OK) Out-of-order tolerates L1 misses Large physical register file (and map table) physical registers = (#threads * #arch-regs) + #in-flight insns map table entries = (#threads * #arch-regs) CIS 501 (Martin): Multithreading 21 Notes About Sharing Soft State Caches can be shared naturally hysically-tagged: address translation distinguishes different threads but TLs need explicit thread IDs to be shared Virtually-tagged: entries of different threads indistinguishable Thread IDs are only a few bits: enough to identify on-chip contexts Thread IDs make sense on T (branch target buffer) T entries are already large, a few extra bits / entry won t matter Different thread s target prediction! automatic mis-prediction but not on a HT (branch history table) HT entries are small, a few extra bits / entry is huge overhead Different thread s direction prediction! mis-prediction not automatic Ordered soft-state should be replicated Examples: ranch History Register (HR), Return Address Stack (RAS) Otherwise it becomes meaningless Fortunately, it is typically small CIS 501 (Martin): Multithreading 22 Multithreading vs. Multicore If you wanted to run multiple threads would you build a A multicore: multiple separate pipelines? A multithreaded processor: a single larger pipeline? oth will get you throughput on multiple threads Multicore core will be simpler, possibly faster clock SMT will get you better performance (IC) on a single thread SMT is basically an IL engine that converts TL to IL Multicore is mainly a TL (thread-level parallelism) engine Do both Sun s Niagara (UltraSARC T1) 8 processors, each with 4-threads (non-smt threading) 1Ghz clock, in-order, short pipeline (6 stages or so) Designed for power-efficient throughput computing CIS 501 (Martin): Multithreading 23 Research: Speculative Multithreading Speculative multithreading Use multiple threads/processors for single-thread performance Speculatively parallelize sequential loops, that might not be parallel rocessing elements (called E) arranged in logical ring Compiler or hardware assigns iterations to consecutive Es Hardware tracks logical order to detect mis-parallelization Techniques for doing this on non-loop code too Detect reconvergence points (function calls, conditional code) Effectively chains ROs of different processors into one big RO Global commit head travels from one E to the next Mis-parallelization flushes one Es, but not all Es Also known as split-window or Multiscalar Not commercially available yet ut it is the biggest idea from academia not yet adopted CIS 501 (Martin): Multithreading 24
7 Multithreading Summary Latency vs. throughput artitioning different processor resources Three multithreading variants Coarse-grain: no single-thread degradation, but long latencies only Fine-grain: other end of the trade-off Simultaneous: fine-grain with out-of-order Multithreading vs. chip multiprocessing CIS 501 (Martin): Multithreading 25
Thread Level Parallelism II: Multithreading
Thread Level Parallelism II: Multithreading Readings: H&P: Chapter 3.5 Paper: NIAGARA: A 32-WAY MULTITHREADED Thread Level Parallelism II: Multithreading 1 This Unit: Multithreading (MT) Application OS
Multithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Thread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
Energy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
Operating System Impact on SMT Architecture
Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th
Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719
Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes
basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static
VLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
OC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.
Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing
T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing Robert Golla Senior Hardware Architect Paul Jordan Senior Principal Hardware Engineer Oracle
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR
NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR THE NIAGARA PROCESSOR IMPLEMENTS A THREAD-RICH ARCHITECTURE DESIGNED TO PROVIDE A HIGH-PERFORMANCE SOLUTION FOR COMMERCIAL SERVER APPLICATIONS. THE HARDWARE
Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
Chapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
CPU Scheduling Outline
CPU Scheduling Outline What is scheduling in the OS? What are common scheduling criteria? How to evaluate scheduling algorithms? What are common scheduling algorithms? How is thread scheduling different
Software and the Concurrency Revolution
Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox )
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge
CS 159 Two Lecture Introduction Parallel Processing: A Hardware Solution & A Software Challenge We re on the Road to Parallel Processing Outline Hardware Solution (Day 1) Software Challenge (Day 2) Opportunities
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Driving force. What future software needs. Potential research topics
Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #
Low Power AMD Athlon 64 and AMD Opteron Processors
Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD
Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations
Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations 1 Problem 6 Show the instruction occupying each stage in each cycle (with bypassing) if I1 is R1+R2
WAR: Write After Read
WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads
Pipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida [email protected] [email protected] October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
Enterprise Applications
Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
Symmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:
Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove
Multi-Core Programming
Multi-Core Programming Increasing Performance through Software Multi-threading Shameem Akhter Jason Roberts Intel PRESS Copyright 2006 Intel Corporation. All rights reserved. ISBN 0-9764832-4-6 No part
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
Multicore Programming with LabVIEW Technical Resource Guide
Multicore Programming with LabVIEW Technical Resource Guide 2 INTRODUCTORY TOPICS UNDERSTANDING PARALLEL HARDWARE: MULTIPROCESSORS, HYPERTHREADING, DUAL- CORE, MULTICORE AND FPGAS... 5 DIFFERENCES BETWEEN
Multi-core and Linux* Kernel
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
Computer Organization and Components
Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides
Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:
Scheduling Scheduling Scheduling levels Long-term scheduling. Selects which jobs shall be allowed to enter the system. Only used in batch systems. Medium-term scheduling. Performs swapin-swapout operations
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team
High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1 2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004)
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
HyperThreading Support in VMware ESX Server 2.1
HyperThreading Support in VMware ESX Server 2.1 Summary VMware ESX Server 2.1 now fully supports Intel s new Hyper-Threading Technology (HT). This paper explains the changes that an administrator can expect
Performance Impacts of Non-blocking Caches in Out-of-order Processors
Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order
POWER8 Performance Analysis
POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs [email protected] #OpenPOWERSummit Join the conversation at #OpenPOWERSummit
How To Understand The Design Of A Microprocessor
Computer Architecture R. Poss 1 What is computer architecture? 2 Your ideas and expectations What is part of computer architecture, what is not? Who are computer architects, what is their job? What is
Enabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
2
1 2 3 4 5 For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring
Enabling Technologies for Distributed Computing
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies
Web Server Architectures
Web Server Architectures CS 4244: Internet Programming Dr. Eli Tilevich Based on Flash: An Efficient and Portable Web Server, Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel, 1999 Annual Usenix Technical
2. Background. 2.1. Network Interface Processing
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann Hyong-youb Kim Scott Rixner Rice University Houston, TX {willmann,hykim,rixner}@rice.edu Vijay S. Pai Purdue University
Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x
Software Pipelining for (i=1, i
18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings
Operatin g Systems: Internals and Design Principle s Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Bear in mind,
Chapter 2: OS Overview
Chapter 2: OS Overview CmSc 335 Operating Systems 1. Operating system objectives and functions Operating systems control and support the usage of computer systems. a. usage users of a computer system:
NVIDIA Tegra 4 Family CPU Architecture
Whitepaper NVIDIA Tegra 4 Family CPU Architecture 4-PLUS-1 Quad core 1 Table of Contents... 1 Introduction... 3 NVIDIA Tegra 4 Family of Mobile Processors... 3 Benchmarking CPU Performance... 4 Tegra 4
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor
Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory
Computer Organization and Architecture Chapter 4 Cache Memory Characteristics of Memory Systems Note: Appendix 4A will not be covered in class, but the material is interesting reading and may be used in
Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.
This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'
Week 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
Computer Systems Structure Main Memory Organization
Computer Systems Structure Main Memory Organization Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Storage/Memory
SOC architecture and design
SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external
Performance evaluation
Performance evaluation Arquitecturas Avanzadas de Computadores - 2547021 Departamento de Ingeniería Electrónica y de Telecomunicaciones Facultad de Ingeniería 2015-1 Bibliography and evaluation Bibliography
Digital Design for Low Power Systems
Digital Design for Low Power Systems Shekhar Borkar Intel Corp. Outline Low Power Outlook & Challenges Circuit solutions for leakage avoidance, control, & tolerance Microarchitecture for Low Power System
Optimizing Code for Accelerators: The Long Road to High Performance
Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010 The Age of Accelerators 2 Accelerators in Real Life 3 Latency (ps/inst) Why Accelerators?
PROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Processing Multiple Buffers in Parallel to Increase Performance on Intel Architecture Processors
White Paper Vinodh Gopal Jim Guilford Wajdi Feghali Erdinc Ozturk Gil Wolrich Martin Dixon IA Architects Intel Corporation Processing Multiple Buffers in Parallel to Increase Performance on Intel Architecture
This Unit: Virtual Memory. CIS 501 Computer Architecture. Readings. A Computer System: Hardware
This Unit: Virtual CIS 501 Computer Architecture Unit 5: Virtual App App App System software Mem CPU I/O The operating system (OS) A super-application Hardware support for an OS Virtual memory Page tables
CPU Scheduling. Core Definitions
CPU Scheduling General rule keep the CPU busy; an idle CPU is a wasted CPU Major source of CPU idleness: I/O (or waiting for it) Many programs have a characteristic CPU I/O burst cycle alternating phases
Introduction to Microprocessors
Introduction to Microprocessors Yuri Baida [email protected] [email protected] October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:
Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
CISC, RISC, and DSP Microprocessors
CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:
Computer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6
Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6 Winter Term 2008 / 2009 Jun.-Prof. Dr. André Brinkmann [email protected] Universität Paderborn PC² Agenda Multiprocessor and
Chapter 5 Process Scheduling
Chapter 5 Process Scheduling CPU Scheduling Objective: Basic Scheduling Concepts CPU Scheduling Algorithms Why Multiprogramming? Maximize CPU/Resources Utilization (Based on Some Criteria) CPU Scheduling
Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano
Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview
LSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends
This Unit CIS 501 Computer Architecture! Metrics! Latency and throughput! Reporting performance! Benchmarking and averaging Unit 2: Performance! CPU performance equation & performance trends CIS 501 (Martin/Roth):
