Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes
|
|
|
- Elfreda Bell
- 9 years ago
- Views:
Transcription
1 basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) 1
2 Readings H+P parts of chapter 2 Recent Research Paper EPIC/IA-64 2
3 VLIW: Very Long Instruction Word problem with superscalar implementation: complexity! wide fetch+branch prediction (can partially fix w/ trace cache) N 2 bypass (can partially fix with clustering) N 2 dependence cross-check (stall+bypass logic) one alternative: VLIW (very long instruction word) single-issue pipe, but unit is N-instruction group (VLIW) instructions in VLIW are guaranteed (by compiler) to be independent + processor does not have to dependence-check within a VLIW VLIW travels down pipe as a unit ( bundle ) often slotted (i.e., 1st must be ALU, 2nd must be load, etc.) instruction 1 instruction 2 instruction 3 3
4 VLIW History started with microcode ( horizontal microcode ) academic projects ELI-512 [Fisher, 85] Illinois IMPACT [Hwu, 91] commercial machines MultiFlow [Colwell+Fisher, 85] failed Cydrome [Rau, 85] failed EPIC (IA-64, Itanium) [Colwell,Fisher+Rau, 97] failing/failed Transmeta [Ditzel, 99]: translates x86 to VLIW failed many embedded controllers (TI, Motorola) are VLIW success 4
5 Pure VLIW pure VLIW: no hardware dependence-checks at all not even between VLIW groups compiler responsible for scheduling entire pipeline including stall cycles possible if you know structure of pipeline and latencies exactly problem 1: pipe & latencies vary across implementations recompile for new implementations (or risk missing a stall)? TransMeta solved this problem by recompiling on-the-fly problem 2: latencies are NOT fixed within implementation don t use caches? (forget it) schedule assuming cache miss? (no point to having caches) not many VLIW purists left 5
6 A VLIW Compromise compromise: EPIC (Explicitly Parallel Instruction Computing) less rigid than VLIW (not really VLIW at all) variable width instruction words implemented as bundles with dependence bits + makes code compatible with different width machines assumes inter-bundle stall logic provided by hardware makes code compatible with different pipeline depths, op latencies enables stalls on cache misses (actually, out-of-order too) + exploits any information on parallelism compiler can give + compatible with multiple implementations of same arch e.g., IA-64: Itanium, Itanium2 6
7 ILP and Scheduling no point to having an N-wide pipeline if, on average, many fewer than N independent instructions per cycle performance is important but utilization (actual/peak performance) is also 7
8 Code Example: SAXPY SAXPY (single-precision A*X+Y) linear algebra routine (used in solving systems of equations) part of famous Livermore Loops kernel (early benchmark) for (I=0;I<N;I++) Z[I] = A*X[I] + Y[I] ldf f0, X(r1) // loop: // assume A in f2 ldf f6, Y(r1) // X,Y,Z are constant addresses stf f8, Z(r1) add r1,r1, #4 // assume I in r1 // assume N*4 in r2 8
9 Default SAXPY Performance scalar (1-wide), pipelined processor (for illustration) 5 cycle FP mult, 2 cycle FP add, both fully-pipelined full bypassing, all branches predicted taken ldf f0,a(r1) F D X M W F D d* E* E* E* E* E* W ldf f6,b(r1) F p* D X M W F D d* d* d* E+ E+ W stf f8,c(r1) F p* p* p* D X M W add r1,r1,#4 F D X M W F D X M W single iteration (7 instructions) latency: 15 cycles performance: 7 instructions / 15 cycles IPC = 0.47 utilization: 0.47 actual IPC / 1 peak IPC 47% 9
10 Performance and Utilization 2-wide pipeline (but still in-order) same configuration, just two at a time ldf f0,a(r1) F D X M W F D d* d* E* E* E* E* E* W ldf f6,b(r1) F D p* X M W F p* p* D d* d* d* d* E+ E+ W stf f8,c(r1) F p* D p* p* p* p* X d* M W add r1,r1,#4 F p* p* p p* D p* X M W F p* p* p* p* D p* d* X M W performance: still 15 cycles - not any better (why?) utilization: 0.47 actual IPC / 2 peak IPC 24%! notice: more hazards stalls (why?) notice: each stall more expensive (why?) 10
11 Scheduling and Issue instruction scheduling: decide on instruction execution order important tool for improving utilization and performance related to instruction issue (when instructions execute) pure VLIW: static scheduling with static issue in-order superscalar, EPIC: static scheduling with dynamic issue well, not completely dynamic... in-order pipeline relies on compiler to schedule well 11
12 Instruction Scheduling idea: independent instructions between slow ops and uses otherwise pipeline sits idle waiting for RAW hazards to resolve we have already seen dynamic pipeline scheduling to do this we need independent instructions scheduling scope: code region we are scheduling the bigger the better (more independent instructions to choose from) once scope is defined, schedule is pretty obvious trick is making a large scope (schedule across branches???) compiler scheduling techniques (more about these later) loop unrolling (for loops) software pipelining (also for loops) trace scheduling (for general control-flow) 12
13 Scheduling: Compiler or Hardware? compiler + large scheduling scope (full program), large lookahead + enables simple hardware with fast clock low branch prediction accuracy (profiling? see next slide!) no information on latencies like cache misses (profiling?) pain to speculate and recover from mis-speculation (h/w support?) hardware + better branch prediction accuracy + dynamic information on latencies (cache misses) and dependences + easy to speculate & recover from mis-speculation finite on-chip instruction buffering limits scheduling scope more complicated hardware (more power? tougher to verify?) slower clock 13
14 Aside: Profiling profile: (statistical) information about program tendencies run program once with a test input and see how it behaves hope that other inputs lead to similar behaviors compiler can use this info for scheduling profiling can be a useful technique must be used carefully - else, can harm performance 14
15 Loop Unrolling SAXPY we want to separate dependent operations from one another but not enough flexibility within single iteration of loop longest chain of operations is 9 cycles load result (1 cycle) forward to multiply (5 cycles) forward to add (2 cycles) forward to store (1 cycle) can t hide 9 cycles of latency using 7 instructions how about 9 cycles of latency twice in 14 instructions? loop unrolling: schedule 2 loop iterations together 15
16 Unrolling Part 1: Fuse Iterations combine two (in general, N) iterations of loop fuse loop control (induction increment + backward branch) adjust implicit uses of internal induction variables (r1 in example) ldf f0,x(r1) ldf f6,y(r1) stf f8,z(r1) add r1,r1,#4 ldf f0,x(r1) ldf f6,y(r1) stf f8,z(r1) add r1,r1,#4 ldf f0,x(r1) ldf f6,y(r1) stf f0,z(r1) add r1,r1,#4 ldf f0,x+4(r1) ldf f6,y+4(r1) stf f8,z+4(r1) add r1,r1,#8 16
17 Unrolling Part 2: Pipeline Schedule pipeline schedule to reduce RAW stalls have seen this already (as done dynamically by hardware) ldf f0,x(r1) ldf f6,y(r1) stf f8,z(r1) ldf f0,x+4(r1) ldf f6,y+4(r1) stf f8,z+4(r1) add r1,r1,#8 ldf f0,x(r1) ldf f0,x+8(r1) ldf f6,y(r1) ldf f6,y+8(r1) stf f8,z(r1) stf f8,z+8(r1) add r1,r1,#8 17
18 Unrolling Part 3: Rename Registers pipeline scheduling caused WAR hazards so we rename registers to solve this problem (similar to w/hardware) ldf f0,x(r1) ldf f0,x+4(r1) ldf f6,y(r1) ldf f6,y+4(r1) stf f8,z(r1) stf f8,z+4(r1) add r1,r1,#8 ldf f0,x(r1) ldf f10,x+4(r1) mulf f14,f10,f2 ldf f6,y(r1) ldf f16,y+4(r1) addf f18,f16,f14 stf f8,z(r1) stf f18,z+4(r1) add r1,r1,#8 18
19 Unrolled SAXPY Performance ldf f0,a(r1) F D X M W ldf f10,a+4(r1) F D X M W F D E* E* E* E* E* W mulf f14,f10,f2 F D E* E* E* E* E* W ldf f6,b(r1) F D X M W regfile write ldf f16,b+4(r1) F D X M s* s* W F D d* E+ E+ s* W addf f18,f16,f14 F p* D E+ p* E+ W stf f8,c(r1) F D X M W stf f18,c+4(r1) F D X M W add r1,r1,#8 F D X M W F D X M W 2 iterations (12 instructions) 17 cycles (fewer stalls) before unrolling, it took 15 cycles for 1 iteration! 19
20 Shortcomings of Loop Unrolling code growth poor scheduling along seams of unrolled copies doesn t handle inter-iteration dependences (recurrences) for(i=0;i<n;i++) X[I] = A*X[I-1]; // each iteration depends on prior ldf f2,x-4(r1) mulf f4,f2,f0 stf f4,x(r1) add r1,r1,#4 ldf f2,x-4(r1) mulf f4,f2,f0 stf f4,x(r1) add r1,r1,#4 unroll ldf f12,x-4(r1) mulf f14,f12,f0 stf f14,x(r1) ldf f2,x(r1) mulf f4,f2,f0 stf f4,x+4(r1) add r1,r1,#8 1 dependence chain can t schedule 20
VLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x
Software Pipelining for (i=1, i
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano
WAR: Write After Read
WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads
Module: Software Instruction Scheduling Part I
Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
Instruction Set Design
Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,
A Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers
CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern
Data Level Parallelism
Data Level Parallelism Reading: H&P: Appendix F Data Level Parallelism 1 This Unit: Data Level Parallelism Application OS Compiler Firmware CPU I/O Memory Data-level parallelism Vector processors Message-passing
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Computer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:
Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove
IA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
Thread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
PROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.
Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December
Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1
Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of
Very Large Instruction Word Architectures (VLIW Processors and Trace Scheduling)
Very Large Instruction Word Architectures (VLIW Processors and Trace Scheduling) Binu Mathew {mbinu} @ {cs.utah.edu} 1 What is a VLIW Processor? Recent high performance processors have depended on Instruction
Pipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida [email protected] [email protected] October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations
Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations 1 Problem 6 Show the instruction occupying each stage in each cycle (with bypassing) if I1 is R1+R2
Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture
University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2012 John Kubiatowicz Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture Your Name:
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge
CS 159 Two Lecture Introduction Parallel Processing: A Hardware Solution & A Software Challenge We re on the Road to Parallel Processing Outline Hardware Solution (Day 1) Software Challenge (Day 2) Opportunities
Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas
Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Opportunities for Loop Optimization Software Pipelining Modulo Scheduling Resource and Dependence Constraints Scheduling
CPU Performance Equation
CPU Performance Equation C T I T ime for task = C T I =Average # Cycles per instruction =Time per cycle =Instructions per task Pipelining e.g. 3-5 pipeline steps (ARM, SA, R3000) Attempt to get C down
AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER
AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS BY TANYA M. LATTNER B.S., University of Portland, 2000 THESIS Submitted in partial fulfillment of the requirements for the degree
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Software Pipelining - Modulo Scheduling
EECS 583 Class 12 Software Pipelining - Modulo Scheduling University of Michigan October 15, 2014 Announcements + Reading Material HW 2 Due this Thursday Today s class reading» Iterative Modulo Scheduling:
2
1 2 3 4 5 For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring
CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of
CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis
Course on Advanced Computer Architectures
Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, September 3rd, 2015 Prof. C. Silvano EX1A ( 2 points) EX1B ( 2
SOC architecture and design
SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
A Guide to Getting Started with Successful Load Testing
Ingenieurbüro David Fischer AG A Company of the Apica Group http://www.proxy-sniffer.com A Guide to Getting Started with Successful Load Testing English Edition 2007 All Rights Reserved Table of Contents
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
Boosting Beyond Static Scheduling in a Superscalar Processor
Boosting Beyond Static Scheduling in a Superscalar Processor Michael D. Smith, Monica S. Lam, and Mark A. Horowitz Computer Systems Laboratory Stanford University, Stanford CA 94305-4070 May 1990 1 Introduction
IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro
EECS 58 Class Instruction Scheduling Software Pipelining Intro University of Michigan October 8, 04 Announcements & Reading Material Reminder: HW Class project proposals» Signup sheet available next Weds
Design Cycle for Microprocessors
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.
This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'
Lua as a business logic language in high load application. Ilya Martynov [email protected] CTO at IPONWEB
Lua as a business logic language in high load application Ilya Martynov [email protected] CTO at IPONWEB Company background Ad industry Custom development Technical platform with multiple components Custom
Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
Efficient Resource Management during Instruction Scheduling for the EPIC Architecture
Submission to PACT 2003 1 Efficient Resource Management during Instruction Scheduling for the EPIC Architecture Dong-Yuan Chen*, Lixia Liu*, Chen Fu**, Shuxin Yang**, Chengyong Wu**, Roy Ju* *Intel Labs
Week 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends
This Unit CIS 501 Computer Architecture! Metrics! Latency and throughput! Reporting performance! Benchmarking and averaging Unit 2: Performance! CPU performance equation & performance trends CIS 501 (Martin/Roth):
on an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
The Motherboard Chapter #5
The Motherboard Chapter #5 Amy Hissom Key Terms Advanced Transfer Cache (ATC) A type of L2 cache contained within the Pentium processor housing that is embedded on the same core processor die as the CPU
OC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No. # 26 Real - Time POSIX. (Contd.) Ok Good morning, so let us get
Multithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
Performance evaluation
Performance evaluation Arquitecturas Avanzadas de Computadores - 2547021 Departamento de Ingeniería Electrónica y de Telecomunicaciones Facultad de Ingeniería 2015-1 Bibliography and evaluation Bibliography
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory
Computer Organization and Architecture Chapter 4 Cache Memory Characteristics of Memory Systems Note: Appendix 4A will not be covered in class, but the material is interesting reading and may be used in
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
Software implementation of Post-Quantum Cryptography
Software implementation of Post-Quantum Cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands October 20, 2013 ASCrypto 2013, Florianópolis, Brazil Part I Optimizing cryptographic software
CS352H: Computer Systems Architecture
CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions
CISC, RISC, and DSP Microprocessors
CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
Using Power to Improve C Programming Education
Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden [email protected] jonasskeppstedt.net jonasskeppstedt.net [email protected]
FLIX: Fast Relief for Performance-Hungry Embedded Applications
FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...
Parallel and Distributed Computing Programming Assignment 1
Parallel and Distributed Computing Programming Assignment 1 Due Monday, February 7 For programming assignment 1, you should write two C programs. One should provide an estimate of the performance of ping-pong
The Microarchitecture of Superscalar Processors
The Microarchitecture of Superscalar Processors James E. Smith Department of Electrical and Computer Engineering 1415 Johnson Drive Madison, WI 53706 ph: (608)-265-5737 fax:(608)-262-1267 email: [email protected]
Introduction to RISC Processor. ni logic Pvt. Ltd., Pune
Introduction to RISC Processor ni logic Pvt. Ltd., Pune AGENDA What is RISC & its History What is meant by RISC Architecture of MIPS-R4000 Processor Difference Between RISC and CISC Pros and Cons of RISC
Thread Level Parallelism II: Multithreading
Thread Level Parallelism II: Multithreading Readings: H&P: Chapter 3.5 Paper: NIAGARA: A 32-WAY MULTITHREADED Thread Level Parallelism II: Multithreading 1 This Unit: Multithreading (MT) Application OS
SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing
SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing Won-Jong Lee, Shi-Hwa Lee, Jae-Ho Nah *, Jin-Woo Kim *, Youngsam Shin, Jaedon Lee, Seok-Yoon Jung SAIT, SAMSUNG Electronics, Yonsei Univ. *,
Driving force. What future software needs. Potential research topics
Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #
Property of ISA vs. Uarch?
More ISA Property of ISA vs. Uarch? ADD instruction s opcode Number of general purpose registers Number of cycles to execute the MUL instruction Whether or not the machine employs pipelined instruction
Operating Systems. Virtual Memory
Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page
Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:
55 Topic 3 Computer Performance Contents 3.1 Introduction...................................... 56 3.2 Measuring performance............................... 56 3.2.1 Clock Speed.................................
Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah
(DSF) Soft Core Prozessor NIOS II Stand Mai 2007 Jens Onno Krah Cologne University of Applied Sciences www.fh-koeln.de [email protected] NIOS II 1 1 What is Nios II? Altera s Second Generation
CIS570 Modern Programming Language Implementation. Office hours: TDB 605 Levine [email protected]. [email protected].
CIS570 Modern Programming Language Implementation Instructor: Admin. Assistant: URL: E Christopher Lewis Office hours: TDB 605 Levine [email protected] Cheryl Hickey [email protected] 502
Fastboot Techniques for x86 Architectures. Marcus Bortel Field Application Engineer QNX Software Systems
Fastboot Techniques for x86 Architectures Marcus Bortel Field Application Engineer QNX Software Systems Agenda Introduction BIOS and BIOS boot time Fastboot versus BIOS? Fastboot time Customizing the boot
The Pentium 4 and the G4e: An Architectural Comparison
The Pentium 4 and the G4e: An Architectural Comparison Part I by Jon Stokes Copyright Ars Technica, LLC 1998-2001. The following disclaimer applies to the information, trademarks, and logos contained in
Energy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
Low Power AMD Athlon 64 and AMD Opteron Processors
Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD
Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors
Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors Lesson 05: Array Processors Objective To learn how the array processes in multiple pipelines 2 Array Processor
Solutions. Solution 4.1. 4.1.1 The values of the signals are as follows:
4 Solutions Solution 4.1 4.1.1 The values of the signals are as follows: RegWrite MemRead ALUMux MemWrite ALUOp RegMux Branch a. 1 0 0 (Reg) 0 Add 1 (ALU) 0 b. 1 1 1 (Imm) 0 Add 1 (Mem) 0 ALUMux is the
