EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro
|
|
|
- Myra Richard
- 10 years ago
- Views:
Transcription
1 EECS 58 Class Instruction Scheduling Software Pipelining Intro University of Michigan October 8, 04
2 Announcements & Reading Material Reminder: HW Class project proposals» Signup sheet available next Weds in class» Think about partners/topic! Today and next Wednesday s class» Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops, B. Rau, MICRO-7, 994, pp Monday is Fall break - -
3 Sample Project Ideas Memory system» Data cache prefetching» Instruction cache prefetching» Data layout for improved cache behavior» Memory contention (ReQoS [ASPLOS ], compiling for niceness [CGO ]) Reliability/HW variability» AVF profiling, vulnerability analysis» Selective code duplication for soft error protection» Low-cost fault recovery» Runtime system for variability - -
4 Sample Project Ideas (Dynamic Compiler) Traditional Dynamic Compiler» Dynamo [PLDI 00], DynamoRio, Pin [PLDI 05]» Run-time program analysis for security» Run-time classic optimization (if they don t have it already!) Light-weight Dynamic Compiler» Protean Code [Micro 4], Scenario-based optimization [CGO 09]» Runtime auto-parallelization» Runtime instruction scheduling» Runtime data layout optimization Denver/Transmeta style Architecture» Binary translation coupled with hardware (Block chop [ISCA ]» Example: Deactivate/power gating parts of processor (FUs, registers, cache) - -
5 Sample Project Ideas (Approximate Computing)» Trading accuracy for performance» Example: loop perforations» Other compilation techniques. SAGE [MICRO ], Paraprox [ASPLOS 4]» Problems: what to approximate, estimate errors, error bound, maximize performance gain, etc - 4 -
6 More Project Ideas Compile for GPU/GPU-like architecture» Auto-optimization for different HW configurations» Reduce branch divergence, etc Applying machine learning to compilation problems» Identify hot paths.» Identify good optimization combinations» auto-parallelism [PDLI 09] Compiler Analysis» Accelerator design» Parallelism opportunities - 5 -
7 And Yet a Few More Program analysis for security» Efficient taint tracking Profiling tools» Distributed control flow/energy profiling (profiling + aggregation) Misc» Program distillation - create a subset program with equivalent memory/branch behavior Remember, don t be constrained by my suggestions, you can pick other topics! - 6 -
8 From Last Time Class Problem machine model latencies add: mpy: load: sync store: sync. Draw dependence graph. Label edges with type and latencies. r = load(r). r = r +. store (r8, r) 4. r = load(r) 5. r4 = r * r 6. r5 = r5 + r4 7. r = r store (r, r5) - 7 -
9 Dependence Graph Properties - Estart Estart = earliest start time, (as soon as possible - ASAP)» Schedule length with infinite resources (dependence height)» Estart = 0 if node has no predecessors» Estart = MAX(Estart(pred) + latency) for each predecessor node» Example
10 Lstart Lstart = latest start time, ALAP» Latest time a node can be scheduled s.t. sched length not increased beyond infinite resource schedule length» Lstart = Estart if node has no successors» Lstart = MIN(Lstart(succ) - latency) for each successor node» Example
11 Slack Slack = measure of the scheduling freedom» Slack = Lstart Estart for each node» Larger slack means more mobility» Example
12 Critical Path Critical operations = Operations with slack = 0» No mobility, cannot be delayed without extending the schedule length of the block» Critical path = sequence of critical operations from node with no predecessors to exit node, can be multiple crit paths
13 Class Problem Node Estart Lstart Slack Critical path(s) = - -
14 Operation Priority Priority Need a mechanism to decide which ops to schedule first (when you have multiple choices) Common priority functions» Height Distance from exit node Ÿ Give priority to amount of work left to do» Slackness inversely proportional to slack Ÿ Give priority to ops on the critical path» Register use priority to nodes with more source operands and fewer destination operands Ÿ Reduces number of live registers» Uncover high priority to nodes with many children Ÿ Frees up more nodes» Original order when all else fails - -
15 Height-Based Priority Height-based is the most common» priority(op) = MaxLstart Lstart(op) + 0, 8,, 4 5 0, 0 6 4, 4 6, 6 9 4, 7 7, 7 8, , 5 op priority
16 List Scheduling (aka Cycle Scheduler) Build dependence graph, calculate priority Add all ops to UNSCHEDULED set time = - while (UNSCHEDULED is not empty)» time++» READY = UNSCHEDULED ops whose incoming dependences have been satisfied» Sort READY using priority function» For each op in READY (highest to lowest priority) Ÿ op can be scheduled at current time? (are the resources free?) Yes, schedule it, op.issue_time = time Mark resources busy in RU_map relative to issue time Remove op from UNSCHEDULED/READY sets No, continue - 5 -
17 Cycle Scheduling Example 0, op priority , 0 m, m 4, 5m 4, , 7 0 8, 8 6 4, 4 6, 6 7m 0, 5 RU_map time ALU MEM Schedule time Ready Placed
18 List Scheduling (Operation Scheduler) Build dependence graph, calculate priority Add all ops to UNSCHEDULED set while (UNSCHEDULED not empty)» op = operation in UNSCHEDULED with highest priority» For time = estart to some deadline Ÿ Op can be scheduled at current time? (are resources free?) Yes, schedule it, op.issue_time = time Mark resources busy in RU_map relative to issue time Remove op from UNSCHEDULED No, continue» Deadline reached w/o scheduling op? (could not be scheduled) Yes, unplace all conflicting ops at op.estart, add them to UNSCHEDULED Schedule op at estart Mark resources busy in RU_map relative to issue time Remove op from UNSCHEDULED - 7 -
19 Classroom Problem Operation Scheduling Machine: issue, memory port, ALU Memory port = cycles, pipelined ALU = cycle RU_map Schedule,5 5 0,, m,4 6 5,5 8 6,6 0 m 0,0 4m, 7 4,4 9m 0,4 time ALU MEM time Ready Placed Calculate height-based priorities. Schedule using Operation scheduler - 8 -
20 Generalize Beyond a Basic Block Superblock» Single entry» Multiple exits (side exits)» No side entries Schedule just like a BB» Priority calculations needs change» Dealing with control deps - 9 -
21 Change Focus to Scheduling Loops Most of program execution time is spent in loops Problem: How do we achieve compact schedules for loops for (j=0; j<00; j++) b[j] = a[j] * 6 Loop: r = _a r = _b r9 = r * 4 : r = load(r) : r4 = r * 6 : store (r, r4) 4: r = r + 4 5: r = r + 4 6: p = cmpp (r < r9) 7: brct p Loop - 0 -
22 Basic Approach List Schedule the Loop Body time Iteration n Schedule each iteration resources: 4 issue, alu, mem, br latencies: add=, mpy=, ld =, st =, br = : r = load(r) : r4 = r * 6 : store (r, r4) 4: r = r + 4 5: r = r + 4 6: p = cmpp (r < r9) 7: brct p Loop time ops 0, , 5, 7 Total time = 6 * n - -
23 Unroll Then Schedule Larger Body time Iteration,,4 5,6 n-,n Schedule each iteration resources: 4 issue, alu, mem, br latencies: add=, cmpp =, mpy=, ld =, st =, br = : r = load(r) : r4 = r * 6 : store (r, r4) 4: r = r + 4 5: r = r + 4 6: p = cmpp (r < r9) 7: brct p Loop time ops 0, 4, 6, 4, 6 4-5, 5, 7 6,5,7 Total time = 7 * n/ - -
24 Problems With Unrolling Code bloat» Typical unroll is 4-6x» Use profile statistics to only unroll important loops» But still, code grows fast Barrier after across unrolled bodies» I.e., for unroll, can only overlap iterations and, and 4, Does this mean unrolling is bad?» No, in some settings its very useful Ÿ Low trip count Ÿ Lots of branches in the loop body» But, in other settings, there is room for improvement - -
25 Overlap Iterations Using Pipelining time Iteration n n With hardware pipelining, while one instruction is in fetch, another is in decode, another in execute. Same thing here, multiple iterations are processed simultaneously, with each instruction in a separate stage. iteration still takes the same time, but time to complete n iterations is reduced! - 4 -
26 A Software Pipeline time A B A C B A Prologue - fill the pipe Loop body with 4 ops A B C D D C B A D C B A D C B A D C B D C D Kernel steady state Epilogue - drain the pipe Steady state: 4 iterations executed simultaneously, operation from each iteration. Every cycle, an iteration starts and finishes when the pipe is full
27 Creating Software Pipelines Lots of software pipelining techniques out there Modulo scheduling» Most widely adopted» Practical to implement, yields good results Conceptual strategy» Unroll the loop completely» Then, schedule the code completely with constraints Ÿ All iteration bodies have identical schedules Ÿ Each iteration is scheduled to start some fixed number of cycles later than the previous iteration» Initiation Interval (II) = fixed delay between the start of successive iterations» Given the constraints, the unrolled schedule is repetitive (kernel) except the portion at the beginning (prologue) and end (epilogue) Ÿ Kernel can be re-rolled to yield a new loop - 6 -
28 Creating Software Pipelines () Create a schedule for iteration of the loop such that when the same schedule is repeated at intervals of II cycles» No intra-iteration dependence is violated» No inter-iteration dependence is violated» No resource conflict arises between operation in same or distinct iterations We will start out assuming Itanium-style hardware support, then remove it later» Rotating registers» Predicates» Software pipeline loop branch - 7 -
29 Terminology time Iter II Iter Iter Initiation Interval (II) = fixed delay between the start of successive iterations Each iteration can be divided into stages consisting of II cycles each Number of stages in iteration is termed the stage count (SC) Takes SC- cycles to fill/drain the pipe - 8 -
30 Resource Usage Legality Need to guarantee that» No resource is used at points in time that are separated by an interval which is a multiple of II» I.E., within a single iteration, the same resource is never used more than x at the same time modulo II» Known as modulo constraint, where the name modulo scheduling comes from» Modulo reservation table solves this problem Ÿ To schedule an op at time T needing resource R The entry for R at T mod II must be free Ÿ Mark busy at T mod II if schedule II = 0 alu alu mem bus0 bus br - 9 -
31 Dependences in a Loop Need worry about kinds» Intra-iteration» Inter-iteration Delay» Minimum time interval between the start of operations» Operation read/write times Distance» Number of iterations separating the operations involved» Distance of 0 means intraiteration Recurrence manifests itself as a circuit in the dependence graph <,> <,0> 4 <,> <,0> <,> Edges annotated with tuple <delay, distance> - 0 -
32 Dynamic Single Assignment (DSA) Form Impossible to overlap iterations because each iteration writes to the same register. So, we ll have to remove the anti and output dependences. Virtual rotating registers * Each register is an infinite push down array (Expanded virtual reg or EVR) * Write to top element, but can reference any element * Remap operation slides everything down à r[n] changes to r[n+] A program is in DSA form if the same virtual register (EVR element) is never assigned to more than x on any dynamic execution path : r = load(r) : r4 = r * 6 : store (r, r4) 4: r = r + 4 5: r = r + 4 6: p = cmpp (r < r9) 7: brct p Loop DSA conversion - - : r[-] = load(r[0]) : r4[-] = r[-] * 6 : store (r[0], r4[-]) 4: r[-] = r[0] + 4 5: r[-] = r[0] + 4 6: p[-] = cmpp (r[-] < r9) remap r, r, r, r4, p 7: brct p[-] Loop
33 Physical Realization of EVRs EVR may contain an unlimited number values» But, only a finite contiguous set of elements of an EVR are ever live at any point in time» These must be given physical registers Conventional register file» Remaps are essentially copies, so each EVR is realized by a set of physical registers and copies are inserted Rotating registers» Direct support for EVRs» No copies needed» File rotated after each loop iteration is completed - -
34 Loop Dependence Example, : r[-] = load(r[0]) : r4[-] = r[-] * 6 : store (r[0], r4[-]) 4: r[-] = r[0] + 4 5: r[-] = r[0] + 4 6: p[-] = cmpp (r[-] < r9) remap r, r, r, r4, p 7: brct p[-] Loop 0,0,0,0 4 5,0 0,0,,, In DSA form, there are no inter-iteration anti or output dependences! 6,0 7 <delay, distance> - -
35 Homework Problem Latencies: ld =, st =, add =, cmpp =, br = : r[-] = load(r[0]) : r[-] = r[] r[] : store (r[-], r[0]) 4: r[-] = r[0] + 4 5: p[-] = cmpp (r[-] < 00) remap r, r, r 6: brct p[-] Loop Draw the dependence graph showing both intra and inter iteration dependences - 4 -
36 To be continued
Software Pipelining - Modulo Scheduling
EECS 583 Class 12 Software Pipelining - Modulo Scheduling University of Michigan October 15, 2014 Announcements + Reading Material HW 2 Due this Thursday Today s class reading» Iterative Modulo Scheduling:
AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER
AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS BY TANYA M. LATTNER B.S., University of Portland, 2000 THESIS Submitted in partial fulfillment of the requirements for the degree
Module: Software Instruction Scheduling Part I
Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section
Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers
CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern
VLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
PROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Architectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas
Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Opportunities for Loop Optimization Software Pipelining Modulo Scheduling Resource and Dependence Constraints Scheduling
Goals of the Unit. spm - 2014 adolfo villafiorita - introduction to software project management
Project Scheduling Goals of the Unit Making the WBS into a schedule Understanding dependencies between activities Learning the Critical Path technique Learning how to level resources!2 Initiate Plan Execute
PROBLEMS. which was discussed in Section 1.6.3.
22 CHAPTER 1 BASIC STRUCTURE OF COMPUTERS (Corrisponde al cap. 1 - Introduzione al calcolatore) PROBLEMS 1.1 List the steps needed to execute the machine instruction LOCA,R0 in terms of transfers between
CPU Organization and Assembly Language
COS 140 Foundations of Computer Science School of Computing and Information Science University of Maine October 2, 2015 Outline 1 2 3 4 5 6 7 8 Homework and announcements Reading: Chapter 12 Homework:
GameTime: A Toolkit for Timing Analysis of Software
GameTime: A Toolkit for Timing Analysis of Software Sanjit A. Seshia and Jonathan Kotker EECS Department, UC Berkeley {sseshia,jamhoot}@eecs.berkeley.edu Abstract. Timing analysis is a key step in the
Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:
Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove
Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes
basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static
Instruction Set Design
Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,
UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995
UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering EEC180B Lab 7: MISP Processor Design Spring 1995 Objective: In this lab, you will complete the design of the MISP processor,
Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1
Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of
(Refer Slide Time: 00:01:16 min)
Digital Computer Organization Prof. P. K. Biswas Department of Electronic & Electrical Communication Engineering Indian Institute of Technology, Kharagpur Lecture No. # 04 CPU Design: Tirning & Control
Solutions. Solution 4.1. 4.1.1 The values of the signals are as follows:
4 Solutions Solution 4.1 4.1.1 The values of the signals are as follows: RegWrite MemRead ALUMux MemWrite ALUOp RegMux Branch a. 1 0 0 (Reg) 0 Add 1 (ALU) 0 b. 1 1 1 (Imm) 0 Add 1 (Mem) 0 ALUMux is the
(Refer Slide Time: 02:39)
Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering, Indian Institute of Technology, Delhi Lecture - 1 Introduction Welcome to this course on computer architecture.
Pipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida [email protected] [email protected] October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
picojava TM : A Hardware Implementation of the Java Virtual Machine
picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1
System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect
IA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
Annotation to the assignments and the solution sheet. Note the following points
Computer rchitecture 2 / dvanced Computer rchitecture Seite: 1 nnotation to the assignments and the solution sheet This is a multiple choice examination, that means: Solution approaches are not assessed
Operating Systems. Virtual Memory
Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page
FLIX: Fast Relief for Performance-Hungry Embedded Applications
FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...
CS521 CSE IITG 11/23/2012
CS521 CSE TG 11/23/2012 A Sahu 1 Degree of overlap Serial, Overlapped, d, Super pipelined/superscalar Depth Shallow, Deep Structure Linear, Non linear Scheduling of operations Static, Dynamic A Sahu slide
Graph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected].
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected] Review Computers in mid 50 s Hardware was expensive
18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
Project Time Management
Project Time Management Plan Schedule Management is the process of establishing the policies, procedures, and documentation for planning, developing, managing, executing, and controlling the project schedule.
Synchronization. Todd C. Mowry CS 740 November 24, 1998. Topics. Locks Barriers
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based (barriers) Point-to-point tightly
Chapter 4 Multi-Stage Interconnection Networks The general concept of the multi-stage interconnection network, together with its routing properties, have been used in the preceding chapter to describe
Instruction scheduling
Instruction ordering Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 When a compiler emits the instructions corresponding to a program, it imposes a total order on them.
Operating System Overview. Otto J. Anshus
Operating System Overview Otto J. Anshus A Typical Computer CPU... CPU Memory Chipset I/O bus ROM Keyboard Network A Typical Computer System CPU. CPU Memory Application(s) Operating System ROM OS Apps
Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!
Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Introduction to Data-flow analysis
Introduction to Data-flow analysis Last Time LULESH intro Typed, 3-address code Basic blocks and control flow graphs LLVM Pass architecture Data dependencies, DU chains, and SSA Today CFG and SSA example
Computer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
Register File, Finite State Machines & Hardware Control Language
Register File, Finite State Machines & Hardware Control Language Avin R. Lebeck Some slides based on those developed by Gershon Kedem, and by Randy Bryant and ave O Hallaron Compsci 04 Administrivia Homework
AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER. B.S., University of Portland, 2000 THESIS
AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS BY TANYA M. LATTNER B.S., University of Portland, 2000 THESIS Submitted in partial fulfillment of the requirements for the degree
Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek
Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture
Aperiodic Task Scheduling
Aperiodic Task Scheduling Jian-Jia Chen (slides are based on Peter Marwedel) TU Dortmund, Informatik 12 Germany Springer, 2010 2014 年 11 月 19 日 These slides use Microsoft clip arts. Microsoft copyright
VDI Solutions - Advantages of Virtual Desktop Infrastructure
VDI s Fatal Flaw V3 Solves the Latency Bottleneck A V3 Systems White Paper Table of Contents Executive Summary... 2 Section 1: Traditional VDI vs. V3 Systems VDI... 3 1a) Components of a Traditional VDI
50 Computer Science MI-SG-FLD050-02
50 Computer Science MI-SG-FLD050-02 TABLE OF CONTENTS PART 1: General Information About the MTTC Program and Test Preparation OVERVIEW OF THE TESTING PROGRAM... 1-1 Contact Information Test Development
CCPM: TOC Based Project Management Technique
CCPM: TOC Based Project Management Technique Prof. P.M. Chawan, Ganesh P. Gaikwad, Prashant S. Gosavi M. Tech, Computer Engineering, VJTI, Mumbai. Abstract In this paper, we are presenting the drawbacks
Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x
Software Pipelining for (i=1, i
UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS
UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction
Scheduling Shop Scheduling. Tim Nieberg
Scheduling Shop Scheduling Tim Nieberg Shop models: General Introduction Remark: Consider non preemptive problems with regular objectives Notation Shop Problems: m machines, n jobs 1,..., n operations
WAR: Write After Read
WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads
CSE 141L Computer Architecture Lab Fall 2003. Lecture 2
CSE 141L Computer Architecture Lab Fall 2003 Lecture 2 Pramod V. Argade CSE141L: Computer Architecture Lab Instructor: TA: Readers: Pramod V. Argade ([email protected]) Office Hour: Tue./Thu. 9:30-10:30
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Chapter-6. Developing a Project Plan
MGMT 4135 Project Management Chapter-6 Where we are now Developing the Project Network Why? Graphical representation of the tasks to be completed Logical sequence of activities Determines dependencies
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
Parallel and Distributed Computing Programming Assignment 1
Parallel and Distributed Computing Programming Assignment 1 Due Monday, February 7 For programming assignment 1, you should write two C programs. One should provide an estimate of the performance of ping-pong
CPU Organisation and Operation
CPU Organisation and Operation The Fetch-Execute Cycle The operation of the CPU 1 is usually described in terms of the Fetch-Execute cycle. 2 Fetch-Execute Cycle Fetch the Instruction Increment the Program
1. Comments on reviews a. Need to avoid just summarizing web page asks you for:
1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
Processor Architectures
ECPE 170 Jeff Shafer University of the Pacific Processor Architectures 2 Schedule Exam 3 Tuesday, December 6 th Caches Virtual Memory Input / Output OperaKng Systems Compilers & Assemblers Processor Architecture
CS352H: Computer Systems Architecture
CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions
COMPUTER HARDWARE. Input- Output and Communication Memory Systems
COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)
Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)
Addressing The problem Objectives:- When & Where do we encounter Data? The concept of addressing data' in computations The implications for our machine design(s) Introducing the stack-machine concept Slide
Real Time Scheduling Basic Concepts. Radek Pelánek
Real Time Scheduling Basic Concepts Radek Pelánek Basic Elements Model of RT System abstraction focus only on timing constraints idealization (e.g., zero switching time) Basic Elements Basic Notions task
Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses
Overview of Real-Time Scheduling Embedded Real-Time Software Lecture 3 Lecture Outline Overview of real-time scheduling algorithms Clock-driven Weighted round-robin Priority-driven Dynamic vs. static Deadline
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints
what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
Computer organization
Computer organization Computer design an application of digital logic design procedures Computer = processing unit + memory system Processing unit = control + datapath Control = finite state machine inputs
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
EE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100)
EE282 Computer Architecture and Organization Midterm Exam February 13, 2001 (Total Time = 120 minutes, Total Points = 100) Name: (please print) Wolfe - Solution In recognition of and in the spirit of the
Operating Systems Concepts: Chapter 7: Scheduling Strategies
Operating Systems Concepts: Chapter 7: Scheduling Strategies Olav Beckmann Huxley 449 http://www.doc.ic.ac.uk/~ob3 Acknowledgements: There are lots. See end of Chapter 1. Home Page for the course: http://www.doc.ic.ac.uk/~ob3/teaching/operatingsystemsconcepts/
Collaborative Scheduling using the CPM Method
MnDOT Project Management Office Presents: Collaborative Scheduling using the CPM Method Presenter: Jonathan McNatty, PSP Senior Schedule Consultant DRMcNatty & Associates, Inc. Housekeeping Items Lines
2) Write in detail the issues in the design of code generator.
COMPUTER SCIENCE AND ENGINEERING VI SEM CSE Principles of Compiler Design Unit-IV Question and answers UNIT IV CODE GENERATION 9 Issues in the design of code generator The target machine Runtime Storage
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Theorem (informal statement): There are no extendible methods in David Chalmers s sense unless P = NP.
Theorem (informal statement): There are no extendible methods in David Chalmers s sense unless P = NP. Explication: In his paper, The Singularity: A philosophical analysis, David Chalmers defines an extendible
Dynamic programming formulation
1.24 Lecture 14 Dynamic programming: Job scheduling Dynamic programming formulation To formulate a problem as a dynamic program: Sort by a criterion that will allow infeasible combinations to be eli minated
Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new
Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems
Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems A. Carbon, Y. Lhuillier, H.-P. Charles CEA LIST DACLE division Embedded Computing Embedded Software Laboratories France
CLASS: Combined Logic Architecture Soft Error Sensitivity Analysis
CLASS: Combined Logic Architecture Soft Error Sensitivity Analysis Mojtaba Ebrahimi, Liang Chen, Hossein Asadi, and Mehdi B. Tahoori INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING
