Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x



Similar documents
VLIW Processors. VLIW Processors


Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

PROBLEMS #20,R0,R1 #$3A,R2,R4

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

CPU Performance Equation

Instruction Set Design

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Very Large Instruction Word Architectures (VLIW Processors and Trace Scheduling)

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

WAR: Write After Read

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Intel IA-64 Architecture Software Developer s Manual

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

IA-64 Application Developer s Architecture Guide

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Intel Pentium 4 Processor on 90nm Technology

Giving credit where credit is due

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Module: Software Instruction Scheduling Part I

İSTANBUL AYDIN UNIVERSITY

The Microarchitecture of Superscalar Processors

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Thread level parallelism

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

CS352H: Computer Systems Architecture

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Software implementation of Post-Quantum Cryptography

The Microarchitecture of the Pentium 4 Processor

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Chapter 5 Instructor's Manual

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

The Pentium 4 and the G4e: An Architectural Comparison

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.

Intel Itanium Architecture Software Developer s Manual

Software Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Property of ISA vs. Uarch?

Instruction Set Architecture

Multithreading Lin Gao cs9244 report, 2006

Architecture of Hitachi SR-8000

2

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Introduction to Cloud Computing

AMD Opteron Quad-Core

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

How To Understand The Design Of A Microprocessor

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Computer Organization and Components

Computer Organization and Architecture

Processor Architectures

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

CIS570 Modern Programming Language Implementation. Office hours: TDB 605 Levine

Multi-Threading Performance on Commodity Multi-Core Processors

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

picojava TM : A Hardware Implementation of the Java Virtual Machine

Instruction Scheduling. Software Pipelining - 2

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Computer Architectures

Performance evaluation

Instruction Set Architecture (ISA) Design. Classification Categories

AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER

OC By Arsene Fansi T. POLIMI

A Lab Course on Computer Architecture

Coming Challenges in Microarchitecture and Architecture

Intel IA-64 Architecture Software Developer s Manual

Multicore Processor and GPU. Jia Rao Assistant Professor in CS

POWER8 Performance Analysis

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Embedded System Hardware - Processing (Part II)

CHAPTER 4 MARIE: An Introduction to a Simple Computer

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Where we are CS 4120 Introduction to Compilers Abstract Assembly Instruction selection mov e1 , e2 jmp e cmp e1 , e2 [jne je jgt ] l push e1 call e

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Five Families of ARM Processor IP

Instruction Set Architecture (ISA)

Computer Architecture TDTS10

Some Future Challenges of Binary Translation. Kemal Ebcioglu IBM T.J. Watson Research Center

Introduction to Microprocessors

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Boosting Beyond Static Scheduling in a Superscalar Processor

Reduced Instruction Set Computer (RISC)

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Enterprise Applications

ADVANCED COMPUTER ARCHITECTURE

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Techniques for Accurate Performance Evaluation in Architecture Exploration

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Transcription:

Software Pipelining for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x An alternative method of reorganizing loops to increase Instruction level Parallelism. } load A[1] incr A[I] load A[2] store A[1] incr A[2] load A[3] store A[2] incr A[3] load A[4] store A[3] incr A[4]... store A[4]...... store A[100]

Now group these instructions horizontally as they are shown in the rectangular boxes. After the first three instructions, the loop can be restructured as for (i=1, i<98, i++) { store A[i] These can be executed in parallel. incr A[i+1] No data dependency exists load A[i+2] } This is followed by the last three instructions.

More examples of compiler support to ILP Problems with loop carried dependence Example 1. The following type of computation is found in many recurrence relations: for (i=2; i <= 100; i=i+1) { } y(i) = y(i-1) + y(i) Now unroll it, and notice the loop-carried dependencies. y(2) = y(1) + y(2) y(3) = y(2) + y(3) ILP suffers y(4) = y(3) + y(4) Loop unrolling does not help.

Example 2 Consider the following program for (i=6; i <= 100; i=i+1) { } y(i) = y(i-5) + y(i) y(6) = y(1) + y(6) y(7) = y(2) + y(7) ILP improves! y(8) = y(3) + y(8) Loop unrolling helps!

Pentium Micro-architecture Pentium 3 > Pentium Pro > Pentium 4 Instruction length 1-17 bytes, awkward for pipelining All decode IA-32 instructions into micro-operations (MIPS like instructions) since it makes pipelining easier. Complex instructions requiring many cycles are executed by standard micro-programmed control. Up to four micro-operations scheduled per clock Speculative pipeline Multiple functional units (7 in Pentium vs 5 in Pro) Pipeline depth = 20 (vs. 10 in Pentium Pro) Uses trace cache (it has its own 512 entry BTB) Better Branch Prediction (4K size in Pentium 4 vs. 512 in Pro)

Trace cache A trace cache is a mechanism for increasing the instruction fetch bandwidth by storing traces of instructions that have already been fetched. The mechanism was first proposed by Eric Rotenberg, Steve Bennett, and Jim Smith in their 1996 paper "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching." Branch prediction Stored in trace cache Branch prediction Branch prediction Instructions are added to trace caches in groups representing either individual basic blocks or dynamic instruction traces. A basic block consists of a group of non-branch instructions ending with a branch. A dynamic trace ("trace path") contains only instructions whose results are actually used, and eliminates instructions following taken branches (since they are not executed); a dynamic trace can be a concatenation of multiple of basic blocks.

Pentium micro-architecture Branch prediction buffer Instruction prefetch and decode 4096 entries Trace cache Micro-operation queue Up to 126 mops here Dispatch and Register Renaming Register File Integer and FP operation queue Memory op queue Complex instructions Int 2X Int 2X FP Load Store Commit Unit Handles MMX & SSE instructions Data Cache

Pentium 4 Pipeline schedule 5 Trace cache & BTB access Micro-operation queue 4 Re-order buffer and register renaming Functional unit queue 5 Scheduling 2 Register file access 1 Execution 3 Commit Re-0rder Buffer 20 cycles

Problems with IA-32 1. Two-address instruction set architecture. 2. Irregular register set and small number of registers. 3. Variable length instruction sizes and variable length opcodes make pipelining challenging. 4. The micro-architecture breaks the instructions into MIPS like micro-operations, but it complicates the design and wastes silicon. 5. Large number of pipeline stages (up to 20 in Pentium 4) increases branch penalty, unless the branch prediction is accurate.

The IA-64 model Developed with HP using the model of PA-RISC. 128 registers, each register is 64-bits long. 64 predicate registers 41 41 41 5 instruction 1 instruction 2 instruction 3 template instruction 1 instruction 2 instruction 3 template instruction 1 instruction 2 instruction 3 template A bundle consists of 128 bits. Ordinarily the three instructions can be scheduled in parallel. Bundles can be chained, i.e. some instructions from different bundles can be scheduled in parallel.

Instruction format Opcode predicate R1 R2 R3 A 14 6 7 7 7 All instructions are predicated. Templates link bundles consisting of instructions that can be scheduled in parallel. EPIC model, similar to VLIW. Supports speculative execution. One or more bundles can be scheduled every cycle.

Register rotation Why 128 registers? R0 R1 R2 The called procedures use a different set of physical registers. Thus the architectural registers R0-R31 are essentially renamed. The overlapped sections simplify parameter passing. Multiple iterations of a loop use multiple rotating registers. Emulates loop unrolling.

Compare operations Play a key role in predication. If (a == b) { { c++ cmp.eq p1, p2 = ra, rb } else { (p1) add rc = rc, 1 d++ (p2) add rd = rd, 1 } }