VLIW Processors. VLIW Processors



Similar documents
Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x


Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

IA-64 Application Developer s Architecture Guide

Module: Software Instruction Scheduling Part I

Instruction Set Design

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

CPU Performance Equation

WAR: Write After Read

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

PROBLEMS #20,R0,R1 #$3A,R2,R4

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Computer Architecture TDTS10

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Design Cycle for Microprocessors

picojava TM : A Hardware Implementation of the Java Virtual Machine

Very Large Instruction Word Architectures (VLIW Processors and Trace Scheduling)

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Wiggins/Redstone: An On-line Program Specializer

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Instruction Set Architecture (ISA)

CS352H: Computer Systems Architecture

Intel IA-64 Architecture Software Developer s Manual

Operating Systems. Virtual Memory

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Instruction Scheduling. Software Pipelining - 2

The AVR Microcontroller and C Compiler Co-Design Dr. Gaute Myklebust ATMEL Corporation ATMEL Development Center, Trondheim, Norway

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Thread level parallelism

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

Week 1 out-of-class notes, discussions and sample problems

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

İSTANBUL AYDIN UNIVERSITY

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Computer organization

Giving credit where credit is due

POWER8 Performance Analysis

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER

CPU Organization and Assembly Language

Computer Organization and Components

A Lab Course on Computer Architecture

Intel 8086 architecture

CHAPTER 7: The CPU and Memory

Pipelining Review and Its Limitations

The Microarchitecture of Superscalar Processors

Instruction Set Architecture

FLIX: Fast Relief for Performance-Hungry Embedded Applications

Instruction Set Architecture (ISA) Design. Classification Categories

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

OC By Arsene Fansi T. POLIMI

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Software Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Introduction to Cloud Computing

Central Processing Unit (CPU)

Computer Organization and Architecture

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

SOC architecture and design

Software Pipelining - Modulo Scheduling

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

EE361: Digital Computer Organization Course Syllabus

ARM Microprocessor and ARM-Based Microcontrollers

Fli;' HEWLETT. Software Pipelining and Superblock Scheduling: Compilation Techniques for VLIW Machines

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

An Introduction to the ARM 7 Architecture

Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

An Overview of Stack Architecture and the PSC 1000 Microprocessor

Computer Architectures

Architecture of Hitachi SR-8000

Property of ISA vs. Uarch?

CISC, RISC, and DSP Microprocessors

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Processor Architectures

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

CPU Organisation and Operation

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Levels of Programming Languages. Gerald Penn CSC 324

612 CHAPTER 11 PROCESSOR FAMILIES (Corrisponde al cap Famiglie di processori) PROBLEMS

CIS570 Modern Programming Language Implementation. Office hours: TDB 605 Levine

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

An Event-Driven Multithreaded Dynamic Optimization Framework

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Architectures and Platforms

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

CSC 2405: Computer Systems II

Stack machines The MIPS assembly language A simple source language Stack-machine implementation of the simple language Readings:

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

PROBLEMS (Cap. 4 - Istruzioni macchina)

Transcription:

1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW (3 operations) today change in the instruction set architecture, i.e., 1 program counter points to 1 bundle (not 1 operation) want operations in a bundle to issue in parallel fixed format so could decode operations in parallel enough FUs for types of operations that can issue in parallel pipelined FUs Autumn 2006 CSE P548 - VLIW 1 VLIW Processors Roots of modern VLIW machines Multiflow & Cydra 5 (8 to 16 operations) in the 1980 s Today s VLIW machines Itanium (3 operations) Transmeta Crusoe (4 operations) embedded processors Autumn 2006 CSE P548 - VLIW 2

2 VLIW Processors Goal of the VLIW design: reduce hardware complexity less design time shorter cycle time better performance reduced power consumption How VLIW designs reduce hardware complexity (in theory) less multiple-issue hardware no dependence checking for instructions within a bundle can be fewer paths between instruction issue slots & FUs simpler instruction dispatch no out-of-order execution, no instruction grouping no structural hazard checking logic Compiler figures all this out. Autumn 2006 CSE P548 - VLIW 3 VLIW Processors Compiler support to increase ILP compiler creates each VLIW word greater need for good code scheduling than with in-order issue superscalars instruction doesn t issue if 1 operation can t Autumn 2006 CSE P548 - VLIW 4

3 VLIW Processors More compiler support to increase ILP detects hazards & hides latencies structural hazards no 2 operations to the same functional unit no 2 operations to the same memory bank data hazards no data hazards among instructions in a bundle control hazards predicated execution static branch prediction hiding latencies data prefetching hoisting loads above stores Autumn 2006 CSE P548 - VLIW 5 Compiler Support for Increasing ILP Compiler optimizations that increase ILP loop unrolling aggressive inlining: function becomes part of the caller code software pipelining: schedules instructions from different iterations together trace scheduling & superblocks: schedule beyond basic block boundaries Autumn 2006 CSE P548 - VLIW 6

4 Compiler Support for Increasing ILP Software pipelining a simple example ld R0,0(R1) add R4,R0,R2 st R4,0(R1) decrement index termination test conditional branch Autumn 2006 CSE P548 - VLIW 7 Compiler Support for Increasing ILP Software pipelining schedules instructions from different iterations together Iteration n-2 Iteration n-1 Iteration n ld R0,0(R1) add R4,R0,R2 ld st R4,0(R1) add ld decrement index termination test conditional branch st add st Autumn 2006 CSE P548 - VLIW 8

5 Compiler Support for Increasing ILP Software pipelining handling memory accesses st R4, 16(R1) add R4, R0, R2 ld R0, 0(R1) stores into mem[i] adds to mem[i-1] loads from mem[i-2] performance advantages: increasing ILP performance disadvantages: still executing loop control instructions Autumn 2006 CSE P548 - VLIW 9 Compiler Support for Increasing ILP Global scheduling (trace scheduling & superblocks) schedule beyond basic block boundaries A[i] = A[i] + B[i] A[i] = 0? B[i] =.. other code C[i] =.. select a trace compact instructions within it compensate for crossing basic block boundaries Autumn 2006 CSE P548 - VLIW 10

6 Compiler Support for Increasing ILP Compiler optimizations that increase ILP unroll the trace trace scheduling: trace entrances & exits at each iteration more difficulty & more compensation code than superblocks: trace exits at each iteration one trace entrance & multiple exits (each iteration) more difficulty & more compensation code than advantages depend on path frequencies, empty instruction slots, whether moved instruction is the beginning of a critical path, amount of compensation code on non-trace path Autumn 2006 CSE P548 - VLIW 11 Explicitly Parallel Instruction Computing, aka VLIW IA-64 architecture, Itanium implementation Bundle of instructions 128 bit bundles 3 instructions/bundle 2 bundles can be issued at once if issue one, get another Autumn 2006 CSE P548 - VLIW 12

7 Registers 128 integer & FP registers implications for architecture? 128 additional registers for loop unrolling & similar optimizations implications for hardware? miscellaneous other registers implications for performance? + + - - Autumn 2006 CSE P548 - VLIW 13 Full predicated execution supported by 64 one-bit predicate registers instructions can set 2 at once (comparison result & complement) example cmp.eq r1, r2, p1, p2 (p1) sub 59, r10, r11 (p2) add r5, r6, r7 Autumn 2006 CSE P548 - VLIW 14

8 Full predicated execution implications for architecture? implications for the hardware? implications for exploiting ILP? Autumn 2006 CSE P548 - VLIW 15 Template in each bundle that indicates: type of operation for each instruction instruction order in bundle examples (2 of 24) M: load & manipulate the address (e.g., increment an index) I: integer ALU op F: FP op B: transfer of control other, e.g., stop (see below) restrictions on which instructions can be in which slots schedule code for functional unit availability (i.e., template types) & latencies Autumn 2006 CSE P548 - VLIW 16

9 Template, cont d. a stop bit that delineates the instructions that can execute in parallel all instructions before a stop have no data dependences implications for hardware: simpler issue logic, no instruction slotting, no out-of-order issue potentially fewer paths between issue slots & functional units potentially no structural hazard checks hardware not have to determine intra-bundle data dependences Autumn 2006 CSE P548 - VLIW 17 Branch support full predicated execution hierarchy of branch prediction structures in different pipeline stages 4-target BTB for repeatedly executed taken branches an instruction puts a specific target in it (i.e., the BTB is exposed to the architecture) larger back-up BTB correlated branch prediction for hard-to-predict branches instruction hint that branches that are statically easy-topredict should not be placed in it private history registers, 4 history bits, shared PHTs separate structure for multi-way branches branch prediction instruction for target forecasting branch prediction instruction for storing a prediction Autumn 2006 CSE P548 - VLIW 18

10 ISA & microarchitecture seem complicated (some features of an out-oforder processors) not all instructions in a bundle need stall if one stalls (a scoreboard keeps track of produced values that will be source operands for stalled instructions) branch prediction hierarchy dynamically sized register stack, aka register windows special hardware for register window overflow detection special instructions for saving & restoring the register stack register remapping to support rotating registers on the register stack which aid in software pipelining array address post-increment & loop control Autumn 2006 CSE P548 - VLIW 19 More complication don t want to store speculative values to memory special instructions check integer register poison bits to detect whether value is speculative (for nonspeculative code or exceptions) OS can override the ban on storing (e.g., for a context switch) different mechanism for speculative floating point values backwards compatibility x86 (IA-32) PA-RISC compatible memory model (segments) Autumn 2006 CSE P548 - VLIW 20

11 Trimedia TM32 Designed for the embedded market Classic VLIW no hazard detection in hardware nops guarantee that dependences are followed instructions decompressed on fetching Autumn 2006 CSE P548 - VLIW 21 Superscalars vs. VLIW Superscalar has more complex hardware for instruction scheduling instruction slotting or out-of-order hardware more paths or more complicated paths between instruction issue structure & functional units dependence checking logic between parallel instructions functional unit hazard checking possible consequences: slower cycle times more chip real estate more power consumption Autumn 2006 CSE P548 - VLIW 22

12 Superscalars vs. VLIW VLIW has more functional units if supports full predication paths between instruction issue structure & more functional units possible consequences: slower cycle times more chip real estate more power consumption Autumn 2006 CSE P548 - VLIW 23 Superscalars vs. VLIW VLIW has larger code size estimates of IA-64 code of up to 2X - 4X over x86 128b holds 4 (not 3) instructions on a RISC superscalar sometimes nops if don t have an instruction of the correct type branch targets must be at the beginning of a bundle predicated execution to avoid branches extra, special instructions check for exceptions check for improper load hoisting (memory aliases) allocate register windows for local variables branch prediction consequences: Autumn 2006 CSE P548 - VLIW 24

13 Superscalars vs. VLIW VLIW requires a more complex compiler consequence: more design effort or poor quality code if good optimizations aren t implemented Superscalars can more efficiently execute pipeline-dependent code consequence: don t have to recompile if change the implementation What else? Autumn 2006 CSE P548 - VLIW 25