Instruction Set Design



Similar documents
Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture (ISA)

İSTANBUL AYDIN UNIVERSITY

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Computer Organization and Architecture

Instruction Set Architecture (ISA) Design. Classification Categories

VLIW Processors. VLIW Processors

Instruction Set Architecture

CPU Organization and Assembly Language

LSN 2 Computer Processors

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Chapter 2 Topics. 2.1 Classification of Computers & Instructions 2.2 Classes of Instruction Sets 2.3 Informal Description of Simple RISC Computer, SRC

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

MICROPROCESSOR AND MICROCOMPUTER BASICS

Week 1 out-of-class notes, discussions and sample problems

EC 362 Problem Set #2

Computer organization

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Design Cycle for Microprocessors

Central Processing Unit (CPU)

IA-64 Application Developer s Architecture Guide

CHAPTER 7: The CPU and Memory

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December :59


Computer Architectures

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

MACHINE ARCHITECTURE & LANGUAGE

(Refer Slide Time: 00:01:16 min)

Chapter 5 Instructor's Manual

An Overview of Stack Architecture and the PSC 1000 Microprocessor

Stack machines The MIPS assembly language A simple source language Stack-machine implementation of the simple language Readings:

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Review: MIPS Addressing Modes/Instruction Formats

Chapter 2 Logic Gates and Introduction to Computer Architecture

Microprocessor and Microcontroller Architecture

1 Classical Universal Computer 3

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

An Introduction to Assembly Programming with the ARM 32-bit Processor Family

Administrative Issues

Intel 8086 architecture

Notes on Assembly Language

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

The AVR Microcontroller and C Compiler Co-Design Dr. Gaute Myklebust ATMEL Corporation ATMEL Development Center, Trondheim, Norway

CPU Organisation and Operation

In the Beginning The first ISA appears on the IBM System 360 In the good old days

on an system with an infinite number of processors. Calculate the speedup of

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Central Processing Unit Simulation Version v2.5 (July 2005) Charles André University Nice-Sophia Antipolis

a storage location directly on the CPU, used for temporary storage of small amounts of data during processing.

ARM Architecture. ARM history. Why ARM? ARM Ltd developed by Acorn computers. Computer Organization and Assembly Languages Yung-Yu Chuang

WAR: Write After Read

Reduced Instruction Set Computer (RISC)

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Cloud Computing. Up until now

Instruction Set Architecture. Datapath & Control. Instruction. LC-3 Overview: Memory and Registers. CIT 595 Spring 2010

M A S S A C H U S E T T S I N S T I T U T E O F T E C H N O L O G Y DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

CISC, RISC, and DSP Microprocessors

Processor Architectures

How It All Works. Other M68000 Updates. Basic Control Signals. Basic Control Signals

An Introduction to the ARM 7 Architecture

ARM Microprocessor and ARM-Based Microcontrollers

Property of ISA vs. Uarch?

picojava TM : A Hardware Implementation of the Java Virtual Machine

Comp 255Q - 1M: Computer Organization Lab #3 - Machine Language Programs for the PDP-8

Computer Organization and Components

Microprocessor & Assembly Language

Techniques for Accurate Performance Evaluation in Architecture Exploration

PART B QUESTIONS AND ANSWERS UNIT I

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

CSE 141 Introduction to Computer Architecture Summer Session I, Lecture 1 Introduction. Pramod V. Argade June 27, 2005

Lecture Outline. Stack machines The MIPS assembly language. Code Generation (I)

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

MACHINE INSTRUCTIONS AND PROGRAMS

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

A Lab Course on Computer Architecture

Guide to RISC Processors

Introducción. Diseño de sistemas digitales.1

Chapter 01: Introduction. Lesson 02 Evolution of Computers Part 2 First generation Computers

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

1 The Java Virtual Machine

CS352H: Computer Systems Architecture

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Chapter 7D The Java Virtual Machine

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

A Unified View of Virtual Machines

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Typy danych. Data types: Literals:

LC-3 Assembly Language

General Introduction

Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z)

Virtualization. Pradipta De

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

THUMB Instruction Set

Microcontroller Basics A microcontroller is a small, low-cost computer-on-a-chip which usually includes:

Transcription:

Instruction Set Design

Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow, well-defined, and mostly static (compare writing a windows emulator [almost impossible] to writing an ISA emulator [a few thousand lines of code]) Application Compiler Micro-code Operating System I/O system interface Instruction Set Architecture Machine Organization Circuit Design

What do we want in an ISA? Compact Simple Scalable (64bit) Spare opcodes Amenable to hw implementa tion. express parallelism Turing Complete Make the common case fast. Easy to verify Cost effective easy to compile for Consistent/ regular/ orthogonal Regular instruction format. Good OS support protection VM Interrupts Easy for the programme rs

Crafting an ISA Designing an ISA is both an art and a science Some things we want out of our ISA completeness orthogonality regularity and simplicity compactness ease of programming ease of implementation ISA design involves dealing in a tight resource instruction bits! This will go down on your permanent record ISAs live forever (almost) Be careful what you put in there

Basic Questions Operations how many? what kinds? Operands how many? location types how to specify? Instruction format how does the computer know what 0001 0100 1101 1111 means? size how many formats? destination operand y = x + b operation source operands

Operand Location Can classify machines into 3 types: Accumulator Stack Registers Two types of register machines register-memory most operands can be registers or memory load-store most operations (e.g., arithmetic) are only between registers explicit load and store instructions to move data between registers and memory

How Many Operands? Accumulator: 1 address add A acc acc + mem[a] Stack: 0 address add tos = tos + next Register-Memory: 2 address add Ra B Ra = Ra + EA(B) 3 address add Ra Rb C Ra = Rb + EA(C) Load/Store: 3 address add Ra Rb Rc Ra = Rb + Rc load Ra Rb Ra = mem[rb] store Ra Rb mem[rb] = Ra A load/store architecture has instructions that do either ALU operations or access memory, but never both.

Functionality calculate: A = X * Y - B * C SP +4 +8 +12 +16 X Y B C A stack accumulator register-memory load-store

Functionality calculate: A = X * Y - B * C SP +4 +8 +12 +16 X Y B C A stack accumulator register-memory load-store Push 8(SP) Push 12(SP) Mult Push 0(SP) Push 4(SP) Mult Sub Store 16(SP) Pop

Functionality calculate: A = X * Y - B * C SP +4 +8 +12 +16 X Y B C A stack accumulator register-memory load-store Push 8(SP) Push 12(SP) Mult Push 0(SP) Push 4(SP) Mult Sub Store 16(SP) Pop Load 8(SP) Mult 12(SP) Store 20(SP) Load 4(SP) Mult 0(SP) Sub 20(SP) Store 16(SP)

Functionality calculate: A = X * Y - B * C SP +4 +8 +12 +16 X Y B C A stack accumulator register-memory load-store Push 8(SP) Push 12(SP) Mult Push 0(SP) Push 4(SP) Mult Sub Store 16(SP) Pop Load 8(SP) Mult 12(SP) Store 20(SP) Load 4(SP) Mult 0(SP) Sub 20(SP) Store 16(SP) Mult R1 0(SP) 4(SP) Mult R2 8(SP) 12(SP) Sub 16(SP) R1 R2

Functionality calculate: A = X * Y - B * C SP +4 +8 +12 +16 X Y B C A stack accumulator register-memory load-store Push 8(SP) Push 12(SP) Mult Push 0(SP) Push 4(SP) Mult Sub Store 16(SP) Pop Load 8(SP) Mult 12(SP) Store 20(SP) Load 4(SP) Mult 0(SP) Sub 20(SP) Store 16(SP) Mult R1 0(SP) 4(SP) Mult R2 8(SP) 12(SP) Sub 16(SP) R1 R2 Load R1 0(SP) Load R2 4(SP) Load R3 8(SP) Load R4 12(SP) Mult R5 R1 R2 Mult R6 R3 R4 Sub R7 R5 R6 St 16(SP) R7

Trade-offs Stack Short instructions Lots of instructions Simple hardware Little exposed architecture Accumulator See stack Register-memory Expressive instructions Few instruction Instructions are complex and diverse Lots of exposed architecture Load-store Simple Higher instruction count Lots of exposed architecture

Memory Considerations Effective Address - memory address specified by the addressing mode How complex should the addressing modes be? What are the trade-offs?

Memory Considerations Effective Address - memory address specified by the addressing mode How complex should the addressing modes be? What are the trade-offs? How widely applicable are they? How much do they impact the complexity of the machine? How many extra bits do they require to encode?

Instruction Operands non-memory Register direct Add R4, R3 Immediate Add R4, #3 Memory Displacement Add R4, 100 (R1) Indirect Add R4, (R1) Indexed Add R3, (R1 + R2) Direct Add R1, (1001) Mem. indirect Add R1, @(R3) Autoincrement Add R1, (R2)+ Autodecrement Add R1, -(R2)

Addressing Mode Utilization Conclusion?

Which Operations? Arithmetic add, subtract, multiply, divide Logical and, or, shift left, shift right Data Transfer load word, store word Control flow branch PC-relative displacement added to the program counter to get target address Does it make sense to have more complex instructions? -e.g., square root, mult-add, matrix multiply, cross product... the 3% criteria

Branch Decisions How is the destination of a branch specified? (how many bits?) How is the condition of the branch specified? What about indirect jumps?

Types of branches (control flow) conditional branch jump procedure call procedure return beq r1,r2, label jmp label call label return

Branch Conditions Condition Codes Processor status bits are set as a side-effect of executed instructions or explicitly by a compare and/or test instruction Ex: sub r1, r2, r3 bz label Condition Register Ex: cmp r1, r2, r3 bgt r1, label Compare and Branch Ex: bgt r1, r2, label

Displacement Size Conclusions?

Encoding of Instruction Set

Compiler/ISA Interaction Compiler is primary customer of ISA Features the compiler doesn t use are wasted Register allocation is a huge contributor to performance Compiler-writer s job is made easier when ISA has regularity primitives, not solutions simple trade-offs Compiler wants simplicity over power

System/Compiler/ISA Issues Parameter passing Accessing data Stack ABI ( Application Binary Interface ) global I/O, Interrupts, Virtual Memory,

Our Desired ISA Load-Store register arch Addressing modes immediate (8-16 bits) displacement (12-16 bits) register indirect Support a reasonable number of operations Don t use condition codes (or support multiple of them ala PPC) Fixed instruction encoding/length for performance Regularity (several general-purpose registers)

MIPS64 instruction set architecture 32 64-bit general purpose registers R0 is always equal to zero 32 floating point registers Data types 8-,16-, 32-, and 64-bit integers 32-, and 64-bit floating point numbers Immediate and displacement addressing modes register indirect is a subset of displacement 32-bit fixed length instruction encoding

MIPS Instruction Format

MIPS instructions Read on your own and become comfortable speaking MIPS LD R1, 1000(R2) R1 gets memory[r2 + 1000] DADDU R1, R2, R3 R1 gets R2 + R3 DADDI R1, R2, #53 R1 gets R2 + 53 JALR R2 RA gets PC + 4; Jump to R2 JR R3 Jump to R3 BEQZ R5, label If R5 == 0, jump to label (label is within displacement)

Very Long Instruction Words Each instruction word contains multiple operations The semantics of the ISA say that they happen in parallel The compiler can (and must) respect this constraint 26

VLIW Example RISC code $s1 = 1; $s2 = 1, $s3 = 4 add $s2, $s1, $s3 sub $s5, $s2, $s3 Sub sees 5 s2 = 5 VLIW instruction word : $s1 = 1; $s2 = 1, $s3 = 4 <add $s2, $s1, $s3; sub $s5, $s2, $s3> sub sees s1 = 1. 27

VLIW s History VLIW has been around for a long time It s the simplest way to get ILP, because the burden of avoiding hazards lies completely with the compiler. When hardware was expensive, this seemed like a good idea. However, the compiler problem is extremely hard. There end up being lots of noops in the long instruction words. As a result, they have either 1. met with limited commercial success as general purpose machines (many companies) or, 2. Become very complicated in new and interesting ways (for instance, by providing special registers and instructions to eliminate branches), or 3. Both 1 and 2 -- See the Itanium from intel. 28

Compiling for VLIW A VLIW compiler must identify instructions that can execute in parallel and will execute under the same conditions The easy place to look is within a basic block (a region of code with no branches or branch targets) Basic blocks are too small (3-10 instructions on average). 29

Trace Scheduling Profile to find the hot path through the code Treat the hot path (or trace ) as a single basic block for scheduling Add fix up code for the cases when execution doesn t follow the hot path. 30

Trace Scheduling 31

Trace Scheduling is Hard Building sufficiently long traces is difficult. Loops Unroll them. Function calls inline them, if possible Virtual functions, etc. make this hard. Getting the fix-up code is challenging. The hot path might not be so hot. 32

VLIW Today VLIW is alive and well in Digital signal processors They are simple and low power, which is important in embedded systems. DSPs run the same, very regular loops all the time, and VLIW machines are very good at that. It is worthwhile to hand-code these loops in assembly. 33

Beyond VLIW In RISC ISA dependence information is implicit in the use of register names In VLIW some non-dependence information is explicit. You can add additional explicit information about dependences to the ISA Dependence information is necessary if the microarchitecture is going to exploit parallelism In some cases it easier to provide that information explicitly -- searching for it in hardware is expensive But not all dependence information is available to the compiler Branch outcomes affect dependences Memory dependences are, in general, unknowable. We ll see richer ISAs later in the quarter. 34

Key Points Modern ISA s typically sacrifice power and flexibility for regularity and simplicity. trade off code density for greater micro-architectural flexibility Instruction bits are extremely limited particularly in a fixed-length instruction format Registers are critical to performance we want lots of them, with few usage restrictions attached Displacement addressing mode handles the vast majority of memory reference needs.