Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers



Similar documents
Instruction scheduling

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Computer Organization and Components

PROBLEMS #20,R0,R1 #$3A,R2,R4


INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

VLIW Processors. VLIW Processors

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 20: Stack Frames 7 March 08

Computer Architecture TDTS10

Where we are CS 4120 Introduction to Compilers Abstract Assembly Instruction selection mov e1 , e2 jmp e cmp e1 , e2 [jne je jgt ] l push e1 call e

CS352H: Computer Systems Architecture

Pipelining Review and Its Limitations

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z)

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Intel 8086 architecture

IA-64 Application Developer s Architecture Guide

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER

Instruction Set Design

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

WAR: Write After Read

High-speed image processing algorithms using MMX hardware

In the Beginning The first ISA appears on the IBM System 360 In the good old days

Faculty of Engineering Student Number:

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

CPU Performance Equation

Software Pipelining - Modulo Scheduling

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Computer Organization and Architecture

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December :59

Performance monitoring with Intel Architecture

Chapter 5 Instructor's Manual

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Computer Architectures

Instruction Set Architecture (ISA)

Thread level parallelism

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

Language Processing Systems

A Lab Course on Computer Architecture

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Putting it all together: Intel Nehalem.

StrongARM** SA-110 Microprocessor Instruction Timing

Return-oriented programming without returns

Introduction to Microprocessors

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

CPU Organization and Assembly Language

Comparative Performance Review of SHA-3 Candidates

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Unpacked BCD Arithmetic. BCD (ASCII) Arithmetic. Where and Why is BCD used? From the SQL Server Manual. Packed BCD, ASCII, Unpacked BCD

CISC, RISC, and DSP Microprocessors

Multithreading Lin Gao cs9244 report, 2006

612 CHAPTER 11 PROCESSOR FAMILIES (Corrisponde al cap Famiglie di processori) PROBLEMS

Abysssec Research. 1) Advisory information. 2) Vulnerable version

Assembly Language: Function Calls" Jennifer Rexford!

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

The WIMP51: A Simple Processor and Visualization Tool to Introduce Undergraduates to Computer Organization

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Hacking Techniques & Intrusion Detection. Ali Al-Shemery arabnix [at] gmail

Machine Programming II: Instruc8ons

Very Large Instruction Word Architectures (VLIW Processors and Trace Scheduling)

Introduction to Cloud Computing

Using the RDTSC Instruction for Performance Monitoring

The IA-32 processor architecture

CS521 CSE IITG 11/23/2012

Intel Pentium 4 Processor on 90nm Technology

İSTANBUL AYDIN UNIVERSITY

CPU performance monitoring using the Time-Stamp Counter register

Using Power to Improve C Programming Education

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Introduction. Figure 1 Schema of DarunGrim2

Processor Architectures

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

MACHINE ARCHITECTURE & LANGUAGE

ò Paper reading assigned for next Thursday ò Lab 2 due next Friday ò What is cooperative multitasking? ò What is preemptive multitasking?

How To Understand The Design Of A Microprocessor

BCD (ASCII) Arithmetic. Where and Why is BCD used? Packed BCD, ASCII, Unpacked BCD. BCD Adjustment Instructions AAA. Example

THE BASICS OF PERFORMANCE- MONITORING HARDWARE

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Compilers I - Chapter 4: Generating Better Code

EE361: Digital Computer Organization Course Syllabus

64-Bit NASM Notes. Invoking 64-Bit NASM

LSN 2 Computer Processors

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

REpsych. : psycholigical warfare in reverse engineering. def con 2015 // domas

A Tiny Guide to Programming in 32-bit x86 Assembly Language

18-548/ Associativity 9/16/98. 7 Associativity / Memory System Architecture Philip Koopman September 16, 1998

Transcription:

CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern processors Modern superscalar architecture executes N instructions every cycle Pentium: N = (U-pipe and V-pipe) Pentium II+: N=; dynamic translation to -4 + µops Reality check: ~. instructions/cycle processor resources are mostly wasted even with good ordering Instruction scheduling Instruction order has significant performance impact on some modern architectures voiding stalls requires understanding processor architecture(s) (Intel rch. SDM Vol., Chapter ) Instruction scheduling: reordering low-level instructions to improve processor utilization Processor spends a lot of time waiting: Memory stalls Branch stalls Expensive arithmetic operations Resource conflicts 4 Simplified architecture model ssume MIPS-like architecture for illustration pipeline stages (Pentium II: 9+9) F: Instruction fetch -- read instruction from memory, decode R: Read values from registers : LU M: Memory load or store W: Write back result to registers Examples mov eax, ebx R: read ebx W: store into eax add eax, 0 F: extract imm. 0 R: read eax : add operands W: store into eax mov ecx, [edx + ] R: read edx, : compute address M: read from cache W: store into ecx push [edx + ]?

Non-pipelined execution time F R W add eax, 0 mov ecx, [edx + ] - cycles per instruction 7 Pipelined Execution time New instruction begun every cycle Most pipeline stages busy every cycle 8 U V Superscalar execution U V U V U V U V wo copies of most execution units Many instructions executed in parallel Memory is slow! Main memory: ~00 cycles Second-level cache: ~0 cycles First-level cache: ~ cycles Loads should be started as early as possible to avoid waiting 9 0 Memory Stalls add ebx, eax memory value available here (if in cache) needed here! Need to stall processor even if in cache Solutions: Option : (original lpha, 48, Pentium) Processor stalls on use of result until available. Compiler should reorder instructions if possible: add ebx, eax add ebx, eax

No interlocks Option : (R000) Memory result not available until two instructions later; compiler must insert some instruction. nop Out-of-order execution Out-of-order execution (PowerPC, recent lpha, MIPS, P): can execute instructions further ahead rather than stall Processor can issue instructions farther ahead in stream reorder buffer from which viable instructions are selected on each cycle mov ax, [cx + ] mov bx, ax add cx, 4 Branch stalls Branch, indirect jump instructions: next instruction to execute not known until address known Processor stall of -0 cycles! cmp ax, bx jz L? beq r, r, L? Option : stall 8048 stalls branches till pipeline empty! Early lpha processors: start initial pipeline stages on predicted branch target, stall until target address known (+ cycle stall on branch mispredict) beq r, L mov r, r ld r4, [t+] F R F MR W M W F R F MR W M W Dealing with stalls lpha: predicts backward branches taken (loops), forward branches not taken (else clauses), also has branch prediction cache Compiler should avoid branches, indirect jumps unroll loops! use conditional move instructions (lpha, Pentium Pro) or predicated instructions (Itanium) - - can be inserted as low-level optimization on assembly code cmp ecx, jz skip mov eax, ebx skip: cmp ecx, cmovz eax, ebx 7 MIPS: branch delay slot Instruction after branch is always executed : branch delay slot beq r, r, L mov r, r4 <target> Options for compiler: always put nop after branch move earlier instruction after branch move in destination instruction if harmless Problem: branch delay slot hasn t scaled 8

Other stalls Some instructions take much longer to complete multiply, divide ypical superscalar processors: 4-way < 4 copies of some functional units R0000: integer LU units, floating point LU units. Pentium: /. Issuing too many LU operations at once some pipelines stall Interleave other kinds of operations so more pipelines are busy 9 Real architectures Deeper pipelines, superscalar MIPS R4000: 8 stages; R0000: 8 stages 4 way lpha: stages, or 4 way Pentium P (Pro, II, III): 9 stage (uops), way, 9 stage dynamic translation pipeline Pentium 4: 0-stage branch prediction pipeline, outstanding instructions, 9 outstanding memory ops, deep speculative execution, K on-chip nd -level cache 0 Instruction scheduling Goal: reorder instructions so that all pipelines are as full as possible Instructions reordered against some particular machine architecture and scheduling rules embedded in hardware May need to compromise so that code works well on a variety of architectures (e.g. Pentium vs. Pentium II) Scheduling constraints Instruction scheduling is a low-level optimization: performed on assembly code Reordered code must have same effect as original Constraints to be considered: data dependences control dependences: only reorder within BB resource constraints Data dependences If two instructions access the same register or memory location, they may be dependent rue dependence: write/read ; add ebx, eax nti-dependence: read/write add ebx, eax; Output dependence: write/write mul ebx; mov edx, ecx -both update edx Dependence Graph. mov ecx, [ebp+8]. add ecx, eax. mov [ebp + 4], eax 4. mov edx, [ecx + 4]. add eax, edx. mov [ebp + 4], ebx 4 O 4 4

Dependence Graph Memory dependences are tricky: two memory addresses may be aliases for each other need alias analysis mov [edx + ], eax dependence? mov ebx, [ecx - 8] mov [ecx+8], edx dependence? If one instruction depends on another, order cannot be reversed constrains scheduling ny valid ordering must make all dependence edges go forward in code: topological sort of dependence DG List Scheduling Initialize ready list R with all instructions not dependent on any unexecuted instruction Loop until R is empty pick best node in R and append it to reordered instructions update ready list with ready successors to best node Works for simple & superscalar processors Problem : Determining best node in R is NPcomplete! Must use heuristic. Problem : Doesn t consider all possible executions Greedy Heuristic If instruction predecessors won t be sufficiently complete yet, creates stall Choose instruction that will be scheduled as soon as possible, based on start time of its predecessors: simulate processor How to break ties: pick node with longest path to DG leaf pick node that can go to non-busy pipeline pick node with many dependent successors 7 Scheduling w/ FRMW model Ready,,, 4,, 4 Result: eliminated stalls after & 4, moved memory operations earlier 4 O. mov cx, [bp+8]. add cx, ax. mov [bp + 4], ax 4. mov dx, [cx + 4]. add ax, dx. mov [bp + 4], bx 8 Register allocation conflict Problem: use of same register creates antidependencies that restrict scheduling Register allocation before scheduling: prevents good scheduling Scheduling before register allocation: spills destroy scheduling Solution: schedule abstract assembly, allocate registers, schedule again! Summary Instruction scheduling very important for non-fancy processors Improves performance even on processors with out-of-order execution (dynamic reordering must be more conservative) List scheduling provides a simple heuristic for instruction scheduling 9 0