CS521 CSE IITG 11/23/2012



Similar documents

Computer Organization and Components

Pipelining Review and Its Limitations

WAR: Write After Read

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

CS352H: Computer Systems Architecture

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Solutions. Solution The values of the signals are as follows:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Instruction Set Architecture (ISA) Design. Classification Categories

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Computer organization

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction scheduling

IA-64 Application Developer s Architecture Guide

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

PROBLEMS #20,R0,R1 #$3A,R2,R4

Software Pipelining - Modulo Scheduling

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Architectures and Platforms

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

B4: Experience with a Globally-Deployed Software Defined WAN TO APPEAR IN SIGCOMM 13

Course on Advanced Computer Architectures

ADVANCED COMPUTER ARCHITECTURE

CPU Performance Equation

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

VLIW Processors. VLIW Processors

Instruction Set Design

1 Storage Devices Summary

Week 1 out-of-class notes, discussions and sample problems

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Memory Testing. Memory testing.1

QLIKVIEW SERVER LINEAR SCALING

Multiple Programming Models For Linux System Design and Development

Computer Organization and Components

THROUGHPUTER. Parallel Program Development and Execution Platform as a Service

Switch Fabric Implementation Using Shared Memory

Software Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Giving credit where credit is due

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Networking Virtualization Using FPGAs

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Mark Bennett. Search and the Virtual Machine

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Unified Batch & Stream Processing Platform

Software Programmable DSP Platform Analysis Episode 7, Monday 19 March 2007, Ingredients. Software Pipelining. Data Dependence. Resource Constraints

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Parallel Programming

Optimising the resource utilisation in high-speed network intrusion detection systems.

Instruction Scheduling. Software Pipelining - 2

Service Design Best Practices

Motivation: Smartphone Market

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Introduction to Cloud Computing

How To Create A Spam Detector On A Web Browser

The i860 XP Second Generation of the i860 Supercomputing Microprocessor Family. Presentation Outline

Hardware Resource Allocation for Hardware/Software Partitioning in the LYCOS System

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Computer Architecture TDTS10

Driving More Value From OpenVMS Critical Infrastructure in Local and Global Datacenters: A CASE STUDY. Presented by: J. Barry Thompson, CTO Tervela

Computer Architecture

Scalability and Classifications

Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan

Table 1: Address Table

GPU Hardware Performance. Fall 2015

System/Networking performance analytics with perf. Hannes Frederic Sowa

Implementing Internet Storage Service Using OpenAFS. Sungjin Dongguen Arum

White Paper November Technical Comparison of Perspectium Replicator vs Traditional Enterprise Service Buses

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

The Microarchitecture of Superscalar Processors

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015

Java Performance. Adrian Dozsa TM-JUG

Thread level parallelism

Evaluating and Comparing the Impact of Software Faults on Web Servers

High Performance Computer Architecture

Performance Analysis of VM Scheduling Algorithm of CloudSim in Cloud Computing

The functions of system LSI become more and more complicated

Scaling from Datacenter to Client

Chapter 9 Computer Design Basics!

RAM & ROM Based Digital Design. ECE 152A Winter 2012

Bernie Velivis President, Performax Inc

Enterprise Applications

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

CAT: Azure SQL DB Premium Deep Dive and Mythbuster

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Performance evaluation

Performance Workload Design

Transcription:

CS521 CSE TG 11/23/2012 A Sahu 1 Degree of overlap Serial, Overlapped, d, Super pipelined/superscalar Depth Shallow, Deep Structure Linear, Non linear Scheduling of operations Static, Dynamic A Sahu slide 2 Serial Shallow Overlapped Linear d Deep Non linear Sequence: A, B, C, B, C, A, C, A A Sahu slide 3 A Sahu slide 4 Static same sequence of stages for all uctions all actions in order if one uction stalls, all subsequent uctions are delayed Dynamic above conditions are relaxed higher throughput is achieved type 1 : beginnings (decode) and endings (put away) in order type 2 : only beginnings in order type 3 : no order restrictions except dependencies type 1 extended : beginnings in order, references that effect memory state are in order [note that a memory reference may lead to page fault] A Sahu slide 5 A Sahu slide 6 A Sahu 1

CS521 CSE TG 11/23/2012 Type CP Serial 5 6 Overlapped 3 d (static) 1.5 2 d (dynamic) 1.2 1.5 Multiple uction issue < 1.0 Data dependencies => Data hazards RAW (read after write) WAR (write after read) WAW (write after write) Resource conflicts => Structural hazards use of same resource in different stages Procedural dependencies => Control hazards conditional and unconditional branches, calls/returns A Sahu slide 7 A Sahu slide 8 previous current read/write read/write 1 2 previous current previous EX R W EX W Data Forwarding/ HW Approach nstruction Reordering / SW App delay = 3 current R A Sahu slide 9 A Sahu slide 10 Data forwarding path P1 Data forwarding path P2 M DM : add $t1,... add $s1,$t1,.. M DM : lw $t1,... add $s1,$t1,.. +1 M DM +1 M M DM A Sahu 2

CS521 CSE TG 11/23/2012 Data forwarding path P3 Data forwarding path P4 M DM : add $t1,... M DM : lw $t1,... +1 M DM +1 M DM P2 +1 P3 +1 P4 +1 Data forwarding paths M DM M M DM M DM M DM M DM M DM : lw $t1,... add $s1,$t1,.. : add $t1,... : lw $t1,... Data forwarding path list P1 from out (EX/DM) to in1/2 P2 from DM/ out (DM/WB) to in1/2 P3/P4 from DM/ out (DM/WB) to DM in 1 move $t0 $zero 2 addi $t2, $zero,100 3 L: lw $t2 0($7) 4 add $t1 $t2 $s1 5 add $a $t1 $s5 6 sw $a 32($s3) 7 add $6 $3 $a 8 addi $t0 $t0 1 9 lw $7 0($8) 10 sw $7 8($0) 11 add $s9 $s9 1 12 beq $t0 $t2 L 13 hlt WAW 2 P2 P1= to P2= M to P3= to M P4 = M to M P1 3 4 5 8 2 OPs P4 7 9 6 P3 Patterson, D.A., and Hennessy, J.L., Computer Organization and Design: The Hardware/Software nterface Chapter 6.4/6.5, third edition Ebook can be found A Sahu 17 A Sahu slide 18 A Sahu 3

CS521 CSE TG 11/23/2012 Caused by Resource Conflicts Use of a hardware resource in more than one cycle A B A C A B A C A B A C Non linear Different sequences of resource usage by different uctions Non pipelined multi cycle resources A Sahu slide 19 D A C B D F D X X F D X X 1 2 3 4 5 6 7 8 Reservation Table A X X X for X B (Required Resources X X of uction in Cycle) C X X X A Sahu slide 20 Multi functional Reservation Table for X for Y 1 2 3 4 5 6 7 8 A YX Y X X B X Y X C Y X Y X Y X 1 2 3 4 5 6 7 8 9 10 11 A 1 2 3 1 4 12 1,2 5 23 2,3 6 B 1 1,2 2,3 3,4 4,5 C 1 1,2 1-3 2-4 Collisions 1 3 means 1,2,3 A Sahu slide 21 A Sahu slide 22 1 2 3 4 5 6 7 8 9 10 11 A 1 2 1 3 1 2 2 B 1 1 2 2 3 3 C 1 1 2 1 2 3 2 2 1 2 3 4 5 6 7 8 9 10 11 A 1 12 1,2 1 23 2,3 B 1 1 2 2 C 1 1 1 2 2 Collisions A Sahu slide 23 A Sahu slide 24 A Sahu 4

CS521 CSE TG 11/23/2012 No Collision for 1, 8, 3 and 6 interval 1, 8, 1, 8,. (1, 8) avg = 4.5 45 3, 3, 3, 3,. (3) avg = 3 6, 6, 6, 6,. (6) avg = 6 Minimum Average Latency? 1 0 1 1 0 1 1 Collision vector for X 1 : collision 0 : no collision 3 6 8+ 8+ m. 2 1 8+ 1 0 1 1 0 1 1 1 1 1 1 1 1 1 3 6 C 1 A B A Sahu slide 25 A Sahu slide 26 Latency Cycles (1, 8) (1, 8, 6, 8) (3) (6) (3, 8) (3, 6, 3) Simple Latency Cycles (no figure repeats) (1, 8) (3) (6) (3, 8) (6, 8) Greedy Latency Cycles (1, 8) (3) from different starting states MAL > max no. of check marks in any row MAL < avg latency of any greedy cycle avg latency of any greedy cycle < no. of 1 s in initial collision vector + 1 A B C A Sahu slide 27 A Sahu slide 28 Consider a greedy cycle (k 1,k 2,..,k n ) Let p = no. of 1 s in initial collision vector k1 < p + 1 k 2 < 2 p k 1 + 2 k i <p+1, k 3 < 3 p k 1 k 2 + 3 k 1 +k2 <2p+2. k n < n p k 1 k 2 k n 1 + n k 1 + k 2 + k n < n p + n MAL < p + 1 Kai Hwang, " Advanced Computer Architecture: Parallelism, Scalability, Programmability Chapter 6 A Sahu slide 29 A Sahu slide 30 A Sahu 5

CS521 CSE TG 11/23/2012 branch next inline target cond eval delay = 2 delay = 5 target addr gen the order of cond eval and target addr gen may be different cond eval may be done in previous uction mproving Branch Performance Branch Elimination replace branch with other uctions Branch Speed Up reduce time for computing CC and TF Branch Prediction guess the outcome and proceed, undo if necessary Branch Target Capture make use of history A Sahu slide 31 A Sahu 6