Course on Advanced Computer Architectures

Similar documents
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

CS352H: Computer Systems Architecture

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Performance evaluation

WAR: Write After Read

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Solutions. Solution The values of the signals are as follows:

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Week 1 out-of-class notes, discussions and sample problems

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

PROBLEMS #20,R0,R1 #$3A,R2,R4

Introducción. Diseño de sistemas digitales.1

A Lab Course on Computer Architecture

Pipelining Review and Its Limitations


Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

LSN 2 Computer Processors

Introduction to Cloud Computing

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

CPU Performance Equation

Computer Organization and Components

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Computer Architecture TDTS10

Module: Software Instruction Scheduling Part I

VLIW Processors. VLIW Processors

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

AMD Opteron Quad-Core

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

The Microarchitecture of Superscalar Processors

A SystemC Transaction Level Model for the MIPS R3000 Processor

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

SOC architecture and design

In the Beginning The first ISA appears on the IBM System 360 In the good old days

CS521 CSE IITG 11/23/2012

Five Families of ARM Processor IP

Computer organization

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Instruction Set Design

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December :59

EC 362 Problem Set #2

on an system with an infinite number of processors. Calculate the speedup of

Instruction Set Architecture

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Introduction to the Latest Tensilica Baseband Solutions

EE361: Digital Computer Organization Course Syllabus

Interconnection Networks

Embedded System Hardware - Processing (Part II)

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Low Power AMD Athlon 64 and AMD Opteron Processors

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Instruction Set Architecture (ISA) Design. Classification Categories

CSEE W4824 Computer Architecture Fall 2012

Software implementation of Post-Quantum Cryptography

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

CPU Performance. Lecture 8 CAP

An Implementation Of Multiprocessor Linux

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

A3 Computer Architecture

Computer Architecture Syllabus of Qualifying Examination

The AVR Microcontroller and C Compiler Co-Design Dr. Gaute Myklebust ATMEL Corporation ATMEL Development Center, Trondheim, Norway

Multi-Threading Performance on Commodity Multi-Core Processors

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

Computer Organization and Components

SPARC64 VIIIfx: CPU for the K computer

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Intel 8086 architecture

How To Understand The Design Of A Microprocessor

CS 61C: Great Ideas in Computer Architecture Finite State Machines. Machine Interpreta4on

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

OC By Arsene Fansi T. POLIMI

Reduced Instruction Set Computer (RISC)

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Performance Application Programming Interface

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Lecture 2 Parallel Programming Platforms

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Putting Checkpoints to Work in Thread Level Speculative Execution

Performance Monitoring of the Software Frameworks for LHC Experiments

Thread level parallelism

Transcription:

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, September 3rd, 2015 Prof. C. Silvano EX1A ( 2 points) EX1B ( 2 points) EX1C ( 2 points) EX2 ( 5 points) EX3 ( 5 points) Subtotal ( 16 points) Q4 ( 5 points) Q5 ( 6 points) Q6 ( 5 points) TOTAL (32 points)

EXERCISE 1A PIPELINE BASIC (2 points) Given the following program has been compiled in MIPS assembly code assuming that registers $t6 and $t7 have been initialized with values 0 and 4N respectively. The symbols VECTA and VECTB are 16-bit constant. The processor clock frequency is 1 GHz. Let us consider the loop executed by 5- stage pipelined MIPS processor WITHOUT any optimization in the pipeline (PLEASE don t consider any inter-iteration dependencies) 1. Identify the RAW (Read After Write) Hazards in the pipeline scheme and identify the Hazard Type (Data Hazard or Control Hazard) in the last column 2. Identify in the first column the number of stalls to be inserted before each instruction (or between the stage IF and ID of each instruction) necessary to solve the hazards Num. Stalls INSTRUCTION C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 Hazard Type 3 FOR:lw $t1,vecta($t6) IF ID EX ME WB CNTR lw $t2,vectb($t6) IF ID EX ME WB 3 add $t5,$t2,$t1 IF ID EX ME WB (RAW $t1) RAW $t2 3 lw $t3,0($t5) IF ID EX ME WB RAW $t5 3 or $t3,$t5,$t3 IF ID EX ME WB (RAW $t5) RAW $t3 3 sw $t3,0($t5) IF ID EX ME WB (RAW $t5) RAW $t3 addi $t6,$t6,4 IF ID EX ME WB 3 blt $t6,$t7, FOR IF ID EX ME WB RAW $t6 blt $t6, $t7, FOR # branch on less than Express the formula then calculate the following metrics: Instruction Count per iteration (IC): IC = 8 Number of stalls per iteration: N stall = 18 Asymptotic CPI (N cycles) : CPI AS = (IC + # stalls) / IC = (8+18) /8 = 3.25 Asymptotic Throughput (expressed in MIPS) (N cycles): MIPS AS = f CLOCK / CPI AS * 10 6 = (1 10 9 ) / (CPI AS * 10 6 ) = 10 3 / 3.25 = 307.7 Page 1 - SOLUTION

EXERCISE 1B PIPELINE OPTIMIZATIONS with BRANCH PREDICTION (2 points) Assuming there are the following optimisations in the pipeline (PLEASE don t consider any inter-iteration dependencies) - In the Register File it is possible the read and write at the same address in the same clock cycle; - Forwarding ONLY FOR EXE-EXE path - Computation of PC and TARGET ADDRESS for branch & jump instructions anticipated in the ID stage - Static branch prediction ALWAYS TAKEN with Branch Target Buffer 1. Identify the RAW (Read After Write) Hazards in the pipeline scheme and identify the Hazard Type (Data Hazard or Control Hazard) in the penultimate (one before last) column 2. Identify in the first column the number of stalls to be inserted before each instruction (or between the stage IF and ID of each instruction) necessary to solve the hazards 3. Draw the pipeline scheme by inserting the stalls to solve the given hazards and adding an ARROW to indicate the Forwarding paths used: Num. Stalls INSTRUCTION C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 Hazard Type Forwarding Path FOR:lw $t1,vecta($t6) IF ID EX ME WB (CNTR solved by BP) lw $t2,vectb($t6) IF ID EX ME WB 3 add $t5,$t2,$t1 IF ID EX ME WB (RAW $t1) RAW $t2 lw $t3,0($t5) IF ID EX ME WB (RAW $t5) EX-EX $t5 3 or $t3,$t5,$t3 IF ID EX ME WB (RAW $t5) RAW $t3 sw $t3,0($t5) IF ID EX ME WB (RAW $t3) EX-EX$t3 addi $t6,$t6,4 IF ID EX ME WB 3 blt $t6,$t7, FOR IF ID EX ME WB RAW $t6 4. blt $t6, $t7, FOR # branch on less than Express the formula then calculate the following metrics: Instruction Count per iteration (IC): IC = 8 Number of stalls per iteration: N stall = 9 Asymptotic CPI (N cycles) : CPI AS = (IC + # stalls) / IC = (8+12) /8 = 2.125 Asymptotic Throughput (expressed in MIPS) (N cycles): MIPS AS = f CLOCK / CPI AS * 10 6 = (10 9 ) / (CPI AS * 10 6 ) = 10 3 / 2.5 = 471 Calculate the Speedup with respect to the EXERCISE1A: Speed up = CPI AS1A / CPI AS1B = 3.25 / 2.5 = 1.53 Page 2 - SOLUTION

EXERCISE 1C PIPELINE OPTIMIZATIONS with BRANCH PREDICTION (2 points) Assuming there are the following optimisations in the pipeline (PLEASE don t consider any inter-iteration dependencies) - In the Register File it is possible the read and write at the same address in the same clock cycle; - Forwarding - Computation of PC and TARGET ADDRESS for branch & jump instructions anticipated in the ID stage - Static branch prediction ALWAYS TAKEN with Branch Target Buffer 5. Identify the RAW (Read After Write) Hazards in the pipeline scheme and identify the Hazard Type (Data Hazard or Control Hazard) in the penultimate (one before last) column 6. Identify in the first column the number of stalls to be inserted before each instruction (or between the stage IF and ID of each instruction) necessary to solve the hazards 7. Draw the pipeline scheme by inserting the stalls to solve the given hazards and adding an ARROW to indicate the Forwarding paths used: Num. Stalls INSTRUCTION C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 Hazard Type Forwarding Path FOR:lw $t1,vecta($t6) IF ID EX ME WB (CNTR solved by BP) lw $t2,vectb($t6) IF ID EX ME WB 1 add $t5,$t2,$t1 IF S ID EX ME WB (RAW $t1) RAW $t2 MEM-ID$t1 MEM-EX $t2 lw $t3,0($t5) S IF ID EX ME WB (RAW $t5) EX-EX $t5 1 or $t3,$t5,$t3 IF S ID EX ME WB (RAW $t5) RAW $t3 MEM-ID$t5 MEM-EX $t3 sw $t3,0($t5) S IF ID EX ME WB (RAW $t3) EX-EX$t3 addi $t6,$t6,4 IF ID EX ME WB 1 blt $t6,$t7, FOR IF S ID EX ME WB RAW $t6 EX-ID $t6 blt $t6, $t7, FOR # branch on less than Express the formula then calculate the following metrics: Instruction Count per iteration (IC): IC = 8 Number of stalls per iteration: N stall = 3 Asymptotic CPI (N cycles) : CPI AS = (IC + # stalls) / IC = (8+3) /8 = 1.375 Asymptotic Throughput (expressed in MIPS) (N cycles): MIPS AS = f CLOCK / CPI AS * 10 6 = (10 9 ) / (CPI AS * 10 6 ) = 10 3 / 1.375 = 727.3 Calculate the Speedup with respect to the EXERCISE1A: Speed up = CPI AS1A / CPI AS1C = 3.25 / 1.375 = 2.36 Page 3 - SOLUTION

EXERCISE 2 PIPELINE OPTIMIZATIONS for VLIW(5 points) Consider the same program be executed on a 2-issue MIPS VLIW (Very Long Instruction Word Architecture) architecture with static scheduling and the same optimisations of EX1c including Static Branch Prediction ALWAYS TAKEN with Branch Target Buffer Consider for each issue: 1 ALU/BRANCH and 1 LOAD/STORE Complete the pipeline scheme by RESCHEDULING the program and inserting the NOPS needed to solve the given hazards and by adding an ARROW to indicate the Forwarding paths used: Num. Stalls INSTRUCTION C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 Forwarding Path 1 A/B nop IF ID EX ME WB 1 L/S FOR:lw $t1,vecta($t6) IF ID EX ME WB 2 A/B nop IF ID EX ME WB 2 L/S lw $t2,vectb($t6) IF ID EX ME WB 3 A/B nop IF ID EX ME WB 3 L/S nop IF ID EX ME WB 4 A/B add $t5,$t2,$t1 IF ID EX ME WB MEM-EX$t2 4 L/S lw $t3,0($t5) IF ID EX ME WB EX-MEM $t5 5 A/B addi $t6, $t6, 4p IF ID EX ME WB 5 L/S nop IF ID EX ME WB 6 A/B or $t3,$t5,$t3 IF ID EX ME WB MEM-EX$t5 MEM-EX$t3 6 L/S nop IF ID EX ME WB 7 A/B blt $t6,$t7, FOR IF ID EX ME WB EX-ID $t6 7 L/S sw $t3,0($t5) IF ID EX ME WB EX-EX $t3 8 A/B IF ID EX ME WB 8 L/S IF ID EX ME WB 9 A/B IF ID EX ME WB 9 L/S IF ID EX ME WB 10 A/B IF ID EX ME WB 10 L/S IF ID EX ME WB 11 A/B IF ID EX ME WB 11 L/S IF ID EX ME WB Asymptotic CPI (N cycles) : CPI AS = (IC + # nops) / 2 *IC = (8+6) / 16 = 0.875 (CPI ideal = 0.5). This can also be calculated as: CPI AS = (# VLIW cycles) / IC = 7 / 8 = 0.875 (CPI ideal = 0.5) Page 4 - SOLUTION

EXERCISE 3 - TOMASULO (5 points) Please consider the program in the table be executed on a CPU with dynamic scheduling based on TOMASULO algorithm with: 2 RESERVATION STATIONS (RS1, RS2) + 2 LOAD/STORE unit (LDU1, LDU2) with latency 4 2 RESERVATION STATIONS (RS3, RS4) + 2 ALU/BR FUs (ALU1, ALU2) with latency 2 Check STRUCTURAL hazards for RS in ISSUE phase Check RAW hazards and Check STRUCTURAL hazards for FUs in START EXECUTE phase WRITE RESULT in RESERVATION STATIONS and RF We assume 1 CDB for RF Static Branch Prediction ALWAYS TAKEN with Branch Target Buffer 1. Please complete the TOMASULO TABLE by assuming all cache HITS and considering ONE ITERATION: ISTRUZIONE Prediction ISSUE START WRITE Hazards Type RSi UNIT T /NT EXEC RESULT FOR:lw $t1,vecta($t6) 1 2 6 (CNTR solved by BP) RS1 LDU1 lw $t2,vectb($t6) 2 3 7 RS2 LDU2 add $t5,$t2,$t1 3 8 10 (RAW $t1) RAW $t2 RS3 ALU1 lw $t3,0($t5) 7 11 15 STRUCT RS1 RAW $t5 RS1 LDU1 or $t3,$t5,$t3 8 16 18 (RAW $t5) RAW $t3 RS4 ALU2 sw $t3,0($t5) 9 19 23 RAW $t3 RS2 LDU2 addi $t6,$t6,4 11 12 14 STRUCT RS3 RS3 ALU1 blt $t6,$t7, FOR T 15 16 18* STRUCT RS3 (RAW $t6) RS3 ALU2 (*) blt is not writing in the CDB 2. Express the formula then calculate the following metrics: CPI = (#clock cyles / IC) = 23/8 =2.875 IPC = 1/CPI = 1/ 2.875 = 0.35 Page 5 - SOLUTION

Course on Advanced Computer Architectures Prof. C Silvano EXAM 03/09/2015 Please write in CAPITAL LETTERS AND BLACK/BLUE COLORS!!! (MAIUSCOLO e COLORE NERO/BLU!!!) QUESTION (4): BRANCH PREDICTION (5 points) Describe the main STATIC BRANCH PREDICTION techniques Page 6 - SOLUTION

Course on Advanced Computer Architectures Prof. C Silvano EXAM 03/09/2015 Please write in CAPITAL LETTERS AND BLACK/BLUE COLORS!!! (MAIUSCOLO e COLORE NERO/BLU!!!) QUESTION (5): MULTIPROCESSORS (6 points) Explain the main problem of Cache Coherency in Multiprocessors Explain the Cache Coherency Protocol used for Single-Bus Multiprocessors Page 7 - SOLUTION

Course on Advanced Computer Architectures Prof. C Silvano EXAM 03/09/2015 Please write in CAPITAL LETTERS AND BLACK/BLUE COLORS!!! (MAIUSCOLO e COLORE NERO/BLU!!!) QUESTION (6): VLIW PROCESSORS (5 points) Describe the main concepts of static scheduling for VLIW (Very Long Instruction Word) processor Draw (disegnare) the structure if 4-issue VLIW processor architecture with a 4 stage pipeline (IF-ID // EXE // ME // WB) : Page 8 - SOLUTION