Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, September 3rd, 2015 Prof. C. Silvano EX1A ( 2 points) EX1B ( 2 points) EX1C ( 2 points) EX2 ( 5 points) EX3 ( 5 points) Subtotal ( 16 points) Q4 ( 5 points) Q5 ( 6 points) Q6 ( 5 points) TOTAL (32 points)

EXERCISE 1A PIPELINE BASIC (2 points) Given the following program has been compiled in MIPS assembly code assuming that registers $t6 and $t7 have been initialized with values 0 and 4N respectively. The symbols VECTA and VECTB are 16-bit constant. The processor clock frequency is 1 GHz. Let us consider the loop executed by 5- stage pipelined MIPS processor WITHOUT any optimization in the pipeline (PLEASE don t consider any inter-iteration dependencies) 1. Identify the RAW (Read After Write) Hazards in the pipeline scheme and identify the Hazard Type (Data Hazard or Control Hazard) in the last column 2. Identify in the first column the number of stalls to be inserted before each instruction (or between the stage IF and ID of each instruction) necessary to solve the hazards Num. Stalls INSTRUCTION C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 Hazard Type 3 FOR:lw $t1,vecta($t6) IF ID EX ME WB CNTR lw $t2,vectb($t6) IF ID EX ME WB 3 add $t5,$t2,$t1 IF ID EX ME WB (RAW $t1) RAW $t2 3 lw $t3,0($t5) IF ID EX ME WB RAW $t5 3 or $t3,$t5,$t3 IF ID EX ME WB (RAW $t5) RAW $t3 3 sw $t3,0($t5) IF ID EX ME WB (RAW $t5) RAW $t3 addi $t6,$t6,4 IF ID EX ME WB 3 blt $t6,$t7, FOR IF ID EX ME WB RAW $t6 blt $t6, $t7, FOR # branch on less than Express the formula then calculate the following metrics: Instruction Count per iteration (IC): IC = 8 Number of stalls per iteration: N stall = 18 Asymptotic CPI (N cycles) : CPI AS = (IC + # stalls) / IC = (8+18) /8 = 3.25 Asymptotic Throughput (expressed in MIPS) (N cycles): MIPS AS = f CLOCK / CPI AS * 10 6 = (1 10 9 ) / (CPI AS * 10 6 ) = 10 3 / 3.25 = 307.7 Page 1 - SOLUTION

EXERCISE 1B PIPELINE OPTIMIZATIONS with BRANCH PREDICTION (2 points) Assuming there are the following optimisations in the pipeline (PLEASE don t consider any inter-iteration dependencies) - In the Register File it is possible the read and write at the same address in the same clock cycle; - Forwarding ONLY FOR EXE-EXE path - Computation of PC and TARGET ADDRESS for branch & jump instructions anticipated in the ID stage - Static branch prediction ALWAYS TAKEN with Branch Target Buffer 1. Identify the RAW (Read After Write) Hazards in the pipeline scheme and identify the Hazard Type (Data Hazard or Control Hazard) in the penultimate (one before last) column 2. Identify in the first column the number of stalls to be inserted before each instruction (or between the stage IF and ID of each instruction) necessary to solve the hazards 3. Draw the pipeline scheme by inserting the stalls to solve the given hazards and adding an ARROW to indicate the Forwarding paths used: Num. Stalls INSTRUCTION C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 Hazard Type Forwarding Path FOR:lw $t1,vecta($t6) IF ID EX ME WB (CNTR solved by BP) lw $t2,vectb($t6) IF ID EX ME WB 3 add $t5,$t2,$t1 IF ID EX ME WB (RAW $t1) RAW $t2 lw $t3,0($t5) IF ID EX ME WB (RAW $t5) EX-EX $t5 3 or $t3,$t5,$t3 IF ID EX ME WB (RAW $t5) RAW $t3 sw $t3,0($t5) IF ID EX ME WB (RAW $t3) EX-EX$t3 addi $t6,$t6,4 IF ID EX ME WB 3 blt $t6,$t7, FOR IF ID EX ME WB RAW $t6 4. blt $t6, $t7, FOR # branch on less than Express the formula then calculate the following metrics: Instruction Count per iteration (IC): IC = 8 Number of stalls per iteration: N stall = 9 Asymptotic CPI (N cycles) : CPI AS = (IC + # stalls) / IC = (8+12) /8 = 2.125 Asymptotic Throughput (expressed in MIPS) (N cycles): MIPS AS = f CLOCK / CPI AS * 10 6 = (10 9 ) / (CPI AS * 10 6 ) = 10 3 / 2.5 = 471 Calculate the Speedup with respect to the EXERCISE1A: Speed up = CPI AS1A / CPI AS1B = 3.25 / 2.5 = 1.53 Page 2 - SOLUTION

EXERCISE 1C PIPELINE OPTIMIZATIONS with BRANCH PREDICTION (2 points) Assuming there are the following optimisations in the pipeline (PLEASE don t consider any inter-iteration dependencies) - In the Register File it is possible the read and write at the same address in the same clock cycle; - Forwarding - Computation of PC and TARGET ADDRESS for branch & jump instructions anticipated in the ID stage - Static branch prediction ALWAYS TAKEN with Branch Target Buffer 5. Identify the RAW (Read After Write) Hazards in the pipeline scheme and identify the Hazard Type (Data Hazard or Control Hazard) in the penultimate (one before last) column 6. Identify in the first column the number of stalls to be inserted before each instruction (or between the stage IF and ID of each instruction) necessary to solve the hazards 7. Draw the pipeline scheme by inserting the stalls to solve the given hazards and adding an ARROW to indicate the Forwarding paths used: Num. Stalls INSTRUCTION C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 Hazard Type Forwarding Path FOR:lw $t1,vecta($t6) IF ID EX ME WB (CNTR solved by BP) lw $t2,vectb($t6) IF ID EX ME WB 1 add $t5,$t2,$t1 IF S ID EX ME WB (RAW $t1) RAW $t2 MEM-ID$t1 MEM-EX $t2 lw $t3,0($t5) S IF ID EX ME WB (RAW $t5) EX-EX $t5 1 or $t3,$t5,$t3 IF S ID EX ME WB (RAW $t5) RAW $t3 MEM-ID$t5 MEM-EX $t3 sw $t3,0($t5) S IF ID EX ME WB (RAW $t3) EX-EX$t3 addi $t6,$t6,4 IF ID EX ME WB 1 blt $t6,$t7, FOR IF S ID EX ME WB RAW $t6 EX-ID $t6 blt $t6, $t7, FOR # branch on less than Express the formula then calculate the following metrics: Instruction Count per iteration (IC): IC = 8 Number of stalls per iteration: N stall = 3 Asymptotic CPI (N cycles) : CPI AS = (IC + # stalls) / IC = (8+3) /8 = 1.375 Asymptotic Throughput (expressed in MIPS) (N cycles): MIPS AS = f CLOCK / CPI AS * 10 6 = (10 9 ) / (CPI AS * 10 6 ) = 10 3 / 1.375 = 727.3 Calculate the Speedup with respect to the EXERCISE1A: Speed up = CPI AS1A / CPI AS1C = 3.25 / 1.375 = 2.36 Page 3 - SOLUTION

EXERCISE 2 PIPELINE OPTIMIZATIONS for VLIW(5 points) Consider the same program be executed on a 2-issue MIPS VLIW (Very Long Instruction Word Architecture) architecture with static scheduling and the same optimisations of EX1c including Static Branch Prediction ALWAYS TAKEN with Branch Target Buffer Consider for each issue: 1 ALU/BRANCH and 1 LOAD/STORE Complete the pipeline scheme by RESCHEDULING the program and inserting the NOPS needed to solve the given hazards and by adding an ARROW to indicate the Forwarding paths used: Num. Stalls INSTRUCTION C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 Forwarding Path 1 A/B nop IF ID EX ME WB 1 L/S FOR:lw $t1,vecta($t6) IF ID EX ME WB 2 A/B nop IF ID EX ME WB 2 L/S lw $t2,vectb($t6) IF ID EX ME WB 3 A/B nop IF ID EX ME WB 3 L/S nop IF ID EX ME WB 4 A/B add $t5,$t2,$t1 IF ID EX ME WB MEM-EX$t2 4 L/S lw $t3,0($t5) IF ID EX ME WB EX-MEM $t5 5 A/B addi $t6, $t6, 4p IF ID EX ME WB 5 L/S nop IF ID EX ME WB 6 A/B or $t3,$t5,$t3 IF ID EX ME WB MEM-EX$t5 MEM-EX$t3 6 L/S nop IF ID EX ME WB 7 A/B blt $t6,$t7, FOR IF ID EX ME WB EX-ID $t6 7 L/S sw $t3,0($t5) IF ID EX ME WB EX-EX $t3 8 A/B IF ID EX ME WB 8 L/S IF ID EX ME WB 9 A/B IF ID EX ME WB 9 L/S IF ID EX ME WB 10 A/B IF ID EX ME WB 10 L/S IF ID EX ME WB 11 A/B IF ID EX ME WB 11 L/S IF ID EX ME WB Asymptotic CPI (N cycles) : CPI AS = (IC + # nops) / 2 *IC = (8+6) / 16 = 0.875 (CPI ideal = 0.5). This can also be calculated as: CPI AS = (# VLIW cycles) / IC = 7 / 8 = 0.875 (CPI ideal = 0.5) Page 4 - SOLUTION

EXERCISE 3 - TOMASULO (5 points) Please consider the program in the table be executed on a CPU with dynamic scheduling based on TOMASULO algorithm with: 2 RESERVATION STATIONS (RS1, RS2) + 2 LOAD/STORE unit (LDU1, LDU2) with latency 4 2 RESERVATION STATIONS (RS3, RS4) + 2 ALU/BR FUs (ALU1, ALU2) with latency 2 Check STRUCTURAL hazards for RS in ISSUE phase Check RAW hazards and Check STRUCTURAL hazards for FUs in START EXECUTE phase WRITE RESULT in RESERVATION STATIONS and RF We assume 1 CDB for RF Static Branch Prediction ALWAYS TAKEN with Branch Target Buffer 1. Please complete the TOMASULO TABLE by assuming all cache HITS and considering ONE ITERATION: ISTRUZIONE Prediction ISSUE START WRITE Hazards Type RSi UNIT T /NT EXEC RESULT FOR:lw $t1,vecta($t6) 1 2 6 (CNTR solved by BP) RS1 LDU1 lw $t2,vectb($t6) 2 3 7 RS2 LDU2 add $t5,$t2,$t1 3 8 10 (RAW $t1) RAW $t2 RS3 ALU1 lw $t3,0($t5) 7 11 15 STRUCT RS1 RAW $t5 RS1 LDU1 or $t3,$t5,$t3 8 16 18 (RAW $t5) RAW $t3 RS4 ALU2 sw $t3,0($t5) 9 19 23 RAW $t3 RS2 LDU2 addi $t6,$t6,4 11 12 14 STRUCT RS3 RS3 ALU1 blt $t6,$t7, FOR T 15 16 18* STRUCT RS3 (RAW $t6) RS3 ALU2 (*) blt is not writing in the CDB 2. Express the formula then calculate the following metrics: CPI = (#clock cyles / IC) = 23/8 =2.875 IPC = 1/CPI = 1/ 2.875 = 0.35 Page 5 - SOLUTION

Course on Advanced Computer Architectures Prof. C Silvano EXAM 03/09/2015 Please write in CAPITAL LETTERS AND BLACK/BLUE COLORS!!! (MAIUSCOLO e COLORE NERO/BLU!!!) QUESTION (4): BRANCH PREDICTION (5 points) Describe the main STATIC BRANCH PREDICTION techniques Page 6 - SOLUTION

Course on Advanced Computer Architectures Prof. C Silvano EXAM 03/09/2015 Please write in CAPITAL LETTERS AND BLACK/BLUE COLORS!!! (MAIUSCOLO e COLORE NERO/BLU!!!) QUESTION (5): MULTIPROCESSORS (6 points) Explain the main problem of Cache Coherency in Multiprocessors Explain the Cache Coherency Protocol used for Single-Bus Multiprocessors Page 7 - SOLUTION

Course on Advanced Computer Architectures Prof. C Silvano EXAM 03/09/2015 Please write in CAPITAL LETTERS AND BLACK/BLUE COLORS!!! (MAIUSCOLO e COLORE NERO/BLU!!!) QUESTION (6): VLIW PROCESSORS (5 points) Describe the main concepts of static scheduling for VLIW (Very Long Instruction Word) processor Draw (disegnare) the structure if 4-issue VLIW processor architecture with a 4 stage pipeline (IF-ID // EXE // ME // WB) : Page 8 - SOLUTION