Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine This is a limited version of a hardware implementation to execute the JAVA programming language. 1 of 23 Structured Computer Organization 5th ed., Prentice Hall, Upper Sable River, NJ, 2006. ISBN 0-13-148521-0.

Review IJVM Programmer Model and Organization Method Area External Program Memory containing the IJVM program to be executed. The internal PC is a pointer in the Method Area to the next instruction to be executed. The method area is accessed using the PC with 8-bit instructions and parameters loaded into the MBR. Constant Pool External Data Memory containing compiled constants (e.g. values, strings, pointers, etc.) required by the Methods to execute programs. Local Variable Frame Operand Stack External Data Memory allocated for the storage of variables for the method being executed. When a method is invoked, a local variable frame of predefined size is established (allocated) for all local variables used by the method External Data Memory used for the storage of temporary operands that are placed on the stack. The Operand Stack maximum size is known when a Method is invoked. Note on the IJVM Theory of operations: Data memory is accessed using an indexed offset from known internal register memory pointers (i.e. LV local variable, CPP constant pool) The local variable frame includes both the local variables and the operand stack. Note that since the maximum size of the stack is known, the maximum size of the local variable frame is known. 2 of 23 Structured Computer Organization 5th ed., Prentice Hall, Upper Sable River, NJ, 2006. ISBN 0-13-148521-0.

IJVM Microinstruction 1 (MIC-1) Architecture Language: iload & wide iload & ldc_w Using common elements for instructions pushing data onto the stack. main1 PC = PC + 1; fetch; goto (MBR) MBR holds opcode; get next byte; dispatch iload1 H = LV MBR contains index; copy LV to H iload2 MAR = MBRU + H; rd MAR = address of local variable to push iload3 MAR = SP = SP + 1 SP points to new top of stack; prepare write iload4 PC = PC + 1; fetch; wr Inc PC; get next opcode; write top of stack iload5 TOS = MDR; goto main1 Update TOS main1 PC = PC + 1; fetch; goto (MBR) MBR holds opcode; get next byte; dispatch wide1 PC = PC+ 1; fetch; Multiway branch with high bit set goto (MBR OR 0x100) wide_iload1 PC = PC + 1; fetch MBR contains 1st index byte; fetch 2nd wide_iload2 H = MBRU << 8 H = 1st index byte shifted left 8 bits wide_iload3 H = MBRU OR H H = 16-bit index of local variable wide_iload4 MAR = LV + H; rd; goto iload3 MAR = address of local variable to push main1 PC = PC + 1; fetch; goto (MBR) MBR holds opcode; get next byte; dispatch wide1 PC = PC+ 1; fetch; Multiway branch with high bit set goto (MBR OR 0x100) ldc_w1 PC = PC + 1; fetch MBR contains 1st index byte; fetch 2nd ldc_w2 H = MBRU << 8 H = 1st index byte << 8 ldc_w3 H = MBRU OR H H = 16-bit index into constant pool ldc_w4 MAR = H + CPP; rd; goto iload3 MAR = address of constant in pool iload3 MAR = SP = SP + 1 SP points to new top of stack; prepare write iload4 PC = PC + 1; fetch; wr Inc PC; get next opcode; write top of stack iload5 TOS = MDR; goto main1 Update TOS 3 of 23 Structured Computer Organization 5th ed., Prentice Hall, Upper Sable River, NJ, 2006. ISBN 0-13-148521-0.

ILOAD (0x15h) Pushing a variable from the Local Variable Memory Space unto the stack JAVA: ILOAD varnum MAL: main1 and iload1(0x015) to iload5 Label Operations Comments main1 PC = PC + 1; fetch; goto (MBR) MBR holds opcode; get next byte; dispatch iload1 H = LV MBR contains index; copy LV to H iload2 MAR = MBRU + H; rd MAR = address of local variable to push iload3 MAR = SP = SP + 1 SP points to new top of stack; prepare write iload4 PC = PC + 1; fetch; wr Inc PC; get next opcode; write top of stack iload5 TOS = MDR; goto main1 Update TOS Similar instructions: ISTORE 4 of 23 Structured Computer Organization 5th ed., Prentice Hall, Upper Sable River, NJ, 2006. ISBN 0-13-148521-0.

Executing the JAVA ILOAD Instruction in Mic-1 Cycle 0 1 2 3 4 5 6 MPC main1 iload1 iload2 iload3 iload4 iload5 main1 INST PC= PC+1; fetch; goto (MBR) H= LV MAR= MBRU + H; rd MAR= SP = SP+1 PC= PC+1; fetch; wr TOS= MDR; goto main1 PC= PC+1; fetch; goto (MBR) Next MPC gt (iload1) iload2 iload3 iload4 iload5 main1 gt (MBR) PC fetch PC PC+1 fetch PC+2 fetch MBR iload1 VARNM INST MAR rd/wr LV+VARNM rd SP+1 wr MDR PARAM SP SP SP+1 LV CPP TOS (@SP) PARAM H LV 5 of 23

WIDE ILOAD (0xC4 0x15) Pushing a variable from the Local Variable Memory Space unto the stack JAVA: WIDE ILOAD varnum1 varnum2 MAL: main1 and wide_iload1(0x115) to wide_iload4 and iload3 to iload5 Note, wide requires that the instruction path from the instruction memory to the micropc uses a gated latch instead of being super-synchronous operation. The MBR register with output to the B-bus is still a super-synchronous multiplexed clocked register. See notes chap. 4-3 for more detail. Label Operations Comments main1 PC = PC + 1; fetch; goto (MBR) MBR holds opcode; get next byte; dispatch wide1 PC = PC+ 1; fetch; goto (MBR OR 0x100) Multiway branch with high bit set wide_iload1 PC = PC + 1; fetch MBR contains 1st index byte; fetch 2nd wide_iload2 H = MBRU << 8 H = 1st index byte shifted left 8 bits wide_iload3 H = MBRU OR H wide_iload4 MAR = LV + H; rd; goto iload3 H = 16-bit index of local variable MAR = address of local variable to push iload3 MAR = SP = SP + 1 SP points to new top of stack; prepare write iload4 PC = PC + 1; fetch; wr Inc PC; get next opcode; write top of stack iload5 TOS = MDR; goto main1 Update TOS Similar instructions: WIDE ISTORE, WIDE LDC_W index1 index2 6 of 23 Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.

Executing the JAVA WIDE ILOAD Instruction in Mic-1 Cycle 0 1 2 3 4 5 6 MPC main1 wide1 wide_iload1 wide_iload2 wide_iload3 wide_iload4 iload3 INST Next MPC PC fetch PC= PC+1; fetch; goto (MBR) gt (wide1) PC PC = PC+ 1; fetch; goto (MBR OR 0x100) (0x100 OR JAMC) PC+1 fetch PC = PC + 1; fetch H = MBRU << 8 H = MBRU OR H MAR = LV + H; rd; goto iload3 MAR= SP = SP+1 wide_iload2 wide_iload3 wide_iload4 iload3 iload4 PC+1 fetch PC+1 fetch MBR wide1 iload1 VAR1 VAR2 MAR rd/wr MDR SP SP LV CPP TOS (@SP) PC+2 fetch H VAR1<< 8 VAR1<< 8 OR VAR2 iload1 is forwarded to the MPC by using fetch as a gated path. MBR represents a clocked path to the B-bus. LV+H rd 7 of 23

Invoking Methods and Returning processes involved in subroutine/object calls For the IJVM, this is how a known subroutine would be executed. (This implementation is technically not object oriented.) For JAVA, additional software would be used to discover the method (dynamically locate), acquire the appropriate address (dynamically linking to the object), and then execute the method. JAVA Inst: INVOKEVIRTUAL disp Disp is a 16-bit offset into the constant pool where the subroutine s address is stored. Method (Subroutine) Area Structure: 4-bytes of Special Data followed by instructions Special Data needed to describe: How many parameters were passed to the method (Num_parms) How many variables are needed by the method (LV_size) Setting up the local variable frame: How many parameters were passed to the method (SP-#parms LV) How many variables are needed by the method Program Memory Space: IRETURN Post- MethodPC PC Next INST disp ls-byte disp ms-byte INVOKEVIRTUAL Method PC 1st INST LV_size ls-byte LV_size ms-byte Num_parms ls-byte Num_parms ms-byte Program Memory Space Required for a Method to be invoked 8 of 23

How a subroutine (or method) is called: the IJVM definition (1) push an object reference (pointer) to the object to be called (consistent with JVMs) (2) push method parameters onto the stack (3) invoke the method Use the CPP to locate the new method s address. Use special data (4-bytes) at the method s address to define storage needed Create a new local variable frame Store values required to return from the method (PC, LV) Necessary Local Variable Frame and Stack Changes Data Memory (a) before and (b) after invoking a method 9 of 23

IJVM invokevirtual microinstructions The process reads the address of the method The old PC is temporarily saved in the OPC register The new PC is installed and the 1 st 16-bit value will be fetched main1 PC = PC + 1; fetch; goto (MBR) MBR holds opcode; get next byte; dispatch invokevirtual1 PC = PC + 1; fetch MBR = index byte 1; inc. PC, get 2nd byte invokevirtual2 H = MBRU << 8 Shift and save first byte in H invokevirtual3 H = MBRU OR H H = offset of method pointer from CPP invokevirtual4 MAR = CPP + H; rd Get pointer to method from CPP area invokevirtual5 OPC = PC + 1 Save Return PC in OPC temporarily invokevirtual6 PC = MDR; fetch PC points to new method; get param count Use the first 16-bit value to determine where the original SP was (OBJREF storage-1) Prepare to overwrite OBJREF and determine the new value for LV Fetch the size of the local variable frame invokevirtual7 PC = PC + 1; fetch Fetch 2nd byte of parameter count invokevirtual8 H = MBRU << 8 Shift and save first byte in H invokevirtual9 H = MBRU OR H H = number of parameters invokevirtual10 PC = PC + 1; fetch Fetch first byte of # locals invokevirtual11 TOS = SP - H TOS = address of OBJREF - 1 invokevirtual12 TOS = MAR = TOS + 1 TOS = address of OBJREF (new LV) Manipulations of the local-variable frame after the previous instructions INVOKE SP OLD SP Parm2 Parm1 OBJREF INVOKE SP TOS and eventual LV Parm2 Parm1 OBJREF OLD Local Variable Frame OLD Local Variable Frame 10 of 23

Overwrite the OBJREF with the top value of the parameters and local variables or link pointer(also where the return PC is going to go) Set the SP to be at the top of the parameters and local variables Set the MAR to write the old PC and old LV invokevirtual13 PC = PC + 1; fetch invokevirtual14 H = MBRU << 8 invokevirtual15 H = MBRU OR H invokevirtual16 MDR = SP + H + 1; wr invokevirtual17 MAR = SP = MDR; Fetch second byte of # locals Shift and save first byte in H H = # locals Overwrite OBJREF with link pointer Set SP, MAR to location to hold old PC Manipulations of the local-variable frame after the previous instructions SP INVOKE SP TOS and eventual LV Parm2 Parm1 OBJREF TOS and eventual LV Parm2 Parm1 OBJREF OLD Local Variable Frame OLD Local Variable Frame 11 of 23

Finally, Write the OLD PC to the stack Write the OLD LV to the stack Set the new LV to the address of the link pointer. fetch the first instruction of the new method. DONE!!! invokevirtual18 MDR = OPC; wr invokevirtual19 MAR = SP = SP + 1 invokevirtual20 MDR = LV; wr invokevirtual21 PC = PC + 1; fetch invokevirtual22 LV = TOS; goto main1 Save old PC above the local variables SP points to location to hold old LV Save old LV above saved PC Fetch first opcode of new method. Set LV to point to LV Frame Necessary Local Variable Frame and Stack Changes Data Memory (a) before and (b) after invoking a method 12 of 23

Returning from a Method Returning from a method must return operation to the original program, restore the original LV, restore the stack pointer to the previous location and add a return value (from the TOS register) onto the top of the stack IJVM Microinstruction 1 (MIC-1) Architecture Language ireturn1 MAR = SP = LV; rd Reset SP, MAR to get link pointer ireturn2 Wait for read ireturn3 LV = MAR = MDR; rd Set LV to link ptr; get old PC ireturn4 MAR = LV + 1 Set MAR to read old LV ireturn5 PC = MDR; rd; fetch Restore PC; fetch next opcode ireturn6 MAR = SP Set MAR to write TOS ireturn7 LV = MDR Restore LV ireturn8 MDR = TOS; wr; goto main1 Save return value on original top of stack The end of IJVM Microarchitecture 1 (Mic-1) 13 of 23

Speeding Up Microarchitecture 1 of the Integer JAVA Virtual Machine How can we attain higher clock rates for the machine? Take the basic concept and design it over more engineering time ($$) Add more gates where the design needs it more time, more gates, more silicon ($$$) 14 of 23

Ideas: (1) Reduce the number of clock cycles per instruction (CPI) (2) Perform instruction fetch and data processing in parallel (3) Make each one of the clock cycles shorter (4) Simplify data flow to the ALU (H-register loading) (5) Overlap the execution of instructions (pipeline) Possibilities: (1) Make an intelligent instruction fetch unit (a) Eliminate PC increment from using the execution unit (b) Eliminate main1 (c) Eliminate multi-byte instruction fetch delays (d) Add an instruction queue (precursor to a cache) (2) Eliminate the need to load the H register (a) Add a new data path bus, the A-Bus (b) Increase the MIR size to add an A-Bus decoder and multiplexer 15 of 23

Instruction Fetch Unit (1) Make an intelligent instruction fetch unit (a) Eliminate PC increment from using the execution unit (b) Eliminate main1 (c) Eliminate multi-byte instruction fetch delays (d) Add an instruction queue (precursor to a cache) Architecture Elements (1) The PC has it s own incrementer! New Instruction Fetch Unit (2) An instruction queue has been added so that future instructions can be preloaded. (a) This requires an Instruction memory address register (IMAR) that is ahead of the PC (b) Special consideration must be made for branching (flush queue and stall) (3) Two pseudo-register created to hold 1-Byte and 2-Byte offsets, displacements, etc. (MBR1 and MBR2) (4) Queued instruction interpretation predetermine the length of each instruction (a) Necessary to drive the IMAR logic and prefetch (b) The IFU instruction fetch finite state machine Note: IMAR fetches 32-bit words (4 bytes) whenever the queue nears empty. 16 of 23

How the instruction queue is filled: queue depth and queue byte count Advantages Eliminated main1 code and delays Eliminated PC manipulations Generated 8-byte and 16-byte instruction values Allows 32-bit memory accesses (similar to data path) 17 of 23

Datapath for the Mic-2 Eliminate the need to load the H register (a) Add a new data path bus, the A-Bus (b) Increase the MIR size to add an A-Bus decoder and multiplexer (+ 4-bits) Architecture Elements: (1) A three-bus internal processor architecture: 2 operand buses with 1 result bus (2) Hardware design optimization of the ALU. That is use another ALU design with less delay! Other methods discussed, but not shown. 18 of 23

The modified microinstruction code speedup a.) b.) Main 1 removed (one clock cycle for every instruction executed) ALU arithmetic and logical operations: iadd,, isub, iand, ior, dup, swap are all similar (reduced by one cycle per instruction) Limitation to speed-up: significant amount of reading and writing memory c.) Stack Manipulation: bipush eliminates Byte reading iload and istore eliminate reading the offset value wide_iload and wide_istore eliminate reading the wide offset value ldc_w eliminate reading the wide offset value iinc eliminate both offset and immediate value fetching Significant improvement due to MBR1 and MBR2 availability Nominally 3 cycles per instruction savings d.) Branching Operations: goto eliminate target address computations iflt and ifeq and if_icmpeq no significant changes (memory read bound) T no change F must signal Instruction Fetch Unit to remove the branch address from MBR2 (more logic) e.) Subroutine Call: invokevirtual goes from 22 to 11 cycles ireturn no changes Overall: A significant reduction in the required microcode store and a significantly faster machine. Instructions least affected: data reading and writing Instructions most affected: those requiring immediate values from the instruction stream 19 of 23

The modified microinstruction code (MIC-2): IJVM Microinstruction 2 (MIC-2) Architecture Language (1 of 4) nop1 goto (MBR) Branch to next instruction iadd1 MAR = SP = SP-1; rd Read in next-to-top word on stack iadd2 H = TOS H = top of stack (optional) iadd3 MDR = TOS = MDR+H; wr; goto (MBR1) Add top two words; write to new top of stack isub1 MAR = SP = SP-1; rd Read in next-to-top word on stack isub2 H = TOS H = top of stack (optional) isub3 MDR = TOS = MDR-H; wr; goto (MBR1) Subtract TOS from Fetched TOS- 1 iand1 MAR = SP = SP-1; rd Read in next-to-top word on stack iand2 H = TOS H = top of stack (optional) iand3 MDR= TOS = MDR AND H; wr; AND Fetched TOS-1 with TOS goto (MBR1) ior1 MAR = SP = SP-1; rd Read in next-to-top word on stack ior2 H = TOS H = top of stack (optional) ior3 MDR= TOS = MDR OR H; wr; OR Fetched TOS-1 with TOS goto (MBR1) dup1 MAR = SP = SP + 1 Increment SP; copy to MAR dup2 MDR = TOS; wr; goto (MBR1) Write new stack word pop1 MAR = SP = SP-1; rd Read in next-to-top word on stack pop2 Wait for read pop3 TOS = MDR; goto (MBR1) Copy new word to TOS swap1 MAR = SP-1; rd Read 2nd word from stack; set MAR to SP swap2 MAR = SP Prepare to write new 2nd word swap3 H = MDR; wr Save new TOS; write 2nd word to stack swap4 MDR = TOS Copy old TOS to MDR swap5 MAR = SP-1; wr Write old TOS to 2nd place on stack swap6 TOS = H; goto (MBR1) Update TOS Overview: Main 1 removed iadd,, isub, iand, ior, dup, swap are all similar (reduced by one cycle per instruction) Limitation to speed-up: significant amount of reading and writing memory 20 of 23

IJVM Microinstruction 2 (MIC-2) Architecture Language (2 of 4) bipush1 SP = MAR = SP + 1 Set up MAR for writing to new top of stack bipush2 MDR = TOS = MBR1; wr; goto Update stack in TOS and memory (MBR1) iload1 MAR = LV + MBR1U; rd Move LV + index to MAR; read operand iload2 MAR = SP = SP + 1 Increment SP; Move new SP to MAR iload3 TOS = MDR; wr; goto (MBR1) Update stack in TOS and memory istore1 MAR = LV + MBR1U Set MAR to LV + index istore2 MDR = TOS; wr Copy TOS for storing istore3 MAR = SP = SP 1; rd Decrement SP; read new TOS istore4 Wait for read istore5 TOS = MDR; goto (MBR1) Update TOS wide1 goto (MBR1 OR 0x100) Next address is 0x100 Ored with opcode wide_iload1 MAR = LV + MBR2U; rd; goto iload2 Identical to iload1 but using 2-byte index wide_istore1 MAR = LV + MBR2U; goto istore2 Identical to istore1 but using 2- byte index ldc_w1 MAR = CPP + MBR2U; rd; goto iload2 Same as wide_iload1 but indexing off CPP Overview: bipush eliminates Byte reading! iload and istore eliminate reading the offset value wide_iload and wide_istore eliminate reading the wide offset value ldc_w eliminate reading the wide offset value Significant improvement due to MBR1 and MBR2 availability Nominally 3 cycles per instruction savings 21 of 23

IJVM Microinstruction 2 (MIC-2) Architecture Language (3 of 4) iinc1 MAR = LV + MBR1U; rd Set MAR to LV + index for read iinc2 H = MBR1 Set H to constant iinc3 MDR = MDR + H; wr; goto Increment by constant and update (MBR1) goto1 H = PC 1 Copy PC to H goto2 PC = H + MBR2 Add offset and update PC goto3 Have to wait for IFU to fetch new opcode goto4 goto (MBR1) Dispatch to next instruction iflt1 MAR = SP = SP 1; rd Read in next-to-top word on stack iflt2 OPC = TOS Save TOS in OPC temporarily iflt3 TOS = MDR Put new top of stack in TOS iflt4 N = OPC; if (N) goto T; else goto F Branch on N bit ifeq1 MAR = SP = SP 1; rd Read in next-to-top word of stack ifeq2 OPC = TOS Save TOS in OPC temporarily ifeq3 TOS = MDR Put new top of stack in TOS ifeq4 Z = OPC; if (Z) goto T; else goto F Branch on Z bit if_icmpeq1 MAR = SP = SP 1; rd Read in next-to-top word of stack if_icmpeq2 MAR = SP = SP 1 Set MAR to read in new top-ofstack if_icmpeq3 H = MDR; rd Copy second stack word to H if_icmpeq4 OPC = TOS Save TOS in OPC temporarily if_icmpeq5 TOS = MDR Put new top of stack in TOS if_icmpeq6 Z = H OPC; if (Z) goto T; else goto F If top 2 words are equal, goto T, else goto F T H=PC 1; goto goto2 Same as goto1 F H = MBR2 Touch bytes in MBR2 to discard F2 goto (MBR1) Overview: iinc eliminate both offset and immediate value fetching goto eliminates target address computations iflt and ifeq and if_icmpeq no significant changes (memory read bound) T no change F must signal IFO to remove the branch address from MBR2 22 of 23

IJVM Microinstruction 2 (MIC-2) Architecture Language (4 of 4) invokevirtual1 MAR = CPP + MBR2U; rd Put address of method pointer in MAR invokevirtual2 OPC = PC Save Return PC in OPC invokevirtual3 PC = MDR Set PC to 1st byte of method code. invokevirtual4 TOS = SP MBR2U TOS = address of OBJREF 1 invokevirtual5 TOS = MAR = H = TOS + 1 TOS = address of OBJREF invokevirtual6 MDR = SP + MBR2U + 1; wr Overwrite OBJREF with link pointer invokevirtual7 MAR = SP = MDR Set SP, MAR to location to hold old PC invokevirtual8 MDR = OPC; wr Prepare to save old PC invokevirtual9 MAR = SP = SP + 1 Inc. SP to point to location to hold old LV invokevirtual10 MDR = LV; wr Save old LV invokevirtual11 LV = TOS; goto (MBR1) Set LV to point to zeroth parameter. ireturn1 MAR = SP = LV; rd Reset SP, MAR to read Link ptr ireturn2 Wait for link ptr ireturn3 LV = MAR = MDR; rd Set LV, MAR to link ptr; read old PC ireturn4 MAR = LV + 1 Set MAR to point to old LV; read old LV ireturn5 PC = MDR; rd Restore PC ireturn6 MAR = SP ireturn7 LV = MDR Restore LV ireturn8 MDR = TOS; wr; goto (MBR1) Save return value on original top of stack Overview: invokevirtual goes from 22 to 11 cycles ireturn no changes Overall: A significant reduction in the required microcode store and a significantly faster machine. Instructions least affected: data reading and writing Instructions most affected: those requiring immediate values from the instruction stream 23 of 23