An Overview of Stack Architecture and the PSC 1000 Microprocessor

Introduction A stack is an important data handling structure used in computing. Specifically, a stack is a dynamic set of elements in which access to the stack elements is pre-specified. This pre-specified manner of access is a last-in, first-out, or LIFO policy. Conceptually, LIFO stacks are the easiest way to meet the temporary storage requirements for important computer tasks (Koopman 1989). To illustrate, some computer architectures may implement an expression evaluation stack that is used to save the intermediate values of arithmetic and logic expressions and to keep track of the levels precedence. Another implementation is a return address stack that is used to save the address of a calling program whenever a subroutine is called. A local variable stack allows each instance of a subroutine to store its local variables on the stack in order to avoid corruption by another instance of the same subroutine in cases of recursion or reentrancy. Finally, a parameter stack is a used to pass parameters when a subroutine is called. Each of these stack implementations does not usually individually exist in real computers, but are combined using both hardware and software in a manner specific to each architecture. To conclude, any application that incorporates many stack features, like the Java Virtual Machine, will benefit in efficiency and speed from computer with hardware support for stacks (Shaw 1997). Dimensions of a Stack Computer Most computer architectures implement some level of hardware support for stack(s). There are many different methods to implement hardware stack support, thus creating several different classes of stack computer architectures. These classes can be categorized using a three-dimensional design space (Koopman 1989). See Figure 1. Figure 1: Stack Computer Design Space (Koopman 1989)

The three dimensions include the number of stacks supported by the hardware (single or multiple), the size of any dedicated buffer for stack elements (small or large), and the number of operands permitted by the instruction format (0, 1, or 2). The number of stacks supported in a computer s architecture is one of the dimensions in the design of a stack computer. A single-stack computer supports only one stack, which is simpler, with concern for the hardware, and easier for an operating system to manage. On the contrary, multiple-stack computers support two or more stacks. Unlike the single-stack architecture, the multiple stack architecture allows for the separation of control flow information and data. Consequently, multiple stacks afford the computer speed because subroutine calls and data operations can occur simultaneously, if the return address stack and data stack are separate (Koopman 1989). The amount of memory used to buffer stack elements is another important dimension in designing a stack computer. In general, computers that use program memory for the stack buffer are considered to have the small stack buffer architecture. Large stack buffer computers possess a stack buffer that does not use the main memory. A large stack buffer has many different implementations, such as a memory unit separate from main memory. A large stack buffer is usually advantageous because program memory cycles are not consumed to access the stack (Koopman 1989). Figure 2: Categories of Stack Computers (Koopman 1989) The number of operands in the instruction format is the last dimension in the design of a stack computer. 0-operand instructions do not allow any operands. All operations use the top stack elements. Computers that use this type of instruction format are also known as pure stack machines. 1-operand instructions use one specified operand and use the top

stack elements for the second operand. 2-operand instructions allow two or three operands, for example a source and a destination, to be specified. Figure 2 categorizes several computers with respect to three dimensions of the stack computer design space. The first letter in the category abbreviation stands for the number of stacks (Single or Multiple), the second letter stands for the size of the stack buffer (Small or Large), and the final digit represents the number of operands in an instruction (0, 1, or 2). As the figure shows most computers, including the mainstream Intel 80x86, can be classified in the stack computer design space. The PSC 1000 Microprocessor The Patriot Scientific Corp. PSC 1000 microprocessor is a highly integrated 32-bit processor designed specifically for embedded applications where power consumption and cost are the important factors (Shaw 1999). Furthermore, the PSC 1000 is one of the first Java like CPUs. The Java Virtual Machine maps very closely onto its architecture, because both are stack architectures (Shaw 1997). Also, many other languages, like C and C++, can also be run efficiently using the PSC 1000, because their compilers implement a stack model. Figure 3: PSC 1000 Block Diagram (Shaw 1999)

The PSC 100 Central Processing Unit (CPU) has dual-processor architecture: the multiprocessing unit (MPU) and the virtual processing unit (VPU). The MPU is a 0- operand dual-stack processor that performs conventional processing tasks, while the VPU is an input-output processor that performs time-synchronous data transfers and may emulate dedicated peripheral functions (Shaw 1996). The CPU also contains global registers, a direct memory access controller (DMAC), an interrupt controller (INTC), on chip resources, bit inputs, bit outputs, a programmable memory interface (MIF), and a clock as seen in Figure 3. Since the MPU is the stack based processor it will be the emphasis of discussion. Micro-processing Unit The MPU is a ML0 stack processor. It was designed under the architectural philosophy of simplification and efficiency. By implementing a 0-operand architecture the MPU is able to achieve a high instruction bandwidth due to its 8-bit instructions. Since the MPU is a 32-bit processor, four instructions, referred to as instruction groups, may be fetched per memory cycle. The instructions are also hardwired into the MPU, adding more efficiency. As a result most instructions can execute in a single clock cycle. Consequently the PSC 1000 MPU is able to achieve twice the instruction bandwidth of most common RISC processors from the advantages gained from its hardware stacks and small sized instruction set (Shaw 1999). I. Registers/Stacks The register set of the MPU has 52 general-purpose registers, including 16 global registers (g0-g15), a 16-deep local-register/return stack (r0-r15), an 18-deep operand stack (s0-s15), an index register (x), and a count register (ct). There is also a mode/status register (mode), two stack pointers (sa and la), and 41 on-chip resource registers used for I/O, configuration, and status. See Figure 4. Figure 4: MPU Registers (Shaw 1999)

The local-register/return stack is used to hold subroutine return addresses and well-nested local variables. The operand stack is used to for expression evaluation and for parameter passing. The registers on both stacks are referenced relative to the top of the stack. For example, when a value is pushed onto the operand stack the former data element in s0 is pushed down to s1 and the new element is now in s0. An unlimited number of values can be stored on either stack because as the available register space fills registers are spilled into memory. Similarly, as the stack registers begin to empty they are refilled from memory. II. ALU and Instruction Set (Appendix A) All ALU opcodes are 8-bit encoded instructions. Since the MPU is a 0-operand stack processor no bits are needed to specify the source and destination operands. Instead the operands are assumed to be the top elements in the operand stack. For example the add instruction would add the elements in s0 and s1 and place the result into s0. See Figure 5. Not all ALU instruction behave like the add instruction, which used two source operands from the operand stack and returned one result to the operand stack. Some instructions only use one source operand and return only one result operand, like the increment and decrement instruction. Figure 5: Example of the add instruction (Shaw 1999) The one problem with 8-bit opcodes is that there are no bits to encode branch offsets and literal values for immediate arithmetic instructions. First, in order to maintain the consistency and simplicity of 8-bit opcodes branch offsets are taken as the last three bits of a branch, loop, or skip opcode plus all the bits to the right of the opcode in the current instruction group (4 opcodes). Figure 6: Memory Addressing (Shaw 1999)

Thus, depending upon the location of a branch within an instruction group the offset can be a 3, 11, 19, or 27 bit two s complement value. The offset value is then added to the program counter (pc) and execution transfers to the resulting cell-aligned addresses. Cells are four byte blocks of memory. Most instruction address memory by cells. Furthermore, the cells of memory have a big-endian byte order, meaning that the high order byte is at the byte address within the cell. See Figure 6. As for literals there are three sizes, nibble, byte, and long. A nibble is taken as the two s complement value of the four least significant bits of the push.n opcode. The data for byte literals is encoded in the right-most bit of an instruction group containing the push.b instruction. Finally, a long or cell sized literal is taken as the two s complement value of the entire instruction group following an instruction group that contains the push.l instruction. See Figure 7. Figure 7: Instruction Formats (Shaw 1999) For load and store instructions the r0, x, or s0 registers are used as index registers. The adda, or add address instruction can also be used to implement an address index. Figure 8 shows a code example that will add a value from memory and a nibble sized literal, and then stores it in memory. The address index will be derived from the values in two global registers (g0 and g2) and a byte sized literal using the add address instruction. The result from the add address operation is in the top of the operand stack,

s0, and will be used an index to load a value from memory into s0. After the value from memory is added to the nibble it is then stored at another address in memory again derived using the add address instruction. Figure 8: Load and Store Code Example (Shaw 1999) Conclusion It is unlikely that stack processors will ever find a home in personal computer and workstations. Their 0-operand addressing makes them less flexible than more common processors used in personal computers. The simplicity and low cost of most stack chips and their efficiency with handling stack-modeled applications, such as the Java Virtual Machine, will make embedded systems, like Web TV control boxes, the primary domain of the stack computer.

Appendix A: PSC 1000 Instruction Set (Shaw 1999)

References 1. Koopman, Philip J. Jr (1989). Stack Computers: the new wave. Computers and their Applications. West Sussex: Ellis Horwood Limited, 1989. 2. Shaw, George William (1999). PSC 1000 Microprocessor Reference Manual. Patriot Scientific Corp., 1999. http://www.ptsc.com/psc1000/documentation.html (29 Oct. 1999). 3. Shaw, George William (1996). Second Processor Takes Care of I/O. EETimes Techweb News. February 5, 1996. http://www.ptsc.com/psc1000/articles.html (29 Oct. 1999). 4. Shaw, George William (1997). Architecture is key to execution in java.. EETimes Techweb News. June 16, 1997. http://www.ptsc.com/psc1000/articles.html (29 Oct. 1999).