Queue Machines for Next Generation Computer Systems

Transcription

1 Queue Machines for Next Generation Computer Systems Masahiro Sowa Graduate School of Information Systems University of Electro-Communications Chofugaoka 1-5-1, Chofu-Shi Tokyo, Japan Arquimedes Canedo IBM Tokyo Research Laboratory Shimotsuruma, Yamato-shi Kanagawa-ken Japan Abstract Queue processors are a novel computer architecture with characteristics that are suitable for the next generation computer systems. Compared to conventional register machines where registers referenced by their names are used for data processing, the queue machines use a nameless first-in first-out queue to perform operations. The head of the queue is the only place where data is read and the tail of the queue is the only place where data can be written. Thus, queue computing allows instructions to read and write data without using register names and, as a consequence, instructions are short and the queue programs are free of false dependencies. Small instructions are preferred over longer instructions since memory traffic can be reduced, the cache performance and size can be improved, and decoding hardware logic can be simplified. The nonexistence of false dependencies allows programs to expose maximum parallelism that the queue processor can execute without complex and power-hungry hardware such as register renaming and large instruction windows. Parallel processing allows queue processors to speed-up the execution of applications. In this paper we present the special characteristics of queue machines that make such design a very suitable architecture for the future generation of computer systems demanding high-performance, low power, and low cost. We present the toolchain of compilers, translators, assembler, virtual machine, functional and cycle accurate simulators, and RTL processors we have designed and developed. From our experimental results we demonstrate that queue programs are smaller than conventional register programs and have very similar characteristics in terms of parallelism. Keywords: Queue Machines, Computer Systems, System Design 1 Introduction Processor design faces a situation where performance cannot be sustained by simply increasing the frequency of operation. As the size of transistors decreases and frequency of operation is raised, power dissipation is a limitation of modern technology. This situation has motivated the designers to reconsider new execution paradigms for modern computer systems [8, 1, 9]. The key points to sustain performance improvements has been in exploiting different sources of parallelism. Queue processors [6, 7, 8] are a novel execution paradigm that allows program to be executed in parallel without needing complex hardware to find and exploit such parallelism. The instructions of a queue processor implicitly reference source and destination operands, thus the instructions are compact and free of false dependencies. The hardware implementation of a queue processor requires simple circuits while it is able to execute instructions in parallel. The programs generated for the queue computation model are unrestricted in parallelism and therefore queue processors can be used for scientific and parallel processing. From its hardware simplicity, a queue processor has the potential to have a low power dissipation footprint, making it ideal for modern applications running in mobile devices. In this paper we present an overview of the principles of queue computing and explain its salient characteristics. We also present a complete queue computer system toolchain and give an insight to its development. An

2 evaluation of the characteristics of queue computing is given to demonstrate the potential of queue-based systems against conventional computing paradigms. 2 Overview of Queue Computing A queue machine is a computer that uses fast registers organized as a first-in first-out (FIFO) queue to perform computations. A FIFO queue establishes that all read accesses are performed in one end of the queue called the head, or QH. And all writes are performed in the opposite end of the queue called the tail, or QT. Since all accesses are done at known and fixed locations, the QH and QT, instructions do not need to specify the source and destination operands. However, the hardware must update and track the position of QH and QT at all times: whenever a write is performed the QT position is updated, and whenever a read is performed the QH is updated. Given the strict rules for accessing the queue, the programs for a queue machine are generated by a level-order traversal of the expression trees. The level order traversal visits all levels from the deepest to the root, and for each level walks on all nodes from left to right. This scheduling method produces programs that can be directly executed in a queue processor. Consider the expression x = (a+b)/(c-d), first the expression tree is built and all nodes visited in a level-order traversal as shown in Figure 1.(a). For every visited node a corresponding instruction of the queue processor is generated, the queue program for the sample expression is given in Figure 1.(b). An important characteristic of the queue program to be noticed is that arithmetic instructions (add, sub, div) consist only of the operator, source and destination operands are missing since they are implicitly accessed through QH and QT. This feature allows the instructions of a queue processor to be short in length and as a consequence programs are compact. This is favorable for the hardware since manipulating small instructions requires fewer circuitry, smaller buses, less complexity, and as a consequence, less power consumption. On the other hand, small instructions have the potential of improving the performance of the cache memories, and the benefit of reducing production costs for computer systems where memory size is constrained as in embedded systems. L0 (root) L1 L2 L3 x / + - a b c d add sub div st a b c d x a b c d add sub a+b c-d div (a+b)/(c-d) st (a) Level-order traversal of expression tree (b) Queue Program (c) Execution of program in a FIFO queue Figure 1: Expression x = a+b c d is (a) traversed in a level-order manner to, (b) generate a queue program and (c) executed in a queue processor. Data independent instructions of a queue program are grouped by the level-order scheduling. Figure 1.(c) shows the execution of the sample queue program in the FIFO queue. The leftmost end of the queue is the QH and the rightmost the QT. The first four instructions of level L 0 place the operands a,b,c,d in the queue. These four instructions are independent from each other and can be issued in parallel. Same happens with the instructions of level L 1, add, sub, they are independent to each other and thus executed simultaneously. This situation applies for all levels, all instructions within the same level can be executed in parallel. The absence of operands in the instructions has another good characteristic, queue programs are free of false dependencies. The parallelism exposed by the level-order scheduling and the

3 IF CTR MEM lack of false dependencies in the program allows the queue machines to effectively exploit all parallelism available in the applications. The grouping of independent instructions allows the implementation of queue processors to require smaller instruction windows, and the absence of false dependencies completely removes the necessity of register renaming hardware. A parallel queue processor (PQP) offers a high-performance architecture with simple hardware design and low power characteristics. 3 Developing Queue Computer Systems In Sowa Laboratory 1 we have investigated the principles of queue computing and developed a toolchain for the design and implementation of queue processors. Figure 2 shows the current layers of our toolchain. Except for the applications, all other elements have been crafted from scratch for the queue computation model. There are fundamental differences between queue machines and conventional stack and register machines that prohibit the utilization of conventional techniques and methods to bui compilers, virtual machines, simulators, operating systems, and hardware. Therefore, we have researched and established new methods to bui custom tools for the queue computing. Applications Queue Compiler Profiler Optimizers Queue Virtual Table 4 Machine Assembler Linker PQP Hardware Simulator QC-2 processor design results: modules complexity as LE (logic elements) and TCF (total combinational functions) when synthesized for FPGAs (with Stratix device) and Structured ASIC (HardCopy II) families. Loader Descriptions Modules LE TCF instruction fetch unit IF instruction decode unit ID queue compute unit QCU barrier queue unit BQU issue unit IS Operating System execution unit EXE queue-register unit QREG memory access MEM control unit CTR QC-2 core QC EXE ID IS BQU QCU QREG Fig. 12. Floorplan of the placed and routed QC-2 core. optimizations area (ARA), speed (SPD) and balanced (BLD). For FPGA implementation, the complexity for SPD optimization is about 17% and 18% higher than that for ARA and BLD optimizations respectively. Figure 12 shows the floorplan of the placed and routed QC-2 core. The modules of the processor show considerable overlap as logic is mapped according to interconnect requirements. 5.3 Speed and Power Consumption Comparison with Synthesizable CPU Cores Figure 2: Layers of a Queue-based Computer System Queue computing and architecture design approaches take into account performance and power consumption considerations early in the design cycle and maintain a power-centric focus across all levels of design abstraction. In QC-2 processor, all instructions designed are fixed format 16-bit words with minimal 18 Applications We have considered a wide range of applications to analyze their characteristics when translated to queue programs. The experimental results show that applications benefit from the queue computing as their size can be reduced up to 47% when compared to the same application on a conventional RISC machine [3]. In terms of other qualitative characteristics such as available parallelism and instruction count, queue programs are very similar to register programs. Most applications are written in high level languages such as C, FORTRAN, Java. Therefore, we need a compiler to translate these programs into machine code for queue processors. 1

4 Queue Compiler The Queue Compiler not only translates high level languages into machine code but also optimizes the programs for smaller execution times, higher parallelism, better resource utilization, smaller code size, among others. Compiling for queue machines is different than compiling for register computers and this is reflected in the structure of the queue compiler. Since a queue program has the concept of queue rather than registers, novel code generation algorithms have been developed [2] to handle this characteristic. The compiler is a critical part of a queue computer system as it schedules the programs to satisfy the constraints of the hardware and achieve the highest performance. Queue Virtual Machine Virtual machines offer great flexibility at reasonable development cost for testing and validating a new computer system. Our queue virtual machine emulates a queue processor and interprets the programs generated by the queue compiler. The virtual machine can be easily modified to emulate different queue processors and to test new architectural features. The virtual machine also helps the refinement of the compiler as it is able to execute programs, and executed programs have different characteristics than compiled programs. Simulator Simulators also emulate a queue processor and execute code generated by the compiler, the difference with the queue virtual machine is the level of detail on which the execution of the program is performed. Our simulators emulate the pipeline of the queue processor and perform a detailed timing execution. Pipeline hazards, stalls, instruction latencies, and other microarchitectural features are simulated. Furthermore, the cycle accurate simulators can be configured to simulate a parallel queue processor that executes instructions simultaneously. Assembler, Linker, Loader When targeting the actual queue processor, the generic assembly code generated by the compiler must be translated into binary code for a queue processor by the assembler. Having an assembler allows us to easily retarget a program to a new queue processor without changing the compiler. The function of the linker and loader are no different than conventional systems, the former resolves symbols and combines different objects into a single executable, and the latter prepares the code for execution in the operating system. Operating System This is the only part of the toolchain where we have not built a prototype. Up until now, we have concentrated our efforts to develop the tools to test the fundamental ideas of queue computing. As it is possible for us to run our queue programs in the PQP hardware directly, the need of an operating system has become a secondary priority. However, our future work includes the development of an operating system to manage the PQP processor. PQP Hardware Several queue processors have been designed in hardware description languages, realized in RTL designs, and implemented in FPGA chips. The hardware has been gradually evolving from a serial queue processor to a parallel queue processor (PQP). We also have developed hybrid architectures that combine register, stack, and queue into a single processor. Our latest research includes a multi-dimensional queue processor where several queues are employed for data processing. Queue processors have the characteristic of hardware simplicity, low power consumption, and high-performance.

5 4 Evaluation In this section, we give the experimental results obtained in the development of the PQP processor. Figure 3 shows the normalized code size, SPEC CINT95 benchmarks [4], for the PQP processor (baseline) and MIPS I processor [5]. From this graph we observe that the absence of operands in the queue instruction set accomplishes small code sizes. For these benchmarks, PQP programs are from 27% to 47% smaller than MIPS programs MIPS 1.88 PQP Code Size go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 134.perl 147.vortex Figure 3: Code size for SPEC CINT95 benchmarks. Figure 4 shows the instruction level parallelism (ILP) for embedded benchmarks extracted at compile time for MIPS programs and for PQP programs. Notice that the PQP programs have, in average, 13% more parallelism than optimized MIPS programs. This is due to the fact that queue programs are not limited by false dependencies and the level-order scheduling exposes all application parallelism. On the other hand, although MIPS programs are heavily optimized, the limited number of architected registers introduces false dependencies that limit the available parallelism MIPS I -O3 PQP 3.75 ILP H.263 MPEG2 FFT Susan Rijndael Sha Blowfish Dijkstra Patricia Adpcm Figure 4: Instruction Level Parallelism (ILP) for embedded benchmarks 5 Conclusion As discussed, queue processors offer a good alternative to modern computer systems. Modern applications constantly demand higher and higher performance at low power consumption and low cost. The simplicity and characteristics of queue-based processors make them ideal candidates for such demands. In this paper, we introduced the principles of queue computing and we presented an overview of our current research on the PQP processor. We have developed a custom toolchain that includes compilers, virtual machines, cycle accurate simulators, assemblers, linkers, loaders, operating system, libraries, and several RTL processors. This research has contributed with novel techniques to develop queue machines and the results have

6 shown that queue processors can produce up to 47% smaller programs and 13% more parallel applications than modern RISC processors. References [1] D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John, C. Lin, C. R. Moore, J. Burrill, R. G. McDona, W. Yoder, and the TRIPS Team. Scaling to the end of silicon with edge architectures. Computer, 37(7):44 55, [2] A. Canedo, B. Abderazek, and M. Sowa. A New Code Generation Algorithm for 2-offset Producer Order Queue Computation Model. Journal of Computer Languages, Systems & Structures, 34(4): , June [3] A. Canedo, B. Abderazek, and M. Sowa. Compiler Support for Code Size Reduction using a Queue-based Processor. Transactions on High-Performance Embedded Architectures and Compilers, 2(4): , [4] J. J. Dujmovic and I. Dujmovic. Evolution and evaluation of SPEC benchmarks. ACM SIGMETRICS Performance Evaluation Review, 26(3):2 9, December [5] G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice Hall, [6] B. Preiss and C. Hamacher. Data Flow on Queue Machines. In 12th Int. IEEE Symposium on computer Architecture, pages , [7] H. Schmit, B. Levine, and B. Ylvisaker. Queue Machines: Hardware Compilation in Hardware. In 10th Annual IEEE Symposium on Fie-Programmable Custom Computing Machines, page 152, [8] M. Sowa, B. Abderazek, and T. Yoshinaga. Parallel Queue Processor Architecture Based on Produced Order Computation Model. Journal of Supercomputing, 32(3): , June [9] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. Wavescalar. In In Proceedings of the 36th International Symposium on Microarchitecture(MICRO, pages , 2003.