OpenTransputer: reinventing a parallel machine from the past

Transcription

1 Dissertation Type: enterprise DEPARTMENT OF COMPUTER SCIENCE OpenTransputer: reinventing a parallel machine from the past David Keller, Andres Amaya Garcia A dissertation submitted to the University of Bristol in accordance with the requirements of the degree of Master of Engineering in the Faculty of Engineering. Thursday 25 th June, 2015

2 2

3 Declaration This dissertation is submitted to the University of Bristol in accordance with the requirements of the degree of MEng in the Faculty of Engineering. It has not been submitted for any other degree or diploma of any examining body. Except where specifically acknowledged, it is all the work of the Authors. David Keller, Andres Amaya Garcia, Thursday 25 th June, 2015 i

4 ii

5 Contents 1 Contextual Background Project Context Why Re-implement the Transputer? Overview of Computer Architecture Project Aims and Objectives Technical Background Occam Transputer Architecture Transputer Microarchitecture Transputer Versions and Enhancements Project Execution The OpenTransputer CPU External Communication I/O Interface The OpenTransputer System Digital Design with Hardware Description Languages (HDL) Critical Evaluation Evaluating Design Decisions Design Verification Synthesis results Conclusion Current Project Status Future Work Final Conclusions A Transputer Instruction Set 51 iii

6 iv

7 List of Figures 1.1 A six level computer as described in [31]. Below each level the way the abstraction is implemented is indicated (including the program responsible for it in parentheses) Instruction sizes for the four different machines, the zero-address machine has the highest code density as each instruction only takes up 8 bits. It should be noted that destination and source operands are addresses in memory Flow of events in sample Occam parallel program Transputer evaluation stack [18] Transputer instruction format [18] The Transputer s hardware implemented process scheduler [18] Conditional execution of microcodes in Simple 42 Transputer [2] Next microinstruction state address generation Stages of the microassembling process Example datapath implementation using three buses Example implementation using a wide datapath approach OpenTransputer s simplified datapath schematic Integration of the three main components of the OpenTransputer CPU Signal timing for correct CPU behaviour Example 8 8 Beneš network Folded over Beneš network with capacity for 8 OpenTransputers Bit fields of the channel addresses for internal and external communication and I/O pins Example route of a package through a Beneš network of OpenTransputers Internal connections of 4 input and 4 output switches Data exchange protocol between two components of a network of OpenTransputers Packet formats for external communication in the Inmos Transputer and the OpenTransputer Interaction between input and output controllers to transfer a message over the network [16] Major components of the OpenTransputer system v

8 vi

9 List of Tables 1.1 High-level overview of three different architectures A comparison of how different architectures encode a simple add instruction A comparison of the assembly code of different architectures for the same high-level statement. It should be noted that all operands are addresses in memory. [9] Occam primitive processes [14] Occam constructs [14] Execution time (in clock cycles) of primary instructions in both the Inmos Transputer [15] and the OpenTransputer Execution time (in clock cycles) of secondary instructions in both the Inmos Transputer [15] and the OpenTransputer. w is the number of 32-bit words to be moved FPGA resources used by the design. The first row shows how much resources are consumed in total. The bottom three rows show how many of these resources the two cores and the switch use individually FPGA resources used by a single core. The bottom three rows show the number of LUTs consumed by the individual components of Core FPGA resources used by the major components within the datapath Comparison of chip area and manufacturing process of the OpenTransputer and the original Transputer after synthesis A.1 Primary instructions [28] A.2 Implemented secondary instructions [28] in the OpenTransputer A.3 Unimplemented secondary instructions [28] in the OpenTransputer vii

10 viii

11 List of Listings 2.1 Occam purely sequential program Occam parallel program Simple 42 microcode for bitwise AND operation OpenTransputer microinstruction for state GT (greater than instruction) OpenTransputer microinstruction for state CCNT Assembly program for a register machine Assembly program using the Transputer Instruction Set Architecture (ISA) ix

12 x

13 Executive Summary We have developed the OpenTransputer, a new microprocessor based on the Transputer architecture designed by Inmos in the 1980s. We believe that the features of the Transputer architecture, such as inbuilt communication and concurrency management, make it an interesting proposition for the emerging Internet of Things (IoT) market, where a small, general-purpose processor could serve as a building block for a broad range of networked applications. The OpenTransputer modernises many aspects of the original Transputer, features a new microarchitecture, external communication mechanism and an I/O interface. The following is a list of our achievements during the project: Developed a functional simulator in C of the Transputer architecture, which was used to evaluate the trade-offs of different design approaches before they were implemented in Verilog. During the late stages of the project, the functional simulator was used as a reference model to verify the behavioural correctness of our design. Developed a new implementation of the Transputer architecture in Verilog: the OpenTransputer. The new design introduces significant changes to the microarchitecture that take advantage of state-of-the-art manufacturing technologies. Introduced a routing mechanism for inter-processes communication of the Inmos Transputer with a switch-based network that distributes data packets among the OpenTransputer nodes. Switches form a rearrangeably non-blocking Benes network, which greatly improves the performance and usability of the processor as a building block to assemble any kind of system. Replaced the existing input and output link controllers for external communication of the 1980s Transputer by a more scalable implementation that uses the idea of virtual channels. Designed and implemented an I/O interface that builds on the existing channel communication functionality of the processor and can be used to connect external hardware devices. xi

14 xii

15 Supporting Technologies The following is a list of third party software and hardware tools, libraries and components used during development of the OpenTransputer project. We used the D7202 Inmos Occam 2 Toolset compiler sources developed by Inmos. The compiler was used to generate assembly code from Occam programs. The output would then be translated into machine code using an assembler script written by us. The compiler s C source code is accessible under yet the program is difficult to compile using modern versions of GNU C Compiler (GCC). We introduced minor changes to the C source and corrected syntax mistakes in existing makefiles before we were able to generate an ELF executable that could be run in modern x86 machines. We used PyYAML to implement an assembler that parses configuration information associated with an Occam program that is meant to execute in a network of OpenTransputers. PyYAML is a Python library for parsing and emitting YAML code, which is a human-readable serialisation language. We used Red Hat s hosting platform OpenShift to make the OpenTransputer website accessible. The OpenTransputer website was implemented by using the WordPress software available from Furthermore, we used the HTML5/CSS3 based theme onetone to produce the graphical interface of the website. We used the ZedBoard XC7Z020-CLG484 FPGA supplied by the Computer Science Department and programmed it using the Verilog description of the OpenTransputer to develop a demo application. The Xilinx Vivado Design Suite was used to simulate the Verilog description of the OpenTransputer and program the FPGA mentioned above. We used the Distributed Memory Generator v8.0 from the Xilinx Vivado Intellectual Property (IP) catalogue to create a single RAM and three ROM structures used within the OpenTransputer design. To estimate the area of a hypothetical OpenTransputer chip, we used Synopsys Design Vision version G SP5 to synthesize the design for a silicon target. Also, we used the UMC synthesis library for a 180µm process. xiii

16 xiv

17 Notation and Acronyms The following is a list of acronyms used throughout this documents. ROM : Read-Only Memory RAM : Random Access Memory ISA : Instruction Set Architecture HDL : Hardware Description Language DMA : Direct Memory Access FPGA : Field-Programmable Gate Array CPU : Central Processing Unit PL : Programmable Logic Iptr : Instruction Pointer Wptr : Workspace Pointer I/O : Input/Output IP : Intellectual Property MUX : Multiplexer DEMUX : Demultiplexer RTL : Register-Transfer Level RTE : Route Towards Edge RTC : Route Towards Core ALU : Arithmetic and Logic Unit AU : Arithmetic Unit LU : Logic Unit LUT : Look-Up Table CSP : Communicating Sequential Processes CISC : Complex Instruction Set Computing RISC : Reduced Instruction Set Computing MISC : Minimal Instruction Set Computing FSM : Finite State Machine RTOS : Real-Time Operating System IoT : Internet of Things ILP : Instruction-Level Parallelism LIFO : Last-In-First-Out xv

18 xvi

19 Acknowledgements First and foremost, we would like to express our gratitude to our supervisor, Prof. David May, for his valuable guidance and support throughout the project. Without his patient explanations about the Inmos Transputer and his advice our dissertation would not be completed. Furthermore, we would like to thank Roger Shepherd for his early input and advice on how to develop the OpenTransputer. Thanks are also owed to Dr. Simon Hollis for his advice on digital design using the Vivado Design Suite and Fred Barnes for his input on the compiler and configuration system of the Transputer. We would also like to thanks Richard Grafton and the University of Bristol Computer Science Department who supplied us with the FPGA used throughout the project. Finally, we thank Iman Malik for her help with the design of the OpenTransputer logo. xvii

20 xviii

21 Chapter 1 Contextual Background 1.1 Project Context The OpenTransputer is a re-implementation of the Transputer, a pioneering microprocessor architecture first released in the 1980s [3]. The original Transputer was considered revolutionary at its time for its integrated memory and serial communication links intended for parallel computing. Including memory and external links on the same chip made the Transputer essentially a computer on a chip. This was supposed to allow information systems to be designed at a higher level the Transputer functioning as a building block for parallel computing networks. Over the last few years, with the shift to cloud computing there has been a trend in the technology world to building large clusters of powerful computers that serve data to an ever-growing number of client devices, which themselves only feature tiny and low-powered processors. These currently include mobile phones and tablets, but will soon also comprise every other device that connects to the internet, ranging from washing machines to cars [11]. We think that the Transputer and its unique feature set make it an excellent processor for this emerging Internet of Things (IoT) market, specifically the connected homes and wearables markets. As such, the OpenTransputer project aims to modernise the Transputer microarchitecture and introduce a more scalable network for communication based on switches and an easy-to-use I/O interface. In the following, we will give a short overview of the Transputer and the recent developments in the field of computer architecture. This will help explain and give background on the design decisions made in the original Transputer, which are detailed in Chapter 2 of this document, and also motivate the changes we have introduced in the OpenTransputer that are explained in Chapter Why Re-implement the Transputer? The Transputer, while revolutionary at its time for being a computer on a single chip, comprising processor, memory and communication links did not receive the attention it deserved. However, it found its way into spacecraft [1], satellites [32], set-top boxes and supercomputers [3] [12]. The Transputer has in-built support for concurrency through message passing, which is used for both off- and on-chip communication between processes, and a single Transputer can maintain and schedule multiple processes. This makes it both a multitasking and multiprocessing platform and provides a useful abstraction, since to the programmer communication operations are used in exactly the same way regardless of where the sender and receiver processes are running. In other architectures, approaches used 1

22 CHAPTER 1. CONTEXTUAL BACKGROUND for inter-process and inter-processor communication often vary and may require programmers to know about the intricacies of the operating system in the former case or the processor in the latter. With its own simple Real-Time Operating System (RTOS) implemented directly in hardware the Transputer can not only perform context switches in a fraction of the time traditional platforms take, but it also simplifies the way programmers interact with the processor. That is, there is no need to know the low-level implementation details of the processor, which is often the case with other platforms. Instead, a programmer can simply write code in the high-level language Occam, which exposes constructs to explicitly describe concurrent processes and their communication patterns [18]. Its ability to essentially serve as both a microcontroller or a small building block for a parallel network make it an attractive starting point to build upon for the IoT market. In this technological landscape, we foresee small chips being the dominant contenders due to their size, low cost and small energy footprint. On the other hand, this market is also defined by a need for communication between devices. The Transputer fits both descriptions and provides further useful features, as detailed above. 1.3 Overview of Computer Architecture The study of computer architecture is the study of the organisation and interconnection of components of computer systems. [29] A computer can be seen as a hierarchy of abstractions as shown in Figure 1.1, each performing a certain function. These levels are the digital logic level, the microarchitecture level, the Instruction Set Architecture (ISA), the operating system machine level and the assembly language level. Each level builds up on the one below it. Computer processor and computer are somewhat synonymous in this context, essentially they refer to a device that performs some sort of computation. Processor nowadays refers to the Central Processing Unit (CPU), which is usually part of a microchip, and hence also known as microprocessor. The CPU is a collection of logic tasked with executing programs that generate results for the user. In this section we want to briefly introduce how a CPU works, ignoring mostly the details of the implementation in digital logic. Instead we aim to focus on the concepts and mechanisms that dictate how a CPU will execute programs. We will give an overview of microarchitecture, the design of the physical hardware underlying the processor, and instruction set design, which refers to the interface to the hardware offered to the programmer. We will focus on basic processor design and omit issues in Advanced Computer Architecture, such as Instruction-Level Parallelism (ILP), pipelining, superscalar and vector processing, as these are not necessary to understand the rest of this document. The inclined reader is advised to take a look at the material referenced, as they will have more information on these and other issues. On the microarchitecture level, we think of computers as constructed from basic building blocks such as memories, arithmetic units and buses. The functional behaviour of these building blocks is similar across most machines. The differences between computers lie in how the modules they are made up of are connected together, in the performance of these modules and the way the entire computer is controlled by programs Instruction Set Architecture (ISA) The instruction set of a processor is a common abstraction used in the literature to describe the way a processor s internals work. It abstracts away the details of the underlying implementation instead providing an explanation of the high-level logical processor. Specifying a processor at this level means a logical processor with a certain ISA can have several different physical implementations or microarchitectures that ultimately work the same way but have 2

23 1.3. OVERVIEW OF COMPUTER ARCHITECTURE Level 5 Problem-oriented language level Translation (compiler) Level 4 Assembly language level Translation (assembler) Level 3 Operating system machine level Partial interpretation (operating system) Level 2 Instruction set architecture level Interpretation (micro-program) or direct execution Level 1 Micro-architecture level Hardware Level 0 Digital logic level Figure 1.1: A six level computer as described in [31]. Below each level the way the abstraction is implemented is indicated (including the program responsible for it in parentheses). Instruction set Type Design Bits Registers ARM Register register RISC 32/64 16/32 Transputer Stack machine MISC 16/32 - x86 Register memory CISC 16/32/64 6/8/16 Table 1.1: High-level overview of three different architectures. different performance characteristics and prices. The ISA-level is a very useful abstraction for programmers, providing a common platform to execute programs on. It means a program can be written in various high-level programming languages, which are then translated (compiled) into ISA-level machine language the processor can actually understand. This machine language is commonly referred to as assembly language or assembly code. Most ISAs used in processors today can be classified as following either a Reduced Instruction Set Computing (RISC) or Complex Instruction Set Computing (CISC) design philosophy [9]. The former refers to instruction sets comprised of very simple instructions each taking very few clock cycles to execute. In contrast, CISC designs typically include more complex instructions that take longer to execute, but perform more operations per instruction. Table 1.1 shows three different architectures: The Intel IA-32 architecture, which is commonly referred to as i386 is a an example of CISC, while the ARM is an example for a RISC. The Transputer is sometimes used as an example for a Minimal Instruction Set Computer (MISC). MISC computers commonly have a very small number of basic operations and are typically stack-based. In reality, the Transputer falls somewhere in the middle of all these three philosophies: A somewhat small and focused feature set would place it in the RISC camp, but its microcode design and 3

24 CHAPTER 1. CONTEXTUAL BACKGROUND some of its more complex instructions, which schedule processes or move several memory words at once resemble a CISC design Processor Design Issues Having defined the concepts and established the level at which we want to lead the discussion, the issue we want to elaborate now is the different options a processor designer has when implementing a processor at this level Number of Addresses Architecture Three-address Two-address One-address (accumulator machine) Zero-address (stack machine) Instruction add dest,src1,src2 add dest,src add addr add Table 1.2: A comparison of how different architectures encode a simple add instruction. A basic characteristic that has an implication on a processor s architecture is the number of addresses used in an instruction, as can be seen in Table 1.2. Most operations performed by an instruction are either binary or unary. A binary operation such as addition and multiplication require two input operands whereas a unary operation only has a single operand. Typically, an operation produces a single output result. This means a total number of three addresses need to be encoded in the instruction: Two addresses to specify two input operands and one additional address to specify the output result. Many processors specify all three addresses explicitly in the instruction. In some architectures such as the Intel IA-32, a two-address format is used where one of the input addresses serves as both source and destination addresses. It is possible to construct architectures in such a way that instructions have only one or zero addresses explicitly encoded. The former are called accumulator machines, while the latter are referred to as stack machines. The Transputer and the OpenTransputer are stack machines. A comparison of the different assembly code, and, by extension, machine code generated from a simple statement written in a high-level language, such as C, will help to discuss the benefits and disadvantages of each type of architecture. It should be noted that in this example, all four hypothetical machines expect operands to be an address in memory. An operand is then just an offset to some base egister or stack pointer. Consider the simple C statement: A = B + C * D - E + F + A Depending on the type of architecture it will be translated into one of the following pieces of assembly code in Table 1.3: RISC processors are usually three-address machines, meaning they define all three addresses explicitly in the instruction. As can be seen from the table above, such an architecture needs the fewest instructions to execute the high-level statement. In the first column, the one depicting the assembly code for the three-address machine, we can identify T as both a source and result operand. This simple observation serves as a basic justification for twoaddress architectures: If most instructions use one operand as both source and destination address, then only encoding two addresses in an instruction will not lose much flexibility and only makes the generated assembly code a little bit more complex, while allowing for shorter instructions. As can be seen in the second column, a two-address machine uses slightly more instructions to execute the statement, which is due to the fact that the value of C is loaded into T in the first instruction. 4

25 1.3. OVERVIEW OF COMPUTER ARCHITECTURE Three-address Two-address One-address Zero-address mult T,C,D load T,C load C push E add T,T,B mult T,D mult D push C sub T,T,E add T,B add B push D add T,T,F sub T,E sub E mult add A,T,A add T,F add F push B add A,T add A add store A sub push F add push A add pop A Table 1.3: A comparison of the assembly code of different architectures for the same high-level statement. It should be noted that all operands are addresses in memory. [9] Examining the second column of Table 1.3, we realise that all of them use the same operand, T. Making this the default forms the basis for a one-address machine, one where destination register, or accumulator, is implicit in the instruction. This kind of architecture is often used in environments where memory is constrained. Finally, in a zero-address machine all operands are implicit to the instruction and hence assumed to be at default locations. A stack is used to obtain the source input operands and the result written back onto the stack. A stack is a Last-In-First-Out (LIFO) data structure, supported and used by most processors, even non-zero-address ones. All operations in a stack machine assume that the top two values of the stack are the input operands. Results are placed (pushed) on the top of the stack. Notice that the pseudo-assembly instructions push and pop are an exception, as they take an address as operand Comparing Different Architectures Each of the four address schemes in Table 1.3 has advantages and disadvantages. Just by a quick glance at the table it can be noted that the number of instructions needed increases as we go from the lefthand side to the right-hand side of the table, i.e. as the number of addresses encoded in an instruction decreases. A possible performance metric is the number of memory accesses. The lower this number the faster we assume the processor is. Hypothetically, if the three-address machine does not have any registers then every instruction takes four memory accesses and because there are five instructions it would take 20 memory accesses in total to execute the statement. The two-address machine also takes four accesses per instruction, it should be recalled that one address doubles as both source and destination operand. Including the load instruction, which requires three accesses to memory, the two-address machine needs 23 memory accesses in total. If we do introduce a single register to store T into both designs, to make the example a bit more realistic, we can reduce the number of memory accesses to 12 for the three-address machine and 13 for the two-address machine respectively. The accumulator and stack machine use registers to read and write values and need 14 and 19 accesses respectively. This, however, does not mean that a three-address machine or two-address machine with registers is faster than a stack machine. Code density, the number of instructions that can be stored per memory unit (byte, 32-bit word, etc.), is another factor that needs to be considered. A stack machine does not need to specify any addresses and therefore requires fewer bits to encode, as can be seen in Figure 1.2. This means we can store more instructions in memory, which makes for shorter programs and helps improve 5

26 CHAPTER 1. CONTEXTUAL BACKGROUND 3-address format 23 bits Opcode Destination Source 1 Source 2 8 bits 5 bits 5 bits 5 bits 2-address format 18 bits Destination / Source Opcode Source bits 5 bits 5 bits 13 bits 1-address format Opcode 8 bits Destination / Source 2 5 bits 0-address format 8 bits Opcode 8 bits Figure 1.2: Instruction sizes for the four different machines, the zero-address machine has the highest code density as each instruction only takes up 8 bits. It should be noted that destination and source operands are addresses in memory. memory bandwidth usage Control Unit The control unit (also called control path or decoder) is responsible for decoding instructions and telling the system what to do. There are two commonly used approaches to design the physical logic within the control unit: hard-coded or hardwired, and microcode designs. RISC machines typically use a hard-coded control unit while CISC processors use a more modular microcode design. A hard-coded design is a direct implementation of a state machine based on wires, flip-flops and logic gates that operates the datapath components. It is simple to implement, although the resulting state machines can be quite complicated. They are however efficient, as the state machine ideally does not contain any superfluous elements but is tailored precisely to what is required. Microcodes can be tightly linked to CISC. In the 1970s memory used to be very expensive, so computer architects tried to minimise the amount of storage a program took up. This meant that each individual instruction a program is made up of had to do more work. On the other hand, designing hardware to carry out complex instructions was also a very expensive task. Microprograms offered the solution: A small run-time interpreter takes a complex instruction and executes simple (micro) instructions. Computer architects use microcode routines to implement more complex instructions. 6

27 1.4. PROJECT AIMS AND OBJECTIVES 1.4 Project Aims and Objectives The aim of the OpenTransputer project is to make a re-implementation of the Transputer, keeping the original ISA whilst updating the microarchitecture. All changes have been proposed and implemented with the IoT, wearables and connected homes market in mind. Specifically, the following are the main areas of improvement over the original Transputer the project is focussed on: 1. Develop an implementation of the Transputer instruction set taking advantage of state-of-the-art technology. 2. Replace the communication mechanism used in the original 1980s Transputer by a network implemented by switches to improve the usability of the processor as a building block for large parallel systems. 3. Introduce an easy-to-use I/O interface that can be used to connect hardware peripherals, such as sensors, commonly used in IoT applications. 7

28 CHAPTER 1. CONTEXTUAL BACKGROUND 8

29 Chapter 2 Technical Background 2.1 Occam Occam is a programming language designed at Inmos to facilitate the development of concurrent, distributed systems [14]. The language was developed hand-in-hand with the Transputer to extract maximum performance from the architecture. In fact, the Transputer System Description and documentation is written in Occam. Despite its purpose, Occam is a high-level language and not assembly. However, the fact that its model of concurrency was derived from Hoare s work on Communicating Sequential Processes (CSP) [13] means that Occam greatly differs from conventional programming languages such as C and Java [20]. Due to its simplicity, we use Occam to introduce the Transputer from a high-level point of view. Occam enables systems to be described as collections of concurrent processes that communicate with each other. Contrary to conventional languages, Occam programs are based on processes rather than procedures. Procedures can be thought of as sequences of instructions that are enclosed into callable constructs enabling code reusability and modularity. On the other hand, processes can be thought of as independent entities or self-contained programs with separate memory spaces and state. These processes are specifically designed to be run on identical Transputer system components containing on-chip memory and serial communication links to other components. The inter-process communication model is based on message passing [7], a scheme in which processes communicate by exchanging messages. Occam exposes point-to-point channel primitives to easily describe the connections between processes. Occam programs can be executed by a single Transputer that allocates CPU time for individual processes. Just as easily, the same concurrent programs can be executed by a network of Transputers without changing the Occam description. In this case, processes truly run in parallel and communicate by using the serial links. Programs are built from three primitive processes described in Table 2.1. Syntax v := e c! e c? v Description Assign expression e to variable v Output expression e to channel c Input from channel c to variable v Table 2.1: Occam primitive processes [14]. These primitives are combined to form the constructs in Table 2.2. Similarly, multiple constructs can be combined to write complex processes that communicate using the point-to-point channels. It is important to mention that in Occam communication is synchronised, 9

30 CHAPTER 2. TECHNICAL BACKGROUND Keyword SEQ PAR ALT WHILE IF Description Components executed in sequence Components executed in parallel First ready component is executed Iterative instructions Conditional construct Table 2.2: Occam constructs [14]. meaning that both sender and receiver processes wait until the transfer is complete before proceeding execution Sequential Processes The SEQ construct labels a process as sequential and causes its components to be executed one after another just as in conventional programming languages. Listing 2.1 is an example of a process that first assigns the value 10 to variable x, then outputs it to channel buffer.out and finally inputs a value into x from buffer.in. Note that the instructions are completed exactly in this order. VAR x: SEQ x := 10 buffer. out! x buffer. in? x Listing 2.1: Occam purely sequential program Parallel Processes The PAR keyword causes processes to be executed at the same time (if possible). Listing 2.2 shows two processes executing in parallel. The first process outputs x to channel c, while the other process receives it into a variable y. Also, note that the parallel construct is enclosed into a sequential construct and is followed by an assignment to z. In this case, the assignment will only be executed after both parallel processes have completed. The flow of events is as described in Figure 2.1. SEQ PAR c! x c? y z := y Listing 2.2: Occam parallel program. 2.2 Transputer Architecture Internal Process Representation In the Transputer each process is associated with a workspace in memory [15]. This area can be thought of as a stack where local variables, channel information and other values that describe the state of the process are stored. When the process is being executed, then its context is also held in six registers: 10

31 2.2. TRANSPUTER ARCHITECTURE c! x c? y z := y Figure 2.1: Flow of events in sample Occam parallel program. A B C Figure 2.2: Transputer evaluation stack [18]. A, B and C. These registers form the evaluation stack used to compute results for expressions. Instructions push values into the top of the stack and pop them when operations are executed as shown in Figure 2.2. Source and destination operands do not need to be encoded within the instructions because all operations are performed on the stack. Workspace pointer (Wptr). Holds the memory address of the top of the workspace stack. This is a special purpose register and can only be manipulated by selected instructions. Instruction pointer (Iptr). Stores the byte address of the next instruction. Operand. The Transputer instructions are 8-bits. The most significant four bits encode the instruction code (opcode) and the remaining bits are an immediate operand. When an instruction is executed, the immediate is loaded into the four least significant bits of the operand register as shown in Figure 2.3. There are also two special prefix instructions that can be used to accumulate operands into the operand register to be used later on by other instructions. It is also worth noting that the 4-bit opcode can only encode 16 different instructions, which clearly does not provide enough flexibility. For this reason, one of these opcodes causes the operand register to be treated as an instruction. The Transputer can only execute a single process at a time, yet it must be able to keep track of many others and eventually allocate them CPU time. To do so, processes are queued in a linked list 11

32 CHAPTER 2. TECHNICAL BACKGROUND 8-bit instruction Opcode Operand Operand register Figure 2.3: Transputer instruction format [18]. with the processor maintaining pointers to the front and back nodes of the data structure in registers. Thus, processes can either be active or inactive. The former refers to processes that are currently being executed or are scheduled to be executed. In contrast, the latter refers to processes that are witing i.e. currently not executing and its context is stored in memory. A process may be in this situation because it is waiting for some communication operation to complete, a timer to expire or its allocated CPU time has finished. When the process is ready to be executed, it is placed at the back of a linked list that acts as a waiting queue as shown in Figure 2.4. When the process reaches the front of the list it is executed [18]. The fact that the Transputer can automatically context switch and time slice its processing resources means that it effectively implements a scheduler in hardware. In most architectures these operations are entrusted to operating systems. However, context switches often involve storing a large number of registers in memory to preserve the state of the process (context), making the task slow and more complex when implemented in software. The Transputer s stack based architecture means that it can perform context switches extremely efficiently. According to the documentation in [15], the Transputer is able to stop an active process and start a new one in approximately 12 clock cycles. The tight relationship between Occam and the Transputer resulted in many of the programming language primitives being implemented as single assembly instructions. This not only includes scheduling operations such as starting and terminating processes, but also inter-process communication tasks. Therefore, processes can perform input and output operations by executing single instructions. It could be argued that the Transputer essentially comes with a Real-Time Operating System (RTOS) implemented in hardware Inter-process Communication As mentioned above, two processes communicate by using the Occam channel primitives which are directly operated by Transputer instructions. The processes might reside within the same or different machines, yet the same instructions are used. In the former case, a channel is represented by a word in memory. When the first process becomes ready, it writes its workspace pointer (identity) in the channel and is descheduled. Then, when the second process is ready the message is copied by the processor to the specified location, the first process rescheduled and the channel returned to the empty state [28]. On the other hand, if the processes reside in different Transputers both sender and receiver are descheduled while the transfer takes place and rescheduled when it concludes. In this case the communication is performed by autonomous link controllers. The links fetch or store data using a Direct Memory Access (DMA) mechanism and exchange messages with remote components through serial point-to-point connections. Each Transputer comes with four bidirectional serial links that make it possible to connect the processor to up to four other devices. There is no limit on the number of Transputers that can be connected in this fashion, which allows for large parallel networks to be constructed from individual Transputers [20]. However, as they are linked in a point-to-point fashion, the more components in the 12

33 2.3. TRANSPUTER MICROARCHITECTURE Workspaces Instructions Scheduling registers Front Process X (inactive process) Back Active process registers Process Y (inactive process) A B C Process Z (active process) Workspace Operand Instruction pointer Figure 2.4: The Transputer s hardware implemented process scheduler [18]. system, the slower the communication will potentially be as messages might have to be relayed over an increasing number of intermediate Transputers. In the Occam communication model, channels are synchronised and unbuffered; therefore, both processes must be ready before messages are transferred [14]. Furthermore, the compiler ensures that before an input or output instruction is executed the source and destination addresses, the channel address and the message length are available. 2.3 Transputer Microarchitecture Microcode Implementation Even though the Transputer is conceptually simple, many of its instructions are complex to implement in hardware and require multiple clock cycles to complete. At each step of the computation, the processor needs a different set of signals to control the datapath. The designers of the Transputer generated these signals by using a microcode approach which is commonly found in CISC processors. The idea is that each assembly instruction is executed by a sequence of simpler hardware-implemented microinstructions that can be completed in a single clock cycle. Each microinstruction is associated with a number of control signals that the Transputer stores in high-speed Read-Only Memory (ROM). In all Transputer models a microcode ROM is used, yet its size varies depending on the complexity of the processor. For instance, one of the early designs known as Simple 42 contains 122 microinstruction words of 68-bits each [10]. However, the more complex T414 version of the Transputer has approximately five times that 13

34 CHAPTER 2. TECHNICAL BACKGROUND number of microinstructions and over 100 bits per word. For simplicity, we focus on the microcodes defined for the Simple 42 Transputer. The control signals of each microcode can be divided into groups or bit fields. The most relevant ones for this discussion are listed and explained below [10]. X and Y bus source select and Z destination select. Due to the limited manufacturing capabilities available during the 1980s, processors needed to be designed to contain only two layers of interconnect. This greatly limited the number of connections between the different components of the design since the number of wire crossings had to be minimised. Thus, the Transputer uses three buses onto which data is multiplexed to be transported to where its required. Two of these (X and Y) correspond to source values for the computation and the remaining (Z) is used to carry the result. This mechanism greatly simplifies the connections between the different components of the processor at the expense of reducing the number of operations that can be performed by the processor simultaneously. ALU operations. These bits of the microcode select which operation is to be performed using the operands in the X and Y buses. Examples include addition, subtraction, etc. Next microinstruction address base. These bits contain the address of the next microinstruction in ROM that must be executed the next clock cycle. It is also possible that a microinstruction concludes the execution of an instruction, in which case these bits are not used. Conditional select. It is desirable to execute conditional statements within the microcode routines. For instance, when executing a conditional jump instruction, the processor must first evaluate an integer comparison and based on the result decide whether to increment the instruction pointer by a specified offset or 1. The Transputer implements conditional microcode execution by replacing the least significant two bits of the next microinstruction address base by conditional bits. The new address formed selects one of four possible microinstructions as shown in Figure 2.5. The two condition bits are generated by the operation specified by the conditional select bits associated with each microinstruction. AND : XfromA YfromB NoCarry ZfromXandY AfromZ Next ; Listing 2.3: Simple 42 microcode for bitwise AND operation. To illustrate the idea consider the Simple 42 microinstruction to execute a bitwise AND instruction shown in Listing 2.3. In this case, the X and Y buses contain the data stored in the A and B stack registers. The Z bus contains the result of the AND operation and this value is stored back into A. The ROM is managed by a microcode engine that ensures that the correct control signals are available at the right time. When an instruction is executed, the engine associates its opcode with a microinstruction and loads the required word from the ROM [10]. At the next clock cycle, the engine will decide whether a new instruction or a further microinstruction is executed. In the first case, the next instruction will be fetched from memory and executed as normal. In the latter case, the engine uses the address encoded into the microinstruction to fetch the next set of control signals. However, the fact that the two least significant bits of the address are overwritten by the conditional bits means that there are four possible microinstructions that could be executed, yet the decision is deferred for the conditional bits to 14

35 2.3. TRANSPUTER MICROARCHITECTURE Conditional bits Base address in microcode currently being executed Multiplexer Datapath control signals for next microinstruction Microcode Read-Only Memory (ROM) Fetch the word at the base address and the next three microinstructions Figure 2.5: Conditional execution of microcodes in Simple 42 Transputer [2]. be computed. For this reason, the engine loads four contiguous words from the ROM each time. Then, these are multiplexed using the conditional bits as control signals as shown in Figure Process Scheduling We mentioned above that the Transputer maintains a linked list of ready processes to be executed. When the CPU becomes available (no process is currently being executed), the process at the front of the list is dequeued and executed. In reality, the Transputer maintains two linked lists of process pointers. One list manages the high priority processes, while the other takes care of low priority ones. The process at the front of the high priority list is run first. It is possible that at any time a background operation requires the CPU s attention, such as a communicating process becoming ready, in which case the high priority process executing will be waiting until the request is completed. When the high priority process completes or blocks the next process at the front of the high priority list is executed. This continues until the high priority queue is empty, in which case the Transputer starts executing processes in the low priority list in a similar fashion. However, if at any point of time a high priority process becomes active, the context of the low priority process will be stored in memory and the CPU executes the high priority process. For this reason, high priority processes should only be used in special cases where performance and responsiveness are paramount, otherwise they will starve low priority processes of CPU resources. In fact, when the Transputer is reset its priority is set to low by default [28]. In Occam it is possible to declare timers, which behave like channels that cannot be output to. Timers are a simple mechanism to force processes to wait for a specified time without consuming any processing resources. The Transputer implements two additional linked lists (high and low priority) to support this feature. The processor implements two clock registers, one for each priority. The high and low priority clock registers are incremented every 1µs and 64µs respectively [28]. Processes in timer lists are sorted with regards to the clock registers. Processes whose time is closer to the time of the clock register are placed towards the front of the queue. When the time of the process at the front of the queue is after the time of the clock register, the Transputer dequeues and executes that process. 15

36 CHAPTER 2. TECHNICAL BACKGROUND Endianness and Memory Addressing The Transputer is purely a little endian processor [15], which means that the least significant byte of a word is stored at the lowest address. Also, contrary to most other processors, memory addresses in the Transputer are signed. 2.4 Transputer Versions and Enhancements The first Transputer was released in 1984 and subsequent models where introduced thereafter. Inmos developed three Transputer variants: T2, T4 and T8. All these designs maintain the core features of the architecture with regards to concurrency management and inter-process communication, yet some additions and enhancements were introduced at each iteration. The T2 variant is the 16-bit version of the processor. There were multiple releases, but one of the latter ones was the T225 [24], which contains 4KB of on-chip RAM, an external memory interface and four communication links. In contrast, the T4 series are 32-bit processors. The smaller of these is the T400 [25], which contained only two communication links and 2 KB of on-chip memory. The larger counterparts of the T400 are the T414 and T425 [26], with the latter containing twice as much memory as the T400 and four autonomous link controllers. Finally, the T800 Transputers were introduced in 1987 [27]. This variant was still 32-bit, but also came with an extended instruction set and a 64-bit floating-point unit. To make it easier to connect large numbers of Transputers in a network, Inmos introduced the C004 programmable link switch [23]. This devices had 32 link inputs and 32 outputs and complied with the serial link protocol already used by the Transputers. 16