OpenTransputer: reinventing a parallel machine from the past

Size: px
Start display at page:

Download "OpenTransputer: reinventing a parallel machine from the past"

Transcription

1 Dissertation Type: enterprise DEPARTMENT OF COMPUTER SCIENCE OpenTransputer: reinventing a parallel machine from the past David Keller, Andres Amaya Garcia A dissertation submitted to the University of Bristol in accordance with the requirements of the degree of Master of Engineering in the Faculty of Engineering. Thursday 25 th June, 2015

2 2

3 Declaration This dissertation is submitted to the University of Bristol in accordance with the requirements of the degree of MEng in the Faculty of Engineering. It has not been submitted for any other degree or diploma of any examining body. Except where specifically acknowledged, it is all the work of the Authors. David Keller, Andres Amaya Garcia, Thursday 25 th June, 2015 i

4 ii

5 Contents 1 Contextual Background Project Context Why Re-implement the Transputer? Overview of Computer Architecture Project Aims and Objectives Technical Background Occam Transputer Architecture Transputer Microarchitecture Transputer Versions and Enhancements Project Execution The OpenTransputer CPU External Communication I/O Interface The OpenTransputer System Digital Design with Hardware Description Languages (HDL) Critical Evaluation Evaluating Design Decisions Design Verification Synthesis results Conclusion Current Project Status Future Work Final Conclusions A Transputer Instruction Set 51 iii

6 iv

7 List of Figures 1.1 A six level computer as described in [31]. Below each level the way the abstraction is implemented is indicated (including the program responsible for it in parentheses) Instruction sizes for the four different machines, the zero-address machine has the highest code density as each instruction only takes up 8 bits. It should be noted that destination and source operands are addresses in memory Flow of events in sample Occam parallel program Transputer evaluation stack [18] Transputer instruction format [18] The Transputer s hardware implemented process scheduler [18] Conditional execution of microcodes in Simple 42 Transputer [2] Next microinstruction state address generation Stages of the microassembling process Example datapath implementation using three buses Example implementation using a wide datapath approach OpenTransputer s simplified datapath schematic Integration of the three main components of the OpenTransputer CPU Signal timing for correct CPU behaviour Example 8 8 Beneš network Folded over Beneš network with capacity for 8 OpenTransputers Bit fields of the channel addresses for internal and external communication and I/O pins Example route of a package through a Beneš network of OpenTransputers Internal connections of 4 input and 4 output switches Data exchange protocol between two components of a network of OpenTransputers Packet formats for external communication in the Inmos Transputer and the OpenTransputer Interaction between input and output controllers to transfer a message over the network [16] Major components of the OpenTransputer system v

8 vi

9 List of Tables 1.1 High-level overview of three different architectures A comparison of how different architectures encode a simple add instruction A comparison of the assembly code of different architectures for the same high-level statement. It should be noted that all operands are addresses in memory. [9] Occam primitive processes [14] Occam constructs [14] Execution time (in clock cycles) of primary instructions in both the Inmos Transputer [15] and the OpenTransputer Execution time (in clock cycles) of secondary instructions in both the Inmos Transputer [15] and the OpenTransputer. w is the number of 32-bit words to be moved FPGA resources used by the design. The first row shows how much resources are consumed in total. The bottom three rows show how many of these resources the two cores and the switch use individually FPGA resources used by a single core. The bottom three rows show the number of LUTs consumed by the individual components of Core FPGA resources used by the major components within the datapath Comparison of chip area and manufacturing process of the OpenTransputer and the original Transputer after synthesis A.1 Primary instructions [28] A.2 Implemented secondary instructions [28] in the OpenTransputer A.3 Unimplemented secondary instructions [28] in the OpenTransputer vii

10 viii

11 List of Listings 2.1 Occam purely sequential program Occam parallel program Simple 42 microcode for bitwise AND operation OpenTransputer microinstruction for state GT (greater than instruction) OpenTransputer microinstruction for state CCNT Assembly program for a register machine Assembly program using the Transputer Instruction Set Architecture (ISA) ix

12 x

13 Executive Summary We have developed the OpenTransputer, a new microprocessor based on the Transputer architecture designed by Inmos in the 1980s. We believe that the features of the Transputer architecture, such as inbuilt communication and concurrency management, make it an interesting proposition for the emerging Internet of Things (IoT) market, where a small, general-purpose processor could serve as a building block for a broad range of networked applications. The OpenTransputer modernises many aspects of the original Transputer, features a new microarchitecture, external communication mechanism and an I/O interface. The following is a list of our achievements during the project: Developed a functional simulator in C of the Transputer architecture, which was used to evaluate the trade-offs of different design approaches before they were implemented in Verilog. During the late stages of the project, the functional simulator was used as a reference model to verify the behavioural correctness of our design. Developed a new implementation of the Transputer architecture in Verilog: the OpenTransputer. The new design introduces significant changes to the microarchitecture that take advantage of state-of-the-art manufacturing technologies. Introduced a routing mechanism for inter-processes communication of the Inmos Transputer with a switch-based network that distributes data packets among the OpenTransputer nodes. Switches form a rearrangeably non-blocking Benes network, which greatly improves the performance and usability of the processor as a building block to assemble any kind of system. Replaced the existing input and output link controllers for external communication of the 1980s Transputer by a more scalable implementation that uses the idea of virtual channels. Designed and implemented an I/O interface that builds on the existing channel communication functionality of the processor and can be used to connect external hardware devices. xi

14 xii

15 Supporting Technologies The following is a list of third party software and hardware tools, libraries and components used during development of the OpenTransputer project. We used the D7202 Inmos Occam 2 Toolset compiler sources developed by Inmos. The compiler was used to generate assembly code from Occam programs. The output would then be translated into machine code using an assembler script written by us. The compiler s C source code is accessible under yet the program is difficult to compile using modern versions of GNU C Compiler (GCC). We introduced minor changes to the C source and corrected syntax mistakes in existing makefiles before we were able to generate an ELF executable that could be run in modern x86 machines. We used PyYAML to implement an assembler that parses configuration information associated with an Occam program that is meant to execute in a network of OpenTransputers. PyYAML is a Python library for parsing and emitting YAML code, which is a human-readable serialisation language. We used Red Hat s hosting platform OpenShift to make the OpenTransputer website accessible. The OpenTransputer website was implemented by using the WordPress software available from Furthermore, we used the HTML5/CSS3 based theme onetone to produce the graphical interface of the website. We used the ZedBoard XC7Z020-CLG484 FPGA supplied by the Computer Science Department and programmed it using the Verilog description of the OpenTransputer to develop a demo application. The Xilinx Vivado Design Suite was used to simulate the Verilog description of the OpenTransputer and program the FPGA mentioned above. We used the Distributed Memory Generator v8.0 from the Xilinx Vivado Intellectual Property (IP) catalogue to create a single RAM and three ROM structures used within the OpenTransputer design. To estimate the area of a hypothetical OpenTransputer chip, we used Synopsys Design Vision version G SP5 to synthesize the design for a silicon target. Also, we used the UMC synthesis library for a 180µm process. xiii

16 xiv

17 Notation and Acronyms The following is a list of acronyms used throughout this documents. ROM : Read-Only Memory RAM : Random Access Memory ISA : Instruction Set Architecture HDL : Hardware Description Language DMA : Direct Memory Access FPGA : Field-Programmable Gate Array CPU : Central Processing Unit PL : Programmable Logic Iptr : Instruction Pointer Wptr : Workspace Pointer I/O : Input/Output IP : Intellectual Property MUX : Multiplexer DEMUX : Demultiplexer RTL : Register-Transfer Level RTE : Route Towards Edge RTC : Route Towards Core ALU : Arithmetic and Logic Unit AU : Arithmetic Unit LU : Logic Unit LUT : Look-Up Table CSP : Communicating Sequential Processes CISC : Complex Instruction Set Computing RISC : Reduced Instruction Set Computing MISC : Minimal Instruction Set Computing FSM : Finite State Machine RTOS : Real-Time Operating System IoT : Internet of Things ILP : Instruction-Level Parallelism LIFO : Last-In-First-Out xv

18 xvi

19 Acknowledgements First and foremost, we would like to express our gratitude to our supervisor, Prof. David May, for his valuable guidance and support throughout the project. Without his patient explanations about the Inmos Transputer and his advice our dissertation would not be completed. Furthermore, we would like to thank Roger Shepherd for his early input and advice on how to develop the OpenTransputer. Thanks are also owed to Dr. Simon Hollis for his advice on digital design using the Vivado Design Suite and Fred Barnes for his input on the compiler and configuration system of the Transputer. We would also like to thanks Richard Grafton and the University of Bristol Computer Science Department who supplied us with the FPGA used throughout the project. Finally, we thank Iman Malik for her help with the design of the OpenTransputer logo. xvii

20 xviii

21 Chapter 1 Contextual Background 1.1 Project Context The OpenTransputer is a re-implementation of the Transputer, a pioneering microprocessor architecture first released in the 1980s [3]. The original Transputer was considered revolutionary at its time for its integrated memory and serial communication links intended for parallel computing. Including memory and external links on the same chip made the Transputer essentially a computer on a chip. This was supposed to allow information systems to be designed at a higher level the Transputer functioning as a building block for parallel computing networks. Over the last few years, with the shift to cloud computing there has been a trend in the technology world to building large clusters of powerful computers that serve data to an ever-growing number of client devices, which themselves only feature tiny and low-powered processors. These currently include mobile phones and tablets, but will soon also comprise every other device that connects to the internet, ranging from washing machines to cars [11]. We think that the Transputer and its unique feature set make it an excellent processor for this emerging Internet of Things (IoT) market, specifically the connected homes and wearables markets. As such, the OpenTransputer project aims to modernise the Transputer microarchitecture and introduce a more scalable network for communication based on switches and an easy-to-use I/O interface. In the following, we will give a short overview of the Transputer and the recent developments in the field of computer architecture. This will help explain and give background on the design decisions made in the original Transputer, which are detailed in Chapter 2 of this document, and also motivate the changes we have introduced in the OpenTransputer that are explained in Chapter Why Re-implement the Transputer? The Transputer, while revolutionary at its time for being a computer on a single chip, comprising processor, memory and communication links did not receive the attention it deserved. However, it found its way into spacecraft [1], satellites [32], set-top boxes and supercomputers [3] [12]. The Transputer has in-built support for concurrency through message passing, which is used for both off- and on-chip communication between processes, and a single Transputer can maintain and schedule multiple processes. This makes it both a multitasking and multiprocessing platform and provides a useful abstraction, since to the programmer communication operations are used in exactly the same way regardless of where the sender and receiver processes are running. In other architectures, approaches used 1

22 CHAPTER 1. CONTEXTUAL BACKGROUND for inter-process and inter-processor communication often vary and may require programmers to know about the intricacies of the operating system in the former case or the processor in the latter. With its own simple Real-Time Operating System (RTOS) implemented directly in hardware the Transputer can not only perform context switches in a fraction of the time traditional platforms take, but it also simplifies the way programmers interact with the processor. That is, there is no need to know the low-level implementation details of the processor, which is often the case with other platforms. Instead, a programmer can simply write code in the high-level language Occam, which exposes constructs to explicitly describe concurrent processes and their communication patterns [18]. Its ability to essentially serve as both a microcontroller or a small building block for a parallel network make it an attractive starting point to build upon for the IoT market. In this technological landscape, we foresee small chips being the dominant contenders due to their size, low cost and small energy footprint. On the other hand, this market is also defined by a need for communication between devices. The Transputer fits both descriptions and provides further useful features, as detailed above. 1.3 Overview of Computer Architecture The study of computer architecture is the study of the organisation and interconnection of components of computer systems. [29] A computer can be seen as a hierarchy of abstractions as shown in Figure 1.1, each performing a certain function. These levels are the digital logic level, the microarchitecture level, the Instruction Set Architecture (ISA), the operating system machine level and the assembly language level. Each level builds up on the one below it. Computer processor and computer are somewhat synonymous in this context, essentially they refer to a device that performs some sort of computation. Processor nowadays refers to the Central Processing Unit (CPU), which is usually part of a microchip, and hence also known as microprocessor. The CPU is a collection of logic tasked with executing programs that generate results for the user. In this section we want to briefly introduce how a CPU works, ignoring mostly the details of the implementation in digital logic. Instead we aim to focus on the concepts and mechanisms that dictate how a CPU will execute programs. We will give an overview of microarchitecture, the design of the physical hardware underlying the processor, and instruction set design, which refers to the interface to the hardware offered to the programmer. We will focus on basic processor design and omit issues in Advanced Computer Architecture, such as Instruction-Level Parallelism (ILP), pipelining, superscalar and vector processing, as these are not necessary to understand the rest of this document. The inclined reader is advised to take a look at the material referenced, as they will have more information on these and other issues. On the microarchitecture level, we think of computers as constructed from basic building blocks such as memories, arithmetic units and buses. The functional behaviour of these building blocks is similar across most machines. The differences between computers lie in how the modules they are made up of are connected together, in the performance of these modules and the way the entire computer is controlled by programs Instruction Set Architecture (ISA) The instruction set of a processor is a common abstraction used in the literature to describe the way a processor s internals work. It abstracts away the details of the underlying implementation instead providing an explanation of the high-level logical processor. Specifying a processor at this level means a logical processor with a certain ISA can have several different physical implementations or microarchitectures that ultimately work the same way but have 2

23 1.3. OVERVIEW OF COMPUTER ARCHITECTURE Level 5 Problem-oriented language level Translation (compiler) Level 4 Assembly language level Translation (assembler) Level 3 Operating system machine level Partial interpretation (operating system) Level 2 Instruction set architecture level Interpretation (micro-program) or direct execution Level 1 Micro-architecture level Hardware Level 0 Digital logic level Figure 1.1: A six level computer as described in [31]. Below each level the way the abstraction is implemented is indicated (including the program responsible for it in parentheses). Instruction set Type Design Bits Registers ARM Register register RISC 32/64 16/32 Transputer Stack machine MISC 16/32 - x86 Register memory CISC 16/32/64 6/8/16 Table 1.1: High-level overview of three different architectures. different performance characteristics and prices. The ISA-level is a very useful abstraction for programmers, providing a common platform to execute programs on. It means a program can be written in various high-level programming languages, which are then translated (compiled) into ISA-level machine language the processor can actually understand. This machine language is commonly referred to as assembly language or assembly code. Most ISAs used in processors today can be classified as following either a Reduced Instruction Set Computing (RISC) or Complex Instruction Set Computing (CISC) design philosophy [9]. The former refers to instruction sets comprised of very simple instructions each taking very few clock cycles to execute. In contrast, CISC designs typically include more complex instructions that take longer to execute, but perform more operations per instruction. Table 1.1 shows three different architectures: The Intel IA-32 architecture, which is commonly referred to as i386 is a an example of CISC, while the ARM is an example for a RISC. The Transputer is sometimes used as an example for a Minimal Instruction Set Computer (MISC). MISC computers commonly have a very small number of basic operations and are typically stack-based. In reality, the Transputer falls somewhere in the middle of all these three philosophies: A somewhat small and focused feature set would place it in the RISC camp, but its microcode design and 3

24 CHAPTER 1. CONTEXTUAL BACKGROUND some of its more complex instructions, which schedule processes or move several memory words at once resemble a CISC design Processor Design Issues Having defined the concepts and established the level at which we want to lead the discussion, the issue we want to elaborate now is the different options a processor designer has when implementing a processor at this level Number of Addresses Architecture Three-address Two-address One-address (accumulator machine) Zero-address (stack machine) Instruction add dest,src1,src2 add dest,src add addr add Table 1.2: A comparison of how different architectures encode a simple add instruction. A basic characteristic that has an implication on a processor s architecture is the number of addresses used in an instruction, as can be seen in Table 1.2. Most operations performed by an instruction are either binary or unary. A binary operation such as addition and multiplication require two input operands whereas a unary operation only has a single operand. Typically, an operation produces a single output result. This means a total number of three addresses need to be encoded in the instruction: Two addresses to specify two input operands and one additional address to specify the output result. Many processors specify all three addresses explicitly in the instruction. In some architectures such as the Intel IA-32, a two-address format is used where one of the input addresses serves as both source and destination addresses. It is possible to construct architectures in such a way that instructions have only one or zero addresses explicitly encoded. The former are called accumulator machines, while the latter are referred to as stack machines. The Transputer and the OpenTransputer are stack machines. A comparison of the different assembly code, and, by extension, machine code generated from a simple statement written in a high-level language, such as C, will help to discuss the benefits and disadvantages of each type of architecture. It should be noted that in this example, all four hypothetical machines expect operands to be an address in memory. An operand is then just an offset to some base egister or stack pointer. Consider the simple C statement: A = B + C * D - E + F + A Depending on the type of architecture it will be translated into one of the following pieces of assembly code in Table 1.3: RISC processors are usually three-address machines, meaning they define all three addresses explicitly in the instruction. As can be seen from the table above, such an architecture needs the fewest instructions to execute the high-level statement. In the first column, the one depicting the assembly code for the three-address machine, we can identify T as both a source and result operand. This simple observation serves as a basic justification for twoaddress architectures: If most instructions use one operand as both source and destination address, then only encoding two addresses in an instruction will not lose much flexibility and only makes the generated assembly code a little bit more complex, while allowing for shorter instructions. As can be seen in the second column, a two-address machine uses slightly more instructions to execute the statement, which is due to the fact that the value of C is loaded into T in the first instruction. 4

25 1.3. OVERVIEW OF COMPUTER ARCHITECTURE Three-address Two-address One-address Zero-address mult T,C,D load T,C load C push E add T,T,B mult T,D mult D push C sub T,T,E add T,B add B push D add T,T,F sub T,E sub E mult add A,T,A add T,F add F push B add A,T add A add store A sub push F add push A add pop A Table 1.3: A comparison of the assembly code of different architectures for the same high-level statement. It should be noted that all operands are addresses in memory. [9] Examining the second column of Table 1.3, we realise that all of them use the same operand, T. Making this the default forms the basis for a one-address machine, one where destination register, or accumulator, is implicit in the instruction. This kind of architecture is often used in environments where memory is constrained. Finally, in a zero-address machine all operands are implicit to the instruction and hence assumed to be at default locations. A stack is used to obtain the source input operands and the result written back onto the stack. A stack is a Last-In-First-Out (LIFO) data structure, supported and used by most processors, even non-zero-address ones. All operations in a stack machine assume that the top two values of the stack are the input operands. Results are placed (pushed) on the top of the stack. Notice that the pseudo-assembly instructions push and pop are an exception, as they take an address as operand Comparing Different Architectures Each of the four address schemes in Table 1.3 has advantages and disadvantages. Just by a quick glance at the table it can be noted that the number of instructions needed increases as we go from the lefthand side to the right-hand side of the table, i.e. as the number of addresses encoded in an instruction decreases. A possible performance metric is the number of memory accesses. The lower this number the faster we assume the processor is. Hypothetically, if the three-address machine does not have any registers then every instruction takes four memory accesses and because there are five instructions it would take 20 memory accesses in total to execute the statement. The two-address machine also takes four accesses per instruction, it should be recalled that one address doubles as both source and destination operand. Including the load instruction, which requires three accesses to memory, the two-address machine needs 23 memory accesses in total. If we do introduce a single register to store T into both designs, to make the example a bit more realistic, we can reduce the number of memory accesses to 12 for the three-address machine and 13 for the two-address machine respectively. The accumulator and stack machine use registers to read and write values and need 14 and 19 accesses respectively. This, however, does not mean that a three-address machine or two-address machine with registers is faster than a stack machine. Code density, the number of instructions that can be stored per memory unit (byte, 32-bit word, etc.), is another factor that needs to be considered. A stack machine does not need to specify any addresses and therefore requires fewer bits to encode, as can be seen in Figure 1.2. This means we can store more instructions in memory, which makes for shorter programs and helps improve 5

26 CHAPTER 1. CONTEXTUAL BACKGROUND 3-address format 23 bits Opcode Destination Source 1 Source 2 8 bits 5 bits 5 bits 5 bits 2-address format 18 bits Destination / Source Opcode Source bits 5 bits 5 bits 13 bits 1-address format Opcode 8 bits Destination / Source 2 5 bits 0-address format 8 bits Opcode 8 bits Figure 1.2: Instruction sizes for the four different machines, the zero-address machine has the highest code density as each instruction only takes up 8 bits. It should be noted that destination and source operands are addresses in memory. memory bandwidth usage Control Unit The control unit (also called control path or decoder) is responsible for decoding instructions and telling the system what to do. There are two commonly used approaches to design the physical logic within the control unit: hard-coded or hardwired, and microcode designs. RISC machines typically use a hard-coded control unit while CISC processors use a more modular microcode design. A hard-coded design is a direct implementation of a state machine based on wires, flip-flops and logic gates that operates the datapath components. It is simple to implement, although the resulting state machines can be quite complicated. They are however efficient, as the state machine ideally does not contain any superfluous elements but is tailored precisely to what is required. Microcodes can be tightly linked to CISC. In the 1970s memory used to be very expensive, so computer architects tried to minimise the amount of storage a program took up. This meant that each individual instruction a program is made up of had to do more work. On the other hand, designing hardware to carry out complex instructions was also a very expensive task. Microprograms offered the solution: A small run-time interpreter takes a complex instruction and executes simple (micro) instructions. Computer architects use microcode routines to implement more complex instructions. 6

27 1.4. PROJECT AIMS AND OBJECTIVES 1.4 Project Aims and Objectives The aim of the OpenTransputer project is to make a re-implementation of the Transputer, keeping the original ISA whilst updating the microarchitecture. All changes have been proposed and implemented with the IoT, wearables and connected homes market in mind. Specifically, the following are the main areas of improvement over the original Transputer the project is focussed on: 1. Develop an implementation of the Transputer instruction set taking advantage of state-of-the-art technology. 2. Replace the communication mechanism used in the original 1980s Transputer by a network implemented by switches to improve the usability of the processor as a building block for large parallel systems. 3. Introduce an easy-to-use I/O interface that can be used to connect hardware peripherals, such as sensors, commonly used in IoT applications. 7

28 CHAPTER 1. CONTEXTUAL BACKGROUND 8

29 Chapter 2 Technical Background 2.1 Occam Occam is a programming language designed at Inmos to facilitate the development of concurrent, distributed systems [14]. The language was developed hand-in-hand with the Transputer to extract maximum performance from the architecture. In fact, the Transputer System Description and documentation is written in Occam. Despite its purpose, Occam is a high-level language and not assembly. However, the fact that its model of concurrency was derived from Hoare s work on Communicating Sequential Processes (CSP) [13] means that Occam greatly differs from conventional programming languages such as C and Java [20]. Due to its simplicity, we use Occam to introduce the Transputer from a high-level point of view. Occam enables systems to be described as collections of concurrent processes that communicate with each other. Contrary to conventional languages, Occam programs are based on processes rather than procedures. Procedures can be thought of as sequences of instructions that are enclosed into callable constructs enabling code reusability and modularity. On the other hand, processes can be thought of as independent entities or self-contained programs with separate memory spaces and state. These processes are specifically designed to be run on identical Transputer system components containing on-chip memory and serial communication links to other components. The inter-process communication model is based on message passing [7], a scheme in which processes communicate by exchanging messages. Occam exposes point-to-point channel primitives to easily describe the connections between processes. Occam programs can be executed by a single Transputer that allocates CPU time for individual processes. Just as easily, the same concurrent programs can be executed by a network of Transputers without changing the Occam description. In this case, processes truly run in parallel and communicate by using the serial links. Programs are built from three primitive processes described in Table 2.1. Syntax v := e c! e c? v Description Assign expression e to variable v Output expression e to channel c Input from channel c to variable v Table 2.1: Occam primitive processes [14]. These primitives are combined to form the constructs in Table 2.2. Similarly, multiple constructs can be combined to write complex processes that communicate using the point-to-point channels. It is important to mention that in Occam communication is synchronised, 9

30 CHAPTER 2. TECHNICAL BACKGROUND Keyword SEQ PAR ALT WHILE IF Description Components executed in sequence Components executed in parallel First ready component is executed Iterative instructions Conditional construct Table 2.2: Occam constructs [14]. meaning that both sender and receiver processes wait until the transfer is complete before proceeding execution Sequential Processes The SEQ construct labels a process as sequential and causes its components to be executed one after another just as in conventional programming languages. Listing 2.1 is an example of a process that first assigns the value 10 to variable x, then outputs it to channel buffer.out and finally inputs a value into x from buffer.in. Note that the instructions are completed exactly in this order. VAR x: SEQ x := 10 buffer. out! x buffer. in? x Listing 2.1: Occam purely sequential program Parallel Processes The PAR keyword causes processes to be executed at the same time (if possible). Listing 2.2 shows two processes executing in parallel. The first process outputs x to channel c, while the other process receives it into a variable y. Also, note that the parallel construct is enclosed into a sequential construct and is followed by an assignment to z. In this case, the assignment will only be executed after both parallel processes have completed. The flow of events is as described in Figure 2.1. SEQ PAR c! x c? y z := y Listing 2.2: Occam parallel program. 2.2 Transputer Architecture Internal Process Representation In the Transputer each process is associated with a workspace in memory [15]. This area can be thought of as a stack where local variables, channel information and other values that describe the state of the process are stored. When the process is being executed, then its context is also held in six registers: 10

31 2.2. TRANSPUTER ARCHITECTURE c! x c? y z := y Figure 2.1: Flow of events in sample Occam parallel program. A B C Figure 2.2: Transputer evaluation stack [18]. A, B and C. These registers form the evaluation stack used to compute results for expressions. Instructions push values into the top of the stack and pop them when operations are executed as shown in Figure 2.2. Source and destination operands do not need to be encoded within the instructions because all operations are performed on the stack. Workspace pointer (Wptr). Holds the memory address of the top of the workspace stack. This is a special purpose register and can only be manipulated by selected instructions. Instruction pointer (Iptr). Stores the byte address of the next instruction. Operand. The Transputer instructions are 8-bits. The most significant four bits encode the instruction code (opcode) and the remaining bits are an immediate operand. When an instruction is executed, the immediate is loaded into the four least significant bits of the operand register as shown in Figure 2.3. There are also two special prefix instructions that can be used to accumulate operands into the operand register to be used later on by other instructions. It is also worth noting that the 4-bit opcode can only encode 16 different instructions, which clearly does not provide enough flexibility. For this reason, one of these opcodes causes the operand register to be treated as an instruction. The Transputer can only execute a single process at a time, yet it must be able to keep track of many others and eventually allocate them CPU time. To do so, processes are queued in a linked list 11

32 CHAPTER 2. TECHNICAL BACKGROUND 8-bit instruction Opcode Operand Operand register Figure 2.3: Transputer instruction format [18]. with the processor maintaining pointers to the front and back nodes of the data structure in registers. Thus, processes can either be active or inactive. The former refers to processes that are currently being executed or are scheduled to be executed. In contrast, the latter refers to processes that are witing i.e. currently not executing and its context is stored in memory. A process may be in this situation because it is waiting for some communication operation to complete, a timer to expire or its allocated CPU time has finished. When the process is ready to be executed, it is placed at the back of a linked list that acts as a waiting queue as shown in Figure 2.4. When the process reaches the front of the list it is executed [18]. The fact that the Transputer can automatically context switch and time slice its processing resources means that it effectively implements a scheduler in hardware. In most architectures these operations are entrusted to operating systems. However, context switches often involve storing a large number of registers in memory to preserve the state of the process (context), making the task slow and more complex when implemented in software. The Transputer s stack based architecture means that it can perform context switches extremely efficiently. According to the documentation in [15], the Transputer is able to stop an active process and start a new one in approximately 12 clock cycles. The tight relationship between Occam and the Transputer resulted in many of the programming language primitives being implemented as single assembly instructions. This not only includes scheduling operations such as starting and terminating processes, but also inter-process communication tasks. Therefore, processes can perform input and output operations by executing single instructions. It could be argued that the Transputer essentially comes with a Real-Time Operating System (RTOS) implemented in hardware Inter-process Communication As mentioned above, two processes communicate by using the Occam channel primitives which are directly operated by Transputer instructions. The processes might reside within the same or different machines, yet the same instructions are used. In the former case, a channel is represented by a word in memory. When the first process becomes ready, it writes its workspace pointer (identity) in the channel and is descheduled. Then, when the second process is ready the message is copied by the processor to the specified location, the first process rescheduled and the channel returned to the empty state [28]. On the other hand, if the processes reside in different Transputers both sender and receiver are descheduled while the transfer takes place and rescheduled when it concludes. In this case the communication is performed by autonomous link controllers. The links fetch or store data using a Direct Memory Access (DMA) mechanism and exchange messages with remote components through serial point-to-point connections. Each Transputer comes with four bidirectional serial links that make it possible to connect the processor to up to four other devices. There is no limit on the number of Transputers that can be connected in this fashion, which allows for large parallel networks to be constructed from individual Transputers [20]. However, as they are linked in a point-to-point fashion, the more components in the 12

33 2.3. TRANSPUTER MICROARCHITECTURE Workspaces Instructions Scheduling registers Front Process X (inactive process) Back Active process registers Process Y (inactive process) A B C Process Z (active process) Workspace Operand Instruction pointer Figure 2.4: The Transputer s hardware implemented process scheduler [18]. system, the slower the communication will potentially be as messages might have to be relayed over an increasing number of intermediate Transputers. In the Occam communication model, channels are synchronised and unbuffered; therefore, both processes must be ready before messages are transferred [14]. Furthermore, the compiler ensures that before an input or output instruction is executed the source and destination addresses, the channel address and the message length are available. 2.3 Transputer Microarchitecture Microcode Implementation Even though the Transputer is conceptually simple, many of its instructions are complex to implement in hardware and require multiple clock cycles to complete. At each step of the computation, the processor needs a different set of signals to control the datapath. The designers of the Transputer generated these signals by using a microcode approach which is commonly found in CISC processors. The idea is that each assembly instruction is executed by a sequence of simpler hardware-implemented microinstructions that can be completed in a single clock cycle. Each microinstruction is associated with a number of control signals that the Transputer stores in high-speed Read-Only Memory (ROM). In all Transputer models a microcode ROM is used, yet its size varies depending on the complexity of the processor. For instance, one of the early designs known as Simple 42 contains 122 microinstruction words of 68-bits each [10]. However, the more complex T414 version of the Transputer has approximately five times that 13

34 CHAPTER 2. TECHNICAL BACKGROUND number of microinstructions and over 100 bits per word. For simplicity, we focus on the microcodes defined for the Simple 42 Transputer. The control signals of each microcode can be divided into groups or bit fields. The most relevant ones for this discussion are listed and explained below [10]. X and Y bus source select and Z destination select. Due to the limited manufacturing capabilities available during the 1980s, processors needed to be designed to contain only two layers of interconnect. This greatly limited the number of connections between the different components of the design since the number of wire crossings had to be minimised. Thus, the Transputer uses three buses onto which data is multiplexed to be transported to where its required. Two of these (X and Y) correspond to source values for the computation and the remaining (Z) is used to carry the result. This mechanism greatly simplifies the connections between the different components of the processor at the expense of reducing the number of operations that can be performed by the processor simultaneously. ALU operations. These bits of the microcode select which operation is to be performed using the operands in the X and Y buses. Examples include addition, subtraction, etc. Next microinstruction address base. These bits contain the address of the next microinstruction in ROM that must be executed the next clock cycle. It is also possible that a microinstruction concludes the execution of an instruction, in which case these bits are not used. Conditional select. It is desirable to execute conditional statements within the microcode routines. For instance, when executing a conditional jump instruction, the processor must first evaluate an integer comparison and based on the result decide whether to increment the instruction pointer by a specified offset or 1. The Transputer implements conditional microcode execution by replacing the least significant two bits of the next microinstruction address base by conditional bits. The new address formed selects one of four possible microinstructions as shown in Figure 2.5. The two condition bits are generated by the operation specified by the conditional select bits associated with each microinstruction. AND : XfromA YfromB NoCarry ZfromXandY AfromZ Next ; Listing 2.3: Simple 42 microcode for bitwise AND operation. To illustrate the idea consider the Simple 42 microinstruction to execute a bitwise AND instruction shown in Listing 2.3. In this case, the X and Y buses contain the data stored in the A and B stack registers. The Z bus contains the result of the AND operation and this value is stored back into A. The ROM is managed by a microcode engine that ensures that the correct control signals are available at the right time. When an instruction is executed, the engine associates its opcode with a microinstruction and loads the required word from the ROM [10]. At the next clock cycle, the engine will decide whether a new instruction or a further microinstruction is executed. In the first case, the next instruction will be fetched from memory and executed as normal. In the latter case, the engine uses the address encoded into the microinstruction to fetch the next set of control signals. However, the fact that the two least significant bits of the address are overwritten by the conditional bits means that there are four possible microinstructions that could be executed, yet the decision is deferred for the conditional bits to 14

35 2.3. TRANSPUTER MICROARCHITECTURE Conditional bits Base address in microcode currently being executed Multiplexer Datapath control signals for next microinstruction Microcode Read-Only Memory (ROM) Fetch the word at the base address and the next three microinstructions Figure 2.5: Conditional execution of microcodes in Simple 42 Transputer [2]. be computed. For this reason, the engine loads four contiguous words from the ROM each time. Then, these are multiplexed using the conditional bits as control signals as shown in Figure Process Scheduling We mentioned above that the Transputer maintains a linked list of ready processes to be executed. When the CPU becomes available (no process is currently being executed), the process at the front of the list is dequeued and executed. In reality, the Transputer maintains two linked lists of process pointers. One list manages the high priority processes, while the other takes care of low priority ones. The process at the front of the high priority list is run first. It is possible that at any time a background operation requires the CPU s attention, such as a communicating process becoming ready, in which case the high priority process executing will be waiting until the request is completed. When the high priority process completes or blocks the next process at the front of the high priority list is executed. This continues until the high priority queue is empty, in which case the Transputer starts executing processes in the low priority list in a similar fashion. However, if at any point of time a high priority process becomes active, the context of the low priority process will be stored in memory and the CPU executes the high priority process. For this reason, high priority processes should only be used in special cases where performance and responsiveness are paramount, otherwise they will starve low priority processes of CPU resources. In fact, when the Transputer is reset its priority is set to low by default [28]. In Occam it is possible to declare timers, which behave like channels that cannot be output to. Timers are a simple mechanism to force processes to wait for a specified time without consuming any processing resources. The Transputer implements two additional linked lists (high and low priority) to support this feature. The processor implements two clock registers, one for each priority. The high and low priority clock registers are incremented every 1µs and 64µs respectively [28]. Processes in timer lists are sorted with regards to the clock registers. Processes whose time is closer to the time of the clock register are placed towards the front of the queue. When the time of the process at the front of the queue is after the time of the clock register, the Transputer dequeues and executes that process. 15

36 CHAPTER 2. TECHNICAL BACKGROUND Endianness and Memory Addressing The Transputer is purely a little endian processor [15], which means that the least significant byte of a word is stored at the lowest address. Also, contrary to most other processors, memory addresses in the Transputer are signed. 2.4 Transputer Versions and Enhancements The first Transputer was released in 1984 and subsequent models where introduced thereafter. Inmos developed three Transputer variants: T2, T4 and T8. All these designs maintain the core features of the architecture with regards to concurrency management and inter-process communication, yet some additions and enhancements were introduced at each iteration. The T2 variant is the 16-bit version of the processor. There were multiple releases, but one of the latter ones was the T225 [24], which contains 4KB of on-chip RAM, an external memory interface and four communication links. In contrast, the T4 series are 32-bit processors. The smaller of these is the T400 [25], which contained only two communication links and 2 KB of on-chip memory. The larger counterparts of the T400 are the T414 and T425 [26], with the latter containing twice as much memory as the T400 and four autonomous link controllers. Finally, the T800 Transputers were introduced in 1987 [27]. This variant was still 32-bit, but also came with an extended instruction set and a 64-bit floating-point unit. To make it easier to connect large numbers of Transputers in a network, Inmos introduced the C004 programmable link switch [23]. This devices had 32 link inputs and 32 outputs and complied with the serial link protocol already used by the Transputers. 16

37 Chapter 3 Project Execution The project is broadly divided in three stages. Firstly, we implemented the processor s core CPU that can execute the main instructions for the scheduling, timer and internal communication operations. Secondly, we developed the autonomous links and network switch to enable message passing among multiple Transputers. Finally, support for I/O operations was introduced building on the existing communications functionality. The following sections describe our design decisions and the implementation details at each stage. 3.1 The OpenTransputer CPU To simplify the overwhelming complexity of the design, the processor s core can be divided into three major parts that were designed separately and subsequently integrated. Each of this components is described in turn and differences with regards to the original Transputer are highlighted where appropriate Control Unit (Decoder) As discussed in Section 2.3.1, the original Transputer used a microcode system to generate the control signals needed for the datapath at each clock cycle. These signals were stored in ROM and loaded by a microcode engine when required. The alternative approach is to hardwire logic that generates the same control signals by implementing combinatorial circuits with a few sequential elements. Considering that in the most complex Transputers there are over 600 microinstructions each consisting of more than 100 bits, hardwired logic would have significantly increased the area and complexity of the design. An attractive feature of microcoded over hardwired systems is that they are much simpler to debug and change, even in later stages of the design process. This is because in these systems most of the machine s behaviour is described by microcode routines rather than by dedicated circuitry. Indeed, the development of these routines resemble programming assembly code rather than hardware design. The disadvantage is that microcoded processors that store their routines in ROM are potentially slower than their hardwired counterparts because there is a delay associated with reading the control signals from ROM. For the OpenTransputer, we have decided to take a middle-ground between the microcoded and hardwired approaches by taking the best features of both. We maintain the flexibility and ease of use of the microcodes and the performance of hardwired logic. Hence, we devised a simple assembly-like language to describe the microinstructions, and a microassembler was developed in Python. The script parses the microinstructions in plain text and converts them into an intermediate representation that can be easily translated into any implementation such as ROM or hardwired logic. Therefore, when a bug in 17

38 CHAPTER 3. PROJECT EXECUTION the design is found or a new feature is introduced, only the source microinstruction is modified and a new version of the control unit can be generated with a few keystrokes. Furthermore, the script makes heavy use of Python s high-level data structures such as lists and dictionaries that can be easily customised to cope with changes in the other components of the Verilog design Microcode Language The OpenTransputer microassembly language was inspired by the original Transputer microinstructions. Each command in the microprogram can be thought of as a different state in a Finite State Machine (FSM) having an integer identifier (address) and an associated set of control signals [30] which are generated by the Python script. We describe microinstructions to the tool by using high-level human-readable text commands. Each microinstruction has three parts: 1. State name. An alphanumeric word that uniquely identifies the state. 2. Body Describes the datapath enabled by the control unit in a single clock cycle. Common commands used in the body of a microinstruction select inputs from multiplexers, generate specific Arithmetic Unit (AU) and Logic Unit (LU) operation codes, enable registers for writing, etc. 3. Control flow At each clock cycle the CPU must be able to decide what happens next. Hence, all microinstructions include a bit field that encodes what action should be taken next. There are four possibilities: (a) Execute the next instruction from the instruction buffer (see Section 3.1.3). (b) Fetch the next instruction from memory. Only occurs when a microinstruction manipulates the instruction pointer, such as in jumps and context switches. (c) Unconditionally execute the next microinstruction whose address is embedded within this state. (d) Jump to one of two or one of four states according to a condition specified by this microinstruction. An example microinstruction with state name GT is shown in Listing 3.1. The body compares the values of A and B registers using the AU; the commands Auop0fromB and Auop1fromA select the output of those registers as the input operands to the AU. The result of the computation is stored in A and the register stack is popped using the commands AfromAubool and BfromC respectively. Also, the operand register is cleared (set to 0) by the command OfromClear. Finally, the control flow command Gotoplus1 increments the instruction pointer by one byte and tells the processor to execute the next instruction. The microinstruction CCNT11 in Listing 3.2 performs an AU operation similar to GT, but uses the boolean result of this computation to decide whether control should be transferred to either CCNT13 (B < A) or CCNT14 (B A). GT BfromC AU(gt) AfromAubool Auop0fromB Auop1fromA OfromClear Gotoplus1 ; Listing 3.1: OpenTransputer microinstruction for state GT (greater than instruction). 18

39 3.1. THE OPENTRANSPUTER CPU Boolean 1-bit result of AU comparison Microinstruction base address N 1 0 Figure 3.1: Next microinstruction state address generation. CCNT11 Auop0fromB Auop1fromA AU( ulteq ) Condaubool ( CCNT13, CCNT14 ); Listing 3.2: OpenTransputer microinstruction for state CCNT Microprogram Assembler The steps followed by the tool to assemble a microprogram are shown in Figure 3.2. After the input is read and parsed, an intermediate representation is generated and analysed for errors. The error-checking facilities help prevent bugs in the design, yet they do not prove the functional correctness of the code. Common mistakes found by the microassembler include: unknown commands, non-existent connections, conflicting instructions, missing states, among others. The Inmos Transputer microinstructions contain a bit field that stores the base address in ROM of four contiguous microinstructions. Also, there are bits to decide which of these four would be executed next (see Section 2.3.1). The OpenTransputer operates in a similar fashion thanks to the address allocation process of the microassembler that translates the alphanumeric identifiers to unique integers. The objective of the tool is to allocate contiguous addresses to groups of microinstructions that are possible targets of the same control flow condition. For instance, for the condition Condaubool(CCNT13,CCNT14) in Listing 3.2, the least significant bit of the addresses of CCNT13 and CCNT14 should be 0 and 1 respectively, while the remaining bits must be equal in order for the instructions to be contiguous. This is because when resolving this condition, the processor will merge the base address encoded in the microinstruction for CCNT11 with the boolean result of the AU to form the address of the target microinstruction as shown in Figure 3.1. In contrast to the original Transputer, in our design the microinstructions can be grouped in pairs and quadruples depending on the specified condition. Therefore, to allocate the state addresses, the microassembler does a first pass through all the input microinstructions and generates a list of all groups. The script allocates addresses to all groups of four, two, and single microinstructions in exactly that order. Note that there are no constraints in the addresses of individual instructions, yet the process terminates correctly if (a) the least significant two bits of the base addresses for all groups of four are 0, and (b) the least significant bit of the base addresses for all groups of two is 0. The objective of the address allocation process is to assign a unique address to all microinstructions while meeting the constraints above and using the smallest range of addresses possible. In other words, we require that all addresses can be represented with the fewest number of bits possible. 19

40 CHAPTER 3. PROJECT EXECUTION The final step of the assembly process is to use the intermediate representation to generate an output that can be combined with the remaining modules of the design. The two main alternatives for this are a ROM similar to that of the original Transputer and hardwired logic. We decided to take the latter approach since our design has less than half the states of the Inmos implementation. Furthermore, the microcodes are very sparse meaning that in most microinstructions the value of the majority of control signals is set to 0. Therefore, using combinatorial logic to generate control signals is potentially cheaper and faster that including a large ROM. For the purpose of this project, we are not concerned with extracting maximum performance from our implementation, so the script generates behavioural Verilog rather than combinatorial logic blocks. Due to this, the design is not as efficient as it could be since we are basically relying on the hardware synthesis process to generate the logic, which generally is not as optimal as manually developing the circuitry. We cannot stress enough that this microcoded approach has greatly increased our productivity, and allowed us to achieve the goals agreed at the beginning of the project. During the early stages of the project, we developed an early prototype implementation in Verilog using the hardwired approach, but manually coding all the control signal values. This is not only very tedious and error prone, but more importantly, effort and time is lost if major changes need to be introduced in later stages of the development process Microcode Bit Fields Read and parse input microinstructions Generate intermediate representation Check microprogram correctness Allocate addresses Generate output representation Figure 3.2: Stages of the microassembling process. The latest OpenTransputer can execute the majority of the T414 Transputer with only 303 different microinstructions. Each microinstruction has 134 bits distributed as follows: 9 bits encode the base state address of the next microinstruction. 4 bits select the control flow condition. 22 bits correspond to register enable signals. 1 bit to enable memory writes. 29 bits are multiplexer control signals. These are generally used to select write values to registers and input operands to modules such as the AU, LU and addressing units. 20

41 3.1. THE OPENTRANSPUTER CPU Registers Mux Mux Operand 0 bus Operand 1 bus ALU Memory Addressing Unit Results bus Mux Figure 3.3: Example datapath implementation using three buses. 4 bits select arithmetic and comparison operation. 2 bits select logic operation. 1 bit differentiates between arithmetic and logical right shift. 72 bits used for various other more specialised purposes Datapath Due to constraints in the manufacturing process during the 1980s, its was not possible to fabricate chips with more than two layers of interconnect. Therefore, the number of connections between the different components of the processor had to be minimised to avoid issues such as wires crossing over each other s paths. This is the main reason why the original Transputer was implemented using three buses. Two of these buses contained the input operands for components such as the ALU. The third gathered the result of a computation and was also used to write the value back to its destination. Despite meeting manufacturing requirements, this design approach has the effect of reducing the number of simultaneous computations that can be performed at any clock cycle. This is due to the fact that it is not possible to transport the operands to the relevant components in parallel. Modern manufacturing technique can easily cope with multiple layers of interconnect, giving us flexibility to include significantly more wiring. Hence, we have decided to replace the buses by large collections of wires that transport the output of a component to every part of the datapath that will eventually need that signal as an input. With the bus approach, shown in Figure 3.3, the outputs of all registers are multiplexed into two buses and the outputs of other logic blocks are placed into a results bus. With our approach, the outputs of the registers and other system components are carried by individual wires, and multiplexers are only needed to select the inputs, making the datapath wider as illustrated in Figure 3.4. From the diagram, it is clear that with only three buses the ALU and the addressing unit cannot operate simultaneously on different operands, while the other approach enables this possibility. The fact that most logic components in the original Transputer are fed the same input operands means that it is difficult to exploit the inherent parallelism of hardware using the bus approach. Thus, instead 21

42 CHAPTER 3. PROJECT EXECUTION Mux Registers Mux Mux ALU Memory Addressing Unit Figure 3.4: Example implementation using a wide datapath approach. of having multiple instances of a module to compute two different values within the same clock cycle, the Inmos engineers reused the same components at the expense of time. For example, instead of having two ALUs for computing two different arithmetic operations simultaneously, the processor would only have a single one and take twice as long to calculate the results. During the 1980s, this was a desirable side effect since replication of major components of a system such as the ALU and memory addressing unit was very expensive in terms of area and power consumption. Also, replication could have a significant impact in the manufacturing cost of the device. Nowadays, thanks to the dramatic decrease in cost of electronic components and improved manufacturing techniques, replication is an acceptable mechanism to maximise performance. Hence, we have decided to take advantage of this idea to reduce the overall cycle count of many instructions. This in turn reduces the total number of microinstructions and potentially decreases the complexity of the hardwired control unit. The disadvantage of module replication and the replacement of the original bus system is a potentially more complex processor datapath than that of the original Transputer. Figure 3.5 is a simplified schematic of the OpenTransputer s datapath. Clearly, the complete diagram contains many more logic modules, connections and multiplexers. However, for the purposes of this explanation the this schematic suffices. The following are a number of key points to note about the datapath: We have included three memory addressing units, two of which calculate word addresses, that is, given a base x and an offset y the output is x + (y 4). The remaining addressing module is simply an adder to compute a byte address given a base and offset. This enables the OpenTransputer to calculate three different addresses simultaneously, one of which is exclusively used for accessing memory and the other two for updating registers. We have developed dedicated circuitry to copy blocks of data from one memory location to another. These are used by the processor to efficiently execute input and output instructions on behalf of processes communicating within a single machine. Because the hardware essentially implements a small kernel, there are 14 constant integer values that 22

43 3.1. THE OPENTRANSPUTER CPU Figure 3.5: OpenTransputer s simplified datapath schematic. are commonly used in operations such as scheduling and communication. These values correspond to special memory addresses or numbers that flag states of processes and channels. Similar to the original Inmos design, the OpenTransputer stores these values in special ROM. However, we have decided to include twice as many such ROM modules to optimise performance. The added cost of the extra memory is negligible since their maximum size is only 16 words. The OpenTransputer has separate arithmetic and logic units rather than an individual ALU. A final optimisation introduced in the OpenTransputer greatly reduces the time taken to perform a context switch between low and high priority processes. When a low priority process wishes to start a new high priority one, the original Transputer blocks the parent process by storing all its context in memory and then executes the new one. This task is particularly slow because the processor must write data to memory at least six times to store the A, B, C, Iptr, Wptr and status registers. We have chosen to optimise context switches from low to high priority in the OpenTransputer by introducing shadow registers [17] within the register file. Shadow registers are additional registers whose only purpose is to temporarily save the value of another register that will be overwritten and would otherwise be lost. Thus, when a low priority process is blocked by a high priority one, the CPU copies all the context into the shadow registers in a single clock cycle. When the low priority process needs to be restored, the saved values are recovered from the shadow registers. This mechanism reduces the average context switch time from 12 to 4 clock cycles. 23

44 CHAPTER 3. PROJECT EXECUTION Fetch Unit The fetch unit of the OpenTransputer is very simple. It consists of a 32-bit instruction buffer (four 8-bit instructions) and some logic to decide when the next fetch should be issued. Its operation is also very simple and is tightly related to the control unit. Firstly, when the machine is switched on a fetch is issued before any instruction is executed. In normal operation the fetch unit uses the two least significant bits of the instruction pointer (the byte index) as an index into the instruction buffer to decide which instruction is executed next. When the byte index reaches 3 a new fetch is issued and execution proceeds. This mechanism does not take into account that instructions must be fetched when the instruction pointer is manipulated either by a jump or any other operation such as context switches. For this reason, all microinstructions have a 1-bit flag that causes the fetch unit to retrieve a word from memory Component Integration and Signal Timing The OpenTransputer does not have a conventional Fetch-Decode-Execute pipeline, a mechanism that is better suited to register machines rather than to stack-based architectures. This is because most instructions in machines such as the Transputer operate on the same register stack causing many data dependencies. For instance, consider the sequence of instructions shown in Listing 3.3 for a hypothetical register machine. Clearly, the addition can easily be completed independently of the other instructions. Nevertheless, there is a dependency between the subtraction and store instruction on register r0. In a register machine even this can easily be overcome by bypassing the result of r0 before it is written back to the register file or occasionally stalling the pipeline. On the other hand, consider the sequence of Transputer instructions in Listing 3.4. Every single one of these instructions operates on the register stack and must be completed before the subsequent instruction can be executed, making the architecture unsuitable for traditional pipelining techniques. The OpenTransputer either executes a fetch or an instruction, but never both at the same time. add r1, r2, r3 sub r0, r5, 56 stw r4, r0, r7 Listing 3.3: Assembly program for a register machine. ldc 1 ldc 2 add ldc 3 sub Listing 3.4: Assembly program using the Transputer Instruction Set Architecture (ISA). Although it is difficult to pipeline the Transputer, this idea can be applied to the individual microinstuctions to a certain extent. However, this optimisation is outside the scope of this project and was not attempted. The three components described above (i.e. fetch and control units and datapath) are connected as shown in Figure 3.6. The fetch unit is the simplest and is in charge of retrieving instructions from memory and feeding them to the control unit where they are associated with a microinstruction and executed. In contrast, the datapath is a collection of modules such as register files, AU, LU and memory addressing units all connected together that actually execute the instructions. Finally, the control unit is the authority in the processor and is the piece of hardware that generates the signals to enable or disable different paths within the machine, yet it does not directly interface with memory. The control unit 24

45 3.1. THE OPENTRANSPUTER CPU receives feedback from the fetch unit to coordinate accesses to memory. Also, it has inputs connected to components of the datapath to access results of computations associated with flow of control. For instance, the address of the next microinstruction is calculated by dedicated circuitry in the processor and the result is used every cycle to decide on the next set of control signals. Control Unit Fetch Unit Memory Datapath (Register file, AU, LU, Memory addressing, etc.) Figure 3.6: Integration of the three main components of the OpenTransputer CPU. As in every processing system, the timing at which the signals are available is key for the CPU to behave correctly. Furthermore, the timing coordinates the interactions between the different parts of the system. In the OpenTransputer all registers are edge-triggered and their values update at the positive edge of the clock. In contrast, the original Transputer operated using level-triggered latches, yet we consider that this mechanism is difficult to use because each register effectively consists of two independent latches that are enabledby independent non-overlapping clocks. In our design, each microinstruction takes exactly one clock cycle to complete. Hence, control signals are updated at the positive edge of the clock previous to instruction commit and the instruction to be executed is visible by the control unit two cycles before. To illustrate this consider the waveform shown in Figure 3.7; note that at time 9675 ns the value of instr changes to 4A 16, which corresponds to a ldc (load constant) instruction that updates both registers A and B. The control signals associated with this instruction are only set one clock cycle later (9685 ns) and the instruction is only executed two cycles after (9695 ns). Now consider the waveform at time 9695 ns when the value of the instruction pointer (i.e. riptr) changes to Clearly, the byte index of Iptr reached 3 and the next four instructions must be fetched from memory. This is why the value of instr changed to 2D 16, which is the same instruction executed four clock cycles ago. The control unit notices this situation and takes action by setting enfetchbuff high (9705 ns) completing the fetch operation one cycle later. It is important to highlight that when the processor executes a fetch, all the other components of the processor are disabled and the values input to the memory module are those output by the fetch unit Other Design Decisions Since the purpose of this project is to develop a re-implementation of the Transputer s core functionality, we decided to omit a number of instructions that common programs are not expected to use. However, introducing support for the missing instructions is a relatively easy task since the OpenTransputer already implements most of the hardware that executes them. In particular, the our design cannot execute the following groups of instructions. 25

46 CHAPTER 3. PROJECT EXECUTION Figure 3.7: Signal timing for correct CPU behaviour. The resetch low-level instruction that is used to reset an autonomous link controller. Debugging instructions to read the scheduler front and back pointers. Floating-point arithmetic. Long integer arithmetic. Multiplication and division. For a detailed list of all instructions refer to Appendix A. 3.2 External Communication Beneš Networks The Inmos Transputer has four external bidirectional communication links that can be used to connect up to four devices. There is no limit in the number of Transputers that can be connected in this fashion [16]. However, as the network grows communication operations may be slower since messages need to be relayed by intermediate nodes before the destination is reached. This clearly reduces the usability of the Transputer as a building block for assembling large parallel networks. Inmos realised this problem and developed a 32-way crossbar switch known as C004 that is compatible with the serial protocol of the Transputer links. The idea is that the switch routes messages between the Transputers rather than joining the processors using direct point-to-point connections. Taking this into account, we implemented the OpenTransputer with a single bidirectional link rather than four. This link is meant to be connected to small switches arranged in a Beneš network. Beneš networks [5] are a special form of Clos networks [8] composed of simple 2 2 switches and three or more stages as shown in Figure 3.8. They were formalised by Václav E. Beneš while working at Bell Labs researching different topologies for telephone switching networks. In a Beneš network, if r is the number of switches in every stage, the network supports N = 2 r different inputs and the same number of outputs. Moreover, the network contains 2 log 2 N 1 stages and a total of N log 2 N N/2 switches. In the case of the OpenTransputer, each processors needs both an input and an output link to communicate. Therefore, the Beneš network can be viewed as folded on itself as shown in Figure 3.9. The resulting Beneš network is composed of crossover switches with four inputs and four outputs (4 4) rather than the simpler 2 2 crossbar switch. Beneš networks are rearrangeably non-blocking [5], meaning that given any input a path can always be found to an output without colliding with any other path from other connections in the network. 26

47 3.2. EXTERNAL COMMUNICATION Inputs 2x2 switch Outputs Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Figure 3.8: Example 8 8 Beneš network. OpenTransputer Layer 3 Layer 2 Layer 1 4 input and 4 output switch OpenTransputer OpenTransputer OpenTransputer Network core Bidirectional connections for OpenTransputers Bidirectional connections for next layer of switches Network edge Figure 3.9: Folded over Beneš network with capacity for 8 OpenTransputers. However, in order for this to happen some of the previous connections might have to be rearranged to use different intermediate switches. For the OpenTransputer, the non-blocking property of Beneš networks means that given the communication pattern of a program, it is always possible to find an arrangement of the channels in the network such that no two paths cross. In other words, we can ensure that no two packets will try to use the same link when the program is being executed. Clearly, this is an attractive 27

48 CHAPTER 3. PROJECT EXECUTION property because in concurrent systems the performance of a program is often limited by the network traffic. For instance, in a more traditional matrix arrangement, the shortest paths between two nodes inevitably use the switches in the main diagonal. As a result, the traffic through these nodes of the network is significantly higher causing a large number of collisions and slowing down the whole system. The non-blocking property is the main reason for our choice of network configuration. The Occam programming model facilitates analysing programs statically in order to extract the communication pattern and efficiently map channels to non-blocking paths in the network. It is important to highlight that this requires the development of new software tools that generate such mappings depending on predefined network topologies Channel Addressing In the OpenTransputer channels may be used for internal and external communication and I/O pin handling. Since the same input and output instructions are used for all these operations, the processor uses channel addresses to differentiate what actions should be performed. As illustrated in Figure 3.10, the least significant two bits of the address are used to identify the type of channel. Hence, if bit 0 is set, the processor knows that this is an external channel address and the I/O instructions should be executed accordingly. On the contrary, if the least significant bit of the address is not set, the decision between internal communication and I/O pin handling relies on the value of the bit with index 1. Recall from Section that for internal communication any word in memory can be used as the channel. Therefore, the address consists of a memory address with the least significant two bits set to 0. On the other hand, for external communication the channel address encodes the route through the Beneš network followed by a packet to reach the destination OpenTransputer. Finally, when a channel refers for an I/O pin, the address encodes the identifier of the pin to be used. In the OpenTransputer, I/O pins are not memory-mapped as in most conventional processors. For a detailed description of I/O pin handling refer to Section 3.3. Internal (soft) channel Word memory address External (hard) channel T Turning point Route towards core (RTC) Route towards edge (RTE) Destination port I/O pin handler Unused bits I/O pin identifier Figure 3.10: Bit fields of the channel addresses for internal and external communication and I/O pins Message Routing The Beneš network is implemented as a collection of smaller four input and four output switches. As shown in Figure 3.9, the OpenTransputers are connected at one end of the network that we call the core. The other end of the network of switches always ends in unconnected ports and is called the edge. The available connections in the edge can be used to attach more switches and increase the capacity of the network. Every new layer of switches linked in to edge doubles the capacity. The routing operation within the network of switches consists of two stages. Firstly, the message is forwarded by the individual 28

49 3.2. EXTERNAL COMMUNICATION switches towards the edge of the network. Then when the message reaches a previously determined layer (turning point), it is forwarded towards the core until it reaches the destination. In this scheme, the channel address encodes the complete route through the network. As illustrated in Figure 3.10, channel addresses have five components: T-bit. This is the most significant bit of the address and can either be 0 or 1 depending on whether the channel is synchronous or asynchronous respectively. In the Occam communication model, all I/O operations are synchronised. Hence, to use asynchronous channels the language must be extended and the compiler modified to cope with the changes. This is outside the scope of this project and has not been attempted. We foresee that asynchronous channels will be fully supported in future iterations of the OpenTransputer, so we have decided to include the T-bit within the address. Destination port. This is a four bit value that encodes the destination port in the receiver OpenTransputer. Each processor has 1 output and 16 input ports allowing up to 16 processes to wait for input operations. Route towards edge (RTE). 11-bit field that specifies the path of the message towards the edge of the switch. Each switch between the sender OpenTransputer and the edge forwards the message using its left or right port if the value of the least significant bit of this field is 0 or 1 respectively. Route towards core (RTC). 11-bit field that specifies the path of the message towards the core of the switch. Each switch after the turning point forwards the message using its left or right port if the value of the least significant bit of this field is 0 or 1 respectively. Turning point. 4-bit field that encodes the depth (layer number) within the network that the message must reach before it reaches its turning point and heads back towards the core. To better understand the routing mechanism consider the example in Figure This network consists of 12 switches and two OpenTransputers connected in opposite ends of the network. To output a message the processor sends the channel address and the data byte to the switch in layer 1 that it is connected to. Since this is the first switch in line and the message has not reached the turning point, it inspects the least significant bit in the RTE field. Because the bit is not set, it forwards the message and address to the next switch in the path using its left output port. Similarly, the second switch inspects the address, yet this time it forwards the message using its right output port because the second bit in RTE is 1. The third switch in the path is the turning point as specified in the channel address. Thus, the switch in layer 3 forwards the message back to layer 2. The next stop in the network is the first switch after the turning point, so it checks the RTC field and decides to forward the message using its left output port. The process continues until the data reaches the destination OpenTransputer. Note that there is a single bidirectional link between each node in the network, so the processor is responsible for delivering the message to the appropriate input link controller according to the destination port specified in the channel address. It can easily be seen that this form of addressing allows us to connect up to 2048 OpenTransputers since the route towards edge and core bit fields are 11 bits each Collision resolution Each of the switches in the Beneš network consists of four input and four output ports connected as shown in Figure Each of these ports is attached to an independent controller, meaning that the 29

50 CHAPTER 3. PROJECT EXECUTION Initial address OpenTransputer (sender) Layer: 3 RTE: 10 RTC: 00 Layer: 2 RTE: 10 RTC: 00 Turning point Layer: 1 RTE: 10 RTC: 00 OpenTransputer (receiver) Layer: 0 RTE: 10 RTC: 00 Layer: 0 RTE: 10 RTC: 00 Route towards edge (RTE) Route towards core (RTC) Figure 3.11: Example route of a package through a Beneš network of OpenTransputers. switch can route up to four messages simultaneously provided that their paths do not cross. The input controllers receive messages from other nodes in the network, process the channel address and decide where to forward the data. The output controllers are substantially simpler because they only have to moderate which one of the three messages forwarded by the input controllers is output to the physical link next. The output controller implements a token mechanism which ensures that all input controllers have a fair share of the physical link time. The token is assigned to a controller for an indefinite amount of time, when the controller finishes using the link it passes the token to the next controller that needs to output a message. Despite the properties of Beneš networks to avoid collisions sometimes the communication patterns of an application cannot be inferred at compile or load time. This will inevitably cause collisions, which is a condition in which two or more messages must be routed through the same physical link. To avoid data loss each node in the network must implement a protocol that ensures that each message sent has been successfully received at the other end of the physical link before the next packet is sent. Figure 3.13 is a step by step description of the protocol used for the OpenTransputer project. When the sender wishes to transmit a packet to another node in the network, it makes the data available in the link (b) and sets the HasData line high (c). The sender will block until there is a response from the receiver to acknowledge that the data has been successfully transferred. When the receiver is able to attend to the 30

51 3.2. EXTERNAL COMMUNICATION request, it copies the input data and channel address into an internal buffer and sets the ReadData line to acknowledge that the packet was received successfully (d). The sender notices that ReadData is 1 and sets HasData low (e). Finally, the receiver notices this change in its input signals and sets ReadData to 0 concluding the data transfer (f). Now, the link is ready to be used for another operation Autonomous Link Controllers The OpenTransputer has a single bidirectional link that is connected to a network switch. The link is shared between the output and input link controllers that were implemented using exactly the same microcoded approach explained in Section In this section we describe the operation of the controllers and their interaction with the CPU Output Controller A disadvantage of the original Transputer design is that each external channel must be bound to a hardware link. Therefore, there can only be four pending input or output operations at any one point of time since there are only four serial communication links. Despite the compiler s support to detect conditions where problems might arise, this issue increases the complexity of the software that needs to take this into account if the parallelism of a network is to be exploited. To mitigate this problem, we have decided to introduce virtual channels. Virtual channels are an abstraction that gives the programmer the impression that there is an infinitely large number of output links. In reality, there is a single link that is shared between all communicating processes. For this purpose, we developed an output link controller that implements a scheduler in hardware similar to that used by the CPU to allocate processing resources. The output link controller has two registers that store pointers to the front and back processes of a linked list. This structure acts as a queue where new processes are appended to the back and dequeued from the front to be given a share of communication time. There are two such lists for high and low priority processes, yet communicating processes cannot be blocked. That is, if a low priority process is using the link and a high priority process wishes to communicate it will be appended to the back of the high priority queue and run when the low priority process completes. The operation of the output controller is conceptually simple. When the OpenTransputer is reset, both the queues are emptied and the controller is set to ready state. When a process wishes to output a message, the processor sends the size and pointer to the message and the workspace pointer to the link Input port 0 Input port 1 Output port 0 Output port 1 Output port 3 Output port 2 Input port 3 Input port 2 Figure 3.12: Internal connections of 4 input and 4 output switches. 31

52 CHAPTER 3. PROJECT EXECUTION a) Sender HasData Data ChanAddress 0 Receiver 0 ReadData b) Sender HasData Data ChanAddress 1 Data [9:0] Address [31:0] Receiver 0 ReadData c) Sender HasData Data ChanAddress 1 Data [9:0] Address [31:0]! Receiver 0 ReadData d) Sender HasData Data ChanAddress 1 Data [9:0] Address [31:0] Receiver 1 ReadData e) Sender HasData Data ChanAddress 0 Receiver! 1 ReadData f) Sender HasData Data ChanAddress 0! Receiver 0 ReadData Figure 3.13: Data exchange protocol between two components of a network of OpenTransputers. 32

53 3.2. EXTERNAL COMMUNICATION controller. Since the link is not busy, the output operation is run by the controller. If another process wishes to use the link when it is busy, the controller uses the DMA mechanism to store the length and pointer to the message in the process workspace and enqueues the process according to its priority. When the first process finishes communicating, the output link controller sends a signal to the CPU to request that the process be scheduled for execution. Then the controller removes the next waiting process at the front of the queue and runs the output operation. It cannot be stressed enough that the output link controller runs concurrently and independently of the CPU. Hence, while the controller is performing an operation, the CPU will be executing another process Input Controllers As mentioned before, the Inmos design had four input link controllers, which we consider a barrier to using the processor to build large parallel networks. To mitigate this problem we introduced the concept of virtual channels, yet it is difficult to implement this idea for input operations. This is because if we modify the input link controller to implement a queue of processes, then we would need a buffer of potentially infinite size to store the partially received messages from other OpenTransputers. Furthermore, messages from any waiting process can arrive at any point of time. Thus, maintaining a simple linked list of all processes would be very inefficient, since in the worst case the controller would have to iterate through every item in the list to match an incoming message with its intended recipient. In this sense, a more sophisticated data structure should be used to implement the scheduler of the input link controller. However, complex schemes are difficult to implement in hardware and have other undesirable side effects such as increased power consumption. Due to these issues, we consider that implementing an input controller similar to the output one has more disadvantages than benefits, yet we still consider that four links as in the original Transputer is too limiting. For this reason, we decided to include 16 input link controllers per OpenTransputer, which share a single bidirectional link with the output controller. Each of the 16 input link controllers corresponds to a different port number as specified in the channel address for external communication. Similar to the output controller, the input controllers use the DMA mechanism to access memory and operate concurrently and independently from each other and the CPU. The interaction between the CPU and the input link controllers its slightly more complex than that of the output link controller. This is because in Occam input operations can be used within alternative constructs. The controller can be in any of four states: Ready. Data has been received from a remote process, but there is currently no receiver. Requested. The controller is communicating on behalf of a process. Enabled. In an alternative construct, processes are bound to guards or conditions. The first of these processes whose condition is met is executed. Naturally, the conditions can be logical expressions, but they can also be input from channel operations. In this case, the process bound to whichever input arrives first is the one executed. To support alternative constructs, the input link controllers can be set to enabled, meaning that they must notify the CPU when the first byte arrives on the link before communication proceeds. Waiting. the link. The controller is not communicating on behalf of a process and no data has arrived in When the controller is reset its state is set to waiting. If a process wishes to execute an input operation, the CPU interacts with one of the input controllers by sending it the message length and destination pointer as well as the process workspace pointer. The input controller will become requested and run this operation to completion. When the message has been fully received, the CPU will be flagged to 33

54 CHAPTER 3. PROJECT EXECUTION Inmos Transputer packets Data OpenTransputer packets Data 1 1 Data byte 0 0 Data byte Acknowledgement Acknowledgement Unused Figure 3.14: Packet formats for external communication in the Inmos Transputer and the OpenTransputer. reschedule the process. On the other hand, if data is received before the controller is requested, it will enter a ready state and wait until a process performs an input operation on that controller. Afterwards, the communication continues as normal. The final case is when the CPU executes an alternative construct. In this case, all involved input controllers are enabled, then the first one of them to notify the CPU that data has been received is allowed to communicate. The remaining enabled controllers are set to ready or waiting state (disabled) depending on whether they have received data from the link External Communication Protocol The external communication protocol in the OpenTransputer is equivalent of that of the original Transputer. The protocol is designed to support synchronized inter-process communication using point-to-point channels as defined in Occam. To support this, the external links of both the OpenTransputer and the Inmos design are bidirectional, meaning that there are independent signal wires in each direction. When communicating, messages are broken down and sent one byte at a time and all data packets must be acknowledged before the next data item is dispatched [18]. For the sake of clarity, the interactions between input and output controllers exchanging two bytes are shown in Figure In this example the output controller initiates the data transfer before a receiving process is ready in the remote OpenTransputer. For this reason, the output controller waits for some amount of time before it receives the first acknowledgement byte and communication can proceed. Also, note that the sender process is only rescheduled after the last acknowledgement package is received to guarantee that communication is completely synchronous. Even though the communication protocol in both the OpenTransputer and the original Transputer is the same, the structure of the individual data packets themselves differ. The Inmos engineers already struggled to fit a single Transputer into a chip; hence, manufacturing a device with more than a single processor within the same chip was not an option. Due to this, (and potentially to reduce the amount of wiring) each Transputer was fitted with four serial communication links that allowed users to assemble large parallel networks. Either data or acknowledgement packets can be exchanged, their structure is shown in Figure The former are 11-bit packets containing a single byte of actual information, and the latter are 2-bit packets that simply tell the sender that a data packet was successfully received. Both types of packets have the most significant bit set for synchronisation purposes, while the least significant (stop) bit is unset to differentiate between the end of a packet and the beginning of the next [14]. We envision that multiple OpenTransputers will be manufactured within the same chip, a setting in which communication is more reliable and we are less concerned with the amount of wiring to connect the processors to the switch. Hence, we have replaced the serial links with parallel connections for 34

55 3.3. I/O INTERFACE Output link controller Input link controller Controller initiates output operation. First data byte is sent. Remaining bytes = 1. Input controller becomes ready. Remaining bytes = UNKNOWN. Output controller waits for acknowledgement. Controller initiates input operation and becomes requested. Time First data byte is stored in memory and acknowledged. Remaining bytes = 1. Second data byte is sent. Remaining bytes = 0. Signal CPU to reschedule sender process. Process rescheduled by CPU. Dequeue next process from virtual channels queue. Second data byte is stored in memory and acknowledged. Remaining bytes = 0. Signal CPU to reschedule receiver process. Process rescheduled by CPU. Controller returns to waiting state. Figure 3.15: Interaction between input and output controllers to transfer a message over the network [16]. on-chip communication. This also has the effect of speeding up data transactions because all the bits belonging to a packet are transferred simultaneously. There are still two types of packets transferred by OpenTransputers, yet we have eliminated the need for the synchronization bits as shown in Figure Also, the 32-bit channel address is forwarded in parallel with the packet. To assemble a network of two or more chips, the user would have to connect together the individual switches. However, the parallel connections approach introduces a usability problem. For instance, connecting two switches together would require over 160 individual wires. To tackle this issue, switches that implement a serial communication protocol similar to that of the original Transputer links or the IEEE1355 [4] standard must be developed. This is out of the scope of this project and was not attempted. 3.3 I/O Interface In many modern processors, the I/O pins are accessed through a mechanism known as Memory-Mapped I/O. The idea consists on using the same bus to address both memory and I/O pins. To differentiate between the two possibilities, there are some reserved addresses that correspond to the pins, the remaining addresses are actual memory locations. When an event occurs in any of the I/O pins, a signal is generated to attract the processor s attention and handle the event without further delay. In our opinion, this process is difficult to grasp for the inexperienced user, and does not take advantage of the other features of the Transputer such as channels. The original Transputer developed at Inmos handles I/O pins as if they are external communication links. Therefore, the same input and output instructions can be used to read and write I/O pins attached 35

56 CHAPTER 3. PROJECT EXECUTION to the chip. Due to its simplicity, even the most basic Occam code can communicate with peripheral hardware such as sensors or displays. In the OpenTransputer we maintain this idea by implementing I/O pin handlers as special link controllers. To maximise their flexibility we also introduce a new instruction that enables the processor to configure the pin handler at run-time. The I/O pin handlers in the OpenTransputer expose the same interface as the communication links. In other words, the processor interacts in exactly the same way with both the pin handlers and the autonomous link controllers. When the processor wants to read or write a pin, it sends the handler the number of bytes to be read or written and the memory address to where the data should be stored or read. The process that is waiting for I/O is descheduled and enters a waiting state. When the pin handler finishes the requested operation, it signals the processor, which then reschedules the process and execution proceeds as normal. The I/O pin handlers are microcoded components in exactly the same way as the communication link controllers and the CPU, yet much smaller. We have chosen the microcoded approach to guarantee that the I/O pin handlers can be easily modified in the future to add new functionality. The processor communicates with the I/O pins by using special channel addresses that must be stored in the B register when input or output instructions are executed. However, these addresses were carefully chosen so that there is no ambiguity between the link controllers, internal communication channels and the I/O pins (refer to Section ). Hence, when interacting with an I/O pin handler the least significant two bits of the address must be set to 10. The pin handler identifier itself is stored in bits 2 to 6 (see Figure 3.10) and it ranges from 17 to 31, meaning that the OpenTransputer has 15 pin handlers. The choice in number of handlers stems from the fact that there are 17 autonomous link controllers, which require five bits to be addressed leaving 15 unused identifiers. To make the design more flexible, we have introduced a new instruction confio that can be used to configure the pins before usage. The same instruction can be used to reconfigure the pins at run-time as needed. confio is similar to other channel instructions in that before it is executed, the programmer must ensure that the B register contains the address of the target I/O pin. Furthermore, the actual configuration is specified in the A register enabling us to use up to 32 bits to control different options. Currently, it is only possible to set a pin in input or output mode. In the former case, when the processor makes a request, the handler simply reads the input pin and stores the value (i.e. either 1 or 0) in memory using the DMA controller. If the pin is configured as output, when a request is made to the handler, it reads the memory location where the data to output is stored and places it in a register that is directly connected to the pin. For the purposes of this project, the implemented functionality suffices to showcase the OpenTransputer, yet in many situations further control over the I/O interface is desirable. For this reason, we implemented the functionality in such a way that it is easy to upgrade and include more features. For example, a possible scheme to configure the I/O ports could use the least significant nine bits of the A register in the following way: 2 bits used to configure the pin handler to reschedule a process when the pin signal is rising, falling or changing. Indeed, this specifies what needs to occur before a pin is ready to be read. 1 bit used to set the pin as handshaken or clocked. The former means that the pin handler changes its output signal as a result of changes in its input. 1 bit to configure the pin as master or slave. This option is only relevant when the handler is used in handshaken mode. A master pin initiates communication with the hardware it is connected to, in other words, it changes its output signal first even though its input remains the same. On the other hand, a slave pin always waits for its input signal to change before making changes to its output. 36

57 3.4. THE OPENTRANSPUTER SYSTEM 16 input link controllers Memory OpenTransputer CPU Output link controllers 15 I/O pin handlers Figure 3.16: Major components of the OpenTransputer system. 4 bits set the clock source for this pin. 1 bit to configure the port as input or output. Finally, since the I/O pins are essentially treated as output channels, it might be desirable that I/O pins can be used within alternatives as regular input channels. However, at the moment this feature is not needed, so we have not implemented it. 3.4 The OpenTransputer System The five major components of the OpenTransputer system are shown in Figure The CPU is the main microcoded component that executes all core instructions as well as internal communication operations. It is also responsible for interfacing and managing all other components of the system and ensuring that all processes are allocated a fair share of CPU time. The input and output link controllers run concurrently and independently of the CPU. The links interact with the processor to enable communication operations between processes that reside in different CPUs. Moreover, the link controllers interface with the network switches to pass messages between OpenTransputers. There are 15 I/O pin handlers that interact with the processor in the same fashion as the link controllers. Nevertheless, instead of communicating with other processors, the handlers have a direct connection to the external I/O pins where hardware peripherals can be connected. The final major component of the OpenTransputer system is the on-chip Random Access Memory (RAM). The memory gives access to 32-bit words per operation, and contrary to the original Inmos design, addresses start from 0 rather than from the most negative integer. The CPU has a direct connection to the memory, while communication links and I/O pin handlers access it by means of a DMA controller. In the interest of prioritising communication operations over individual process execution, the DMA controller has a higher priority 37

58 CHAPTER 3. PROJECT EXECUTION to access memory than the CPU. To achieve this, we introduced a 1-bit field in each microinstruction that indicates whether a memory access is required. If this bit is not set, the microinstruction does not use memory and execution proceeds regardless of the DMA controller. In contrast, if the bit is set, the microinstruction will only be executed if the DMA is not active, otherwise the CPU will wait until the memory becomes available. 3.5 Digital Design with Hardware Description Languages (HDL) Due to the increasing complexity of digital circuits, designers needed logic descriptions of their implementations that were not tied to any electronic technology like CMOS. For this reason, Hardware Description Languages (HDL) were created that can be used to describe digital circuits at the register-transfer level (RTL). In some sense, HDLs are similar to conventional software programming languages like C, yet they have an explicit notion of time and concurrency [21]. Thanks to this, designs described in HDLs can be simulated to verify their behaviour. The two most widely used HDL languages are VHDL and Verilog; for this project we use the latter. In Verilog there are three main levels of abstraction for modelling digital logic: Structural modelling. Also referred to as gate-level modelling, this allows systems to be described purely in terms logic gate primitives such as AND, OR and NOT. This is a very low-level description of a digital circuit, consisting only of modules with declarations of input and output ports and a list of logic gates [6]. Dataflow modelling. This enables designers to focus on the function of a system rather than on the individual logic gates. In dataflow modelling, circuits are described in terms of boolean equations and there are operators that can act as input and output sources. In this sense, dataflow is the next level of abstraction from structural modelling [22]. Behavioural modelling. Perhaps the highest level of abstraction, it enables the designer to describe the function of the hardware in an algorithmic manner. Behavioural modelling in Verilog resembles C programming [22]. Structural modelling is generally only used when the number of gates is very limited and the designer wishes to control how each gate is connected. However, for larger modules it is desirable to focus on the general function of the circuit rather than the specific connections. In this case, dataflow is preferred over structural modelling. On the other hand, behavioural Verilog is commonly used when the designers wish to evaluate the trade-offs of multiple techniques. This is referred to as architectural evaluation and takes place at an algorithmic level [21]. Despite potentially producing less efficient designs, we have implemented the OpenTransputer using a combination of dataflow and behavioural Verilog. This is because we have a limited amount of time to deliver the finalised design. Nowadays there are tools that can transform a high-level C implementation of an algorithm into a hardware design. This process is called high-level synthesis and is used as a preliminary step to evaluate a technique or idea. This design approach is generally faster than implementing circuits at RTL level. Hence, we considered implementing the OpenTransputer using SystemC, a set of C++ class libraries that provide a cycle-accurate model for hardware design. Nevertheless, with large and complex designs such as processors, high-level synthesis will normally result in extremely large and inefficient hardware implementations. In our case, an overly large synthesised design means that our target FPGA does not have enough programmable logic to implement it. For this reason, high-level alternatives to HDLs were quickly discarded for this project. 38

59 Chapter 4 Critical Evaluation 4.1 Evaluating Design Decisions Thanks to the recent improvements in technology and manufacturing techniques, we have been able to approach the design of the OpenTransputer in a radically different way from that of the original Transputer. The following is a concise list of changes introduced in our implementation of the architecture, all of which have been extensively discussed in Chapter 3. Replaced the microcode ROM for hardwired logic that generates the necessary control signals to drive the data path at each clock cycle. Replaced the bus datapath used in the Inmos Transputer for a wide datapath that includes additional wiring to increase the number of connections between the different components. Included multiple instances of the same components such as addressing units to increase the number of parallel computations that can be carried out at each clock cycle. Introduced a new external communication mechanism. Firstly, the four serial communication links were replaced in favour of a single parallel bidirectional link that is connected to a network switch. Secondly, the concept of virtual channels was introduced allowing the OpenTransputer to maintain an unbounded number of simultaneous external output operations. Finally, the number of input ports was increased from 4 to 16. The communication mechanism of the original Transputer based on direct point-to-point connections between the processors was completely changed. The OpenTransputer parallel link is directly wired to switches forming a Beneš network. This network ensures that connections can be established in a non-blocking fashion, which is an advantage over the Transputer approach where messages had to be relayed via intermediate processor nodes. Introduced a new I/O interface that uses the OpenTransputer s external communication mechanism and its virtual channels, streamlining the way programmers interact with external devices. Since the OpenTransputer is still in its early stages of development, it is difficult to make an objective comparative study with other architectures. Instead, we will compare the performance of our implementation with that of the Inmos Transputer. Our metric is the number of cycles taken to execute the core instructions in both the OpenTransputer and the Inmos Transputer. Table 4.1 shows the cycle count for 15 primary instructions of the instruction set. 39

60 CHAPTER 4. CRITICAL EVALUATION Instruction Inmos Transputer OpenTransputer ldl 2 1 stl 1 1 ldlp 1 1 ldnl 2 1 stnl 2 1 ldnlp 1 1 eqc 2 1 ldc 1 1 adc 1 1 j 3 1 cj call 7 4 ajw 1 1 nfix 1 1 pfix 1 1 Table 4.1: Execution time (in clock cycles) of primary instructions in both the Inmos Transputer [15] and the OpenTransputer. It can be seen from the table that out of the 15 instructions displayed, the number of cycles of 6 of them improved by at least a factor of 2. We consider that this is the effect of the significant changes introduced in the datapath of the processor with regards to module replication and additional wiring between the components. The OpenTransputer can compute multiple results for the same operations simultaneously because there are additional modules to do this and also because it is possible to feed all the logic blocks with different operand values. This is not the case in the Inmos Transputer, where the bus system feeds most logic blocks with the same input values. Another interesting result in Table 4.1 is the time taken to execute a call instruction, which only improved by 3 clock cycles. Call is executed when there is a call to a procedure in Occam. The instruction stores the A, B and C registers and the return address in memory and updates the instruction pointer. The OpenTransputer can only complete a single memory access operation per cycle, suggesting that the performance of the processor is bounded by the number of memory operations. Table 4.2 presents the cycle count for some secondary instructions. Notice that the scheduling instructions (i.e. runp and startp) have improved by a factor or 2 in the worst case. This is due to the fact that we have implemented shadow registers to improve the performance of context switches between high and low priority process as discussed in Section The use of shadow registers eliminates the need to perform 6 memory accesses to store the context of the process being blocked. Therefore, the worst case scenario for startp and runp is reduced to only 5 clock cycles, while the Inmos Transputer takes approximately 12. The cycle count of most instructions in Table 4.2 reinforces our claim about the OpenTransputer. That is, instructions such as alt, talt have improved by at least a factor of 2 due to module replication and the wide datapath. However, other instructions such as enbt and enbc remain almost unchanged possibly because they do not have any parallelism that the OpenTransputer could exploit. A final observation is that the performance of memory block transfer operations has decreased by approximately a factor of 2. Recall that the OpenTransputer contains dedicated combinatorial logic blocks that implement most of the functionality pertaining these operations. However, it can be inferred from the table that the equivalent logic in the Inmos Transputer is significantly faster and is only bound by memory since there are two accesses per word transferred. This is due to the lack of optimisations in the current OpenTransputer design and would be part of future development on the platform. 40

61 4.2. DESIGN VERIFICATION Instruction Inmos Transputer OpenTransputer runp startp endp 13 3 alt 2 1 altend 6 2 talt 4 2 move 8 + 2w 6 + 5w enbt 8 6 enbc 7 6 csub0 2 2 ccnt testerr 3 1 stoperr 2 2 seterr 1 1 Table 4.2: Execution time (in clock cycles) of secondary instructions in both the Inmos Transputer [15] and the OpenTransputer. w is the number of 32-bit words to be moved. 4.2 Design Verification To ensure that the OpenTransputer exhibits the expected behaviour, we conducted throughout the development process. There is extensive documentation describing the internal behaviour of the processor. Hence, we used these written sources as the specification for our design. In particular, we made heavy use of the Occam system description of the Transputer in [28] and the more formal definition of each instruction found in [15]. Specially during the early stages of development, each datapath component was implemented and individually verified, a process that is known as unit-verification. These components are generally small blocks of combinatorial logic such as the block move, AU, LU and addressing units. The tests were generated using three different methodologies. First, large numbers of randomised tests [33] were generated offline using Python scripts. Afterwards, other software tools written by us were used to generate constrained randomised tests to enforce particular conditions such as overflow and underflow. Finally, a small set of handcrafted tests targeting very specific edge cases were written. Since most of the complexity has been dealt with using high-level software tools, the testbenches written in Verilog are very simple. After having some confidence on the correctness of the individual datapath components, our efforts were directed towards integrating the complete CPU, namely the fetch and control units together with the datapath. At this stage, the objective is to ensure the correctness of the microcode sequencing system. That is, we wish to ensure that the CPU can correctly execute instructions that take multiple cycles by stepping through the appropriate sequences of microcodes. Common pitfalls at this stage were: Timing issues. Signals were not ready at the required times. In many occasions this was due to register values being updated at the same clock edge when the signal was also required elsewhere in the design. Microcode errors. To our surprise, the majority of problems with the microcode sequencing system where the result of errors in the microinstructions. Fortunately, due to our development strategy, fixes for these bugs only required minor changes to human-readable microinstructions. Incompatible interfaces. Sometimes the individual datapath components exposed different interfaces that were not compatible with each other. For instance, a common mistake resulted from discrepancies in memory addressing between the OpenTransputer and the Transputer documentation. As mentioned before, in the original Inmos machine memory addresses start from the most 41

62 CHAPTER 4. CRITICAL EVALUATION negative number, while in our implementation they start from 0. To solve these issues additional wiring and in many cases extra logic components were included in the design resulting in a significantly more complex datapath. We consider that this effect was caused by the fact that we are using a wide datapath as opposed to the bus approach of the original Transputer as discussed in Section To test the integrated CPU we decided to use a combination of constrained randomised tests and compiled Occam programs. The randomised tests were generated using the same ideas outlined before. This proved to be a very effective mechanism to ensure the correctness of small and meaningful sequences of instructions such as load and store from memory, arithmetic, logic, etc. Nevertheless, it is difficult to automatically generate meaningful tests that resemble real programs and make use of the scheduler, timers and internal channel communication functionality. Instead, we developed a suite of simple Occam programs that enforced execution of specific microinstructions. To verify the correctness of the Verilog description of the OpenTransputer, we used the C functional simulator as a reference model. In other words, we run the test suite in both the HDL description and the functional simulator and compared their state after each instruction is executed. In this case, state refers to the values of the architectural registers. During the final stage of development, the OpenTransputer CPU was integrated with the autonomous link controllers and the network switches. At this point the system had grown large and the only feasible alternative was to verify the HDL description using compiled Occam programs in much the same way as described above. At this point, most of our efforts were focused on ensuring that protocols between the different system components are correctly implemented and data is not lost on each transfer. We have invested approximately 20% of the project verifying the OpenTransputer. Thanks to the well defined specification and our design approach using a microcoded control unit we have been able to greatly increase our productivity. We uncovered and fixed bugs in 80% of all microinstructions as well as the sequencing system. Nevertheless, we consider that further verification efforts are due before the OpenTransputer can confidently be used within an embedded product. 4.3 Synthesis results We synthesised our Verilog design for two different targets, a ZedBoard XC7Z020-CLG484 FPGA and a silicon target for a manufacturing process. The Verilog design consists of two processors connected by a switch. In this section we briefly discuss the result of the synthesis process and mention prominent findings FPGA Synthesis Component LUTs Registers LUTs as logic LUTs as memory Entire Design 18,646 11,102 16,022 2,624 Switch Core0 9,294 5,526 7,982 1,312 Core1 9,289 5,526 7,977 1,312 Table 4.3: FPGA resources used by the design. The first row shows how much resources are consumed in total. The bottom three rows show how many of these resources the two cores and the switch use individually. An FPGA uses Look-Up Tables (LUT) to implement logic. A LUT can be described as having an arbitrary number of inputs and one output. It can then be programmed to assume a certain output value 42

63 4.3. SYNTHESIS RESULTS depending on the inputs. It essentially can model any kind of logic gate or a combination of logic gates. The FPGA used in this project has 53,200 LUTs in total, of which the entire design uses 18,646, as can be seen in Table 4.3. In other words, our design utilises 35% of the FPGA s LUTs. The design only uses 10% of the available registers. Most LUTs are used as logic (86%) and only 14% are used as memory. As can be seen in the second row, the switch barely consumes any of the resources available on the FPGA. The cores on the other hand make up most of the design. It might look a bit curious that Core0 uses 5 more LUTs than Core1. It can be seen from Table 4.3 that these LUTs are used as logic and as such are an artifact of the synthesis process, which is non-deterministic. As the switch uses a negligible number of resources, the proportion of resources used by the individual cores is basically the same as that of the entire design. Component LUTs Registers LUTs as logic LUTs as memory Core0 9,294 5,526 7,982 1,312 Datapath 7,372 5,334 6,060 1,312 Decoder 1, ,916 0 Fetch Table 4.4: FPGA resources used by a single core. The bottom three rows show the number of LUTs consumed by the individual components of Core0. If we examine Table 4.4, we observe that the datapath consumes most of the resources inside the core, it uses 7372 LUTs and 5334 registers; in other words, 79% of all the LUTs and 57% of all the registers used by the processor. It also contains the entirety of memory used within a core. The decoder uses 1916 LUTs and 160 registers, which means it uses 21% of all the LUTs and only 2% of registers allocated to the processor. Finally, the fetch unit only uses a minimal number of 8 LUTs and 32 registers corresponding to the instruction buffer. Component LUTs Registers LUTs as logic LUTs as memory Datapath 7,372 5,334 6,060 1,312 Autonomous links 4,061 4,599 4,061 0 RAM 1, ,312 Register file 1, ,408 0 Other Table 4.5: FPGA resources used by the major components within the datapath. As the datapath is such a major component of the processor, Table 4.5 shows which of its parts use the largest number of resources in the FPGA. The autonomous controllers, which provide external communication and support for the I/O interface use over half of all the LUTs and almost 90% of registers in the datapath. The RAM on the other hand contains all the LUTs used as memory within the processor, yet very little logic. The register file, which contains the registers accessible to the programmer uses 1408 LUTs and 503 registers, or 19% of all the LUTs and 9% of all the registers in the datapath. It is surprising to see the datapath, and in particular the autonomous link controllers consume that many FPGA resources. In an earlier iteration of the design, the control unit written in behavioural Verilog had taken up the majority of the resources, as it was synthesised to contain large collections of multiplexers. In this iteration, it hardly uses up any resources, which is testament to how well the Verilog generated by the microcode assembler integrates into the processor design. The reason why the link controllers use so many resources can be explained by their sheer numbers: there are one output and 16 input controllers per processor, more than twice as many than in the Inmos Transputer. Each link encompasses the microcode sequencing logic similar to that of the processor s control logic and three different interfaces to DMA controller, the physical link and the CPU. As mentioned 43

64 CHAPTER 4. CRITICAL EVALUATION before, this should be the subject of optimisations in future versions of the OpenTransputer Synthesis Timing Analysis The Verilog design runs at 41 MHz on the FPGA. The maximum possible clock frequency is limited by logic paths with the longest delay in the circuit, also known as critical paths. We expected the critical path to be within the control unit, which is implemented by large amounts of behavioural Verilog. Nevertheless, the synthesis timing reports show that the critical paths are mainly associated with the autonomous links. In particular, the critical paths that cross the interfaces to the CPU are the ones with the longest delay. An example critical path entails an intermediate register in the output link controller that is used to implement the interface with the CPU, and another register that stores the current working values of the controller. When a process wishes to perform an output operation, the CPU places the relevant information in the intermediate register, then the value is stored in the controller s work register in another clock cycle. The great delay associated with this path is explained by the fact that these registers are placed at a physically large distance from each other by the route and place algorithm used by the synthesiser. The algorithm instantiates some of the intermediate registers that are logically part of the links nearer to the core processor while putting other registers closer to the links, increasing the distance between them Manufacturing Process To put our design in perspective, it is useful to compare it to a manufactured processor. For this purpose, we synthesised our Verilog design for an actual silicon target with 180µm technology using Synopsys Design Vision and compared it to the original Transputer. Once more, it is important to highlight that the microarchitecture of both designs is radically different despite implementing the same architecture. In some respects, the OpenTransputer is more complex than the Inmos Transputer since it uses a wide datapath with multiple instances of the same logic modules. On top of this, the OpenTransputer is still in its early stages of development and is not fully optimised, but due to technological advancements in the last two decades we expect to observe some sort of relation. OpenTransputer Transputer Area 3.69 mm 2 64 mm 2 Manufacturing process 180 nm 1000 nm Table 4.6: Comparison of chip area and manufacturing process of the OpenTransputer and the original Transputer after synthesis. Moore s law refers to the observation that in the history of modern computing the number of transistors in an integrated circuit doubles every two years [19]. Keeping this mind we can make some interesting observations about the synthesis results listed in Table 4.6. We see that the area of the OpenTransputer is 3.69 mm 2 while the Transputer in 1985 had an area of approximately 64 mm 2 ; in other words, there is a decrease in area by a factor of Since the OpenTransputer is more complex than the original Transputer, i.e. is made up of more hardware components and by extension transistors, an explanation for the area reduction is that the individual components have shrunk in size. As we see in the second row of Table 4.6, this has indeed been the case, the OpenTransputer is targeted at 180 nm technology, while the Transputer has been targeted at 1000 nm. This implies a reduction in the size of transistors by a factor of 5.6. If the OpenTransputer and Transputer were completely identical in terms of transistor count, then, according to Moore s law, the decrease in area by a factor of 17.3 would only be due to a reduction of the 44

65 4.3. SYNTHESIS RESULTS target technology by the square root of this factor, i.e or roughly 4.2. This implies that (to some extent) the area reduction of the OpenTransputer with regards to the Inmos Transputer follows Moore s law. The difference between the technology reduction factor (4.2) and the total area factor (5.6) can be attributed to the lack of optimisations and the more complex microarchitecture of the OpenTransputer. 45

66 CHAPTER 4. CRITICAL EVALUATION 46

67 Chapter 5 Conclusion 5.1 Current Project Status We have developed a new implementation of the Transputer architecture that we call OpenTransputer. We have designed a radically different microarchitecture that takes advantages of state-of-the-art manufacturing techniques and current developments in the field of computer architecture. The project can be broadly divided into three major components as per our initial aims and objectives: CPU, external communication and I/O interface. With regards to the OpenTransputer CPU, we replaced the original microcode ROM in favour of hardwired logic generated from a microprogram assembler. Also, we implemented a wide datapath that replaces the original bus system used in the Inmos Transputer. Furthermore, module replication was heavily used enabling the OpenTransputer to perform more simultaneous operations within the same clock cycle than the 1980s design. This has the effect of greatly decreasing the average number of cycles taken by most instructions to execute as described in Chapter 4. On the other hand, there are still a number of instructions that our design does not recognise. We introduced significant changes into the external communication mechanism used by the Transputer. The Inmos design was equipped with four serial communication links that could be used to connect the processors together and assemble parallel processing networks of arbitrary size. In the interest of making the OpenTransputer easier to use as a building block for any sort of system we have replaced the four serial links for a single bidirectional parallel connection to a network of switches. These networks are arranged in a Beneš fashion providing rearrangeably non-blocking communication between all Open- Transputer nodes. We have also developed a new message routing mechanism that uses the addresses of the Occam channels to describe the path between two nodes in the network. The OpenTransputer includes drastically different input and output controllers for communication that implement the concept of virtual channels. This approach enables the processor to keep track of a virtually unbounded number of simultaneous output operations as compared to the original Transputer that only supports up to four. Once more, the effect is to improve the usability of the processor as a building block. Moreover, the Beneš network configuration makes it possible to efficiently map a broad range of networks, including neural networks, to a parallel processing system built of OpenTransputers. Since we envision the OpenTransputer to be used as part of mass consumer products within the IoT realm an I/O interface that is compatible with a range of hardware components, such as sensors and actuators, is required. Despite not fully achieving this goal due to the time constraints of this project, we have developed a basic I/O interface built upon the channel communication functionality. This means that even the most simple Occam program that simply outputs an integer to a channel can drive 47

68 CHAPTER 5. CONCLUSION hardware peripherals. Currently, the OpenTransputer I/O interface does not provide enough flexibility to implement communication protocols that are commonly used by hardware peripherals. However, the interface has been designed in a way such that new features can be easily added as described in Section 3.3. Our implementation was fully developed in Verilog HDL using the Xilinx Vivado Design Suite. By the end of the project a dual-core system with a single communication switch was successfully synthesised for an FPGA target. This enabled us to develop a simple demo application to showcase the capabilities of our implementation. The synthesised design runs at 41 MHz and utilises approximately 35% of the target FPGA. To our surprise, the majority of the logic resources are consumed by the autonomous link controllers rather than by the core CPU components. This is possibly the result of introducing support of virtual channels and extending the number of input controllers from 4 to 16. We consider that future work must focus on optimising these components of the OpenTransputer system to reduce the area of the design as a whole. 5.2 Future Work We propose a number of directions to explore in future work: A software development environment that facilitates the use of the OpenTransputer is paramount if the device is intended to be used for commercial purposes. Currently, our design does not have any practical means of loading and debugging programs on the actual hardware. Due to time constraints, the control unit of the OpenTransputer was written using behavioural Verilog as described in Section 3.5. However, synthesis tools generally produce rather overly large designs from such descriptions that result in inefficient implementations. We consider that different approaches to implement the control unit should be explored using plain combinatorial logic blocks to generate the control signals. Furthermore, it would be interesting to conduct an empirical study that compares the hardwired and ROM approaches to implement the control unit. The OpenTransputer can be easily used as a building block to assemble multicore systems, yet a serial communication protocol must be implemented in the switches for off-chip connections for the device to be practical. 5.3 Final Conclusions With the rising popularity of new ideas such as IoT and the current state of the technology landscape, we consider that there is an opportunity for the OpenTransputer to be broadly used within small devices that gather information about the environment and intelligently respond to it. To the best of our ability, we have developed a device that improves many aspects of the original Inmos implementation; nevertheless, more work needs to be done before the OpenTransputer can be used within commercial products. 48

69 Bibliography [1] Hete-2 spacecraft [online], available: [07 may 2015]. [2] Simple 42 microcode in vbc. Technical report, Inmos Limited. [3] HR Arabnia. The transputer family of products and their applications in building a high performance computer. Belzer, J., Holzman., AG, Kent, A.(eds.) Encyclopedia of Computer Science and Technology, 39:283, [4] IEEE Standards Association ieee standard for heterogeneous interconnect (hic) (lowcost, low-latency scalable serial interconnect for parallel system construction) [5] Václad E Beneš. Optimal rearrangeable multistage connecting networks. Bell System Technical Journal, 43(4): , [6] Michael D Ciletti. Advanced digital design with the Verilog HDL, volume 2. Prentice Hall, [7] Andrea Clematis and Ornella Tavani. An analysis of message passing systems for distributed memory computers. In Parallel and Distributed Processing, Proceedings. Euromicro Workshop on, pages IEEE, [8] Charles Clos. A study of non-blocking switching networks. Bell System Technical Journal, 32(2): , [9] Sivarama P Dandamudi. Guide to RISC processors: for programmers and engineers. Springer Science & Business Media, [10] Guy Harriman and David May. Simple 42 documentation. Technical report, Inmos Limited, March [11] John L Hennessy and David A Patterson. Computer architecture: a quantitative approach. Elsevier, [12] Anthony JG Hey. Supercomputing with transputers past, present and future, volume 18. ACM, [13] Charles Antony Richard Hoare. Communicating sequential processes. Communications of the ACM, 21(8): , [14] Mark Homewood, David May, David Shepherd, and Roger Shepherd. The ims t800 transputer. Micro, IEEE, 7(5):10 26, [15] Inmos Limited. Transputer Instruction Set: A Compiler Writer s Guide. Prentice Hall, July [16] Inmos Limited. Transputer databook. Inmos Limited,

70 BIBLIOGRAPHY [17] Jagan Jayaraj, Pravin Lawrence Rajendran, and Thiruvel Thirumoolam. Shadow register file architecture: A mechanism to reduce context switch latency. College of Engineering Guindy, Anna University, Chennai, India, [18] David May. The transputer implementation of occam. Institute for New Generation Computer Technology, Presented at the International Conference on Fifth Generation Computer Systems, Tokyo, Japan, November [19] Gordon E Moore et al. Cramming more components onto integrated circuits. Proceedings of the IEEE, 86(1):82 85, [20] JR Newport. An introduction to occam and the development of parallel software. Software Engineering Journal, 1(4): , [21] Samir Palnitkar. Verilog HDL: a guide to digital design and synthesis, volume 1. Prentice Hall Professional, [22] Mohamed Rafiquzzaman. Fundamentals of digital logic and microcomputer design. John Wiley & Sons, [23] SGS-Thomson Microelectronincs. IMS C004 Programmable link switch, September [24] SGS-Thomson Microelectronincs. IMS T bit transputer, July [25] SGS-Thomson Microelectronincs. IMS T400 Low cost 32-bit transputer, July [26] SGS-Thomson Microelectronincs. IMS T bit transputer, February [27] SGS-Thomson Microelectronincs. IMS T bit floating-point transputer, February [28] Roger Shepherd. Transputer System Description. Inmos Limited, September [29] Harold S. Stone. Introduction to Computer Architecture. Sra, [30] M Tanaka, N Fukuchi, Y Ooki, and C Fukunga. Design of a transputer core and its implementation in an fpga. Proceedings of Communication Process Architecture, pages , [31] Andrew S. Tanenbaum. Structured Computer Organization. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2nd edition, [32] B Tatry and M-A Claire. Myriade microsatellites: a new way for agencies and industry to various missions. In Small Satellites, Systems and Services, volume 571, page 4, [33] Bruce Wile, John C Goss, and Wolfgang Roesner. Comprehensive Functional Verification: The Complete Industry Cycle. Morgan Kaufmann,

71 Appendix A Transputer Instruction Set The Transputer instruction set is comprised of two groups of instructions: primary and secondary. The former are a group of 16 instructions that are directly executed from memory. The latter are the remaining instructions that must be pre-loaded into the operand register. Then, the program must execute a special primary instruction to interpret that register as a secondary instruction. A.0.1 Primary Instructions The OpenTransputer implements all primary instructions show in Table A.1. Each instructions is 8 bits: 4 bits dedicated to the instruction code and the remaining bits are an immediate value that is loaded into the operand register. Hex code Abbreviation Description #07 ldl Load local #0D stl Store local #01 ldlp Load local pointer #03 ldnl Load non-local #0E stnl Store non-local #05 ldnlp Load non-local pointer #0C eqc Equals constant #04 ldc Load constant #08 adc Add constant #00 j Jump #0A cj Conditional jump #09 call Call #0B ajw Adjust workspace #02 pfix Prefix #06 nfix Negative prefix #0F opr Operate Table A.1: Primary instructions [28]. A.0.2 Secondary Instructions The OpenTransputer implements approximately 60% of all secondary instructions. Tables A.2 and A.3 show all implemented and missing instructions respectively. 51

72 APPENDIX A. TRANSPUTER INSTRUCTION SET Hex code Abbreviation Description #00 rev Reverse #20 ret Return #1B ldpi Load pointer to instruction #3C gajw General adjust workspace #06 gcall General call #42 mint Minimum integer #21 lend Loop end #13 csub0 Check subscript from 0 #4D ccnt1 Check count from 1 #29 testerr Test error false and clear #10 seterr Set error #55 stoperr Stop on error #57 clrhalterr Clear halt-on-error #58 sethalterr Set halt-on-error #59 testhalterr Test halt-on-error #02 bsub Byte subscript #0A wsub Word subscript #34 bcnt Byte count #3F wcnt Word count #4A move Move message #46 and And #4B or Or #33 xor Exclusive or #32 not Bitwise not #41 shl Shift left #40 shr Shift right #05 add Add #0C sub Subtract #09 gt Greater than #04 diff Difference #52 sum Sum #0D startp Start process #03 endp End process #39 runp Run process #15 stopp Stop process #1E ldpri Load current priority #07 in Input message #0B out Output message #0F outword Output word #0E outbyte Output byte #43 alt Alt start #44 altwt Alt wait #45 altend Alt end #48 enbc Enable channel #2F disc Disable channel #22 ldtimer Load timer #2B tin Timer input #4E talt Timer alt start #51 taltwt Timer alt wait #47 endt Enable timer #2E dist Disable timer Table A.2: Implemented secondary instructions [28] in the OpenTransputer. 52

73 Hex code Abbreviation Description #01 lb Load byte #3B sb Store byte #53 mul Multiply #2C div Divide #1F rem Remainder #08 prod Product #12 resetch Reset channel #49 enbs Enable skip #30 diss Disable skip #3A xword Extend to word #56 cword Check word #1D xdble Extend to double #4C csngl Check single #16 ladd Long add #38 lsub Long subtract #37 lsum Long sum #4F ldiff Long difference #31 lmul Long multiply #1A ldiv Long division #36 lshl Long shift left #35 lshr Long shift right #19 norm Normalise #3E saveh Save high priority queue registers #3D savel Save low priority queue registers #18 sthf Store high priority front pointer #50 sthb Store high priority back pointer #1C stlf Store low priority front pointer #17 stlb Store low priority back pointer #54 sttimer Store timer #63 unpacksn Unpack single length fp number #6D roundsn Round single length fp number #6C postnormsn Post-normalise correction of single length fp number #71 ldinf Load single length infinity #73 cferr Check single length fp infinity or NaN #72 fmul Fractional multiply #28 teststd Store to Dreg for testing #27 testste Store to Ereg for testing #26 teststs Store to StatusReg for testing #25 testldd Load to Dreg for testing #24 testlde Load to Ereg for testing #23 testlds Load to StatusReg for testing #19B Single step TimeOut for testing #2D testhardchan Test hard channel stack Table A.3: Unimplemented secondary instructions [28] in the OpenTransputer. 53

İSTANBUL AYDIN UNIVERSITY

İSTANBUL AYDIN UNIVERSITY İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER

More information

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2 Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

More information

MICROPROCESSOR AND MICROCOMPUTER BASICS

MICROPROCESSOR AND MICROCOMPUTER BASICS Introduction MICROPROCESSOR AND MICROCOMPUTER BASICS At present there are many types and sizes of computers available. These computers are designed and constructed based on digital and Integrated Circuit

More information

Chapter 5 Instructor's Manual

Chapter 5 Instructor's Manual The Essentials of Computer Organization and Architecture Linda Null and Julia Lobur Jones and Bartlett Publishers, 2003 Chapter 5 Instructor's Manual Chapter Objectives Chapter 5, A Closer Look at Instruction

More information

Central Processing Unit (CPU)

Central Processing Unit (CPU) Central Processing Unit (CPU) CPU is the heart and brain It interprets and executes machine level instructions Controls data transfer from/to Main Memory (MM) and CPU Detects any errors In the following

More information

Chapter 2 Logic Gates and Introduction to Computer Architecture

Chapter 2 Logic Gates and Introduction to Computer Architecture Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are

More information

150127-Microprocessor & Assembly Language

150127-Microprocessor & Assembly Language Chapter 3 Z80 Microprocessor Architecture The Z 80 is one of the most talented 8 bit microprocessors, and many microprocessor-based systems are designed around the Z80. The Z80 microprocessor needs an

More information

LSN 2 Computer Processors

LSN 2 Computer Processors LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2

More information

An Overview of Stack Architecture and the PSC 1000 Microprocessor

An Overview of Stack Architecture and the PSC 1000 Microprocessor An Overview of Stack Architecture and the PSC 1000 Microprocessor Introduction A stack is an important data handling structure used in computing. Specifically, a stack is a dynamic set of elements in which

More information

CPU Organisation and Operation

CPU Organisation and Operation CPU Organisation and Operation The Fetch-Execute Cycle The operation of the CPU 1 is usually described in terms of the Fetch-Execute cycle. 2 Fetch-Execute Cycle Fetch the Instruction Increment the Program

More information

(Refer Slide Time: 00:01:16 min)

(Refer Slide Time: 00:01:16 min) Digital Computer Organization Prof. P. K. Biswas Department of Electronic & Electrical Communication Engineering Indian Institute of Technology, Kharagpur Lecture No. # 04 CPU Design: Tirning & Control

More information

CHAPTER 4 MARIE: An Introduction to a Simple Computer

CHAPTER 4 MARIE: An Introduction to a Simple Computer CHAPTER 4 MARIE: An Introduction to a Simple Computer 4.1 Introduction 195 4.2 CPU Basics and Organization 195 4.2.1 The Registers 196 4.2.2 The ALU 197 4.2.3 The Control Unit 197 4.3 The Bus 197 4.4 Clocks

More information

Architectures and Platforms

Architectures and Platforms Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation

More information

CHAPTER 7: The CPU and Memory

CHAPTER 7: The CPU and Memory CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides

More information

Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine

More information

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software

More information

Central Processing Unit

Central Processing Unit Chapter 4 Central Processing Unit 1. CPU organization and operation flowchart 1.1. General concepts The primary function of the Central Processing Unit is to execute sequences of instructions representing

More information

a storage location directly on the CPU, used for temporary storage of small amounts of data during processing.

a storage location directly on the CPU, used for temporary storage of small amounts of data during processing. CS143 Handout 18 Summer 2008 30 July, 2008 Processor Architectures Handout written by Maggie Johnson and revised by Julie Zelenski. Architecture Vocabulary Let s review a few relevant hardware definitions:

More information

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the

More information

Instruction Set Design

Instruction Set Design Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,

More information

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

More information

Guide to RISC Processors

Guide to RISC Processors Guide to RISC Processors Sivarama P. Dandamudi Guide to RISC Processors for Programmers and Engineers Sivarama P. Dandamudi School of Computer Science Carleton University Ottawa, ON K1S 5B6 Canada [email protected]

More information

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit

More information

MACHINE ARCHITECTURE & LANGUAGE

MACHINE ARCHITECTURE & LANGUAGE in the name of God the compassionate, the merciful notes on MACHINE ARCHITECTURE & LANGUAGE compiled by Jumong Chap. 9 Microprocessor Fundamentals A system designer should consider a microprocessor-based

More information

Basic Computer Organization

Basic Computer Organization Chapter 2 Basic Computer Organization Objectives To provide a high-level view of computer organization To describe processor organization details To discuss memory organization and structure To introduce

More information

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer Computers CMPT 125: Lecture 1: Understanding the Computer Tamara Smyth, [email protected] School of Computing Science, Simon Fraser University January 3, 2009 A computer performs 2 basic functions: 1.

More information

MICROPROCESSOR. Exclusive for IACE Students www.iace.co.in iacehyd.blogspot.in Ph: 9700077455/422 Page 1

MICROPROCESSOR. Exclusive for IACE Students www.iace.co.in iacehyd.blogspot.in Ph: 9700077455/422 Page 1 MICROPROCESSOR A microprocessor incorporates the functions of a computer s central processing unit (CPU) on a single Integrated (IC), or at most a few integrated circuit. It is a multipurpose, programmable

More information

ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER

ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER Pierre A. von Kaenel Mathematics and Computer Science Department Skidmore College Saratoga Springs, NY 12866 (518) 580-5292 [email protected] ABSTRACT This paper

More information

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

A s we saw in Chapter 4, a CPU contains three main sections: the register section, 6 CPU Design A s we saw in Chapter 4, a CPU contains three main sections: the register section, the arithmetic/logic unit (ALU), and the control unit. These sections work together to perform the sequences

More information

RAPID PROTOTYPING OF DIGITAL SYSTEMS Second Edition

RAPID PROTOTYPING OF DIGITAL SYSTEMS Second Edition RAPID PROTOTYPING OF DIGITAL SYSTEMS Second Edition A Tutorial Approach James O. Hamblen Georgia Institute of Technology Michael D. Furman Georgia Institute of Technology KLUWER ACADEMIC PUBLISHERS Boston

More information

Let s put together a Manual Processor

Let s put together a Manual Processor Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce

More information

MACHINE INSTRUCTIONS AND PROGRAMS

MACHINE INSTRUCTIONS AND PROGRAMS CHAPTER 2 MACHINE INSTRUCTIONS AND PROGRAMS CHAPTER OBJECTIVES In this chapter you will learn about: Machine instructions and program execution, including branching and subroutine call and return operations

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Intel 8086 architecture

Intel 8086 architecture Intel 8086 architecture Today we ll take a look at Intel s 8086, which is one of the oldest and yet most prevalent processor architectures around. We ll make many comparisons between the MIPS and 8086

More information

18-447 Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

18-447 Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013 18-447 Computer Architecture Lecture 3: ISA Tradeoffs Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013 Reminder: Homeworks for Next Two Weeks Homework 0 Due next Wednesday (Jan 23), right

More information

An Introduction to the ARM 7 Architecture

An Introduction to the ARM 7 Architecture An Introduction to the ARM 7 Architecture Trevor Martin CEng, MIEE Technical Director This article gives an overview of the ARM 7 architecture and a description of its major features for a developer new

More information

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Question Bank Subject Name: EC6504 - Microprocessor & Microcontroller Year/Sem : II/IV

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Question Bank Subject Name: EC6504 - Microprocessor & Microcontroller Year/Sem : II/IV DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Question Bank Subject Name: EC6504 - Microprocessor & Microcontroller Year/Sem : II/IV UNIT I THE 8086 MICROPROCESSOR 1. What is the purpose of segment registers

More information

Chapter 3: Operating-System Structures. System Components Operating System Services System Calls System Programs System Structure Virtual Machines

Chapter 3: Operating-System Structures. System Components Operating System Services System Calls System Programs System Structure Virtual Machines Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines Operating System Concepts 3.1 Common System Components

More information

Network Traffic Monitoring an architecture using associative processing.

Network Traffic Monitoring an architecture using associative processing. Network Traffic Monitoring an architecture using associative processing. Gerald Tripp Technical Report: 7-99 Computing Laboratory, University of Kent 1 st September 1999 Abstract This paper investigates

More information

Computer Systems Structure Input/Output

Computer Systems Structure Input/Output Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices

More information

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin BUS ARCHITECTURES Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin Keywords: Bus standards, PCI bus, ISA bus, Bus protocols, Serial Buses, USB, IEEE 1394

More information

Computer organization

Computer organization Computer organization Computer design an application of digital logic design procedures Computer = processing unit + memory system Processing unit = control + datapath Control = finite state machine inputs

More information

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit. Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

Levels of Programming Languages. Gerald Penn CSC 324

Levels of Programming Languages. Gerald Penn CSC 324 Levels of Programming Languages Gerald Penn CSC 324 Levels of Programming Language Microcode Machine code Assembly Language Low-level Programming Language High-level Programming Language Levels of Programming

More information

A New Paradigm for Synchronous State Machine Design in Verilog

A New Paradigm for Synchronous State Machine Design in Verilog A New Paradigm for Synchronous State Machine Design in Verilog Randy Nuss Copyright 1999 Idea Consulting Introduction Synchronous State Machines are one of the most common building blocks in modern digital

More information

AC 2007-2027: A PROCESSOR DESIGN PROJECT FOR A FIRST COURSE IN COMPUTER ORGANIZATION

AC 2007-2027: A PROCESSOR DESIGN PROJECT FOR A FIRST COURSE IN COMPUTER ORGANIZATION AC 2007-2027: A PROCESSOR DESIGN PROJECT FOR A FIRST COURSE IN COMPUTER ORGANIZATION Michael Black, American University Manoj Franklin, University of Maryland-College Park American Society for Engineering

More information

Managing Variability in Software Architectures 1 Felix Bachmann*

Managing Variability in Software Architectures 1 Felix Bachmann* Managing Variability in Software Architectures Felix Bachmann* Carnegie Bosch Institute Carnegie Mellon University Pittsburgh, Pa 523, USA [email protected] Len Bass Software Engineering Institute Carnegie

More information

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected].

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu. Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected] Review Computers in mid 50 s Hardware was expensive

More information

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s) Addressing The problem Objectives:- When & Where do we encounter Data? The concept of addressing data' in computations The implications for our machine design(s) Introducing the stack-machine concept Slide

More information

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution

More information

Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

More information

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah (DSF) Soft Core Prozessor NIOS II Stand Mai 2007 Jens Onno Krah Cologne University of Applied Sciences www.fh-koeln.de [email protected] NIOS II 1 1 What is Nios II? Altera s Second Generation

More information

9/14/2011 14.9.2011 8:38

9/14/2011 14.9.2011 8:38 Algorithms and Implementation Platforms for Wireless Communications TLT-9706/ TKT-9636 (Seminar Course) BASICS OF FIELD PROGRAMMABLE GATE ARRAYS Waqar Hussain [email protected] Department of Computer

More information

Chapter 11 I/O Management and Disk Scheduling

Chapter 11 I/O Management and Disk Scheduling Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization

More information

A3 Computer Architecture

A3 Computer Architecture A3 Computer Architecture Engineering Science 3rd year A3 Lectures Prof David Murray [email protected] www.robots.ox.ac.uk/ dwm/courses/3co Michaelmas 2000 1 / 1 6. Stacks, Subroutines, and Memory

More information

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory 1 1. Memory Organisation 2 Random access model A memory-, a data byte, or a word, or a double

More information

(Refer Slide Time: 02:39)

(Refer Slide Time: 02:39) Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering, Indian Institute of Technology, Delhi Lecture - 1 Introduction Welcome to this course on computer architecture.

More information

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine This is a limited version of a hardware implementation to execute the JAVA programming language. 1 of 23 Structured Computer

More information

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com Best Practises for LabVIEW FPGA Design Flow 1 Agenda Overall Application Design Flow Host, Real-Time and FPGA LabVIEW FPGA Architecture Development FPGA Design Flow Common FPGA Architectures Testing and

More information

Test Driven Development of Embedded Systems Using Existing Software Test Infrastructure

Test Driven Development of Embedded Systems Using Existing Software Test Infrastructure Test Driven Development of Embedded Systems Using Existing Software Test Infrastructure Micah Dowty University of Colorado at Boulder [email protected] March 26, 2004 Abstract Traditional software development

More information

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level System: User s View System Components: High Level View Input Output 1 System: Motherboard Level 2 Components: Interconnection I/O MEMORY 3 4 Organization Registers ALU CU 5 6 1 Input/Output I/O MEMORY

More information

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture

More information

Transport Layer Protocols

Transport Layer Protocols Transport Layer Protocols Version. Transport layer performs two main tasks for the application layer by using the network layer. It provides end to end communication between two applications, and implements

More information

7a. System-on-chip design and prototyping platforms

7a. System-on-chip design and prototyping platforms 7a. System-on-chip design and prototyping platforms Labros Bisdounis, Ph.D. Department of Computer and Communication Engineering 1 What is System-on-Chip (SoC)? System-on-chip is an integrated circuit

More information

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program. Name: Class: Date: Exam #1 - Prep True/False Indicate whether the statement is true or false. 1. Programming is the process of writing a computer program in a language that the computer can respond to

More information

CPU Organization and Assembly Language

CPU Organization and Assembly Language COS 140 Foundations of Computer Science School of Computing and Information Science University of Maine October 2, 2015 Outline 1 2 3 4 5 6 7 8 Homework and announcements Reading: Chapter 12 Homework:

More information

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.

More information

A+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware

A+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware A+ Guide to Managing and Maintaining Your PC, 7e Chapter 1 Introducing Hardware Objectives Learn that a computer requires both hardware and software to work Learn about the many different hardware components

More information

RISC AND CISC. Computer Architecture. Farhat Masood BE Electrical (NUST) COLLEGE OF ELECTRICAL AND MECHANICAL ENGINEERING

RISC AND CISC. Computer Architecture. Farhat Masood BE Electrical (NUST) COLLEGE OF ELECTRICAL AND MECHANICAL ENGINEERING COLLEGE OF ELECTRICAL AND MECHANICAL ENGINEERING NATIONAL UNIVERSITY OF SCIENCES AND TECHNOLOGY (NUST) RISC AND CISC Computer Architecture By Farhat Masood BE Electrical (NUST) II TABLE OF CONTENTS GENERAL...

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance What You Will Learn... Computers Are Your Future Chapter 6 Understand how computers represent data Understand the measurements used to describe data transfer rates and data storage capacity List the components

More information

Computer Systems Design and Architecture by V. Heuring and H. Jordan

Computer Systems Design and Architecture by V. Heuring and H. Jordan 1-1 Chapter 1 - The General Purpose Machine Computer Systems Design and Architecture Vincent P. Heuring and Harry F. Jordan Department of Electrical and Computer Engineering University of Colorado - Boulder

More information

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Pentium vs. Power PC Computer Architecture and PCI Bus Interface Pentium vs. Power PC Computer Architecture and PCI Bus Interface CSE 3322 1 Pentium vs. Power PC Computer Architecture and PCI Bus Interface Nowadays, there are two major types of microprocessors in the

More information

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune Introduction to RISC Processor ni logic Pvt. Ltd., Pune AGENDA What is RISC & its History What is meant by RISC Architecture of MIPS-R4000 Processor Difference Between RISC and CISC Pros and Cons of RISC

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra

More information

Programming Logic controllers

Programming Logic controllers Programming Logic controllers Programmable Logic Controller (PLC) is a microprocessor based system that uses programmable memory to store instructions and implement functions such as logic, sequencing,

More information

Getting off the ground when creating an RVM test-bench

Getting off the ground when creating an RVM test-bench Getting off the ground when creating an RVM test-bench Rich Musacchio, Ning Guo Paradigm Works [email protected],[email protected] ABSTRACT RVM compliant environments provide

More information

2) Write in detail the issues in the design of code generator.

2) Write in detail the issues in the design of code generator. COMPUTER SCIENCE AND ENGINEERING VI SEM CSE Principles of Compiler Design Unit-IV Question and answers UNIT IV CODE GENERATION 9 Issues in the design of code generator The target machine Runtime Storage

More information

Design and Verification of Nine port Network Router

Design and Verification of Nine port Network Router Design and Verification of Nine port Network Router G. Sri Lakshmi 1, A Ganga Mani 2 1 Assistant Professor, Department of Electronics and Communication Engineering, Pragathi Engineering College, Andhra

More information

Computer Organization & Architecture Lecture #19

Computer Organization & Architecture Lecture #19 Computer Organization & Architecture Lecture #19 Input/Output The computer system s I/O architecture is its interface to the outside world. This architecture is designed to provide a systematic means of

More information

Computer Organization

Computer Organization Computer Organization and Architecture Designing for Performance Ninth Edition William Stallings International Edition contributions by R. Mohan National Institute of Technology, Tiruchirappalli PEARSON

More information

The Central Processing Unit:

The Central Processing Unit: The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Objectives Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

Chapter 2: OS Overview

Chapter 2: OS Overview Chapter 2: OS Overview CmSc 335 Operating Systems 1. Operating system objectives and functions Operating systems control and support the usage of computer systems. a. usage users of a computer system:

More information

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C Embedded Systems A Review of ANSI C and Considerations for Embedded C Programming Dr. Jeff Jackson Lecture 2-1 Review of ANSI C Topics Basic features of C C fundamentals Basic data types Expressions Selection

More information

8051 MICROCONTROLLER COURSE

8051 MICROCONTROLLER COURSE 8051 MICROCONTROLLER COURSE Objective: 1. Familiarization with different types of Microcontroller 2. To know 8051 microcontroller in detail 3. Programming and Interfacing 8051 microcontroller Prerequisites:

More information

Computer Organization and Architecture

Computer Organization and Architecture Computer Organization and Architecture Chapter 11 Instruction Sets: Addressing Modes and Formats Instruction Set Design One goal of instruction set design is to minimize instruction length Another goal

More information

1 The Java Virtual Machine

1 The Java Virtual Machine 1 The Java Virtual Machine About the Spec Format This document describes the Java virtual machine and the instruction set. In this introduction, each component of the machine is briefly described. This

More information

Processor Architectures

Processor Architectures ECPE 170 Jeff Shafer University of the Pacific Processor Architectures 2 Schedule Exam 3 Tuesday, December 6 th Caches Virtual Memory Input / Output OperaKng Systems Compilers & Assemblers Processor Architecture

More information

CHAPTER 6: Computer System Organisation 1. The Computer System's Primary Functions

CHAPTER 6: Computer System Organisation 1. The Computer System's Primary Functions CHAPTER 6: Computer System Organisation 1. The Computer System's Primary Functions All computers, from the first room-sized mainframes, to today's powerful desktop, laptop and even hand-held PCs, perform

More information

Instruction Set Architecture. Datapath & Control. Instruction. LC-3 Overview: Memory and Registers. CIT 595 Spring 2010

Instruction Set Architecture. Datapath & Control. Instruction. LC-3 Overview: Memory and Registers. CIT 595 Spring 2010 Instruction Set Architecture Micro-architecture Datapath & Control CIT 595 Spring 2010 ISA =Programmer-visible components & operations Memory organization Address space -- how may locations can be addressed?

More information

CS101 Lecture 26: Low Level Programming. John Magee 30 July 2013 Some material copyright Jones and Bartlett. Overview/Questions

CS101 Lecture 26: Low Level Programming. John Magee 30 July 2013 Some material copyright Jones and Bartlett. Overview/Questions CS101 Lecture 26: Low Level Programming John Magee 30 July 2013 Some material copyright Jones and Bartlett 1 Overview/Questions What did we do last time? How can we control the computer s circuits? How

More information

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

COMPUTER HARDWARE. Input- Output and Communication Memory Systems COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)

More information

PART B QUESTIONS AND ANSWERS UNIT I

PART B QUESTIONS AND ANSWERS UNIT I PART B QUESTIONS AND ANSWERS UNIT I 1. Explain the architecture of 8085 microprocessor? Logic pin out of 8085 microprocessor Address bus: unidirectional bus, used as high order bus Data bus: bi-directional

More information

Chapter 7D The Java Virtual Machine

Chapter 7D The Java Virtual Machine This sub chapter discusses another architecture, that of the JVM (Java Virtual Machine). In general, a VM (Virtual Machine) is a hypothetical machine (implemented in either hardware or software) that directly

More information

Systems I: Computer Organization and Architecture

Systems I: Computer Organization and Architecture Systems I: Computer Organization and Architecture Lecture : Microprogrammed Control Microprogramming The control unit is responsible for initiating the sequence of microoperations that comprise instructions.

More information

Lecture 12: More on Registers, Multiplexers, Decoders, Comparators and Wot- Nots

Lecture 12: More on Registers, Multiplexers, Decoders, Comparators and Wot- Nots Lecture 12: More on Registers, Multiplexers, Decoders, Comparators and Wot- Nots Registers As you probably know (if you don t then you should consider changing your course), data processing is usually

More information

An Introduction to Computer Science and Computer Organization Comp 150 Fall 2008

An Introduction to Computer Science and Computer Organization Comp 150 Fall 2008 An Introduction to Computer Science and Computer Organization Comp 150 Fall 2008 Computer Science the study of algorithms, including Their formal and mathematical properties Their hardware realizations

More information

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language Chapter 4 Register Transfer and Microoperations Section 4.1 Register Transfer Language Digital systems are composed of modules that are constructed from digital components, such as registers, decoders,

More information

SFWR 4C03: Computer Networks & Computer Security Jan 3-7, 2005. Lecturer: Kartik Krishnan Lecture 1-3

SFWR 4C03: Computer Networks & Computer Security Jan 3-7, 2005. Lecturer: Kartik Krishnan Lecture 1-3 SFWR 4C03: Computer Networks & Computer Security Jan 3-7, 2005 Lecturer: Kartik Krishnan Lecture 1-3 Communications and Computer Networks The fundamental purpose of a communication network is the exchange

More information

Switch Fabric Implementation Using Shared Memory

Switch Fabric Implementation Using Shared Memory Order this document by /D Switch Fabric Implementation Using Shared Memory Prepared by: Lakshmi Mandyam and B. Kinney INTRODUCTION Whether it be for the World Wide Web or for an intra office network, today

More information