Queue Machines for Next Generation Computer Systems

Size: px
Start display at page:

Download "Queue Machines for Next Generation Computer Systems"

Transcription

1 Queue Machines for Next Generation Computer Systems Masahiro Sowa Graduate School of Information Systems University of Electro-Communications Chofugaoka 1-5-1, Chofu-Shi Tokyo, Japan Arquimedes Canedo IBM Tokyo Research Laboratory Shimotsuruma, Yamato-shi Kanagawa-ken Japan Abstract Queue processors are a novel computer architecture with characteristics that are suitable for the next generation computer systems. Compared to conventional register machines where registers referenced by their names are used for data processing, the queue machines use a nameless first-in first-out queue to perform operations. The head of the queue is the only place where data is read and the tail of the queue is the only place where data can be written. Thus, queue computing allows instructions to read and write data without using register names and, as a consequence, instructions are short and the queue programs are free of false dependencies. Small instructions are preferred over longer instructions since memory traffic can be reduced, the cache performance and size can be improved, and decoding hardware logic can be simplified. The nonexistence of false dependencies allows programs to expose maximum parallelism that the queue processor can execute without complex and power-hungry hardware such as register renaming and large instruction windows. Parallel processing allows queue processors to speed-up the execution of applications. In this paper we present the special characteristics of queue machines that make such design a very suitable architecture for the future generation of computer systems demanding high-performance, low power, and low cost. We present the toolchain of compilers, translators, assembler, virtual machine, functional and cycle accurate simulators, and RTL processors we have designed and developed. From our experimental results we demonstrate that queue programs are smaller than conventional register programs and have very similar characteristics in terms of parallelism. Keywords: Queue Machines, Computer Systems, System Design 1 Introduction Processor design faces a situation where performance cannot be sustained by simply increasing the frequency of operation. As the size of transistors decreases and frequency of operation is raised, power dissipation is a limitation of modern technology. This situation has motivated the designers to reconsider new execution paradigms for modern computer systems [8, 1, 9]. The key points to sustain performance improvements has been in exploiting different sources of parallelism. Queue processors [6, 7, 8] are a novel execution paradigm that allows program to be executed in parallel without needing complex hardware to find and exploit such parallelism. The instructions of a queue processor implicitly reference source and destination operands, thus the instructions are compact and free of false dependencies. The hardware implementation of a queue processor requires simple circuits while it is able to execute instructions in parallel. The programs generated for the queue computation model are unrestricted in parallelism and therefore queue processors can be used for scientific and parallel processing. From its hardware simplicity, a queue processor has the potential to have a low power dissipation footprint, making it ideal for modern applications running in mobile devices. In this paper we present an overview of the principles of queue computing and explain its salient characteristics. We also present a complete queue computer system toolchain and give an insight to its development. An

2 evaluation of the characteristics of queue computing is given to demonstrate the potential of queue-based systems against conventional computing paradigms. 2 Overview of Queue Computing A queue machine is a computer that uses fast registers organized as a first-in first-out (FIFO) queue to perform computations. A FIFO queue establishes that all read accesses are performed in one end of the queue called the head, or QH. And all writes are performed in the opposite end of the queue called the tail, or QT. Since all accesses are done at known and fixed locations, the QH and QT, instructions do not need to specify the source and destination operands. However, the hardware must update and track the position of QH and QT at all times: whenever a write is performed the QT position is updated, and whenever a read is performed the QH is updated. Given the strict rules for accessing the queue, the programs for a queue machine are generated by a level-order traversal of the expression trees. The level order traversal visits all levels from the deepest to the root, and for each level walks on all nodes from left to right. This scheduling method produces programs that can be directly executed in a queue processor. Consider the expression x = (a+b)/(c-d), first the expression tree is built and all nodes visited in a level-order traversal as shown in Figure 1.(a). For every visited node a corresponding instruction of the queue processor is generated, the queue program for the sample expression is given in Figure 1.(b). An important characteristic of the queue program to be noticed is that arithmetic instructions (add, sub, div) consist only of the operator, source and destination operands are missing since they are implicitly accessed through QH and QT. This feature allows the instructions of a queue processor to be short in length and as a consequence programs are compact. This is favorable for the hardware since manipulating small instructions requires fewer circuitry, smaller buses, less complexity, and as a consequence, less power consumption. On the other hand, small instructions have the potential of improving the performance of the cache memories, and the benefit of reducing production costs for computer systems where memory size is constrained as in embedded systems. L0 (root) L1 L2 L3 x / + - a b c d add sub div st a b c d x a b c d add sub a+b c-d div (a+b)/(c-d) st (a) Level-order traversal of expression tree (b) Queue Program (c) Execution of program in a FIFO queue Figure 1: Expression x = a+b c d is (a) traversed in a level-order manner to, (b) generate a queue program and (c) executed in a queue processor. Data independent instructions of a queue program are grouped by the level-order scheduling. Figure 1.(c) shows the execution of the sample queue program in the FIFO queue. The leftmost end of the queue is the QH and the rightmost the QT. The first four instructions of level L 0 place the operands a,b,c,d in the queue. These four instructions are independent from each other and can be issued in parallel. Same happens with the instructions of level L 1, add, sub, they are independent to each other and thus executed simultaneously. This situation applies for all levels, all instructions within the same level can be executed in parallel. The absence of operands in the instructions has another good characteristic, queue programs are free of false dependencies. The parallelism exposed by the level-order scheduling and the

3 IF CTR MEM lack of false dependencies in the program allows the queue machines to effectively exploit all parallelism available in the applications. The grouping of independent instructions allows the implementation of queue processors to require smaller instruction windows, and the absence of false dependencies completely removes the necessity of register renaming hardware. A parallel queue processor (PQP) offers a high-performance architecture with simple hardware design and low power characteristics. 3 Developing Queue Computer Systems In Sowa Laboratory 1 we have investigated the principles of queue computing and developed a toolchain for the design and implementation of queue processors. Figure 2 shows the current layers of our toolchain. Except for the applications, all other elements have been crafted from scratch for the queue computation model. There are fundamental differences between queue machines and conventional stack and register machines that prohibit the utilization of conventional techniques and methods to bui compilers, virtual machines, simulators, operating systems, and hardware. Therefore, we have researched and established new methods to bui custom tools for the queue computing. Applications Queue Compiler Profiler Optimizers Queue Virtual Table 4 Machine Assembler Linker PQP Hardware Simulator QC-2 processor design results: modules complexity as LE (logic elements) and TCF (total combinational functions) when synthesized for FPGAs (with Stratix device) and Structured ASIC (HardCopy II) families. Loader Descriptions Modules LE TCF instruction fetch unit IF instruction decode unit ID queue compute unit QCU barrier queue unit BQU issue unit IS Operating System execution unit EXE queue-register unit QREG memory access MEM control unit CTR QC-2 core QC EXE ID IS BQU QCU QREG Fig. 12. Floorplan of the placed and routed QC-2 core. optimizations area (ARA), speed (SPD) and balanced (BLD). For FPGA implementation, the complexity for SPD optimization is about 17% and 18% higher than that for ARA and BLD optimizations respectively. Figure 12 shows the floorplan of the placed and routed QC-2 core. The modules of the processor show considerable overlap as logic is mapped according to interconnect requirements. 5.3 Speed and Power Consumption Comparison with Synthesizable CPU Cores Figure 2: Layers of a Queue-based Computer System Queue computing and architecture design approaches take into account performance and power consumption considerations early in the design cycle and maintain a power-centric focus across all levels of design abstraction. In QC-2 processor, all instructions designed are fixed format 16-bit words with minimal 18 Applications We have considered a wide range of applications to analyze their characteristics when translated to queue programs. The experimental results show that applications benefit from the queue computing as their size can be reduced up to 47% when compared to the same application on a conventional RISC machine [3]. In terms of other qualitative characteristics such as available parallelism and instruction count, queue programs are very similar to register programs. Most applications are written in high level languages such as C, FORTRAN, Java. Therefore, we need a compiler to translate these programs into machine code for queue processors. 1

4 Queue Compiler The Queue Compiler not only translates high level languages into machine code but also optimizes the programs for smaller execution times, higher parallelism, better resource utilization, smaller code size, among others. Compiling for queue machines is different than compiling for register computers and this is reflected in the structure of the queue compiler. Since a queue program has the concept of queue rather than registers, novel code generation algorithms have been developed [2] to handle this characteristic. The compiler is a critical part of a queue computer system as it schedules the programs to satisfy the constraints of the hardware and achieve the highest performance. Queue Virtual Machine Virtual machines offer great flexibility at reasonable development cost for testing and validating a new computer system. Our queue virtual machine emulates a queue processor and interprets the programs generated by the queue compiler. The virtual machine can be easily modified to emulate different queue processors and to test new architectural features. The virtual machine also helps the refinement of the compiler as it is able to execute programs, and executed programs have different characteristics than compiled programs. Simulator Simulators also emulate a queue processor and execute code generated by the compiler, the difference with the queue virtual machine is the level of detail on which the execution of the program is performed. Our simulators emulate the pipeline of the queue processor and perform a detailed timing execution. Pipeline hazards, stalls, instruction latencies, and other microarchitectural features are simulated. Furthermore, the cycle accurate simulators can be configured to simulate a parallel queue processor that executes instructions simultaneously. Assembler, Linker, Loader When targeting the actual queue processor, the generic assembly code generated by the compiler must be translated into binary code for a queue processor by the assembler. Having an assembler allows us to easily retarget a program to a new queue processor without changing the compiler. The function of the linker and loader are no different than conventional systems, the former resolves symbols and combines different objects into a single executable, and the latter prepares the code for execution in the operating system. Operating System This is the only part of the toolchain where we have not built a prototype. Up until now, we have concentrated our efforts to develop the tools to test the fundamental ideas of queue computing. As it is possible for us to run our queue programs in the PQP hardware directly, the need of an operating system has become a secondary priority. However, our future work includes the development of an operating system to manage the PQP processor. PQP Hardware Several queue processors have been designed in hardware description languages, realized in RTL designs, and implemented in FPGA chips. The hardware has been gradually evolving from a serial queue processor to a parallel queue processor (PQP). We also have developed hybrid architectures that combine register, stack, and queue into a single processor. Our latest research includes a multi-dimensional queue processor where several queues are employed for data processing. Queue processors have the characteristic of hardware simplicity, low power consumption, and high-performance.

5 4 Evaluation In this section, we give the experimental results obtained in the development of the PQP processor. Figure 3 shows the normalized code size, SPEC CINT95 benchmarks [4], for the PQP processor (baseline) and MIPS I processor [5]. From this graph we observe that the absence of operands in the queue instruction set accomplishes small code sizes. For these benchmarks, PQP programs are from 27% to 47% smaller than MIPS programs MIPS 1.88 PQP Code Size go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 134.perl 147.vortex Figure 3: Code size for SPEC CINT95 benchmarks. Figure 4 shows the instruction level parallelism (ILP) for embedded benchmarks extracted at compile time for MIPS programs and for PQP programs. Notice that the PQP programs have, in average, 13% more parallelism than optimized MIPS programs. This is due to the fact that queue programs are not limited by false dependencies and the level-order scheduling exposes all application parallelism. On the other hand, although MIPS programs are heavily optimized, the limited number of architected registers introduces false dependencies that limit the available parallelism MIPS I -O3 PQP 3.75 ILP H.263 MPEG2 FFT Susan Rijndael Sha Blowfish Dijkstra Patricia Adpcm Figure 4: Instruction Level Parallelism (ILP) for embedded benchmarks 5 Conclusion As discussed, queue processors offer a good alternative to modern computer systems. Modern applications constantly demand higher and higher performance at low power consumption and low cost. The simplicity and characteristics of queue-based processors make them ideal candidates for such demands. In this paper, we introduced the principles of queue computing and we presented an overview of our current research on the PQP processor. We have developed a custom toolchain that includes compilers, virtual machines, cycle accurate simulators, assemblers, linkers, loaders, operating system, libraries, and several RTL processors. This research has contributed with novel techniques to develop queue machines and the results have

6 shown that queue processors can produce up to 47% smaller programs and 13% more parallel applications than modern RISC processors. References [1] D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John, C. Lin, C. R. Moore, J. Burrill, R. G. McDona, W. Yoder, and the TRIPS Team. Scaling to the end of silicon with edge architectures. Computer, 37(7):44 55, [2] A. Canedo, B. Abderazek, and M. Sowa. A New Code Generation Algorithm for 2-offset Producer Order Queue Computation Model. Journal of Computer Languages, Systems & Structures, 34(4): , June [3] A. Canedo, B. Abderazek, and M. Sowa. Compiler Support for Code Size Reduction using a Queue-based Processor. Transactions on High-Performance Embedded Architectures and Compilers, 2(4): , [4] J. J. Dujmovic and I. Dujmovic. Evolution and evaluation of SPEC benchmarks. ACM SIGMETRICS Performance Evaluation Review, 26(3):2 9, December [5] G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice Hall, [6] B. Preiss and C. Hamacher. Data Flow on Queue Machines. In 12th Int. IEEE Symposium on computer Architecture, pages , [7] H. Schmit, B. Levine, and B. Ylvisaker. Queue Machines: Hardware Compilation in Hardware. In 10th Annual IEEE Symposium on Fie-Programmable Custom Computing Machines, page 152, [8] M. Sowa, B. Abderazek, and T. Yoshinaga. Parallel Queue Processor Architecture Based on Produced Order Computation Model. Journal of Supercomputing, 32(3): , June [9] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. Wavescalar. In In Proceedings of the 36th International Symposium on Microarchitecture(MICRO, pages , 2003.

Architectures and Platforms

Architectures and Platforms Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation

More information

İSTANBUL AYDIN UNIVERSITY

İSTANBUL AYDIN UNIVERSITY İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER

More information

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

Introducción. Diseño de sistemas digitales.1

Introducción. Diseño de sistemas digitales.1 Introducción Adapted from: Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Original from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] Diseño de sistemas digitales.1

More information

Instruction Set Design

Instruction Set Design Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,

More information

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu. Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Computers in mid 50 s Hardware was expensive

More information

LSN 2 Computer Processors

LSN 2 Computer Processors LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2

More information

FLIX: Fast Relief for Performance-Hungry Embedded Applications

FLIX: Fast Relief for Performance-Hungry Embedded Applications FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...

More information

Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine

More information

Chapter 2 Logic Gates and Introduction to Computer Architecture

Chapter 2 Logic Gates and Introduction to Computer Architecture Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are

More information

A Lab Course on Computer Architecture

A Lab Course on Computer Architecture A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

More information

International Workshop on Field Programmable Logic and Applications, FPL '99

International Workshop on Field Programmable Logic and Applications, FPL '99 International Workshop on Field Programmable Logic and Applications, FPL '99 DRIVE: An Interpretive Simulation and Visualization Environment for Dynamically Reconægurable Systems? Kiran Bondalapati and

More information

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU Martin Straka Doctoral Degree Programme (1), FIT BUT E-mail: strakam@fit.vutbr.cz Supervised by: Zdeněk Kotásek E-mail: kotasek@fit.vutbr.cz

More information

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001 Agenda Introduzione Il mercato Dal circuito integrato al System on a Chip (SoC) La progettazione di un SoC La tecnologia Una fabbrica di circuiti integrati 28 How to handle complexity G The engineering

More information

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah (DSF) Soft Core Prozessor NIOS II Stand Mai 2007 Jens Onno Krah Cologne University of Applied Sciences www.fh-koeln.de jens_onno.krah@fh-koeln.de NIOS II 1 1 What is Nios II? Altera s Second Generation

More information

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit

More information

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2 Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

VLIW Processors. VLIW Processors

VLIW Processors. VLIW Processors 1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

More information

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

EE361: Digital Computer Organization Course Syllabus

EE361: Digital Computer Organization Course Syllabus EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)

More information

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano

More information

Management Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System?

Management Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System? Management Challenge Managing Hardware Assets What computer processing and storage capability does our organization need to handle its information and business transactions? What arrangement of computers

More information

An Overview of Stack Architecture and the PSC 1000 Microprocessor

An Overview of Stack Architecture and the PSC 1000 Microprocessor An Overview of Stack Architecture and the PSC 1000 Microprocessor Introduction A stack is an important data handling structure used in computing. Specifically, a stack is a dynamic set of elements in which

More information

Contents. System Development Models and Methods. Design Abstraction and Views. Synthesis. Control/Data-Flow Models. System Synthesis Models

Contents. System Development Models and Methods. Design Abstraction and Views. Synthesis. Control/Data-Flow Models. System Synthesis Models System Development Models and Methods Dipl.-Inf. Mirko Caspar Version: 10.02.L.r-1.0-100929 Contents HW/SW Codesign Process Design Abstraction and Views Synthesis Control/Data-Flow Models System Synthesis

More information

Design Cycle for Microprocessors

Design Cycle for Microprocessors Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types

More information

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit. Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution

More information

7a. System-on-chip design and prototyping platforms

7a. System-on-chip design and prototyping platforms 7a. System-on-chip design and prototyping platforms Labros Bisdounis, Ph.D. Department of Computer and Communication Engineering 1 What is System-on-Chip (SoC)? System-on-chip is an integrated circuit

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS CHRISTOPHER J. ZIMMER

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS CHRISTOPHER J. ZIMMER THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS By CHRISTOPHER J. ZIMMER A Thesis submitted to the Department of Computer Science In partial fulfillment of

More information

Chapter 11 I/O Management and Disk Scheduling

Chapter 11 I/O Management and Disk Scheduling Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization

More information

Introduction to Digital System Design

Introduction to Digital System Design Introduction to Digital System Design Chapter 1 1 Outline 1. Why Digital? 2. Device Technologies 3. System Representation 4. Abstraction 5. Development Tasks 6. Development Flow Chapter 1 2 1. Why Digital

More information

Real-time Java Processor for Monitoring and Test

Real-time Java Processor for Monitoring and Test Real-time Java Processor for Monitoring and Test Martin Zabel, Thomas B. Preußer, Rainer G. Spallek Technische Universität Dresden {zabel,preusser,rgs}@ite.inf.tu-dresden.de Abstract This paper introduces

More information

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Ms Lavanya Thunuguntla 1, Saritha Sapa 2 1 Associate Professor, Department of ECE, HITAM, Telangana

More information

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level System: User s View System Components: High Level View Input Output 1 System: Motherboard Level 2 Components: Interconnection I/O MEMORY 3 4 Organization Registers ALU CU 5 6 1 Input/Output I/O MEMORY

More information

FPGA area allocation for parallel C applications

FPGA area allocation for parallel C applications 1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University

More information

AC 2007-2027: A PROCESSOR DESIGN PROJECT FOR A FIRST COURSE IN COMPUTER ORGANIZATION

AC 2007-2027: A PROCESSOR DESIGN PROJECT FOR A FIRST COURSE IN COMPUTER ORGANIZATION AC 2007-2027: A PROCESSOR DESIGN PROJECT FOR A FIRST COURSE IN COMPUTER ORGANIZATION Michael Black, American University Manoj Franklin, University of Maryland-College Park American Society for Engineering

More information

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Improving Grid Processing Efficiency through Compute-Data Confluence

Improving Grid Processing Efficiency through Compute-Data Confluence Solution Brief GemFire* Symphony* Intel Xeon processor Improving Grid Processing Efficiency through Compute-Data Confluence A benchmark report featuring GemStone Systems, Intel Corporation and Platform

More information

An Open Architecture through Nanocomputing

An Open Architecture through Nanocomputing 2009 International Symposium on Computing, Communication, and Control (ISCCC 2009) Proc.of CSIT vol.1 (2011) (2011) IACSIT Press, Singapore An Open Architecture through Nanocomputing Joby Joseph1and A.

More information

FPGA-based MapReduce Framework for Machine Learning

FPGA-based MapReduce Framework for Machine Learning FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China

More information

Programming Logic controllers

Programming Logic controllers Programming Logic controllers Programmable Logic Controller (PLC) is a microprocessor based system that uses programmable memory to store instructions and implement functions such as logic, sequencing,

More information

Floating Point Fused Add-Subtract and Fused Dot-Product Units

Floating Point Fused Add-Subtract and Fused Dot-Product Units Floating Point Fused Add-Subtract and Fused Dot-Product Units S. Kishor [1], S. P. Prakash [2] PG Scholar (VLSI DESIGN), Department of ECE Bannari Amman Institute of Technology, Sathyamangalam, Tamil Nadu,

More information

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the

More information

Intel Pentium 4 Processor on 90nm Technology

Intel Pentium 4 Processor on 90nm Technology Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software

More information

TPCalc : a throughput calculator for computer architecture studies

TPCalc : a throughput calculator for computer architecture studies TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University pierre.michaud@inria.fr Stijn.Eyerman@elis.UGent.be

More information

NIOS II Based Embedded Web Server Development for Networking Applications

NIOS II Based Embedded Web Server Development for Networking Applications NIOS II Based Embedded Web Server Development for Networking Applications 1 Sheetal Bhoyar, 2 Dr. D. V. Padole 1 Research Scholar, G. H. Raisoni College of Engineering, Nagpur, India 2 Professor, G. H.

More information

Testing of Digital System-on- Chip (SoC)

Testing of Digital System-on- Chip (SoC) Testing of Digital System-on- Chip (SoC) 1 Outline of the Talk Introduction to system-on-chip (SoC) design Approaches to SoC design SoC test requirements and challenges Core test wrapper P1500 core test

More information

OC By Arsene Fansi T. POLIMI 2008 1

OC By Arsene Fansi T. POLIMI 2008 1 IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5

More information

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1 Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of

More information

Techniques for Accurate Performance Evaluation in Architecture Exploration

Techniques for Accurate Performance Evaluation in Architecture Exploration IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 4, AUGUST 2003 601 Techniques for Accurate Performance Evaluation in Architecture Exploration George Hadjiyiannis and Srinivas

More information

A SystemC Transaction Level Model for the MIPS R3000 Processor

A SystemC Transaction Level Model for the MIPS R3000 Processor SETIT 2007 4 th International Conference: Sciences of Electronic, Technologies of Information and Telecommunications March 25-29, 2007 TUNISIA A SystemC Transaction Level Model for the MIPS R3000 Processor

More information

Performance Oriented Management System for Reconfigurable Network Appliances

Performance Oriented Management System for Reconfigurable Network Appliances Performance Oriented Management System for Reconfigurable Network Appliances Hiroki Matsutani, Ryuji Wakikawa, Koshiro Mitsuya and Jun Murai Faculty of Environmental Information, Keio University Graduate

More information

Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

More information

Photonic Networks for Data Centres and High Performance Computing

Photonic Networks for Data Centres and High Performance Computing Photonic Networks for Data Centres and High Performance Computing Philip Watts Department of Electronic Engineering, UCL Yury Audzevich, Nick Barrow-Williams, Robert Mullins, Simon Moore, Andrew Moore

More information

Chapter 12. Development Tools for Microcontroller Applications

Chapter 12. Development Tools for Microcontroller Applications Chapter 12 Development Tools for Microcontroller Applications Lesson 01 Software Development Process and Development Tools Step 1: Development Phases Analysis Design Implementation Phase 1 Phase 2 Phase

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats Execution Cycle Pipelining CSE 410, Spring 2005 Computer Systems http://www.cs.washington.edu/410 1. Instruction Fetch 2. Instruction Decode 3. Execute 4. Memory 5. Write Back IF and ID Stages 1. Instruction

More information

Central Processing Unit (CPU)

Central Processing Unit (CPU) Central Processing Unit (CPU) CPU is the heart and brain It interprets and executes machine level instructions Controls data transfer from/to Main Memory (MM) and CPU Detects any errors In the following

More information

Chapter 01: Introduction. Lesson 02 Evolution of Computers Part 2 First generation Computers

Chapter 01: Introduction. Lesson 02 Evolution of Computers Part 2 First generation Computers Chapter 01: Introduction Lesson 02 Evolution of Computers Part 2 First generation Computers Objective Understand how electronic computers evolved during the first generation of computers First Generation

More information

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern: Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove

More information

The Central Processing Unit:

The Central Processing Unit: The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Objectives Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

Component visualization methods for large legacy software in C/C++

Component visualization methods for large legacy software in C/C++ Annales Mathematicae et Informaticae 44 (2015) pp. 23 33 http://ami.ektf.hu Component visualization methods for large legacy software in C/C++ Máté Cserép a, Dániel Krupp b a Eötvös Loránd University mcserep@caesar.elte.hu

More information

ESE566 REPORT3. Design Methodologies for Core-based System-on-Chip HUA TANG OVIDIU CARNU

ESE566 REPORT3. Design Methodologies for Core-based System-on-Chip HUA TANG OVIDIU CARNU ESE566 REPORT3 Design Methodologies for Core-based System-on-Chip HUA TANG OVIDIU CARNU Nov 19th, 2002 ABSTRACT: In this report, we discuss several recent published papers on design methodologies of core-based

More information

Xeon+FPGA Platform for the Data Center

Xeon+FPGA Platform for the Data Center Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system

More information

Chapter 1. Dr. Chris Irwin Davis Email: cid021000@utdallas.edu Phone: (972) 883-3574 Office: ECSS 4.705. CS-4337 Organization of Programming Languages

Chapter 1. Dr. Chris Irwin Davis Email: cid021000@utdallas.edu Phone: (972) 883-3574 Office: ECSS 4.705. CS-4337 Organization of Programming Languages Chapter 1 CS-4337 Organization of Programming Languages Dr. Chris Irwin Davis Email: cid021000@utdallas.edu Phone: (972) 883-3574 Office: ECSS 4.705 Chapter 1 Topics Reasons for Studying Concepts of Programming

More information

Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system

Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system Joseph LaBauve Department of Electrical and Computer Engineering University of Central Florida

More information

Rapid System Prototyping with FPGAs

Rapid System Prototyping with FPGAs Rapid System Prototyping with FPGAs By R.C. Coferand Benjamin F. Harding AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO Newnes is an imprint of

More information

1/20/2016 INTRODUCTION

1/20/2016 INTRODUCTION INTRODUCTION 1 Programming languages have common concepts that are seen in all languages This course will discuss and illustrate these common concepts: Syntax Names Types Semantics Memory Management We

More information

High Performance or Cycle Accuracy?

High Performance or Cycle Accuracy? CHIP DESIGN High Performance or Cycle Accuracy? You can have both! Bill Neifert, Carbon Design Systems Rob Kaye, ARM ATC-100 AGENDA Modelling 101 & Programmer s View (PV) Models Cycle Accurate Models Bringing

More information

TDT 4260 lecture 11 spring semester 2013. Interconnection network continued

TDT 4260 lecture 11 spring semester 2013. Interconnection network continued 1 TDT 4260 lecture 11 spring semester 2013 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Interconnection network continued Routing Switch microarchitecture

More information

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune Introduction to RISC Processor ni logic Pvt. Ltd., Pune AGENDA What is RISC & its History What is meant by RISC Architecture of MIPS-R4000 Processor Difference Between RISC and CISC Pros and Cons of RISC

More information

CISC, RISC, and DSP Microprocessors

CISC, RISC, and DSP Microprocessors CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:

More information

Parallelization: Binary Tree Traversal

Parallelization: Binary Tree Traversal By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First

More information

IBM Deep Computing Visualization Offering

IBM Deep Computing Visualization Offering P - 271 IBM Deep Computing Visualization Offering Parijat Sharma, Infrastructure Solution Architect, IBM India Pvt Ltd. email: parijatsharma@in.ibm.com Summary Deep Computing Visualization in Oil & Gas

More information

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,

More information

Topics. Introduction. Java History CS 146. Introduction to Programming and Algorithms Module 1. Module Objectives

Topics. Introduction. Java History CS 146. Introduction to Programming and Algorithms Module 1. Module Objectives Introduction to Programming and Algorithms Module 1 CS 146 Sam Houston State University Dr. Tim McGuire Module Objectives To understand: the necessity of programming, differences between hardware and software,

More information

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

EMC XtremSF: Delivering Next Generation Performance for Oracle Database White Paper EMC XtremSF: Delivering Next Generation Performance for Oracle Database Abstract This white paper addresses the challenges currently facing business executives to store and process the growing

More information

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Index Terms Domain name, Firewall, Packet, Phishing, URL. BDD for Implementation of Packet Filter Firewall and Detecting Phishing Websites Naresh Shende Vidyalankar Institute of Technology Prof. S. K. Shinde Lokmanya Tilak College of Engineering Abstract Packet

More information

On Demand Loading of Code in MMUless Embedded System

On Demand Loading of Code in MMUless Embedded System On Demand Loading of Code in MMUless Embedded System Sunil R Gandhi *. Chetan D Pachange, Jr.** Mandar R Vaidya***, Swapnilkumar S Khorate**** *Pune Institute of Computer Technology, Pune INDIA (Mob- 8600867094;

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC

More information

Computer Organization

Computer Organization Computer Organization and Architecture Designing for Performance Ninth Edition William Stallings International Edition contributions by R. Mohan National Institute of Technology, Tiruchirappalli PEARSON

More information

SPARC64 VIIIfx: CPU for the K computer

SPARC64 VIIIfx: CPU for the K computer SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS

More information

MONITORING power consumption of a microprocessor

MONITORING power consumption of a microprocessor IEEE TRANSACTIONS ON CIRCUIT AND SYSTEMS-II, VOL. X, NO. Y, JANUARY XXXX 1 A Study on the use of Performance Counters to Estimate Power in Microprocessors Rance Rodrigues, Member, IEEE, Arunachalam Annamalai,

More information

Instruction scheduling

Instruction scheduling Instruction ordering Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 When a compiler emits the instructions corresponding to a program, it imposes a total order on them.

More information

Driving force. What future software needs. Potential research topics

Driving force. What future software needs. Potential research topics Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #

More information

Embedded Software development Process and Tools:

Embedded Software development Process and Tools: Embedded Software development Process and Tools: Lesson-2 Integrated Development Environment (IDE) 1 1. IDE 2 Consists of Simulators editors, compilers, assemblers, etc., IDE 3 emulators logic analyzers

More information

TYPES OF COMPUTERS AND THEIR PARTS MULTIPLE CHOICE QUESTIONS

TYPES OF COMPUTERS AND THEIR PARTS MULTIPLE CHOICE QUESTIONS MULTIPLE CHOICE QUESTIONS 1. What is a computer? a. A programmable electronic device that processes data via instructions to output information for future use. b. Raw facts and figures that has no meaning

More information

PROBLEMS #20,R0,R1 #$3A,R2,R4

PROBLEMS #20,R0,R1 #$3A,R2,R4 506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,

More information

picojava TM : A Hardware Implementation of the Java Virtual Machine

picojava TM : A Hardware Implementation of the Java Virtual Machine picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer

More information

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102

More information

IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

More information