Queue Machines for Next Generation Computer Systems
|
|
- Stewart Brooks
- 7 years ago
- Views:
Transcription
1 Queue Machines for Next Generation Computer Systems Masahiro Sowa Graduate School of Information Systems University of Electro-Communications Chofugaoka 1-5-1, Chofu-Shi Tokyo, Japan Arquimedes Canedo IBM Tokyo Research Laboratory Shimotsuruma, Yamato-shi Kanagawa-ken Japan Abstract Queue processors are a novel computer architecture with characteristics that are suitable for the next generation computer systems. Compared to conventional register machines where registers referenced by their names are used for data processing, the queue machines use a nameless first-in first-out queue to perform operations. The head of the queue is the only place where data is read and the tail of the queue is the only place where data can be written. Thus, queue computing allows instructions to read and write data without using register names and, as a consequence, instructions are short and the queue programs are free of false dependencies. Small instructions are preferred over longer instructions since memory traffic can be reduced, the cache performance and size can be improved, and decoding hardware logic can be simplified. The nonexistence of false dependencies allows programs to expose maximum parallelism that the queue processor can execute without complex and power-hungry hardware such as register renaming and large instruction windows. Parallel processing allows queue processors to speed-up the execution of applications. In this paper we present the special characteristics of queue machines that make such design a very suitable architecture for the future generation of computer systems demanding high-performance, low power, and low cost. We present the toolchain of compilers, translators, assembler, virtual machine, functional and cycle accurate simulators, and RTL processors we have designed and developed. From our experimental results we demonstrate that queue programs are smaller than conventional register programs and have very similar characteristics in terms of parallelism. Keywords: Queue Machines, Computer Systems, System Design 1 Introduction Processor design faces a situation where performance cannot be sustained by simply increasing the frequency of operation. As the size of transistors decreases and frequency of operation is raised, power dissipation is a limitation of modern technology. This situation has motivated the designers to reconsider new execution paradigms for modern computer systems [8, 1, 9]. The key points to sustain performance improvements has been in exploiting different sources of parallelism. Queue processors [6, 7, 8] are a novel execution paradigm that allows program to be executed in parallel without needing complex hardware to find and exploit such parallelism. The instructions of a queue processor implicitly reference source and destination operands, thus the instructions are compact and free of false dependencies. The hardware implementation of a queue processor requires simple circuits while it is able to execute instructions in parallel. The programs generated for the queue computation model are unrestricted in parallelism and therefore queue processors can be used for scientific and parallel processing. From its hardware simplicity, a queue processor has the potential to have a low power dissipation footprint, making it ideal for modern applications running in mobile devices. In this paper we present an overview of the principles of queue computing and explain its salient characteristics. We also present a complete queue computer system toolchain and give an insight to its development. An
2 evaluation of the characteristics of queue computing is given to demonstrate the potential of queue-based systems against conventional computing paradigms. 2 Overview of Queue Computing A queue machine is a computer that uses fast registers organized as a first-in first-out (FIFO) queue to perform computations. A FIFO queue establishes that all read accesses are performed in one end of the queue called the head, or QH. And all writes are performed in the opposite end of the queue called the tail, or QT. Since all accesses are done at known and fixed locations, the QH and QT, instructions do not need to specify the source and destination operands. However, the hardware must update and track the position of QH and QT at all times: whenever a write is performed the QT position is updated, and whenever a read is performed the QH is updated. Given the strict rules for accessing the queue, the programs for a queue machine are generated by a level-order traversal of the expression trees. The level order traversal visits all levels from the deepest to the root, and for each level walks on all nodes from left to right. This scheduling method produces programs that can be directly executed in a queue processor. Consider the expression x = (a+b)/(c-d), first the expression tree is built and all nodes visited in a level-order traversal as shown in Figure 1.(a). For every visited node a corresponding instruction of the queue processor is generated, the queue program for the sample expression is given in Figure 1.(b). An important characteristic of the queue program to be noticed is that arithmetic instructions (add, sub, div) consist only of the operator, source and destination operands are missing since they are implicitly accessed through QH and QT. This feature allows the instructions of a queue processor to be short in length and as a consequence programs are compact. This is favorable for the hardware since manipulating small instructions requires fewer circuitry, smaller buses, less complexity, and as a consequence, less power consumption. On the other hand, small instructions have the potential of improving the performance of the cache memories, and the benefit of reducing production costs for computer systems where memory size is constrained as in embedded systems. L0 (root) L1 L2 L3 x / + - a b c d add sub div st a b c d x a b c d add sub a+b c-d div (a+b)/(c-d) st (a) Level-order traversal of expression tree (b) Queue Program (c) Execution of program in a FIFO queue Figure 1: Expression x = a+b c d is (a) traversed in a level-order manner to, (b) generate a queue program and (c) executed in a queue processor. Data independent instructions of a queue program are grouped by the level-order scheduling. Figure 1.(c) shows the execution of the sample queue program in the FIFO queue. The leftmost end of the queue is the QH and the rightmost the QT. The first four instructions of level L 0 place the operands a,b,c,d in the queue. These four instructions are independent from each other and can be issued in parallel. Same happens with the instructions of level L 1, add, sub, they are independent to each other and thus executed simultaneously. This situation applies for all levels, all instructions within the same level can be executed in parallel. The absence of operands in the instructions has another good characteristic, queue programs are free of false dependencies. The parallelism exposed by the level-order scheduling and the
3 IF CTR MEM lack of false dependencies in the program allows the queue machines to effectively exploit all parallelism available in the applications. The grouping of independent instructions allows the implementation of queue processors to require smaller instruction windows, and the absence of false dependencies completely removes the necessity of register renaming hardware. A parallel queue processor (PQP) offers a high-performance architecture with simple hardware design and low power characteristics. 3 Developing Queue Computer Systems In Sowa Laboratory 1 we have investigated the principles of queue computing and developed a toolchain for the design and implementation of queue processors. Figure 2 shows the current layers of our toolchain. Except for the applications, all other elements have been crafted from scratch for the queue computation model. There are fundamental differences between queue machines and conventional stack and register machines that prohibit the utilization of conventional techniques and methods to bui compilers, virtual machines, simulators, operating systems, and hardware. Therefore, we have researched and established new methods to bui custom tools for the queue computing. Applications Queue Compiler Profiler Optimizers Queue Virtual Table 4 Machine Assembler Linker PQP Hardware Simulator QC-2 processor design results: modules complexity as LE (logic elements) and TCF (total combinational functions) when synthesized for FPGAs (with Stratix device) and Structured ASIC (HardCopy II) families. Loader Descriptions Modules LE TCF instruction fetch unit IF instruction decode unit ID queue compute unit QCU barrier queue unit BQU issue unit IS Operating System execution unit EXE queue-register unit QREG memory access MEM control unit CTR QC-2 core QC EXE ID IS BQU QCU QREG Fig. 12. Floorplan of the placed and routed QC-2 core. optimizations area (ARA), speed (SPD) and balanced (BLD). For FPGA implementation, the complexity for SPD optimization is about 17% and 18% higher than that for ARA and BLD optimizations respectively. Figure 12 shows the floorplan of the placed and routed QC-2 core. The modules of the processor show considerable overlap as logic is mapped according to interconnect requirements. 5.3 Speed and Power Consumption Comparison with Synthesizable CPU Cores Figure 2: Layers of a Queue-based Computer System Queue computing and architecture design approaches take into account performance and power consumption considerations early in the design cycle and maintain a power-centric focus across all levels of design abstraction. In QC-2 processor, all instructions designed are fixed format 16-bit words with minimal 18 Applications We have considered a wide range of applications to analyze their characteristics when translated to queue programs. The experimental results show that applications benefit from the queue computing as their size can be reduced up to 47% when compared to the same application on a conventional RISC machine [3]. In terms of other qualitative characteristics such as available parallelism and instruction count, queue programs are very similar to register programs. Most applications are written in high level languages such as C, FORTRAN, Java. Therefore, we need a compiler to translate these programs into machine code for queue processors. 1
4 Queue Compiler The Queue Compiler not only translates high level languages into machine code but also optimizes the programs for smaller execution times, higher parallelism, better resource utilization, smaller code size, among others. Compiling for queue machines is different than compiling for register computers and this is reflected in the structure of the queue compiler. Since a queue program has the concept of queue rather than registers, novel code generation algorithms have been developed [2] to handle this characteristic. The compiler is a critical part of a queue computer system as it schedules the programs to satisfy the constraints of the hardware and achieve the highest performance. Queue Virtual Machine Virtual machines offer great flexibility at reasonable development cost for testing and validating a new computer system. Our queue virtual machine emulates a queue processor and interprets the programs generated by the queue compiler. The virtual machine can be easily modified to emulate different queue processors and to test new architectural features. The virtual machine also helps the refinement of the compiler as it is able to execute programs, and executed programs have different characteristics than compiled programs. Simulator Simulators also emulate a queue processor and execute code generated by the compiler, the difference with the queue virtual machine is the level of detail on which the execution of the program is performed. Our simulators emulate the pipeline of the queue processor and perform a detailed timing execution. Pipeline hazards, stalls, instruction latencies, and other microarchitectural features are simulated. Furthermore, the cycle accurate simulators can be configured to simulate a parallel queue processor that executes instructions simultaneously. Assembler, Linker, Loader When targeting the actual queue processor, the generic assembly code generated by the compiler must be translated into binary code for a queue processor by the assembler. Having an assembler allows us to easily retarget a program to a new queue processor without changing the compiler. The function of the linker and loader are no different than conventional systems, the former resolves symbols and combines different objects into a single executable, and the latter prepares the code for execution in the operating system. Operating System This is the only part of the toolchain where we have not built a prototype. Up until now, we have concentrated our efforts to develop the tools to test the fundamental ideas of queue computing. As it is possible for us to run our queue programs in the PQP hardware directly, the need of an operating system has become a secondary priority. However, our future work includes the development of an operating system to manage the PQP processor. PQP Hardware Several queue processors have been designed in hardware description languages, realized in RTL designs, and implemented in FPGA chips. The hardware has been gradually evolving from a serial queue processor to a parallel queue processor (PQP). We also have developed hybrid architectures that combine register, stack, and queue into a single processor. Our latest research includes a multi-dimensional queue processor where several queues are employed for data processing. Queue processors have the characteristic of hardware simplicity, low power consumption, and high-performance.
5 4 Evaluation In this section, we give the experimental results obtained in the development of the PQP processor. Figure 3 shows the normalized code size, SPEC CINT95 benchmarks [4], for the PQP processor (baseline) and MIPS I processor [5]. From this graph we observe that the absence of operands in the queue instruction set accomplishes small code sizes. For these benchmarks, PQP programs are from 27% to 47% smaller than MIPS programs MIPS 1.88 PQP Code Size go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 134.perl 147.vortex Figure 3: Code size for SPEC CINT95 benchmarks. Figure 4 shows the instruction level parallelism (ILP) for embedded benchmarks extracted at compile time for MIPS programs and for PQP programs. Notice that the PQP programs have, in average, 13% more parallelism than optimized MIPS programs. This is due to the fact that queue programs are not limited by false dependencies and the level-order scheduling exposes all application parallelism. On the other hand, although MIPS programs are heavily optimized, the limited number of architected registers introduces false dependencies that limit the available parallelism MIPS I -O3 PQP 3.75 ILP H.263 MPEG2 FFT Susan Rijndael Sha Blowfish Dijkstra Patricia Adpcm Figure 4: Instruction Level Parallelism (ILP) for embedded benchmarks 5 Conclusion As discussed, queue processors offer a good alternative to modern computer systems. Modern applications constantly demand higher and higher performance at low power consumption and low cost. The simplicity and characteristics of queue-based processors make them ideal candidates for such demands. In this paper, we introduced the principles of queue computing and we presented an overview of our current research on the PQP processor. We have developed a custom toolchain that includes compilers, virtual machines, cycle accurate simulators, assemblers, linkers, loaders, operating system, libraries, and several RTL processors. This research has contributed with novel techniques to develop queue machines and the results have
6 shown that queue processors can produce up to 47% smaller programs and 13% more parallel applications than modern RISC processors. References [1] D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John, C. Lin, C. R. Moore, J. Burrill, R. G. McDona, W. Yoder, and the TRIPS Team. Scaling to the end of silicon with edge architectures. Computer, 37(7):44 55, [2] A. Canedo, B. Abderazek, and M. Sowa. A New Code Generation Algorithm for 2-offset Producer Order Queue Computation Model. Journal of Computer Languages, Systems & Structures, 34(4): , June [3] A. Canedo, B. Abderazek, and M. Sowa. Compiler Support for Code Size Reduction using a Queue-based Processor. Transactions on High-Performance Embedded Architectures and Compilers, 2(4): , [4] J. J. Dujmovic and I. Dujmovic. Evolution and evaluation of SPEC benchmarks. ACM SIGMETRICS Performance Evaluation Review, 26(3):2 9, December [5] G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice Hall, [6] B. Preiss and C. Hamacher. Data Flow on Queue Machines. In 12th Int. IEEE Symposium on computer Architecture, pages , [7] H. Schmit, B. Levine, and B. Ylvisaker. Queue Machines: Hardware Compilation in Hardware. In 10th Annual IEEE Symposium on Fie-Programmable Custom Computing Machines, page 152, [8] M. Sowa, B. Abderazek, and T. Yoshinaga. Parallel Queue Processor Architecture Based on Produced Order Computation Model. Journal of Supercomputing, 32(3): , June [9] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. Wavescalar. In In Proceedings of the 36th International Symposium on Microarchitecture(MICRO, pages , 2003.
Architectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
More informationİSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
More informationSolution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
More informationIntroducción. Diseño de sistemas digitales.1
Introducción Adapted from: Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Original from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] Diseño de sistemas digitales.1
More informationInstruction Set Design
Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,
More informationComputer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Computers in mid 50 s Hardware was expensive
More informationLSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
More informationFLIX: Fast Relief for Performance-Hungry Embedded Applications
FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...
More informationInstruction Set Architecture (ISA)
Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine
More informationChapter 2 Logic Gates and Introduction to Computer Architecture
Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are
More informationA Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
More informationInternational Workshop on Field Programmable Logic and Applications, FPL '99
International Workshop on Field Programmable Logic and Applications, FPL '99 DRIVE: An Interpretive Simulation and Visualization Environment for Dynamically Reconægurable Systems? Kiran Bondalapati and
More informationVHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU
VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU Martin Straka Doctoral Degree Programme (1), FIT BUT E-mail: strakam@fit.vutbr.cz Supervised by: Zdeněk Kotásek E-mail: kotasek@fit.vutbr.cz
More informationAgenda. Michele Taliercio, Il circuito Integrato, Novembre 2001
Agenda Introduzione Il mercato Dal circuito integrato al System on a Chip (SoC) La progettazione di un SoC La tecnologia Una fabbrica di circuiti integrati 28 How to handle complexity G The engineering
More informationDigitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah
(DSF) Soft Core Prozessor NIOS II Stand Mai 2007 Jens Onno Krah Cologne University of Applied Sciences www.fh-koeln.de jens_onno.krah@fh-koeln.de NIOS II 1 1 What is Nios II? Altera s Second Generation
More informationADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
More informationAdvanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationVLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
More informationA Survey on ARM Cortex A Processors. Wei Wang Tanima Dey
A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
More informationEE361: Digital Computer Organization Course Syllabus
EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)
More informationINSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano
More informationManagement Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System?
Management Challenge Managing Hardware Assets What computer processing and storage capability does our organization need to handle its information and business transactions? What arrangement of computers
More informationAn Overview of Stack Architecture and the PSC 1000 Microprocessor
An Overview of Stack Architecture and the PSC 1000 Microprocessor Introduction A stack is an important data handling structure used in computing. Specifically, a stack is a dynamic set of elements in which
More informationContents. System Development Models and Methods. Design Abstraction and Views. Synthesis. Control/Data-Flow Models. System Synthesis Models
System Development Models and Methods Dipl.-Inf. Mirko Caspar Version: 10.02.L.r-1.0-100929 Contents HW/SW Codesign Process Design Abstraction and Views Synthesis Control/Data-Flow Models System Synthesis
More informationDesign Cycle for Microprocessors
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
More informationLogical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.
Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how
More informationEE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
More information7a. System-on-chip design and prototyping platforms
7a. System-on-chip design and prototyping platforms Labros Bisdounis, Ph.D. Department of Computer and Communication Engineering 1 What is System-on-Chip (SoC)? System-on-chip is an integrated circuit
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationTHE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS CHRISTOPHER J. ZIMMER
THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS By CHRISTOPHER J. ZIMMER A Thesis submitted to the Department of Computer Science In partial fulfillment of
More informationChapter 11 I/O Management and Disk Scheduling
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization
More informationIntroduction to Digital System Design
Introduction to Digital System Design Chapter 1 1 Outline 1. Why Digital? 2. Device Technologies 3. System Representation 4. Abstraction 5. Development Tasks 6. Development Flow Chapter 1 2 1. Why Digital
More informationReal-time Java Processor for Monitoring and Test
Real-time Java Processor for Monitoring and Test Martin Zabel, Thomas B. Preußer, Rainer G. Spallek Technische Universität Dresden {zabel,preusser,rgs}@ite.inf.tu-dresden.de Abstract This paper introduces
More informationDesign and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip
Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Ms Lavanya Thunuguntla 1, Saritha Sapa 2 1 Associate Professor, Department of ECE, HITAM, Telangana
More informationComputer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level
System: User s View System Components: High Level View Input Output 1 System: Motherboard Level 2 Components: Interconnection I/O MEMORY 3 4 Organization Registers ALU CU 5 6 1 Input/Output I/O MEMORY
More informationFPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
More informationAC 2007-2027: A PROCESSOR DESIGN PROJECT FOR A FIRST COURSE IN COMPUTER ORGANIZATION
AC 2007-2027: A PROCESSOR DESIGN PROJECT FOR A FIRST COURSE IN COMPUTER ORGANIZATION Michael Black, American University Manoj Franklin, University of Maryland-College Park American Society for Engineering
More informationInstruction Set Architecture. or How to talk to computers if you aren t in Star Trek
Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationImproving Grid Processing Efficiency through Compute-Data Confluence
Solution Brief GemFire* Symphony* Intel Xeon processor Improving Grid Processing Efficiency through Compute-Data Confluence A benchmark report featuring GemStone Systems, Intel Corporation and Platform
More informationAn Open Architecture through Nanocomputing
2009 International Symposium on Computing, Communication, and Control (ISCCC 2009) Proc.of CSIT vol.1 (2011) (2011) IACSIT Press, Singapore An Open Architecture through Nanocomputing Joby Joseph1and A.
More informationFPGA-based MapReduce Framework for Machine Learning
FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China
More informationProgramming Logic controllers
Programming Logic controllers Programmable Logic Controller (PLC) is a microprocessor based system that uses programmable memory to store instructions and implement functions such as logic, sequencing,
More informationFloating Point Fused Add-Subtract and Fused Dot-Product Units
Floating Point Fused Add-Subtract and Fused Dot-Product Units S. Kishor [1], S. P. Prakash [2] PG Scholar (VLSI DESIGN), Department of ECE Bannari Amman Institute of Technology, Sathyamangalam, Tamil Nadu,
More informationwhat operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
More informationIntel Pentium 4 Processor on 90nm Technology
Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended
More informationComputer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
More informationChapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan
Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software
More informationTPCalc : a throughput calculator for computer architecture studies
TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University pierre.michaud@inria.fr Stijn.Eyerman@elis.UGent.be
More informationNIOS II Based Embedded Web Server Development for Networking Applications
NIOS II Based Embedded Web Server Development for Networking Applications 1 Sheetal Bhoyar, 2 Dr. D. V. Padole 1 Research Scholar, G. H. Raisoni College of Engineering, Nagpur, India 2 Professor, G. H.
More informationTesting of Digital System-on- Chip (SoC)
Testing of Digital System-on- Chip (SoC) 1 Outline of the Talk Introduction to system-on-chip (SoC) design Approaches to SoC design SoC test requirements and challenges Core test wrapper P1500 core test
More informationOC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
More informationPipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1
Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of
More informationTechniques for Accurate Performance Evaluation in Architecture Exploration
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 4, AUGUST 2003 601 Techniques for Accurate Performance Evaluation in Architecture Exploration George Hadjiyiannis and Srinivas
More informationA SystemC Transaction Level Model for the MIPS R3000 Processor
SETIT 2007 4 th International Conference: Sciences of Electronic, Technologies of Information and Telecommunications March 25-29, 2007 TUNISIA A SystemC Transaction Level Model for the MIPS R3000 Processor
More informationPerformance Oriented Management System for Reconfigurable Network Appliances
Performance Oriented Management System for Reconfigurable Network Appliances Hiroki Matsutani, Ryuji Wakikawa, Koshiro Mitsuya and Jun Murai Faculty of Environmental Information, Keio University Graduate
More informationWeek 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
More informationPhotonic Networks for Data Centres and High Performance Computing
Photonic Networks for Data Centres and High Performance Computing Philip Watts Department of Electronic Engineering, UCL Yury Audzevich, Nick Barrow-Williams, Robert Mullins, Simon Moore, Andrew Moore
More informationChapter 12. Development Tools for Microcontroller Applications
Chapter 12 Development Tools for Microcontroller Applications Lesson 01 Software Development Process and Development Tools Step 1: Development Phases Analysis Design Implementation Phase 1 Phase 2 Phase
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationExecution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats
Execution Cycle Pipelining CSE 410, Spring 2005 Computer Systems http://www.cs.washington.edu/410 1. Instruction Fetch 2. Instruction Decode 3. Execute 4. Memory 5. Write Back IF and ID Stages 1. Instruction
More informationCentral Processing Unit (CPU)
Central Processing Unit (CPU) CPU is the heart and brain It interprets and executes machine level instructions Controls data transfer from/to Main Memory (MM) and CPU Detects any errors In the following
More informationChapter 01: Introduction. Lesson 02 Evolution of Computers Part 2 First generation Computers
Chapter 01: Introduction Lesson 02 Evolution of Computers Part 2 First generation Computers Objective Understand how electronic computers evolved during the first generation of computers First Generation
More informationQ. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:
Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove
More informationThe Central Processing Unit:
The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Objectives Identify the components of the central processing unit and how they work together and interact with memory Describe how
More informationComponent visualization methods for large legacy software in C/C++
Annales Mathematicae et Informaticae 44 (2015) pp. 23 33 http://ami.ektf.hu Component visualization methods for large legacy software in C/C++ Máté Cserép a, Dániel Krupp b a Eötvös Loránd University mcserep@caesar.elte.hu
More informationESE566 REPORT3. Design Methodologies for Core-based System-on-Chip HUA TANG OVIDIU CARNU
ESE566 REPORT3 Design Methodologies for Core-based System-on-Chip HUA TANG OVIDIU CARNU Nov 19th, 2002 ABSTRACT: In this report, we discuss several recent published papers on design methodologies of core-based
More informationXeon+FPGA Platform for the Data Center
Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system
More informationChapter 1. Dr. Chris Irwin Davis Email: cid021000@utdallas.edu Phone: (972) 883-3574 Office: ECSS 4.705. CS-4337 Organization of Programming Languages
Chapter 1 CS-4337 Organization of Programming Languages Dr. Chris Irwin Davis Email: cid021000@utdallas.edu Phone: (972) 883-3574 Office: ECSS 4.705 Chapter 1 Topics Reasons for Studying Concepts of Programming
More informationInnovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system
Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system Joseph LaBauve Department of Electrical and Computer Engineering University of Central Florida
More informationRapid System Prototyping with FPGAs
Rapid System Prototyping with FPGAs By R.C. Coferand Benjamin F. Harding AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO Newnes is an imprint of
More information1/20/2016 INTRODUCTION
INTRODUCTION 1 Programming languages have common concepts that are seen in all languages This course will discuss and illustrate these common concepts: Syntax Names Types Semantics Memory Management We
More informationHigh Performance or Cycle Accuracy?
CHIP DESIGN High Performance or Cycle Accuracy? You can have both! Bill Neifert, Carbon Design Systems Rob Kaye, ARM ATC-100 AGENDA Modelling 101 & Programmer s View (PV) Models Cycle Accurate Models Bringing
More informationTDT 4260 lecture 11 spring semester 2013. Interconnection network continued
1 TDT 4260 lecture 11 spring semester 2013 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Interconnection network continued Routing Switch microarchitecture
More informationIntroduction to RISC Processor. ni logic Pvt. Ltd., Pune
Introduction to RISC Processor ni logic Pvt. Ltd., Pune AGENDA What is RISC & its History What is meant by RISC Architecture of MIPS-R4000 Processor Difference Between RISC and CISC Pros and Cons of RISC
More informationCISC, RISC, and DSP Microprocessors
CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:
More informationParallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
More informationIBM Deep Computing Visualization Offering
P - 271 IBM Deep Computing Visualization Offering Parijat Sharma, Infrastructure Solution Architect, IBM India Pvt Ltd. email: parijatsharma@in.ibm.com Summary Deep Computing Visualization in Oil & Gas
More informationThe Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.
White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,
More informationTopics. Introduction. Java History CS 146. Introduction to Programming and Algorithms Module 1. Module Objectives
Introduction to Programming and Algorithms Module 1 CS 146 Sam Houston State University Dr. Tim McGuire Module Objectives To understand: the necessity of programming, differences between hardware and software,
More informationEMC XtremSF: Delivering Next Generation Performance for Oracle Database
White Paper EMC XtremSF: Delivering Next Generation Performance for Oracle Database Abstract This white paper addresses the challenges currently facing business executives to store and process the growing
More informationIndex Terms Domain name, Firewall, Packet, Phishing, URL.
BDD for Implementation of Packet Filter Firewall and Detecting Phishing Websites Naresh Shende Vidyalankar Institute of Technology Prof. S. K. Shinde Lokmanya Tilak College of Engineering Abstract Packet
More informationOn Demand Loading of Code in MMUless Embedded System
On Demand Loading of Code in MMUless Embedded System Sunil R Gandhi *. Chetan D Pachange, Jr.** Mandar R Vaidya***, Swapnilkumar S Khorate**** *Pune Institute of Computer Technology, Pune INDIA (Mob- 8600867094;
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More informationOutline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip
Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC
More informationComputer Organization
Computer Organization and Architecture Designing for Performance Ninth Edition William Stallings International Edition contributions by R. Mohan National Institute of Technology, Tiruchirappalli PEARSON
More informationSPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
More informationMONITORING power consumption of a microprocessor
IEEE TRANSACTIONS ON CIRCUIT AND SYSTEMS-II, VOL. X, NO. Y, JANUARY XXXX 1 A Study on the use of Performance Counters to Estimate Power in Microprocessors Rance Rodrigues, Member, IEEE, Arunachalam Annamalai,
More informationInstruction scheduling
Instruction ordering Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 When a compiler emits the instructions corresponding to a program, it imposes a total order on them.
More informationDriving force. What future software needs. Potential research topics
Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #
More informationEmbedded Software development Process and Tools:
Embedded Software development Process and Tools: Lesson-2 Integrated Development Environment (IDE) 1 1. IDE 2 Consists of Simulators editors, compilers, assemblers, etc., IDE 3 emulators logic analyzers
More informationTYPES OF COMPUTERS AND THEIR PARTS MULTIPLE CHOICE QUESTIONS
MULTIPLE CHOICE QUESTIONS 1. What is a computer? a. A programmable electronic device that processes data via instructions to output information for future use. b. Raw facts and figures that has no meaning
More informationPROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
More informationpicojava TM : A Hardware Implementation of the Java Virtual Machine
picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer
More informationGEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
More informationSoftware Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x
Software Pipelining for (i=1, i
More informationIA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
More information