FLIX: Fast Relief for Performance-Hungry Embedded Applications
|
|
- Meredith Cooper
- 8 years ago
- Views:
Transcription
1 FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February Tensilica, Inc.
2 25 Tensilica, Inc. ii
3 Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications... Four Applications in Search of Acceleration...2 Works on Large Code Blocks Too...5 Conclusion...8 Figures Figure : Designer-defined FLIX instructions for the Xtensa LX processor can be either 32 or 64 bits wide and can encode several independent operations in one instruction word....2 Figure 2: Bit Manipulator application performance versus gate-count....4 Figure 3: H.264 deblocking filter application performance versus gate-count...6 Figure 4: MPEG-4 decoder application performance versus gate-count...7 Figure 5: SAD (sum of absolute differences) application performance versus gate-count...8 Tables Table : Results for Bit Manipulator application...9 Table 2: Results for H.264 deblocking filter...9 Table 3: Results for MPEG-4 decoder...9 Table 4: Results for SAD (sum of absolute differences) engine Tensilica, Inc. iii
4 25 Tensilica, Inc. iv
5 FLIX: Fast Relief for Performance-Hungry Embedded Applications By Steven Leibson and John Massingham Tensilica, Inc. Microprocessors are great building blocks for all types of embedded systems because they re so flexible. Compile some code for them and they can decode and play digital audio, route IP network packets, or decompress video (just to name a very few applications). If microprocessors were infinitely fast, there d never be a need to design any other hardware. However, microprocessors aren t infinitely fast. Often, they re not even fast enough to meet project goals. One of the bottlenecks in general-purpose microprocessor designs that prevent them from meeting performance goals is their insistence on executing one operation at a time. Modern RISC processor designs solve this problem somewhat through pipelining, which allows several instructions to be in various pipeline stages simultaneously. However, most RISC designs remain singleinstruction-issue machines. To combat this bottleneck, processor designers sometimes develop designs that issue and execute multiple independent operations simultaneously. These processors are often called VLIW (very long instruction word) machines because they encode multiple independent operations into one long instruction word. Many classes of programs benefit from the increased instruction parallelism provided by VLIW processor designs. However, VLIW instruction words must often be hundreds of bits long to allow the encoding of many simultaneous independent operations. As a result, VLIW programs tend to be large, which is the usual price for encoding multiple, independent operations. Tensilica has developed a VLIW-like technology called FLIX (flexible-length instruction extensions) for its Xtensa processor core family. This technology offers developers a way to realize the performance of VLIW instructions but without the usual VLIW code bloat. And with Tensilica s XPRES Compiler, SOC designers don t have to become processor designers to employ this technology the XPRES Compiler exploits capability when it automatically generates processor configurations. FLIX instruction formats can be either 32 or 64 bits wide and can encode many independent operations in designer-defined operation slots within the FLIX instruction word, as shown in Figure. Note that as the number of independent operations encoded in each FLIX instruction increases, the number of bits available in each operation slot decreases because the number of bits in the instruction is constant. With fewer available encoding bits, generalpurpose instructions become more specialized because there are fewer bits available to specify source and destination operands and immediate values. This fact will be important to remember when analyzing the results of the tests made for this white paper. 25 Tensilica, Inc.
6 Designer-Defined FLIX Instruction Formats with Designer-Defined Number of Operations 63 Operation Operation 2 Operation3 Example: 3-Operation, 64-bit Instruction Format 63 Operation Operation 2 Op 3 Op 4 Operation 5 Example: 5-Operation, 64-bit Instruction Format 3 Op Op 2 Op 3 Op. 4 Example: 4-Operation, 32-bit Instruction Format These three examples show (from top to bottom) a 3-operation, 64-bit FLIX instruction; a 5-operation, 64-bit FLIX instruction; and a 4-operation, 32-bit FLIX instruction. Note that as the number of operations in a FLIX instruction increases, the number of bits available to encode each operation decreases. Figure : Designer-defined FLIX instructions for the Xtensa LX processor can be either 32 or 64 bits wide and can encode several independent operations in one instruction word. The Xtensa LX C/C++ compiler that is generated along with a FLIX-enhanced Xtensa LX processor core can exploit the operational parallelism provided by FLIX instructions. Thus FLIX instructions can be used selectively to improve application performance where needed while the processor s native 24- and 6-bit instructions can be used in other sections of the code where parallelism isn t needed. This flexibility allows the compiler to generate compact code in sections of the application where the high performance of multiple operations/clock isn t required. Four Applications in Search of Acceleration To demonstrate the ability of FLIX instructions to accelerate code performance, Tensilica used the XPRES Compiler to automatically analyze the C code of four different applications. The XPRES Compiler is an ideal tool for this sort of architectural exploration. In much less than an hour, the XPRES Compiler analyzes the instruction flow within a C program and then generates hundreds of thousands or millions of candidate processor architectures all based on Tensilica s Xtensa LX processor core. It then selects the best candidates based on silicon area (cost) and performance criteria, and presents the final candidates to the system designer who selects a final architecture based on project goals. For each of the four applications considered in this white paper, we used a baseline single-instruction-issue processor configuration with one load/store unit that executes the full Xtensa LX instruction set to generate baseline performance numbers. We then allowed the XPRES Compiler to generate eight additional processor configurations (four with one load/store unit and four with two load/store units) with FLIX enhancements. In each group, the XPRES Compiler generated versions of the Xtensa LX processor with 2, 3, 4, and 5 operation slots in the FLIX instruction word. The addition of a second load/store unit allows the Xtensa LX processor to emulate XY memory operation that is a popular performance-enhancing feature found in many DSP processors. Addition of the second load/store unit requires the use of FLIX technology because each load/store unit requires its own operation field. For these experiments, we restricted the XPRES Compiler to use only one of its three optimization methods: the addition of FLIX instructions. The XPRES Compiler can also create new kinds of instructions using operator fusion and SIMD vectorization techniques 25 Tensilica, Inc. 2
7 and these additional optimizations are discussed in another Tensilica white paper: The XPRES Compiler triple-threat solution to code performance challenges. However, for this white paper, we constrained the XPRES Compiler to use of FLIX optimizations and no new operations were created. The XPRES compiler was only allowed to replicate baseline processor instructions in the additional FLIX operation slots if the additional parallelism could increase the application s performance. The results from these experiments show which of the four application programs benefit from the addition of an extra load/store unit and which benefit from the additional operation slots. The four test applications include: Bit Manipulator, a simple multi-operation algorithm that takes two numbers, masks each, shifts each, and then adds them together in a loop An H.264 (video) deblocking filter A SAD (sum of absolute differences) algorithm for video motion estimation An MPEG-4 video decoder algorithm Cycle counts for these four algorithms running on the baseline single-operation/clock Xtensa LX processor range from a few tens of thousands to hundreds of millions of cycles. Performance improvements from FLIX extensions range from cycle-count reductions of as much as 63% (the code runs nearly three times faster) to about 6%, which shows that not all code benefits from the availability of multiple simultaneous operations. Some code is stubbornly serial and cannot be accelerated through the operational parallelism of SIMD units or even big VLIW architectures. It s very important to note that the use of the XPRES Compiler allowed this design-space exploration to occur very quickly. XPRES can examine a block of code and generate multiple processor designs in less than an hour. Even a 6% performance improvement could help many projects meet performance goals. However, tripling an algorithm s speed in the time it takes to go out for a meal is truly a remarkable result. Tables through 4 (which appear at the end of this white paper) list the raw performance numbers for the four applications listed above. These tables show the performance of the unaugmented Xtensa LX processor core, a very competent 32-bit embedded RISC processor even without application-specific extensions, and they list the performance numbers for the enhanced 2-, 3-, 4-, and 5-slot Xtensa LX processors created by the XPRES compiler. 25 Tensilica, Inc. 3
8 35, 3, slot Load/Store Unit 2 Load/Store Units 25, Cycle Count 2, 5,, 2 slots 2 slots 5,, 2, 3, 4, 5, 6, Incremental Gate Count A plot for an Xtensa LX base processor and processors with 2-, 3-, 4-, and 5-slot FLIX extensions and one or two load/store units. Figure 2: Bit Manipulator application performance versus gate-count. The Bit Manipulator application results appear in Figure 2 and are perhaps the easiest to understand. The dark line in Figure 2 shows performance results for the baseline Xtensa LX processor and for processors that have been enhanced with 2-, 3-, 4-, and 5- slot FLIX instructions. All of the operation slots in these processors are filled with instances of Xtensa LX baseline instructions but the processors with FLIX enhancements can execute multiple operations during each clock cycle. All of the processors represented by the dark line in the graph in Figure 2 have one load/store unit. The lighter line in Figure 2 plots the same results but all of the processors on that line (except for the baseline processor) have two load/store units. The graph plots the cycle count required to execute the application versus the number of additional gates required to add the multiple execution units and the second load/store unit. As you can see from this graph, adding the ability to execute multiple simultaneous operations greatly accelerates the Bit Manipulator application. This application places load, mask, shift, and store operations within a loop and the Xtensa C/C++ compiler is able to profitably use the additional parallel execution resources to accelerate loop performance. A processor with 3-slot FLIX instructions requires only about 37% of the execution cycles to execute this application code compared to the baseline Xtensa LX processor it s nearly three times faster for this application. The addition of a FLIX instruction format with three operation slots is a general sort of extension and can be usefully employed to accelerate a wide range of application code. There are three additional factors to note with respect to Figure 2:. Essentially all of the benefit from parallel operations is realized in the processor with 3-slot FLIX instructions. More operation slots add more hardware parallelism but the Xtensa C/C++ compiler is unable to exploit the additional available instruction-level parallelism for this particular application program. 25 Tensilica, Inc. 4
9 2. Processors with a second load/store unit (results shown by the lighter line in Figure 2) are no faster than the same processor configuration with one load/store unit. This result indicates that the Bit Manipulator application is compute intensive and that the load/store unit is not a bottleneck in this instance, so the additional cost (in terms of silicon area) of the second load/store unit is not merited in this case. 3. The 5-slot version of the Xtensa LX processor actually exhibits slightly degraded performance compared to the 3- and 4-slot versions. This result shows that forcing the XPRES compiler to add more than the required number of operation slots can result in a loss of operation efficiency by reducing the number of operation-encoding bits available to each operation slot. Having more encoding bits available per operation allows the XPRES Compiler to create more comprehensive operations so that the C compiler needs fewer of these operations to execute a task. Works on Large Code Blocks Too Figure 3 illustrates these same trends but for a much larger application: an H.264 deblocking filter. This application program requires nearly 2 million cycles to run on an unaugmented Xtensa LX processor. The XPRES Compiler achieves about a 6% performance improvement (eliminating more than million execution cycles from the application in a few minutes) by adding a FLIX instruction format with two operation slots. Based on the results, this particular application appears to be more limited by data movement than by a lack of computational resources because additional operation slots do not appear to improve performance. In fact, adding more than two operation slots appears to slightly degrade performance compared to the 2-slot result. However, the addition of a second load/store unit nearly doubles the achieved performance improvement, as shown by the lighter line in the graph in Figure 3. This result differs from the one observed for the Bit Manipulator application, demonstrating that different applications really do benefit from different processor optimizations. 25 Tensilica, Inc. 5
10 2,, 95,, slot 9,, Load/Store Unit 2 Load/Store Units Cycle Count 85,, 8,, 75,, 7,, 2 slots 2 slots 65,, 6,, 55,, 5,,, 2, 3, 4, 5, 6, Incremental Gate Count A plot for an Xtensa LX base processor and processors with 2-, 3-, 4-, and 5-slot FLIX extensions and one or two load/store units. Figure 3: H.264 deblocking filter application performance versus gate-count Another video application, an MPEG-4 video decoder, exhibits the same sort of results as the H.264 deblocking filter. Results for this application appear in Figure 4 and the pattern of the results for the processors with one load/store unit is very similar to the results obtained for the H.264 deblocking filter but the MPEG-4 video decoder application code benefits more from the optimizations of the XPRES Compiler, which achieves a 23% reduction in cycle count by adding a second operation slot to the Xtensa LX processor. In addition, a second load/store unit further increases performance and the processor with two load/store units can gainfully exploit three operation slots for yet more performance. Again, it s important to remember that all of this design-space exploration consumes only a few minutes because it s automated by the XPRES Compiler. Normally, a design team would not be able to conduct this sort of extensive architectural research because handdesigned processor variants require months of design time, not minutes. 25 Tensilica, Inc. 6
11 3,, 2,, slot Load/Store Unit 2 Load/Store Units,, Cycle Count,, 9,, 2 slots 2 slots 8,, 7,, 6,,, 2, 3, 4, 5, 6, Incremental Gate Count A plot for an Xtensa LX base processor and processors with 2-, 3-, 4-, and 5-slot FLIX extensions and one or two load/store units. Figure 4: MPEG-4 decoder application performance versus gate-count The fourth application, which is also from the video domain, is a SAD (sum of absolute differences) engine used for video motion estimation. Results achieved for this application (shown in Figure 5) are very similar to those achieved for the Bit Manipulator application, although the SAD application consumes about 2 times the number of cycles. The addition of 3-slot FLIX instructions cuts the cycle count by about 63% and the addition of a second load/store unit provides no benefit to the SAD engine code. 25 Tensilica, Inc. 7
12 8, slot 7, 6, Load/Store Unit 2 Load/Store Units Cycle Count 5, 4, 3, 2, 2 slots 2 slots,, 2, 3, 4, 5, 6, Incremental Gate Count A plot for an Xtensa LX base processor and processors with 2-, 3-, 4-, and 5-slot FLIX extensions and one or two load/store units. Figure 5: SAD (sum of absolute differences) application performance versus gate-count Conclusion Use of the XPRES Compiler in this white paper was artificially constrained but the results are real and demonstrate both the ability of FLIX multi-operation instructions to accelerate various applications and the ability of the XPRES Compiler to rapidly explore processor configurations and extensions that can accelerate the execution of critical code blocks in a system. The XPRES Compiler results for each application discussed in this white paper consumed far less than an hour per run, resulting in performance enhancements that range from some performance improvement to a tripling of code execution speed. The automated XPRES Compiler allows the developer to discover how much performance benefit FLIX instruction extensions can provide to an application in the time it takes to eat lunch. Some applications benefit only mildly from FLIX-type processor enhancement and others benefit substantially. Examination of results from experiments conducted with the XPRES Compiler by Tensilica customers has prompted design teams to restructure some application code. This code restructuring, taking only a day or two, has substantially boosted application performance in some cases. 25 Tensilica, Inc. 8
13 Number of Instruction slots Table : Results for the Bit Manipulator application Bit Manipulator One Load/Store Unit Two Load/Store Units Cycles Gates Cycles Gates slot 3,746 NA NA 2-slot FLIX 5,384 2,263 5,384 35,284 3-slot FLIX,29 24,662,287 48,46 4-slot FLIX,29 24,73,287 49,43 5-slot FLIX,287 26,366 2,34 49,946 Number of Instruction slots Table 2: Results for the H.264 deblocking filter H.264 Deblocking Filter One Load/Store Unit Two Load/Store Units Cycles Gates Cycles Gates slot 93,788,7 NA NA 2-slot FLIX 8,4,93 2, 74,638,399 36,92 3-slot FLIX 82,552,788 2,55 74,9,533 49,26 4-slot FLIX 82,679,72 2,684 77,978,299 48,75 5-slot FLIX 82,988,82 2,765 79,85,489 48,837 Number of Instruction slots Table 3: Results for the MPEG-4 decoder MPEG-4 Decoder One Load/Store Unit Two Load/Store Units Cycles Gates Cycles Gates slot 22,8,59 NA NA 2-slot FLIX 93,684,776 4,74 9,69,824 38,833 3-slot FLIX,669,84 27,992 85,259,477 5,56 4-slot FLIX,7,48 28,9 9,647,42 5,549 5-slot FLIX,36,58 28,29 87,847,53 52,439 Number of Instruction slots Table 4: Results for the SAD (sum of absolute differences) engine SAD Engine One Load/Store Unit Two Load/Store Units Cycles Gates Cycles Gates slot 729,322 NA NA 2-slot FLIX 369,37 2,2 369,37 35,23 3-slot FLIX 287,943 22, ,943 45,76 4-slot FLIX 27,43 22,747 27,43 45,862 5-slot FLIX 287,993 24,76 287,993 47,89 25 Tensilica, Inc. 9
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
More informationBoost ASIC and SOC Performance by Matching Processor to Task through Automated Processor Generation
WHITE PAPER Boost ASIC and SOC Performance by Matching Processor to Task through Automated Processor Generation System architects face a number of important design decisions when creating the best ASIC
More informationARM Microprocessor and ARM-Based Microcontrollers
ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 A Microcontroller-Based Embedded System Roadmap 1 Introduction ARM ARM Basics 2 ARM Extensions Thumb Jazelle NEON & DSP Enhancement
More informationIntroduction to the Latest Tensilica Baseband Solutions
Introduction to the Latest Tensilica Baseband Solutions Dr. Chris Rowen Founder and Chief Technology Officer Tensilica Inc. Outline The Mobile Wireless Challenge Multi-standard Baseband Tensilica Fits
More informationİSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
More informationVLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
More informationA Survey on ARM Cortex A Processors. Wei Wang Tanima Dey
A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:
More informationHigh-speed image processing algorithms using MMX hardware
High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to
More informationArchitectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
More informationUnderstanding Compression Technologies for HD and Megapixel Surveillance
When the security industry began the transition from using VHS tapes to hard disks for video surveillance storage, the question of how to compress and store video became a top consideration for video surveillance
More informationInstruction Set Design
Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,
More informationChapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors
Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors Lesson 05: Array Processors Objective To learn how the array processes in multiple pipelines 2 Array Processor
More informationTensilica Software Development Toolkit (SDK)
Tensilica Datasheet Tensilica Software Development Toolkit (SDK) Quickly develop application code Features Cadence Tensilica Xtensa Xplorer Integrated Development Environment (IDE) with full graphical
More informationCompiling PCRE to FPGA for Accelerating SNORT IDS
Compiling PCRE to FPGA for Accelerating SNORT IDS Abhishek Mitra Walid Najjar Laxmi N Bhuyan QuickTime and a QuickTime and a decompressor decompressor are needed to see this picture. are needed to see
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More informationA Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
More informationon an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
More informationBindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
More informationChapter 2 Logic Gates and Introduction to Computer Architecture
Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are
More informationData Analysis Software
Data Analysis Software Compatible with all Race Technology products Fully integrated video support Accurate track maps Graphs generated with a single mouse click for fast analysis Automatically splits
More informationFast Arithmetic Coding (FastAC) Implementations
Fast Arithmetic Coding (FastAC) Implementations Amir Said 1 Introduction This document describes our fast implementations of arithmetic coding, which achieve optimal compression and higher throughput by
More informationPowerPC Microprocessor Clock Modes
nc. Freescale Semiconductor AN1269 (Freescale Order Number) 1/96 Application Note PowerPC Microprocessor Clock Modes The PowerPC microprocessors offer customers numerous clocking options. An internal phase-lock
More informationMicrowatt to Megawatt - Transforming Edge to Data Centre Insights
Security Level: Public Microwatt to Megawatt - Transforming Edge to Data Centre Insights Steve Langridge steve.langridge@huawei.com May 3, 2015 www.huawei.com Agenda HW Acceleration System thinking Big
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More information7a. System-on-chip design and prototyping platforms
7a. System-on-chip design and prototyping platforms Labros Bisdounis, Ph.D. Department of Computer and Communication Engineering 1 What is System-on-Chip (SoC)? System-on-chip is an integrated circuit
More informationCISC, RISC, and DSP Microprocessors
CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:
More informationA Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
More informationLSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
More informationIA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
More informationDesign Cycle for Microprocessors
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
More informationwhat operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
More informationMPSoC Designs: Driving Memory and Storage Management IP to Critical Importance
MPSoC Designs: Driving Storage Management IP to Critical Importance Design IP has become an essential part of SoC realization it is a powerful resource multiplier that allows SoC design teams to focus
More informationBest Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com
Best Practises for LabVIEW FPGA Design Flow 1 Agenda Overall Application Design Flow Host, Real-Time and FPGA LabVIEW FPGA Architecture Development FPGA Design Flow Common FPGA Architectures Testing and
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationOverview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
More informationWhite paper. H.264 video compression standard. New possibilities within video surveillance.
White paper H.264 video compression standard. New possibilities within video surveillance. Table of contents 1. Introduction 3 2. Development of H.264 3 3. How video compression works 4 4. H.264 profiles
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationComputer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Computers in mid 50 s Hardware was expensive
More informationBDTI Solution Certification TM : Benchmarking H.264 Video Decoder Hardware/Software Solutions
Insight, Analysis, and Advice on Signal Processing Technology BDTI Solution Certification TM : Benchmarking H.264 Video Decoder Hardware/Software Solutions Steve Ammon Berkeley Design Technology, Inc.
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationParallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
More informationCSE 237A Final Project Final Report
CSE 237A Final Project Final Report Multi-way video conferencing system over 802.11 wireless network Motivation Yanhua Mao and Shan Yan The latest technology trends in personal mobile computing are towards
More informationSolution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
More informationAn Overview of Stack Architecture and the PSC 1000 Microprocessor
An Overview of Stack Architecture and the PSC 1000 Microprocessor Introduction A stack is an important data handling structure used in computing. Specifically, a stack is a dynamic set of elements in which
More informationWeek 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
More informationInstruction Set Architecture (ISA) Design. Classification Categories
Instruction Set Architecture (ISA) Design Overview» Classify Instruction set architectures» Look at how applications use ISAs» Examine a modern RISC ISA (DLX)» Measurement of ISA usage in real computers
More informationIncreasing performance and lowering the cost of storage for VDI With Virsto, Citrix, and Microsoft
Increasing performance and lowering the cost of storage for VDI With Virsto, Citrix, and Microsoft 2010 Virsto www.virsto.com Virsto: Improving VDI with Citrix and Microsoft Virsto Software, developer
More informationSOC architecture and design
SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external
More informationFlexible VDSL2 datapath IP for SOC designs provides ready access to the VDSL2 chip market
An UpZide White Paper Flexible datapath IP for SOC designs provides ready access to the chip market The fundamentals of the ITU-T G.993.2 recommendation was accepted by the ITU in mid 2005 and the rush
More informationIntroduction to Digital System Design
Introduction to Digital System Design Chapter 1 1 Outline 1. Why Digital? 2. Device Technologies 3. System Representation 4. Abstraction 5. Development Tasks 6. Development Flow Chapter 1 2 1. Why Digital
More informationUnit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit
Unit A451: Computer systems and programming Section 2: Computing Hardware 1/5: Central Processing Unit Section Objectives Candidates should be able to: (a) State the purpose of the CPU (b) Understand the
More informationAdvanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
More informationPROBLEMS. which was discussed in Section 1.6.3.
22 CHAPTER 1 BASIC STRUCTURE OF COMPUTERS (Corrisponde al cap. 1 - Introduzione al calcolatore) PROBLEMS 1.1 List the steps needed to execute the machine instruction LOCA,R0 in terms of transfers between
More informationThe Evolution of CCD Clock Sequencers at MIT: Looking to the Future through History
The Evolution of CCD Clock Sequencers at MIT: Looking to the Future through History John P. Doty, Noqsi Aerospace, Ltd. This work is Copyright 2007 Noqsi Aerospace, Ltd. This work is licensed under the
More informationComputer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
More informationRouter Architectures
Router Architectures An overview of router architectures. Introduction What is a Packet Switch? Basic Architectural Components Some Example Packet Switches The Evolution of IP Routers 2 1 Router Components
More informationComputer Organization and Components
Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides
More informationPexip Speeds Videoconferencing with Intel Parallel Studio XE
1 Pexip Speeds Videoconferencing with Intel Parallel Studio XE by Stephen Blair-Chappell, Technical Consulting Engineer, Intel Over the last 18 months, Pexip s software engineers have been optimizing Pexip
More information5Get rid of hackers and viruses for
Reprint from TechWorld /2007 TEChWoRLd ISSuE 2007 ThEBIG: 5 FIREWaLLS TEChWoRLd ISSuE 2007 ThEBIG: 5 FIREWaLLS TEChWoRLd ISSuE 2007 ThEBIG: 5 FIREWaLLS # # # Load balancing is basically a simple task where
More informationELEC 5260/6260/6266 Embedded Computing Systems
ELEC 5260/6260/6266 Embedded Computing Systems Spring 2016 Victor P. Nelson Text: Computers as Components, 3 rd Edition Prof. Marilyn Wolf (Georgia Tech) Course Topics Embedded system design & modeling
More informationA PPENDIX H RITERIA FOR AES E VALUATION C RITERIA FOR
A PPENDIX H RITERIA FOR AES E VALUATION C RITERIA FOR William Stallings Copyright 20010 H.1 THE ORIGINS OF AES...2 H.2 AES EVALUATION...3 Supplement to Cryptography and Network Security, Fifth Edition
More informationINSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano
More informationProcessor Architectures
ECPE 170 Jeff Shafer University of the Pacific Processor Architectures 2 Schedule Exam 3 Tuesday, December 6 th Caches Virtual Memory Input / Output OperaKng Systems Compilers & Assemblers Processor Architecture
More informationManagement Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System?
Management Challenge Managing Hardware Assets What computer processing and storage capability does our organization need to handle its information and business transactions? What arrangement of computers
More informationInfluence of Load Balancing on Quality of Real Time Data Transmission*
SERBIAN JOURNAL OF ELECTRICAL ENGINEERING Vol. 6, No. 3, December 2009, 515-524 UDK: 004.738.2 Influence of Load Balancing on Quality of Real Time Data Transmission* Nataša Maksić 1,a, Petar Knežević 2,
More informationHow To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology
More informationInstruction Set Architecture (ISA)
Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine
More informationIntroduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationIBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
More informationCS2101a Foundations of Programming for High Performance Computing
CS2101a Foundations of Programming for High Performance Computing Marc Moreno Maza & Ning Xie University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Course Overview 2 Hardware Acceleration
More informationPROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
More informationThe Importance of Software License Server Monitoring
The Importance of Software License Server Monitoring NetworkComputer How Shorter Running Jobs Can Help In Optimizing Your Resource Utilization White Paper Introduction Semiconductor companies typically
More informationIntroduction to Embedded Systems. Software Update Problem
Introduction to Embedded Systems CS/ECE 6780/5780 Al Davis logistics minor Today s topics: more software development issues 1 CS 5780 Software Update Problem Lab machines work let us know if they don t
More informationIMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS
Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE
More informationHardware/Software Co-Design of a Java Virtual Machine
Hardware/Software Co-Design of a Java Virtual Machine Kenneth B. Kent University of Victoria Dept. of Computer Science Victoria, British Columbia, Canada ken@csc.uvic.ca Micaela Serra University of Victoria
More informationInstruction Set Architecture
Instruction Set Architecture Consider x := y+z. (x, y, z are memory variables) 1-address instructions 2-address instructions LOAD y (r :=y) ADD y,z (y := y+z) ADD z (r:=r+z) MOVE x,y (x := y) STORE x (x:=r)
More informationEnhancing SQL Server Performance
Enhancing SQL Server Performance Bradley Ball, Jason Strate and Roger Wolter In the ever-evolving data world, improving database performance is a constant challenge for administrators. End user satisfaction
More informationUnderstanding Video Latency What is video latency and why do we care about it?
By Pete Eberlein, Sensoray Company, Inc. Understanding Video Latency What is video latency and why do we care about it? When choosing components for a video system, it is important to understand how the
More informationAmazon EC2 XenApp Scalability Analysis
WHITE PAPER Citrix XenApp Amazon EC2 XenApp Scalability Analysis www.citrix.com Table of Contents Introduction...3 Results Summary...3 Detailed Results...4 Methods of Determining Results...4 Amazon EC2
More informationAlberto Corrales-García, Rafael Rodríguez-Sánchez, José Luis Martínez, Gerardo Fernández-Escribano, José M. Claver and José Luis Sánchez
Alberto Corrales-García, Rafael Rodríguez-Sánchez, José Luis artínez, Gerardo Fernández-Escribano, José. Claver and José Luis Sánchez 1. Introduction 2. Technical Background 3. Proposed DVC to H.264/AVC
More informationDigitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah
(DSF) Soft Core Prozessor NIOS II Stand Mai 2007 Jens Onno Krah Cologne University of Applied Sciences www.fh-koeln.de jens_onno.krah@fh-koeln.de NIOS II 1 1 What is Nios II? Altera s Second Generation
More informationTable of Contents. Cisco How Does Load Balancing Work?
Table of Contents How Does Load Balancing Work?...1 Document ID: 5212...1 Introduction...1 Prerequisites...1 Requirements...1 Components Used...1 Conventions...1 Load Balancing...1 Per Destination and
More informationTriMedia CPU64 Application Development Environment
Published at ICCD 1999, International Conference on Computer Design, October 10-13, 1999, Austin Texas, pp. 593-598. TriMedia CPU64 Application Development Environment E.J.D. Pol, B.J.M. Aarts, J.T.J.
More informationUptime Infrastructure Monitor. Installation Guide
Uptime Infrastructure Monitor Installation Guide This guide will walk through each step of installation for Uptime Infrastructure Monitor software on a Windows server. Uptime Infrastructure Monitor is
More informationQ. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:
Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove
More informationThe new 32-bit MSP432 MCU platform from Texas
Technology Trend MSP432 TM microcontrollers: Bringing high performance to low-power applications The new 32-bit MSP432 MCU platform from Texas Instruments leverages its more than 20 years of lowpower leadership
More informationCPU Organization and Assembly Language
COS 140 Foundations of Computer Science School of Computing and Information Science University of Maine October 2, 2015 Outline 1 2 3 4 5 6 7 8 Homework and announcements Reading: Chapter 12 Homework:
More informationSystolic Computing. Fundamentals
Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW
More informationIEC 61131-3. The Fast Guide to Open Control Software
IEC 61131-3 The Fast Guide to Open Control Software 1 IEC 61131-3 The Fast Guide to Open Control Software Introduction IEC 61131-3 is the first vendor-independent standardized programming language for
More informationChapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language
Chapter 4 Register Transfer and Microoperations Section 4.1 Register Transfer Language Digital systems are composed of modules that are constructed from digital components, such as registers, decoders,
More informationIn the Beginning... 1964 -- The first ISA appears on the IBM System 360 In the good old days
RISC vs CISC 66 In the Beginning... 1964 -- The first ISA appears on the IBM System 360 In the good old days Initially, the focus was on usability by humans. Lots of user-friendly instructions (remember
More informationAgenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.
Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance
More informationEight Ways to Increase GPIB System Performance
Application Note 133 Eight Ways to Increase GPIB System Performance Amar Patel Introduction When building an automated measurement system, you can never have too much performance. Increasing performance
More informationCPU Organisation and Operation
CPU Organisation and Operation The Fetch-Execute Cycle The operation of the CPU 1 is usually described in terms of the Fetch-Execute cycle. 2 Fetch-Execute Cycle Fetch the Instruction Increment the Program
More informationAnalyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution
Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution Jonathan Halstuch, COO, RackTop Systems JHalstuch@racktopsystems.com Big Data Invasion We hear so much on Big Data and
More informationFPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.
3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 s Introduction Convolution is one of the basic and most common operations in both analog and digital domain signal processing.
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
More informationInstruction Set Architecture. or How to talk to computers if you aren t in Star Trek
Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture
More informationThe Design of the Inferno Virtual Machine. Introduction
The Design of the Inferno Virtual Machine Phil Winterbottom Rob Pike Bell Labs, Lucent Technologies {philw, rob}@plan9.bell-labs.com http://www.lucent.com/inferno Introduction Virtual Machine are topical
More information