1 Radar Processing: FPGAs or GPUs? WP White Paper While generalpurpose graphics processing units (GPGPUs) offer high rates of peak floatingpoint operations per second (FLOPs), FPGAs now offer competing levels of floatingpoint processing. Moreover, Altera FPGAs now support OpenCL, a leading programming language used with GPUs. Introduction FPGAs and CPUs have long been an integral part of radar signal processing. FPGAs are traditionally used for frontend processing, while CPUs for the backend processing. As radar systems increase in capability and complexity, the processing requirements have increased dramatically. While FPGAs have kept pace in increasing processing capabilities and throughput, CPUs have struggled to provide the signal processing performance required in nextgeneration radar. This struggle has led to increasing use of CPU accelerators, such as graphic processing units (GPUs), to support the heavy processing loads. This white paper compares FPGAs and GPUs floatingpoint performance and design flows. In the last few years, GPUs have moved beyond graphics become powerful floatingpoint processing platforms, known as GPGPUs, that offer high rates of peak FLOPs. FPGAs, traditionally used for fixedpoint digital signal processing (DSP), now offer competing levels of floatingpoint processing, making them candidates for backend radar processing acceleration. On the FPGA front, a number of verifiable floatingpoint benchmarks have been published at both 40 nm and 28 nm. Altera s nextgeneration highperformance FPGAs will support a minimum of 5 TFLOPs performance by leveraging Intel s 14 nm TriGate process. Up to 100 GFLOPs/W can be expected using this advanced semiconductor process. Moreover, Altera FPGAs now support OpenCL, a leading programming language used with GPUs. Peak GFLOPs Ratings Current FPGAs have capabilities of 1+ peak TFLOPs (1), while AMD and Nvidia s latest GPUs are rated even higher, up to nearly 4 TFLOPs. However, the peak GFLOPs or TFLOPs provides little information on the performance of a given device in a particular application. It merely indicates the total number of theoretical floatingpoint additions or multiplies which can be performed per second. This analysis indicates that FPGAs can, in many cases, exceed the throughput of GPUs in algorithms and data sizes common in radar applications. 101 Innovation Drive San Jose, CA All rights reserved. ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS and STRATIX words and logos are trademarks of and registered in the U.S. Patent and Trademark Office and in other countries. All other words and logos identified as trademarks or service marks are the property of their respective holders as described at Altera warrants performance of its semiconductor products to current specifications in accordance with Altera's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Altera. Altera customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. ISO 9001:2008 Registered Feedback Subscribe
2 Page 2 Algorithm Benchmarking A common algorithm of moderate complexity is the fast Fourier transform (FFT). Because most radar systems perform much of their processing in the frequency domain, the FFT algorithm is used very heavily. For example, a 4,096point FFT has been implemented using singleprecision floatingpoint processing. It is able to input and output four complex samples per clock cycle. Each single FFT core can run at over 80 GFLOPs, and a large 28 nm FPGA has resources to implement seven such cores. However, as Figure 1 indicates, the FFT algorithm on this FPGA is nearly 400 GFLOPs. This result is based on a push button OpenCL compilation, with no FPGA expertise required. Using logiclock and Design Space Explorer (DSE) optimizations, the sevencore design can approach the f MAX of the singlecore design, boosting it to over 500 GFLOPs, with over 10 GFLOPs/W using 28 nm FPGAs. Figure 1. Stratix V 5SGSD8 FPGA FloatingPoint FFT Performance Points Data 4,096 Complex Architecture Parallel Lanes Radix 2 2 Feedforward 4 ALMs Registers DSP Blocks M20K Blocks f MAX (MHz) GFLOPs Single FFT Core Pushbutton 7 FFT Cores/FPGA Fully Optimized 7 FFT Cores/FPGA 36.7 k 252 k 257 k 73.0 k 457 k 511 k , This GFLOPs/W result is much higher than achievable CPU or GPU power efficiency. In terms of GPU comparisons, the GPU is not efficient at these FFT lengths, so no benchmarks are presented. The GPU becomes efficient with FFT lengths of several hundred thousand points, when it can provide useful acceleration to a CPU. However, the shorter FFT lengths are prevalent in radar processing, where FFT lengths of 512 to 8,192 are the norm. In summary, the useful GFLOPs are often a fraction of the peak or theoretical GFLOPs. For this reason, a better approach is to compare performance with an algorithm that can reasonably represent the characteristics of typical applications. As the complexity of the benchmarked algorithm increases, it becomes more representative of actual radar system performance. Algorithm Benchmarking Rather than rely upon a vendor s peak GFLOPs ratings to drive processing technology decisions, an alternative is to rely upon thirdparty evaluations using examples of sufficient complexity. A common algorithm for in spacetime adaptive processing (STAP) radar is the Cholesky decomposition. This algorithm is often used in linear algebra for efficient solving of multiple equations, and can be used on correlation matrices. Radar Processing: FPGAs or GPUs?
3 Algorithm Benchmarking Page 3 The Cholesky algorithm has a high numerical complexity, and almost always requires floatingpoint numerical representation for reasonable results. The computations required are proportional to N3, where N is the matrix dimension, so the processing requirements are often demanding. As radar systems normally operate in real time, a high throughput is a requirement. The result will depend both on the matrix size and the required matrix processing throughput, but can often be over 100 GFLOPs. Table 1 shows benchmarking results based on an Nvidia GPU rated at 1.35 TFLOPs, using various libraries, as well as a Xilinx Virtex6 XC6VSX475T, an FPGA optimized for DSP processing with a density of 475K LCs. These devices are similar in density to the Altera FPGA used for Cholesky benchmarks. The LAPACK and MAGMA are commercially supplied libraries, while the GPU GFLOPs refers to the OpenCL implementation developed at University of Tennessee (2). The latter are clearly more optimized at smaller matrix sizes. Table 1. GPU and Xilinx FPGA Cholesky Benchmarks (2) Matrix LAPACK GFLOPs MAGMA GFLOPs GPU GFLOPs FPGA GFLOPs 512 (SP) (DP) (SP) (DP) ,024 (SP) ,024 (DP) ,048 (SP) ,048 (DP) A midsize Altera Stratix V FPGA (460K logic elements (LEs)) was benchmarked by Altera using the Cholesky algorithm in singleprecision floatingpoint processing. As shown in Table 2, the Stratix V FPGA performance on the Cholesky algorithm is much higher than Xilinx results. The Altera benchmarks also include the QR decomposition, another matrix processing algorithm of reasonable complexity. Both Cholesky and QRD are available as parameterizable cores from Altera. Table 2. Altera FPGA Cholesky and QR Benchmarks Algorithm (Complex, Single Precision) Cholesky QR Matrix Size Vector Size f MAX (MHz) GFLOPs Radar Processing: FPGAs or GPUs?
4 Page 4 GPU and FPGA Design Methodology It should be noted that the matrix sizes of the benchmarks are not the same. The University of Tennessee results start at matrix sizes of [ ], while the Altera benchmarks go up to [360x360] for Cholesky and [450x450] for QRD. The reason is that GPUs are very inefficient at smaller matrix sizes, so there is little incentive to use them to accelerate a CPU in these cases. In contrast, FPGAs can operate efficiently with much smaller matrices. This efficiency is critical, as radar systems need fairly high throughput, in the thousands of matrices per second. So smaller sizes are used, even when it requires tiling a larger matrix into smaller ones for processing. In addition, the Altera benchmarks are per Cholesky core. Each parameterizable Cholesky core allows selection of matrix size, vector size, and channel count. The vector size roughly determines the FPGA resources. The larger [ ] matrix size uses a larger vector size, allowing for a single core in this FPGA, at 91 GFLOPs. The smaller [60 60] matrix uses fewer resources, so two cores could be implemented, for a total of 2 42 = 84 GFLOPs. The smallest [30 30] matrix size permits three cores, for a total of 3 25 = 75 GFLOPs. FPGAs seem to be much better suited for problems with smaller data sizes, which is the case in many radar systems. The reduced efficiency of GPUs is due to computational loads increasing as N3, data I/O increasing as N2, and eventually the I/O bottlenecks of the GPU become less of a problem as the dataset increases. In addition, as matrix sizes increase, the matrix throughput per second drops dramatically due to the increased processing per matrix. At some point, the throughput becomes too low to be unusable for realtime requirements of radar systems. For FFTs, the computation load increases to N log 2 N, whereas the data I/O increases as N. Again, at very large data sizes, the GPU becomes an efficient computational engine. By contrast, the FPGA is an efficient computation engine at all data sizes, and better suited in most radar applications where FFT sizes are modest, but throughput is at a premium. GPU and FPGA Design Methodology GPUs are programmed using either Nvidia s proprietary CUDA language, or an open standard OpenCL language. These languages are very similar in capability, with the biggest difference being that CUDA can only be used on Nvidia GPUs. FPGAs are typically programmed using HDL languages Verilog or VHDL. Neither of these languages is well suited to supporting floatingpoint designs, although the latest versions do incorporate definition, though not necessarily synthesis, of floatingpoint numbers. For example, in System Verilog, a short real variable is analogue to an IEEE single (float), and real to an IEEE double. DSP Builder Advanced Blockset Synthesis of floatingpoint datapaths into an FPGA using traditional methods is very inefficient, as shown by the low performance of Xilinx FPGAs on the Cholesky algorithm, implemented using Xilinx FloatingPoint Core Gen functions. However, Altera offers two alternatives. The first is to use DSP Builder Advanced Blockset, a Mathworksbased design entry. This tool contains support for both fixed and floatingpoint numbers, and supports seven different precisions of floatingpoint processing, including IEEE half, single and doubleprecision implementations. It Radar Processing: FPGAs or GPUs?
5 GPU and FPGA Design Methodology Page 5 also supports vectorization, which is needed for efficient implementation of linear algebra. Most important is its ability to map floatingpoint circuits efficiently onto today s fixedpoint FPGA architectures, as demonstrated by the benchmarks supporting close to 100 GFLOPs on the Cholesky algorithm in a midsize 28 nm FPGA. By comparison, the Cholesky implementation on a similarly sized Xilinx FPGA without this synthesis capability shows only 20 GFLOPs of performance on the same algorithm (2). OpenCL for FPGAs OpenCL is familiar to GPU programmers. An OpenCL Compiler (3) for FPGAs means that OpenCL code written for AMD or Nvidia GPUs can be compiled onto an FPGA. In addition, an OpenCL Compiler from Altera enables GPU programs to use FPGAs, without the necessity of developing the typical FPGA design skill set. Using OpenCL with FPGAs offers several key advantages over GPUs. First, GPUs tend to be I/O limited. All input and output data must be passed by the host CPU through the PCI Express (PCIe ) interface. The resulting delays can stall the GPU processing engines, resulting in lower performance. OpenCL Extensions for FPGAs FPGAs are well known for their wide variety of highbandwidth I/O capabilities. These capabilities allow data to stream in and out of the FPGA over Gigabit Ethernet (GbE), Serial RapidIO (SRIO), or directly from analogtodigital converters (ADCs) and digitaltoanalog converters (DACs). Altera has defined a vendorspecific extension of the OpenCL standard to support streaming operations. This extension is a critical feature in radar systems, as it allows the data to move directly from the fixedpoint frontend beamforming and digital downconversion processing to the floatingpoint processing stages for pulse compression, Doppler, STAP, moving target indicator (MTI), and other functions shown in Figure 2. In this way, the data flow avoids the CPU bottleneck before being passed to a GPU accelerator, so the overall processing latency is reduced. Radar Processing: FPGAs or GPUs?
6 Page 6 GPU and FPGA Design Methodology Figure 2. Generic Radar Signal Processing Diagram FPGAs can also offer a much lower processing latency than a GPU, even independent of I/O bottlenecks. It is well known that GPUs must operate on many thousands of threads to perform efficiently, due to the extremely long latencies to and from memory and even between the many processing cores of the GPU. In effect, the GPU must operate many, many tasks to keep the processing cores from stalling as they await data, which results in very long latency for any given task. The FPGA uses a coarsegrained parallelism architecture instead. It creates multiple optimized and parallel datapaths, each of which outputs one result per clock cycle. The number of instances of the datapath depends upon the FPGA resources, but is typically much less than the number of GPU cores. However, each datapath instance has a much higher throughput than a GPU core. The primary benefit of this approach is low latency, a critical performance advantage in many applications. Another advantage of FPGAs is their much lower power consumption, resulting in dramatically lower GFLOPs/W. FPGA power measurements using development boards show 56 GFLOPs/W for algorithms such as Cholesky and QRD, and about 10 GFLOPs/W for simpler algorithms such as FFTs. GPU energy efficiency measurements are much hard to find, but using the GPU performance of 50 GFLOPs for Cholesky and a typical power consumption of 200 W, results in 0.25 GFLOPs/W, which is twenty times more power consumed per useful FLOPs. For airborne or vehiclemounted radar, size, weight, and power (SWaP) considerations are paramount. One can easily imagine radarbearing drones with tens of TFLOPs in future systems. The amount of processing power available is correlated with the permissible resolution and coverage of a modern radar system. Radar Processing: FPGAs or GPUs?
7 GPU and FPGA Design Methodology Page 7 Fused Datapath Both OpenCL and DSP Builder rely on a technique known as fused datapath (Figure 3), where floatingpoint processing is implemented in such a fashion as to dramatically reduce the number of barrelshifting circuits required, which in turn allows for large scale and highperformance floatingpoint designs to be built using FPGAs. Figure 3. Fused Datapath Implementation of FloatingPoint Processing DSP Processor: Each Operation IEEE 745 Altera Floating Point: Fused Datapath Slightly LargerWider Operands >> >> >> True FloatingPoint Mantissa (Not Just 1.0, 1.99,...) +/ +/ ABS ABS ABS DeNormalize Count << Count << Count << Remove Normalization Rnd Rnd Rnd High Performance Low Resources Normalize >> Low Performance High Resources +/ +/ ABS Count << Rnd To reduce the frequency of implementing barrel shifting, the synthesis process looks for opportunities where using larger mantissa widths can offset the need for frequent normalization and denormalization. The availability of and hard multipliers allows for significantly larger multipliers than the 23 bits required by singleprecision implementations, and also the construction of and multipliers allows for larger than the 52 bits required for doubleprecision implementations. The FPGA logic is already optimized for implementation of large, fixedpoint adder circuits, with inclusion of builtin carry lookahead circuits. Radar Processing: FPGAs or GPUs?
8 Page 8 GPU and FPGA Design Methodology Figure 4. Vector Dot Product Optimization Where normalization and denormalization is required, an alternative implementation that avoids low performance and excessive routing is to use multipliers. For a 24 bit singleprecision mantissa (including the sign bit), the multiplier shifts the input by multiplying by 2n. Again, the availability of hardened multipliers in and allows for extended mantissa sizes in singleprecision implementations, and can be used to construct the multiplier sizes for doubleprecision implementations. A vector dot product is the underlying operation consuming the bulk of the FLOPs used in many linear algebra algorithms. A singleprecision implementation of length 64 long vector dot product would require 64 floatingpoint multipliers, followed by an adder tree made up of 63 floatingpoint adders. Such an implementation would require many barrel shifting circuits. Instead, the outputs of the 64 multipliers can be denormalized to a common exponent, being the largest of the 64 exponents. Then these 64 outputs could be summed using a fixedpoint adder circuit, and a final normalization performed at the end of the adder tree. This localizedblock floatingpoint processing dispenses with all the interim normalization and denormalization required at each individual adder, and is shown in Figure 4. Even with IEEE 754 floatingpoint processing, the number with the largest exponent determines the exponent at the end, so this change just moves the exponent adjustment to an earlier point in the calculation. DeNormalize Normalize Radar Processing: FPGAs or GPUs?
9 GPU and FPGA Design Methodology Page 9 However, when performing signal processing, the best results are found by carrying as much precision as possible for performing a truncation of results at the end of the calculation. The approach here compensates for this suboptimal early denormalization by carrying extra mantissa bit widths over and above that required by singleprecision floatingpoint processing, usually from 27 to 36 bits. The mantissa extension is performed with floatingpoint multipliers to eliminate the need to normalize the product at each step. 1 This approach can also produce one result per clock cycle. GPU architectures can produce all the floating point multipliers in parallel, but cannot efficiently perform the additions in parallel. This inability is due to requirements that say different cores must pass data through local memories to communicate to each other, thereby lacking the flexibility of connectivity of an FPGA architecture. The fused datapath approach generates results that are more accurate that conventional IEEE 754 floatingpoint results, as shown by Table 3. Table 3. Cholesky Decomposition Accuracy (Single Precision) Complex Input Matrix Size (n x n) Vector Size Err with MATLAB Using Desktop Computer Err with DSP Builder Advanced Blockset Generated RTL 360 x e e x e e x e e008 These results were obtained by implementing large matrix inversions using the Cholesky decomposition algorithm. The same algorithm was implemented in three different ways: In MATLAB/Simulink with IEEE 754 singleprecision floatingpoint processing. In RTL singleprecision floatingpoint processing using the fused datapath approach. In MATLAB with doubleprecision floatingpoint processing. Doubleprecision implementation is about one billion times (10 9 ) more precise than singleprecision implementation. This comparison of MATLAB singleprecision errors, RTL singleprecision errors, and MATLAB doubleprecision errors confirms the integrity of the fused datapath approach. This approach is shown for both the normalized error across all the complex elements in the output matrix and the matrix element with the maximum error. The overall error or norm is calculated by using the Frobenius norm: n n E F = e 2 ij i = 1 j = 1 1 Because the norm includes errors in all elements, it is often much larger than the individual errors. Radar Processing: FPGAs or GPUs?
10 Page 10 Conclusion In addition, the both DSP Builder Advanced Blockset and OpenCL tool flows transparently support and optimize current designs for nextgeneration FPGA architectures. Up to 100 peak GFLOPs/W can be expected, due to both architecture innovations and process technology innovations. Conclusion Highperformance radar systems now have new processing platform options. In addition to much improved SWaP, FPGAs can provide lower latency and higher GFLOPs than processorbased solutions. These advantages will be even more dramatic with the introduction of nextgeneration, highperformance computingoptimized FPGAs. Altera s OpenCL Compiler provides a near seamless path for GPU programmers to evaluate the merits of this new processing architecture. Altera OpenCL is 1.2 compliant, with a full set of math library support. It abstracts away the traditional FPGA challenges of timing closure, DDR memory management, and PCIe host processor interfacing. For nongpu developers, Altera offers DSP Builder Advanced Blockset tool flow, which allows developers to build highf MAX fixed or floatingpoint DSP designs, while retaining the advantages of a Mathworksbased simulation and development environment. This product has been used for years by radar developers using FPGAs to enable a more productive workflow and simulation, which offers the same f MAX performance as a handcoded HDL. Further Information Acknowledgements 1. White Paper: Achieving One TeraFLOPS with 28nm FPGAs: 2. Performance Comparison of Cholesky Decomposition on GPUs and FPGAs, Depeng Yang, Junqing Sun, JunKu Lee, Getao Liang, David D. Jenkins, Gregory D. Peterson, and Husheng Li, Department of Electrical Engineering and Computer Science, University of Tennessee: saahpc.ncsa.illinois.edu/10/papers/paper_45.pdf 3. White Paper: Implementing FPGA Design with the OpenCL Standard: Michael Parker, DSP Product Planning Manager, Radar Processing: FPGAs or GPUs?
11 Document Revision History Page 11 Document Revision History Table 4 shows the revision history for this document. Table 4. Document Revision History Date Version Changes 2.0 Minor text edits. Updated Table 2 and Table 3. April Initial release. Radar Processing: FPGAs or GPUs?
Binary Numbering Systems April 1997, ver. 1 Application Note 83 Introduction Binary numbering systems are used in virtually all digital systems, including digital signal processing (DSP), networking, and
White Paper Introduction This paper presents a rigorous methodology for benchmarking the capabilities of an FPGA family. The goal of benchmarking is to compare the results for one FPGA family versus another
Applying the Benefits of on a Chip Architecture to FPGA System Design WP-01149-1.1 White Paper This document describes the advantages of network on a chip (NoC) architecture in Altera FPGA system design.
White Paper 40-nm FPGAs and the Defense Electronic Design Organization Introduction With Altera s introduction of 40-nm FPGAs, the design domains of military electronics that can be addressed with programmable
Using the On-Chip Signal Quality Monitoring Circuitry (EyeQ) Feature in Stratix IV Transceivers AN-605-1.2 Application Note This application note describes how to use the on-chip signal quality monitoring
Algorithms and Implementation Platforms for Wireless Communications TLT-9706/ TKT-9636 (Seminar Course) BASICS OF FIELD PROGRAMMABLE GATE ARRAYS Waqar Hussain email@example.com Department of Computer
(DSF) Quartus II Stand: Mai 2007 Jens Onno Krah Cologne University of Applied Sciences www.fh-koeln.de firstname.lastname@example.org Quartus II 1 Quartus II Software Design Series : Foundation 2007 Altera
Floating Point Fused Add-Subtract and Fused Dot-Product Units S. Kishor , S. P. Prakash  PG Scholar (VLSI DESIGN), Department of ECE Bannari Amman Institute of Technology, Sathyamangalam, Tamil Nadu,
White Paper Introduction Changes in technology and requirements are leading to FPGAs playing larger roles in defense electronics designs, and consequently are creating both opportunities and risks. The
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Q UA R T U S P R I M E D E S I G N S O F T WA R E Fastest Path to Your Design Quartus Prime software is number one in performance and productivity for FPGA, CPLD, and SoC designs, providing the fastest
Using the Agilent 3070 Tester for In-System Programming in Altera CPLDs AN-628-1.0 Application Note This application note describes how to use the Agilent 3070 test system to achieve faster programming
PROFINET IRT: Getting Started with The Siemens CPU 315 PLC AN-674 Application Note This document shows how to demonstrate a working design using the PROFINET isochronous real-time (IRT) device firmware.
2015.05.04 Quartus II Software and Device Support Release Notes Version 15.0 RN-01080-15.0.0 Subscribe This document provides late-breaking information about the Altera Quartus II software release version
Using Nios II Floating-Point Custom Instructions Tutorial 101 Innovation Drive San Jose, CA 95134 www.altera.com TU-N2FLTNGPNT-2.0 Copyright 2010 Altera Corporation. All rights reserved. Altera, The Programmable
February 2011 NII52018-10.1.0 13. Publishing Component Information to Embedded Software NII52018-10.1.0 This document describes how to publish SOPC Builder component information for embedded software tools.
White Paper Understanding Metastability in FPGAs This white paper describes metastability in FPGAs, why it happens, and how it can cause design failures. It explains how metastability MTBF is calculated,
LLRF Digital RF Stabilization System Many instruments. Many people. Working together. Stability means knowing your machine has innovative solutions. For users, stability means a machine achieving its full
February 2012 Introduction Reference Design RD1031 Adaptive algorithms have become a mainstay in DSP. They are used in wide ranging applications including wireless channel estimation, radar guidance systems,
White Paper Engineering Change Order (ECO) Support in Programmable Logic Design A major benefit of programmable logic is that it accommodates changes to the system specification late in the design cycle.
White Paper Introduction As digital processing technologies such as digital signal processors, FPGAs, and CPUs become more complex and powerful, product and feature differentiation among vendors has significantly
www.thalesgroup.com Float to Fix conversion Fabrice Lemonnier Research & Technology 2 / Thales Research & Technology : Research center of Thales Objective: to propose technological breakthrough for the
2015.06.12 Altera Error Message Register Unloader IP Core User Guide UG-01162 Subscribe The Error Message Register (EMR) Unloader IP core (altera unloader) reads and stores data from the hardened error
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
STUDY ON HARDWARE REALIZATION OF GPS SIGNAL FAST ACQUISITION Huang Lei Kou Yanhong Zhang Qishan School of Electronics and Information Engineering, Beihang University, Beijing, P. R. China, 100083 ABSTRACT
ENHANCEMENT OF TEGRA TABLET'S COMPUTATIONAL PERFORMANCE BY GEFORCE DESKTOP AND WIFI Di Zhao The Ohio State University GPU Technology Conference 2014, March 24-27 2014, San Jose California 1 TEGRA-WIFI-GEFORCE
(DSF) Soft Core Prozessor NIOS II Stand Mai 2007 Jens Onno Krah Cologne University of Applied Sciences www.fh-koeln.de email@example.com NIOS II 1 1 What is Nios II? Altera s Second Generation
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
DIGITAL-TO-ANALOGUE AND ANALOGUE-TO-DIGITAL CONVERSION Introduction The outputs from sensors and communications receivers are analogue signals that have continuously varying amplitudes. In many systems
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
High Performance GPGPU Computer for Embedded Systems Author: Dan Mor, Aitech Product Manager September 2015 Contents 1. Introduction... 3 2. Existing Challenges in Modern Embedded Systems... 3 2.1. Not
DS1104 R&D Controller Board Cost-effective system for controller development Highlights Single-board system with real-time hardware and comprehensive I/O Cost-effective PCI hardware for use in PCs Application
White Paper Increase Flexibility in Layer 2 es by Integrating Ethernet ASSP Functions Into FPGAs Introduction A Layer 2 Ethernet switch connects multiple Ethernet LAN segments. Because each port on the
Understanding CIC Compensation Filters April 2007, ver. 1.0 Application Note 455 Introduction f The cascaded integrator-comb (CIC) filter is a class of hardware-efficient linear phase finite impulse response
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
WHITE PAPER Advanced Core Operating System (ACOS): Experience the Performance Table of Contents Trends Affecting Application Networking...3 The Era of Multicore...3 Multicore System Design Challenges...3
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Experiences With Mobile Processors for Energy Efficient HPC Nikola Rajovic, Alejandro Rico, James Vipond, Isaac Gelado, Nikola Puzovic, Alex Ramirez Barcelona Supercomputing Center Universitat Politècnica
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
Providing Battery-Free, FPGA-Based RAID Cache Solutions WP-01141-1.0 White Paper RAID adapter cards are critical data-center subsystem components that ensure data storage and recovery during power outages.
1 A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications Simon McIntosh-Smith Director of Architecture 2 Multi-Threaded Array Processing Architecture
Kalray MPPA Massively Parallel Processing Array Next-Generation Accelerated Computing February 2015 2015 Kalray, Inc. All Rights Reserved February 2015 1 Accelerated Computing 2015 Kalray, Inc. All Rights
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) firstname.lastname@example.org http://www.mzahran.com Modern GPU
Non-Data Aided Carrier Offset Compensation for SDR Implementation Anders Riis Jensen 1, Niels Terp Kjeldgaard Jørgensen 1 Kim Laugesen 1, Yannick Le Moullec 1,2 1 Department of Electronic Systems, 2 Center
VPX Implementation Serves Shipboard Search and Track Needs By: Thierry Wastiaux, Senior Vice President Interface Concept Defending against anti-ship missiles is a problem for which high-performance computing
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Xilinx SDAccel A Unified Development Environment for Tomorrow s Data Center By Loring Wirbel Senior Analyst November 2014 www.linleygroup.com Copyright 2014 The Linley Group, Inc. This paper examines Xilinx
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Nios II Software Developer s Handbook Nios II Software Developer s Handbook 101 Innovation Drive San Jose, CA 95134 www.altera.com NII5V2-13.1 2014 Altera Corporation. All rights reserved. ALTERA, ARRIA,
CO-DEVELOPMENT MANUFACTURING INNOVATION & SUPPORT Model-based system-on-chip design on Altera and Xilinx platforms Ronald Grootelaar, System Architect RJA.Grootelaar@3t.nl Agenda 3T Company profile Technology
Whether targeted to HPC or embedded applications, Pico Computing s modular and highly-scalable architecture, based on Field Programmable Gate Array (FPGA) technologies, brings orders-of-magnitude performance
GIGA seminar 11.01.2010 MIMO detector algorithms and their implementations for LTE/LTE-A Markus Myllylä and Johanna Ketonen 11.01.2010 2 Outline Introduction System model Detection in a MIMO-OFDM system
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
Networking Remote-Controlled Moving Image Monitoring System First Prize Networking Remote-Controlled Moving Image Monitoring System Institution: Participants: Instructor: National Chung Hsing University
Digital Hardware Design Decisions and Trade-offs for Software Radio Systems John Patrick Farrell This thesis is submitted to the Faculty of Virginia Polytechnic Institute and State University in partial
All Programmable Logic Hans-Joachim Gelke Institute of Embedded Systems Institute of Embedded Systems 31 Assistants 10 Professors 7 Technical Employees 2 Secretaries www.ines.zhaw.ch Research: Education:
Quartus II Software Questions & Answers Following are the most frequently asked questions about the new features in Altera s Quartus II design software. PowerPlay Power Analysis & Optimization Technology
OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted
Product Development Flow Including Model- Based Design and System-Level Functional Verification 2006 The MathWorks, Inc. Ascension Vizinho-Coutry, email@example.com Agenda Introduction to Model-Based-Design
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
MAX 10 Analog to Digital Converter User Guide Subscribe Last updated for Quartus Prime Design Suite: 16.0 UG-M10ADC 101 Innovation Drive San Jose, CA 95134 www.altera.com TOC-2 Contents MAX 10 Analog to
FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China
Nutaq PicoDigitizer 125-Series 16 or 32 Channels, 125 MSPS, FPGA-Based DAQ Solution PRODUCT SHEET QUEBEC I MONTREAL I N E W YO R K I nutaq.com Nutaq PicoDigitizer 125-Series The PicoDigitizer 125-Series
Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim? Successful FPGA datacenter usage at scale will require differentiated capability, programming ease, and scalable implementation models Executive
Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most
July 2011 ED51006-1.2 8. Hardware Acceleration and ED51006-1.2 This chapter discusses how you can use hardware accelerators and coprocessing to create more eicient, higher throughput designs in OPC Builder.
Applications on HPC Special Issue on High Performance Computing Performance Tuning of a CFD Code on the Earth Simulator By Ken ichi ITAKURA,* Atsuya UNO,* Mitsuo YOKOKAWA, Minoru SAITO, Takashi ISHIHARA
Product Brief 6th Gen Intel Core Processors for Desktops: Sseries LOOKING FOR AN AMAZING PROCESSOR for your next desktop PC? Look no further than 6th Gen Intel Core processors. With amazing performance
OTU2 I.7 FEC IP Core (IP-OTU2EFECI7Z) Data Sheet Revision 0.02 Release Date 2015-02-24 Document number TD0382 . All rights reserved. ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS and STRATIX
ARM Processors for Computer-On-Modules Christian Eder Marketing Manager congatec AG COM Positioning Proprietary Modules Qseven COM Express Proprietary Modules Small Module Powerful Module No standard feature
Continuous-Time Converter Architectures for Integrated Audio Processors: By Brian Trotter, Cirrus Logic, Inc. September 2008 As consumer electronics devices continue to both decrease in size and increase
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
TMS320 DSP DESIGNER S NOTEBOOK Fast Logarithms on a Floating-Point Device APPLICATION BRIEF: SPRA218 Keith Larson Digital Signal Processing Products Semiconductor Group Texas Instruments March 1993 IMPORTANT
HARDWARE ACCELERATION IN FINANCIAL MARKETS A step change in speed NAME OF REPORT SECTION 3 HARDWARE ACCELERATION IN FINANCIAL MARKETS A step change in speed Faster is more profitable in the front office
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
4 Jan, 2008 The Bus (PCI and PCI-Express) The CPU, memory, disks, and all the other devices in a computer have to be able to communicate and exchange data. The technology that connects them is called the
The big data revolution Friso van Vollenhoven (Xebia) Enterprise NoSQL Recently, there has been a lot of buzz about the NoSQL movement, a collection of related technologies mostly concerned with storing
Project Denver Processor to Usher in a New Era of Computing Bill Dally January 5, 2011 http://blogs.nvidia.com/2011/01/project-denver-processor-to-usher-in-new-era-of-computing/ Project Denver Announced