Examining the Viability of FPGA Supercomputing
|
|
|
- Albert Arnold
- 9 years ago
- Views:
Transcription
1 1 Examining the Viability of FPGA Supercomputing Stephen D. Craven and Peter Athanas Bradley Department of Electrical and Computer Engineering Virginia Polytechnic Institute and State University Blacksburg, VA USA Abstract For certain applications, custom computational hardware created using field programmable gate arrays (FPGAs) produces significant performance improvements over processors, leading some in academia and industry to call for the inclusion of FPGAs in supercomputing clusters. This paper presents a comparative analysis of FPGAs and traditional processors, focusing on floatingpoint performance and procurement costs, revealing economic hurdles in the adoption of FPGAs for general High-Performance Computing (HPC). Index Terms computational accelerator, digital arithmetic, Field programmable gate arrays, highperformance computing, supercomputers. I. INTRODUCTION Supercomputers have experienced a recent resurgence, fueled by government research dollars and the development of low-cost supercomputing clusters. Unlike the Massively Parallel Processor (MPP) designs found in Cray and CDC machines of the 70s and 80s, featuring proprietary processor architectures, many modern supercomputing clusters are constructed from commodity PC processors, significantly reducing procurement costs. In an effort to improve performance, several companies offer machines that place one or more FPGAs in each node of the cluster. Configurable logic devices, of which FPGAs are one example, permit the device s hardware to be programmed multiple times after manufacture. A wide body of research over two decades has repeatedly demonstrated significant performance improvements for certain classes of applications when implemented within an FPGA s configurable logic [1]. Applications well suited to speed-up by FPGAs typically exhibit massive parallelism and small integer or fixed-point data types. When implemented inside an FPGA, the Smith-Waterman gene sequencing algorithm can be accelerated two orders of magnitude over software implementations [2, 3]. Although an FPGA s clock rate rarely exceeds one-tenth that of a PC, hardware implemented digital filters can process data at many times that of software implementations [4]. Additional performance gains have been described for cryptography [5], network packet filtering [6], target recognition [7] and pattern matching [8], among other applications. Because of these successes, FPGAs have begun to appear in some supercomputing clusters. SRC Computers [9], DRC Computer Corp. [10], Cray [11], Starbridge Systems [12], and SGI [13] all offer Dagstuhl Seminar Proceedings Dynamically Reconfigurable Architectures
2 2 clusters featuring programmable logic. Cray s XD1 architecture, characteristic of many of these systems, integrates 12 AMD Opteron processors in a chassis with six large Virtex 4 FPGAs. The smallest configurable logic device found in these systems is a XC4VLX60, with many systems featuring some of the largest parts including the XC2VP100 and XC4VLX160. The architecture of the Cray and the SGI machines are similar in that they feature a high-performance channel connecting the FPGA fabric with the core processors. The addition of these high-end FPGAs to commodity processor-based clusters can significantly increase costs. Ideally, this expense would be more than offset by the increased computational throughput provided by the FPGA. High-Performance Computing (HPC) applications, however, traditionally utilize doubleprecision floating-point arithmetic, a domain for which FPGAs traditionally are not well suited. Efforts have been made to automate the conversion of floating-point arithmetic to the fixed-point arithmetic more amenable to hardware implementations [14]. Even with an automated flow, however, user intervention is required to specify an error bound and cost function. Modern FPGAs are heterogeneous in nature, mixing configurable logic cells with dedicated memories and fast, fixed logic for performing integer arithmetic such as multiplication and Multiply Accumulate (MAC). The fine-grained programmability of the configurable fabric permits bit-wise hardware customization, enhancing performance for applications involving small data widths such as gene sequencing. The dedicated fixed-width multipliers improve performance for standard Digital Signal Processing (DSP) applications while improving computational density. Dedicated floating-point arithmetic logic, however, does not currently exist in FPGAs, requiring that floating-point units be constructed out of the dedicated integer multipliers and configurable fabric. The intent of this paper is to present a thorough comparison of FPGAs to commodity processors for traditional HPC tasks. Three separate analyses are performed. In Section II, a comparison of IEEE-standard floatingpoint performance of FPGAs to processors is presented. Section III considers performance enhancements gained by using custom floating-point formats in hardware. For certain applications the flexibility of custom hardware permits alternative algorithms to be employed, better leveraging the strengths of FPGAs through all-integer implementations. The performance for one representative application, Mersenne prime identification, is discussed in Section IV. Based on the results of these analyses, Section V discusses the conclusions drawn. II. IEEE STANDARD FLOATING-POINT COMPARISON Many common HPC tasks utilize double precision floating-point arithmetic for applications such as weather modeling, weapons simulation, and computational fluid dynamics. Traditional processors, where the architecture has fixed the data path s width at 32 or 64-bits, provide no incentive to explore reduced precision formats. While FPGAs permit data path width customization, some in the HPC community are loath to utilize a
3 3 nonstandard format owing to verification and portability difficulties. By standardizing data formats across different platforms, applications can be verified against known, golden results by a simple comparison of the numerical results. Many applications also require the full dynamic range of the double precision format to ensure numeric stability. Furthermore, standard formats ease portability of software across different architectures. This principle is at the heart of the Top500 List of fastest supercomputers [15], where ranked machines must exactly reproduce valid results when running the LINPACK benchmarks. A. Present Day Cost-Performance Comparison Owing to the prevalence of IEEE standard floating-point in a wide range of applications, several researchers have designed IEEE 754 compliant floating-point accelerator cores constructed out of the Xilinx Virtex-II Pro FPGA s configurable logic and dedicated integer multipliers [16-18]. Dou et al published one of the highest performance benchmarks of 15.6 GFLOPS by placing 39 floating-point processing elements on a theoretical Xilinx XC2VP125 FPGA [19]. Interpolating their results for the largest production Xilinx Virtex-II Pro device, the XC2VP100, produces 12.4 GFLOPS, compared to the peak 6.4 GFLOPS achievable for a 3.2 GHz Intel Pentium processor. Assuming that the Pentium can sustain 50% of its peak, the FPGA outperforms the processor by a factor of four for matrix multiplication. Dou et al s design is comprised of a linear array of MAC elements, linked to a host processor providing memory access. This architecture enables high computational density by simplifying routing and control, at the requirement of a host controller. Since the results of Dou et al are superior to other published results, and even Xilinx s floating-point cores, they are taken as a conservative upper limit on FPGA s double precision floating-point performance. Performance in any deployed system would be lower owing the addition of interface logic and the reduced frequency caused by routing densely packed designs. Table I below extrapolates Dou et al s performance results for other FPGA device families. Given the similar configurable logic architectures between the different Xilinx families, it has been assumed that Dou et al s requirements of 1,419 logic slices and nine dedicated multipliers holds for all families. While the slice requirements may be less for the Virtex 4 family, owing to the inclusion of a MAC function with the dedicated multipliers, as all considered Virtex 4 implementations were multiplier limited the over estimate in required slices do not affect the results. The clock frequency has been scaled by a factor obtained by averaging the performance differential of Xilinx s double precision floating-point multiplier and adder cores [20] across the different families. For comparison purposes several commercial processors have been included in the list. Sony s Cell processor powers the PlayStation 3 but was designed as a broadband media engine. The Cell features a PPC processing core along with eight independent processing cores sharing the same die. Intel s Pentium D 920 is a high-performance dual-core design. As the presented prices are merely for device cost, ignoring all required peripherals, data for the System X supercomputing cluster [21] has been included. The peak performance for each processor was reduced by 50%, taking into account compiler and system inefficiencies,
4 4 permitting a more fair comparison as FPGAs designs typically sustain a much higher percentage of their peak performance than processors. This 50% performance penalty is in line with the sustained performance seen in the Top500 List s LINPACK benchmark. In the table FPGAs are assumed to sustain their peak performance. Table I. Double Precision Floating-Point Multiply Accumulate Cost-Performance Device Speed GFLOPS Device Cost $ / GFLOPS (MHz) Virtex 4 FPGA XC4VLX $7,010 $1,250 XC4VSX $542 $97 Virtex-II Pro FPGA XC2VP $9,610 $775 XC2VP $6,860 $613 XC2VP $2,780 $334 XC2VP $781 $244 Spartan 3 FPGA XC3S $242 $78 XC3S $164 $59 Processors 50% peak Pentium $167 $56 Pentium D x $203 $36 Cell Processor 3200 x 9 10 [22] $230 [23] $23 HPC Rmax System Cost System X 2300 x $5.8 M [24] $473 As can be seen from the table, FPGA double precision floating-point performance is noticeably higher than for traditional Intel processors; however, when considering the cost of this performance processors fair better, with the worst processor beating the best FPGA. In particular, Sony s Cell processor is more than two times cheaper per GFLOPS than the best FPGA. The results indicate that the current generation of larger FPGAs, such the XC2VP100 and XC4VLX200 found on many FPGA-augmented HPC clusters, are far from cost competitive with the current generation of processors for double precision floating-point tasks typical of supercomputing applications. With one exception, all costs in Table I only cover the price of the device and do not include other components (motherboard, memory, network, etc.) that are necessary to produce a functioning supercomputer. These additional costs are non-negligible and, while the FPGA accelerators would also incur additional costs for circuit board and components, it is likely that the cost of components to create a functioning HPC node from a processor, even factoring in economies of scale, would be larger than for creating an accelerator plug-in from an FPGA. To place these additional costs in perspective the costperformance for Virginia Tech s System X supercomputing cluster has been included. Constructed from 1,100 dual core Apple XServe nodes, the supercomputer, including the cost of all components, cost $473 per GFLOPS. Several of the larger FPGAs cost more per GFLOPS even without the memory, boards, and
5 5 assembly required to create a functional accelerator. As the dedicated integer multipliers included by Xilinx, the largest configurable logic manufacturer, are only 18-bits wide, several multipliers must be combined to produce the 52-bit multiplication needed for double-precision floating-point multiplication. For Xilinx s double precision floating-point core 16 of these 18-bit multipliers are required [20] for each multiplier, while for the Dou et al design only nine are needed. For many FPGA device families the high multiplier requirement limits the number of floating-point multipliers that may be placed on the device. For example, while 31 of Dou s MAC units may be placed on an XC2VP100, the largest Virtex-II Pro device, the lack of sufficient dedicated multipliers permits only 10 to be placed on the largest Xilinx FPGA, an XC4VLX200. If this device was solely used as a matrix multiplication accelerator, as in Dou s work, over 80% of the device would be unused. Of course this idle configurable logic could be used to implement additional multipliers, at a significant performance penalty. And while the device could implement many more floating-point adders, most HPC-related tasks, such as matrix multiplication, dot product, and Fast Fourier Transforms (FFTs), make heavy use of multiplication and would not benefit from a sea of adders. B. Extrapolated Cost-Performance Comparison While the larger FPGA devices that are prevalent in computational accelerators do not provide a cost benefit for the double precision floating-point calculations required by the HPC community, historical trends [25] suggest that FPGA performance is improving at a rate faster than that of processors. The question is then asked, when, if ever, will FPGAs overtake processors in cost-performance? Several assumptions are made in the following analysis. As has been noted by some, the cost of the largest cutting-edge FPGA remains roughly constant over time, while performance and size improves. A first-order estimate of $8,000 has been made for the cost of the largest and newest FPGA an estimate supported by the cost of the largest Virtex-II Pro and Virtex 4 devices. Furthermore, it is assumed that the cost of a processor remains constant at $500 over time as well. While these estimates are somewhat misleading, as these costs certainly do vary over time, the variability in the cost of computing devices between generations is much less than the increase in performance. The comparison further assumes, as before, that processors can sustain 50% of their peak floating-point performance while FPGAs sustain 100%. Whenever possible, estimates were rounded to favor FPGAs. Two sources of data were used for performance extrapolation to increase the validity of the results. The work of Dou et al [19], representing the fastest double precision floating-point MAC design, was extrapolated to the largest parts in several Xilinx device families. Additional data was obtained by extrapolating the results of Underwood s historical analysis [25] to include the Virtex 4 family. The initial results are shown below in Figure 1(a) for the Underwood data and Figure 1(b) for Dou et al.
6 6 $100, $10, FPGAs Processors Extrapolation FPGA Extrapolation Processor Cost / GFLOPS $1, $ $ Year (a) $10, FPGAs Processors Extrapolation FPGA Extrapolation Processor $1, Cost / GFLOPS $ $ Figure 1: Extrapolated double precision floating-point MAC cost-performance for: (a) Underwood design and (b) Dou et al design. (b) As apparent from the graphs, there are significant differences between the results of Dou and Underwood. An additional data point exists for the Underwood graph as his work included results for the Virtex E FPGAs. The design of Dou et al could not be extrapolated to this device as the Virtex E has no dedicated Year
7 7 multipliers and their design requires nine. The Dou et al design is higher performance and smaller, in terms of slices, than Underwood s design. In both graphs the latest data point, representing the largest Virtex 4 device, displays worse cost-performance than the previous generation of devices. This is due to the shortage of dedicated multipliers on the larger Virtex 4 devices. The Virtex 4 architecture is comprised of three subfamilies: the LX, SX, and FX. The Virtex 4 sub-family with the largest devices, by far, is the LX and it is these devices that are found in FPGA-augmented HPC nodes. However, the LX sub-family is focused on logic density, trading most of the dedicated multipliers found in the smaller SX sub-family for configurable logic. This significantly reduces the floating-point multiplication performance of the larger Virtex 4 devices. As the graphs demonstrate, if this trend towards logic-centric large FPGAs continues, it is unlikely that the largest FPGAs will be cost effective compared to processors anytime soon, if ever. The largest announced next-generation Virtex 5 device, the XC5VLX330, includes 192 dedicated multipliers, compared with only 96 that are present in the largest Virtex 4 device. Furthermore, the Virtex 5 multipliers are 25-bits by 18-bits wide, better supporting floating-point calculations. This multiplier design should reduce the number of dedicated multipliers required to construct a double precision floating-point multiplier from the present nine multipliers to four, eliminating the multiplier bottleneck that harms performance in the Virtex 4s. As preliminary Virtex 5 data suggests that the relatively poor floating-point performance of the Virtex 4 is an aberration and not indicative of a trend in FPGA architectures, it seems reasonable to reconsider the results excluding the Virtex 4 data points. Figure 2, below, excludes these potential misleading data points and reconstructs the trend lines. When the Virtex 4 data is ignored, the cost-performance of FPGAs for double precision floating-point matrix multiplication improves at a rate greater than that for processors. While there is always a danger from drawing conclusions from a small data set, both the Dou et al and Underwood design results point to a crossover point sometime around 2009 to At this point in time the largest FPGA devices, like those typically found in commercial FPGA-augmented HPC clusters, will be cost effective, compared to processors, for double precision floating-point calculations.
8 8 $100, $10, FPGAs Processors Extrapolation FPGA Extrapolation Processor Cost / GFLOPS $1, $ $ Year a) ( $10, FPGAs Processors Extrapolation FPGA Extrapolation Processor $1, Cost / GFLOPS $ $ Figure 2: Extrapolated double precision floating-point MAC cost-performance excluding Virtex 4 data for: (a) Underwood design and (b) Dou et al design. b) Year (
9 9 III. NONSTANDARD FLOATING-POINT COMPARISON The use of IEEE standard floating-point data formats in hardware implementations prevents the user from leveraging an FPGA s fine-grained configurability, effectively reducing an FPGA to a collection of floatingpoint units with configurable interconnect. Seeing the advantages of customizing the data format to fit the problem, several authors have constructed nonstandard floating-point units. One of the earlier projects demonstrated a 23x speedup on a 2-D FFT through the use of a custom 18-bit floating-point format [26]. More recent work has focused on parameterizible libraries of floating-point units that can be tailored to the task at hand [27-29]. By using a custom floating-point format sized to match the width s of the FPGA s internal integer multipliers, a speedup of 44 was achieved for a hydrodynamics simulation [30] using four large FPGAs. Nakasato and Hamada s 38 GFLOPS of performance is impressive, even from a cost-performance standpoint. For the cost of their PROGRAPE-3 board, estimated at $15,000, it is likely that a 15-node processor cluster could be constructed producing 196 single precision peak GFLOPS. Even in the unlikely scenario that this cluster could sustain the same 10% of peak performance obtained by Nakasato and Hamada s for their software implementation, the PROGRAPE-3 design would still achieve a 2x speedup. As in many FPGA to CPU comparisons, it is likely that the analysis unfairly favors the FPGA solution. Hardware implementations require specialized skills in digital design and vendor-specific tool flows. Development time and costs are significantly higher than for software. Many comparisons in literature spend significantly more time optimizing the hardware implementations than they do optimizing their software implementations. Previous research has demonstrated significant compiler inefficiency for common HPC functions [31]. For the DGEMM matrix multiplication function, a hand-coded version outperformed the compiler by greater than eight times. It appears that likely that Nakasato and Hamada s speedup would be significantly reduced, and perhaps eliminated on a cost-performance basis, if equal effort was applied to optimizing the software at the assembly level. To permit their design to be more cost-competitive, even against efficient software implementations, smaller FPGAs could be used. As evident from Table I, FPGA cost declines sharply as the size of the device decreases. IV. ALL-INTEGER IMPLEMENTATIONS The strength of configurable logic stems from the ability to customize a hardware solution to a specific problem at the bit level. The previously presented works all implemented coarse-grained floating-point units inside an FPGA for a wide range of HPC applications. For certain applications the full flexibility of configurable logic can be leveraged to create a custom solution to a specific problem, utilizing data types that play to the FPGA s strengths integer arithmetic. One such application can be found in the Great Internet Mersenne Prime Search (GIMPS) [32]. The software used by GIMPS relies heavily on double precision floating-point FFTs. Through a careful analysis of the problem, an all-integer solution is possible that improves FPGA performance by a factor of two.
10 10 The largest known prime numbers are Mersenne primes prime numbers of the form 2 q -1, where q is also prime. The distributed computing project GIMPS has been created to identify large Mersenne primes and a reward of $100,000 has been issued for the first person to identify a prime number with greater than 10 million digits. The algorithm used by GIMPS, the Lucas-Lehmer test, is iterative, repeatedly performing modular squaring. As the numbers being squared are quite large (tens of millions of bits), specialized multiplication techniques are used. One of the most efficient multiplication algorithms for large integers utilizes the FFT, treating the number being squared as a long sequence of smaller numbers. Instead of one, 24 million bit number, the algorithm may, for example, create a 3 million element sequence of 8-bit numbers. The linear convolution of this sequence with itself performs the squaring. As linear convolution in the time domain is equivalent to multiplication in the frequency domain, the FFT of the sequence is taken and the resulting frequency domain sequence is squared element-wise before being brought back into the time domain. Floating-point arithmetic is used to meet the strict precision requirements across the time and frequency domains. The software used by GIMPS has been optimized at the assembly level for maximum performance on Pentium processors, making this application an effective benchmark of relative processor floating-point performance. Previous work focused on an FPGA hardware implementation of the GIMPS algorithm to compare FPGA and processor floating-point performance [33]. To leverage the advantages of a configurable architecture, multiple FFT implementations were considered, including floating-point and all-integer designs. The results of this comparison indicated that an all-integer FFT, specifically an irrational base discrete weighted transform, could outperform a floating-point implementation by a factor of two. The final GIMPS accelerator, designed for the largest Virtex-II Pro FPGA, consisted of two 8-point FFT units fed by reorder caches constructed from the internal memories. To prevent a memory bottleneck, the design assumed four independent banks of DDR SDRAM. The final design could be clocked at 80 MHz and used 86% of the dedicated multipliers and 70% of the configurable logic. In spite of the unique all-integer algorithmic approach, the customized FPGA implementation only achieved a speed-up of 1.76 compared to a 3.4 GHz Pentium 4 processor. Amdahl s Law limited the FPGA s performance due to the serial nature of certain steps in the algorithm. A slightly reworked implementation, designed as an FFT accelerator with all serial functions implemented on an attached processor, could achieve a speed-up of 2.6 compared to a processor alone under the reasonable assumption that communication could be overlapped with computation on the FPGA. V. CONCLUSION When comparing HPC architectures, many factors must be weighed, including memory and I/O bandwidth, communication latencies, and peak and sustained performance. However, as the recent focus on power consumption and commodity processor clusters demonstrates, cost-performance is of paramount importance. In order for FPGAs to gain acceptance within the general HPC community they
11 11 must be cost-competitive with traditional processors for the floating-point arithmetic typical in supercomputing applications. The analysis of the cost-performance of various current generation FPGAs revealed that only the lower end devices were cost-competitive with processors for double precision floatingpoint matrix multiplications. The limited capabilities of these lower end parts make them unattractive for inclusion in HPC nodes, with industry favoring the largest FPGA devices. An extrapolation of the double precision floating-point cost-performance of these larger devices using two different floating-point designs suggests that the largest FPGAs will not be cost-competitive with processors until the timeframe. However, FPGA floating-point performance is very sensitive to the mix of dedicated arithmetic units in the architecture and for this cost-performance cross point to be achieved in the suggested timeframe requires architectures with significant dedicated multipliers. For non-ieee standard floating-point formats, current generation FPGAs fair much better, making FPGAs cost-competitive with processors. While completely integer implementations of floating-point applications permit the FPGA to fully leverage its strengths, for at least one such application the cost-performance of an all-integer implementation was significantly worse than a processor. This suggests that, as with traditional integer applications, only certain classes of problems will experience significant performance improvements with hardware implementation. REFERENCES 1. Compton, K. and S. Hauck, Reconfigurable computing: a survey of systems and software. ACM Comput. Surv., (2): p Puttegowda, K., et al. A Run-Time Reconfigurable System for Gene-Sequence Searching. in International VLSI Design Conference New Delhi, India. 3. TimeLogic, DeCypher Engine G Tessier, R. and W. Burleson, Reconfigurable Computing for Digital Signal Processing: A Survey. Journal of VLSI Signal Processing, : p Patterson, C. High performance DES encryption in Virtex FPGAs using JBits Sinnappan, R. and S. Hazelhurst, A Reconfigurable Approach to Packet Filtering, in Lecture Notes in Computer Science p Jean, J., et al. Accelerating an IR automatic target recognition application with FPGAs Zachary, K.B. and K.P. Viktor, Time and area efficient pattern matching on FPGAs, in Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays. 2004, ACM Press: Monterey, California, USA. 9. SCR Inc.., SRC-7 Product Sheet Vance, A. Start-up could kick Opteron into overdrive. The Register 2006 April 21, Woods, G. Cray ARSC Presentation FPGA,. in ARSC High-Performance Reconfigurable Computing Workshop Fairbanks, AL.
12 12. Collins, J., G. Kent, and J. Yardley. Using the Starbridge Systems FPGA-based Hypercomputer for Cancer Research. in International Conference on Military and Aerospace Programmable Logic Devices Washington, D.C. 13. SGI, Extraordinary Acceleration of Workflows with Reconfigurable Application-specific Computing from SGI Leong, M.P., et al. Automatic floating to fixed point translation and its application to post-rendering 3D warping Meuer, H., J. Dongarra, and E. Strohmaier. Top 500 List [cited; Available from: Underwood, K.D. and K.S. Hemmert. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance Zhuo, L. and V.K. Prasanna. Design tradeoffs for BLAS operations on reconfigurable hardware Ho, C.H., et al. Rapid prototyping of FPGA based floating point DSP systems. in 13th IEEE International Workshop on Rapid System Prototyping, Proceedings Dou, Y., et al., 64-bit floating-point FPGA matrix multiplication, in Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays. 2005, ACM Press: Monterey, California, USA. 20. Xilinx, Floating-point Operator v2.0 Datasheet Ribbens, C., et al. Balancing Computational Science and Computer Science Research on a Terascale Computing Facility. in ICCS 2005: 5th International Conference Atlanta, GA. 22. Chen, T., et al. Cell Broadband Engine Architecture and its first implementation. IBM developworks 2005 Nov 29, 2005 [cited; Available from: Lynch, M. Playstation 3 slippage looking more likely - implications. Technology Strategy report Feb 17, Kahney, L., System X Faster, but Falls Behind, in Wired News Keith, U., FPGAs vs. CPUs: trends in peak floating-point performance, in Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays. 2004, ACM Press: Monterey, California, USA. 26. Shirazi, N., P. Athanas, and A. Abbott. Implementation of a 2-D Fast Fourier Transform on a FPGA-Based Custom Computing Machine. in International Workshop on Field Programmable Logic and Applications Oxford, UK. 27. Liang, J., R. Tessier, and O. Mencer. Floating point unit generation and evaluation for FPGAs Belanovic, P. and M. Leeser. A Library of Parameterized Floating Point Modules and Their Use. in International Conference on Field Programmable Logic and Applications Montpelier, France. 29. Dido, J., et al., A flexible floating-point format for optimizing data-paths and operators in FPGA based DSPs, in Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-programmable gate arrays. 2002, ACM Press: Monterey, California, USA. 30. Nakasato, N. and T. Hamada. Astrophysical hydrodynamics simulations on a reconfigurable system Gropp, W. Closing the Performance Gap. in DOE SciDAC PI Meeting Napa, California. 32. GIMPS. The Great Internet Mersenne Prime Search. [cited; Available from: 12
13 33. Craven, S., C. Patterson, and P. Athanas. Super-sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? in International Conference on Military and Aerospace Programmable Logic Devices Washington, D.C. 13
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
International Workshop on Field Programmable Logic and Applications, FPL '99
International Workshop on Field Programmable Logic and Applications, FPL '99 DRIVE: An Interpretive Simulation and Visualization Environment for Dynamically Reconægurable Systems? Kiran Bondalapati and
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25
FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25 December 2014 FPGAs in the news» Catapult» Accelerate BING» 2x search acceleration:» ½ the number of servers»
Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system
Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system Joseph LaBauve Department of Electrical and Computer Engineering University of Central Florida
White Paper FPGA Performance Benchmarking Methodology
White Paper Introduction This paper presents a rigorous methodology for benchmarking the capabilities of an FPGA family. The goal of benchmarking is to compare the results for one FPGA family versus another
7a. System-on-chip design and prototyping platforms
7a. System-on-chip design and prototyping platforms Labros Bisdounis, Ph.D. Department of Computer and Communication Engineering 1 What is System-on-Chip (SoC)? System-on-chip is an integrated circuit
A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications
1 A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications Simon McIntosh-Smith Director of Architecture 2 Multi-Threaded Array Processing Architecture
FPGA-based MapReduce Framework for Machine Learning
FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China
high-performance computing so you can move your enterprise forward
Whether targeted to HPC or embedded applications, Pico Computing s modular and highly-scalable architecture, based on Field Programmable Gate Array (FPGA) technologies, brings orders-of-magnitude performance
Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com
Best Practises for LabVIEW FPGA Design Flow 1 Agenda Overall Application Design Flow Host, Real-Time and FPGA LabVIEW FPGA Architecture Development FPGA Design Flow Common FPGA Architectures Testing and
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Hardware Acceleration for CST MICROWAVE STUDIO
Hardware Acceleration for CST MICROWAVE STUDIO Chris Mason Product Manager Amy Dewis Channel Manager Agenda 1. Introduction 2. Why use Hardware Acceleration? 3. Hardware Acceleration Technologies 4. Current
Radar Processing: FPGAs or GPUs?
Radar Processing: FPGAs or GPUs? WP011972.0 White Paper While generalpurpose graphics processing units (GPGPUs) offer high rates of peak floatingpoint operations per second (FLOPs), FPGAs now offer competing
Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level
System: User s View System Components: High Level View Input Output 1 System: Motherboard Level 2 Components: Interconnection I/O MEMORY 3 4 Organization Registers ALU CU 5 6 1 Input/Output I/O MEMORY
FPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
Getting the Performance Out Of High Performance Computing. Getting the Performance Out Of High Performance Computing
Getting the Performance Out Of High Performance Computing Jack Dongarra Innovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge National Lab http://www.cs.utk.edu/~dongarra/
THE NAS KERNEL BENCHMARK PROGRAM
THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Float to Fix conversion
www.thalesgroup.com Float to Fix conversion Fabrice Lemonnier Research & Technology 2 / Thales Research & Technology : Research center of Thales Objective: to propose technological breakthrough for the
Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine
Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Accelerate Cloud Computing with the Xilinx Zynq SoC
X C E L L E N C E I N N E W A P P L I C AT I O N S Accelerate Cloud Computing with the Xilinx Zynq SoC A novel reconfigurable hardware accelerator speeds the processing of applications based on the MapReduce
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Exploiting Stateful Inspection of Network Security in Reconfigurable Hardware
Exploiting Stateful Inspection of Network Security in Reconfigurable Hardware Shaomeng Li, Jim Tørresen, Oddvar Søråsen Department of Informatics University of Oslo N-0316 Oslo, Norway {shaomenl, jimtoer,
Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution
Arista 10 Gigabit Ethernet Switch Lab-Tested with Panasas ActiveStor Parallel Storage System Delivers Best Results for High-Performance and Low Latency for Scale-Out Cloud Storage Applications Introduction
Architectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
Networking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
CFD Implementation with In-Socket FPGA Accelerators
CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline
on an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Compiling PCRE to FPGA for Accelerating SNORT IDS
Compiling PCRE to FPGA for Accelerating SNORT IDS Abhishek Mitra Walid Najjar Laxmi N Bhuyan QuickTime and a QuickTime and a decompressor decompressor are needed to see this picture. are needed to see
SPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications
Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications Theses of the Ph.D. dissertation Zoltán Nagy Scientific adviser: Dr. Péter Szolgay Doctoral School
System Models for Distributed and Cloud Computing
System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems
Introduction to Xilinx System Generator Part II. Evan Everett and Michael Wu ELEC 433 - Spring 2013
Introduction to Xilinx System Generator Part II Evan Everett and Michael Wu ELEC 433 - Spring 2013 Outline Introduction to FPGAs and Xilinx System Generator System Generator basics Fixed point data representation
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule
All Programmable Logic Hans-Joachim Gelke Institute of Embedded Systems Institute of Embedded Systems 31 Assistants 10 Professors 7 Technical Employees 2 Secretaries www.ines.zhaw.ch Research: Education:
Easier - Faster - Better
Highest reliability, availability and serviceability ClusterStor gets you productive fast with robust professional service offerings available as part of solution delivery, including quality controlled
Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
Supercomputing 2004 - Status und Trends (Conference Report) Peter Wegner
(Conference Report) Peter Wegner SC2004 conference Top500 List BG/L Moors Law, problems of recent architectures Solutions Interconnects Software Lattice QCD machines DESY @SC2004 QCDOC Conclusions Technical
Lattice QCD Performance. on Multi core Linux Servers
Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most
Building an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
How To Design An Image Processing System On A Chip
RAPID PROTOTYPING PLATFORM FOR RECONFIGURABLE IMAGE PROCESSING B.Kovář 1, J. Kloub 1, J. Schier 1, A. Heřmánek 1, P. Zemčík 2, A. Herout 2 (1) Institute of Information Theory and Automation Academy of
IT@Intel. Comparing Multi-Core Processors for Server Virtualization
White Paper Intel Information Technology Computer Manufacturing Server Virtualization Comparing Multi-Core Processors for Server Virtualization Intel IT tested servers based on select Intel multi-core
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Performance Oriented Management System for Reconfigurable Network Appliances
Performance Oriented Management System for Reconfigurable Network Appliances Hiroki Matsutani, Ryuji Wakikawa, Koshiro Mitsuya and Jun Murai Faculty of Environmental Information, Keio University Graduate
AMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
Computer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
Extending the Power of FPGAs. Salil Raje, Xilinx
Extending the Power of FPGAs Salil Raje, Xilinx Extending the Power of FPGAs The Journey has Begun Salil Raje Xilinx Corporate Vice President Software and IP Products Development Agenda The Evolution of
FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab
FPGA Accelerator Virtualization in an OpenPOWER cloud Fei Chen, Yonghua Lin IBM China Research Lab Trend of Acceleration Technology Acceleration in Cloud is Taking Off Used FPGA to accelerate Bing search
Floating Point Fused Add-Subtract and Fused Dot-Product Units
Floating Point Fused Add-Subtract and Fused Dot-Product Units S. Kishor [1], S. P. Prakash [2] PG Scholar (VLSI DESIGN), Department of ECE Bannari Amman Institute of Technology, Sathyamangalam, Tamil Nadu,
Introduction to Digital System Design
Introduction to Digital System Design Chapter 1 1 Outline 1. Why Digital? 2. Device Technologies 3. System Representation 4. Abstraction 5. Development Tasks 6. Development Flow Chapter 1 2 1. Why Digital
Embedded System Hardware - Processing (Part II)
12 Embedded System Hardware - Processing (Part II) Jian-Jia Chen (Slides are based on Peter Marwedel) Informatik 12 TU Dortmund Germany Springer, 2010 2014 年 11 月 11 日 These slides use Microsoft clip arts.
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education
Lesson 7: SYSTEM-ON ON-CHIP (SoC( SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY 1 VLSI chip Integration of high-level components Possess gate-level sophistication in circuits above that of the counter,
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, [email protected] Abstract 1 Interconnect quality
PowerPC Microprocessor Clock Modes
nc. Freescale Semiconductor AN1269 (Freescale Order Number) 1/96 Application Note PowerPC Microprocessor Clock Modes The PowerPC microprocessors offer customers numerous clocking options. An internal phase-lock
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim?
Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim? Successful FPGA datacenter usage at scale will require differentiated capability, programming ease, and scalable implementation models Executive
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University
QLogic 16Gb Gen 5 Fibre Channel in IBM System x Deployments
QLogic 16Gb Gen 5 Fibre Channel in IBM System x Deployments Increase Virtualization Density and Eliminate I/O Bottlenecks with QLogic High-Speed Interconnects Key Findings Support for increased workloads,
Software Development with Real- Time Workshop Embedded Coder Nigel Holliday Thales Missile Electronics. Missile Electronics
Software Development with Real- Time Workshop Embedded Coder Nigel Holliday Thales 2 Contents Who are we, where are we, what do we do Why do we want to use Model-Based Design Our Approach to Model-Based
Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001
Agenda Introduzione Il mercato Dal circuito integrato al System on a Chip (SoC) La progettazione di un SoC La tecnologia Una fabbrica di circuiti integrati 28 How to handle complexity G The engineering
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
Reconfigurable System-on-Chip Design
Reconfigurable System-on-Chip Design MITCHELL MYJAK Senior Research Engineer Pacific Northwest National Laboratory PNNL-SA-93202 31 January 2013 1 About Me Biography BSEE, University of Portland, 2002
Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
Improved LS-DYNA Performance on Sun Servers
8 th International LS-DYNA Users Conference Computing / Code Tech (2) Improved LS-DYNA Performance on Sun Servers Youn-Seo Roh, Ph.D. And Henry H. Fong Sun Microsystems, Inc. Abstract Current Sun platforms
Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin
BUS ARCHITECTURES Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin Keywords: Bus standards, PCI bus, ISA bus, Bus protocols, Serial Buses, USB, IEEE 1394
Graphic Processing Units: a possible answer to High Performance Computing?
4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/
Cluster Computing at HRI
Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: [email protected] 1 Introduction and some local history High performance computing
Sun Constellation System: The Open Petascale Computing Architecture
CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical
Systolic Computing. Fundamentals
Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW
Model-based system-on-chip design on Altera and Xilinx platforms
CO-DEVELOPMENT MANUFACTURING INNOVATION & SUPPORT Model-based system-on-chip design on Altera and Xilinx platforms Ronald Grootelaar, System Architect [email protected] Agenda 3T Company profile Technology
FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.
3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 s Introduction Convolution is one of the basic and most common operations in both analog and digital domain signal processing.
Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1
Introduction to High Performance Cluster Computing Cluster Training for UCL Part 1 What is HPC HPC = High Performance Computing Includes Supercomputing HPCC = High Performance Cluster Computing Note: these
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Lecture N -1- PHYS 3330. Microcontrollers
Lecture N -1- PHYS 3330 Microcontrollers If you need more than a handful of logic gates to accomplish the task at hand, you likely should use a microcontroller instead of discrete logic gates 1. Microcontrollers
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Navigating the Enterprise Database Selection Process: A Comparison of RDMS Acquisition Costs Abstract
Navigating the Enterprise Database Selection Process: A Comparison of RDMS Acquisition Costs Abstract Companies considering a new enterprise-level database system must navigate a number of variables which
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
A Very Brief History of High-Performance Computing
A Very Brief History of High-Performance Computing CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Very Brief History of High-Performance Computing Spring 2016 1
