Supercomputing: DataFlow vs. ControlFlow (Paradigm Shift in Both, Benchmarking Methodology and Programming Model)
|
|
|
- Jodie Warren
- 10 years ago
- Views:
Transcription
1 Supercomputing: DataFlow vs. ControlFlow (Paradigm Shift in Both, Benchmarking Methodology and Programming Model) Introductory frame #1: Milutinovic, V., Salom, J., Trifunovic, N., and Valero, M. This paper was prepared in response to over 100 messages with questions from the CACM readership inspired by our previous CACM contribution [1]. Introduction The strength of DataFlow computers, compared to ControlFlow ones, is in the fact that they accelerate the data flows and application loops for one or more orders of magnitude; how many orders of magnitude that depends on the amount of data reusability within the loops. This feature is enabled by compiling down to levels much below the machine code, which brings important additional effects: much lower execution time, equipment size, and power dissipation. The previous paper [1] argues that time has come to redefine Top 500 benchmarking. Concrete measurement data from real applications in GeoPhysics (Schlumberger) and FinancialEngineering (JPMorgan), shows that a DataFlow machine (for example, the Maxeler MAX series) rates better than a ControlFlow machine (for example, Cray Titan), if a different benchmark is used (e.g., a BigData benchmark), as well as a different ranking methodology (e.g., the benchmark execution time multiplied by the number of 1U boxes needed to achieve the given execution time - it is assumed, no matter what technology is inside, the 1U box always has the same size and always uses the same power). In reaction to the previous paper [1], scientific community insists that more light is shed on two issues: (a) Programming paradigm and (b) Benchmarking methodology. Consequently the stress of this viewpoint is on these two issues. Programming Model What is the fastest, the least complex, and the least power consuming way to do computing? Answer: Rather than writing one program to control the flow of data through the computer, one has to write a program to configure the hardware of the computer, so that input data, when it arrives, can flow through the computer hardware in only one way (the way how the computer hardware has been configured). This is best achieved if the serial part of the application (the transactions) continues to run on the ControlFlow host and the parallel part of the application (BigData crunching and loops) is migrated into a DataFlow accelerator. The early works of Dennis [2] and Arvind [3] could prove the concept, but could not result in commercial successes for three reasons: (a) Reconfigurable hardware technology was not yet ready (contemporary 1
2 ASIC was fast enough but not reconfigurable, while reconfigurable FPGA was nonexistent); (b) System software technology was not yet ready (methodologies for fast creation of system software did exist, but effective tools for large scale efforts of this sort did not); and (c) Applications of those days were not of the BigData type, so the streaming capabilities of the DataFlow computing model could not generate performance superiority (recent measurements show, that, currently, Maxeler can move internally over 1TB of data per second [4]). Each programming model is characterized with its quantity and quality. The quantity and quality aspects of the Maxeler DataFlow model are discussed in the next two paragraphs, based on Figure 1. Quantitatively speaking, the complexity of DataFlow programming, in the case of Maxeler, is equal to 2n + 3, where n refers to the number of loops migrated from the ControlFlow host to the DataFlow accelerator. This means, the following programs have to be written: One kernel program per loop, to map the loop onto the DataFlow hardware; One kernel test program per loop, to test the above; One manager program (no matter how many kernels there are) to move data: (a) Into the DataFlow accelerator, (b) In between the kernels (if more than one kernel exists), and (c) Out of the DataFlow accelerator; One simulation builder program, to test code without the time consuming migration into the binary level; One hardware builder program, to exploit the code on the binary level. In addition, in the host program (initially written in Fortran, Hadoop, MapReduce, MathLab, Matematika, C++, or C), instead of each migrated loop, one has to include a streaming construct (send data + receive results). 2
3 CPU CPU Code SLiC Memory MaxelerOS Main X Y PCI Express C3 Manager C0 C1 C2 C4 CPU Code (.c) Manager (.java) MyKernel (.java) #include "MaxSLiCInterface.h" #include "Calc.max" int *x, *y; Calc(x, DATA_SIZE); Manager m = new Manager("Calc"); Kernel k = new MyKernel(); m.setkernel(k); m.setio( link("x", PCIE), link("y", PCIE)); m.addmode(modedefault()); m.build(); DFEvar x = io.input("x", hwint(32)); DFEvar result = c0*fifo(x,1) + c1*x + c2*fifi(x,-1) + c3*fifi(x, L) + c4*fifi(x,-l); Figure 1: An example Host code, Manager code, and Kernel code (a single kernel case): 2D Convolution Legend: SLIC = Compiler support for selected domain specific languages and customer applications DFEvar = keyword used for defining the variables that internally flow through the configured hardware (in contrast to standard Java variables used to instruct the compiler) Qualitatively speaking, the above quantity (2n + 3) is not any more difficult to realize because of the existence of a DSL (domain specific language) like MaxJ (an extension of standard Java with over 100 new functionalities). Figure 2 shows that a relatively complex BigData processing problem can be realized in only a few lines of MaxJ code. Note that the programming model implies the need for existence of two types of variables: (a) Standard Java variables, to control compile time activities, and (b) DFE (Data Flow Engine) variables, which actually flow through configured hardware (denoted with the DFE prefix in the examples of figures 1 and 2). 3
4 HostCode (.c) for(t = 1; t < tmax; t++) { // Set-up timestep if(t < tsrc) { source = generate.soruce.wavelet(t); maxlib.stream.region_from_host( maxlib, "source", source srcx, srcy, srcz, srcx+1, srcy+1, srcz+1); } maxlib.stream_from_dram(maxlib, "curr", curr_ptr); maxlib.stream_from_dram(maxlib, "prev", prev_ptr); maxlib.stream_earthmodel_from_dram(maxlib, dvv_array); maxlib.stream_to_dram(maxlib, "next", next_ptr); maxlib_run(maxlib); // Execute timestep FDKernel (.java) public class IsotropicModelingKernel extends FDKernel { public IsotropicModelingKernel(FDKernelParameters p) { super(p); Stencil stencil = fixedstencil(-6, 6, coeffs, 1/8.0); DFEVar curr = io.wavefieldinput("curr", 1.0, 6); DFEVar prev = io.wavefieldinput("curr", 1.0, 0); DFEVar dvv = io.earthmodelinput("dvv", 9.0, 0); DFEVar source = io.hostinput("source", 1.0n 0); } } DFEVar I = convolve(curr, ConvolveAxes.XYZ.stencil); DFEVar next = curr*2 - prev + dvv + I + source; io.wavefieldoutput("next", next); swap_buffers(prev_ptr, curr_ptr, next_ptr); } Figure 2: An example of the Host and Kernel Code (The programming is largely facilitated through the use of appropriate Java extensions) Legend: DFEvar = keyword used for defining the variables that internally flow through the configured hardware (in contrast to standard Java variables used to instruct the compiler) Benchmarking Methodology Bare speed is definitely neither the only issue of importance nor the most important one [5]. Consequently, the Top 500 ranking should not concentrate on only one issue of importance, no matter if it is speed, power dissipation, or size (the size includes hardware complexity in the widest sense); it should concentrate on all three issues together, at the same time. In this paper we argue that the best methodology for Top 500 benchmarking should be based on the holistic performance measure H (T BigData, N 1U ) (HPM) defined as the number of 1U boxes (N 1U ) (or equivalent) needed to achieve the desired execution time using a given BigData benchmark. Issues like power dissipation (monthly electricity bill), and the physical size of the equipment (the assumption here is that equipment size is proportional to the hardware complexity, and to the hardware production cost) are implicitly covered by H (T BigData, N 1U ). Selection of the performance measure H is coherent with the TPA concept introduced in [6] and described in Figure 3. 4
5 Figure 3: The TPA (Time, Power, and Area) Concept of the Optimal Computer Design Design optimizations have to optimize the three essential issues jointly: T = Time, P = Power, and A = Area (complexity of a VLSI chip). Note that the hardware design cost is not encompassed by the parameter A (parameter A encompasses only the hardware production cost), which causes that the above defined H formula represents an upper bound for ControlFlow machines and a lower bound for DataFlow machines. This is due to the fact that ControlFlow machines are based on the Von Neumann logic, which is complex to design (execution control unit, cash control mechanism, prediction mechanisms, etc.), while the DataFlow machines are based on the FPGA logic, which is simple to design (mostly because the level of design repetitiveness is extremely high, etc ). As indicated in the previous paper [1], the performance measure H puts PetaFlops out of date, and brings PetaData into the focus. Consequently, if the Top 500 ranking was based on the performance measure H, DataFlow machines would outperform ControlFlow machines. This statement is backed up with performance data presented in the next section. Most Recent Performance Data A survey of recent implementations of various algorithms using the DataFlow paradigm can be found in [7]. Future trends in the development of the DataFlow paradigm can be found in [8]. For comparison purposes, future trends in the ControlFlow paradigm can be found in [9]. Some of recent implementations of various DataFlow algorithms interesting within the context of the performance measure H are summarized below. 5
6 (1) Lindtjorn et al. [10], proved: (T) A speed-up of about 70 over a CPU and 14 over a GPU machine (application: Schlumberger, GeoPhysics), (P) Using a 150MHz DataFlow machine, and (A) Packaged as 1U. The algorithm involved was Reverse Time Migration (RTM). Starting from Acoustic wave equation: where u is pressure and v is velocity (2) Oriato et al. [11], proved: (T) That two DataFlow nodes outperform 1,900 CPU cores (application: ENI, The velocity-stress form of the elastic wave equation), (P) Using sixteen 150MHz FPGAs, and (A) Packaged as 2U. The algorithm involved (3D Finite Difference) was: (3) Mencer et al. [12] proved: (T) That two DataFlow nodes outperform 382 CPU cores (application: ENI, CRS 4 Lab, Meteorological Modelling), (P) Using a 150MHz DataFlow machine, and (A) Packaged as 1U. The algorithm involved (Continuity (1) and (2) and thermal energy (3) equations) was: (1) (2) 6
7 (3) (4) Stojanović et al. [13] proved: (T) Speed-up of about 10, (P) Power reduction of about 17, and (A) All that on an 1U card, comparing it with i7 CPU. The algorithm involved (Gross Pitaevskii equation) was: (5) Chow et al. [14] proved: (T) Speed-up of about 163, (P) Power reduction of about 170, and (A) All that on an 1U card, comparing it with quad core CPU. The algorithm involved (Monte Carlo simulation) was: where is the input vector, N is the number of sample points, and f H N is the sampled mean value of the quantity. (6) Arram et al. [15] proved: (T) Speed-up of about 13 over a 20 core CPU and 4 over a GPU machine, (P) Using one 150MHz DataFlow machine, and (A) Packaged as 1U. The algorithm involved (Genetic Sequence Alignment) was based on FM-index. This index combines the properties of suffix array (SA) with the Burrows-Wheeler transform (BWT). SA interval is updated for each character in pattern Q, moving from the last character to the first: where pointers k and l are respectively the smallest and largest indices in the SA which starts with Q, c(x) (frequency) is the number of symbols in the BWT sequence that are lexicographically smaller than x and 7
8 s(x, i) (occurrence) is the number of occurrences of the symbol x in the BWT sequence from the 0 th position to the i th position. (7) Guo et al. [16] proved: (T) Speed-up of about 517 over a CPU and 28 over a GPU machine, (P) Using one 150MHz DataFlow machine, and (A) Packaged as 1U. The algorithm involved (Gaussian Mixture Models) was: where x is a D-dimensional continuous-valued data vector (i.e. measurement or features), w i, i = 1,, M, are the mixture weights, and g(x μ i, Σ i ), i = 1,, M, are the component Gaussian densities. Conclusion This viewpoint sheds more light on the programming paradigm and the most recent performance achievements of the DataFlow concept (more details can be found in [8, 17]). It can be of interest to those who have to make decisions about future developments of their BigData centers. It opens up a new important problem: The need for a development of a public cloud of ready-to-use applications. Some researches compare the use of MultiCore architectures to plowing with X horses and the use of ManyCore architectures to plowing with 1,000X chickens; we compare the use of DataFlow architectures to plowing with 1,000,000X ants. One question is what method of plowing is the fastest (computational speed)? If one plows a horizontal field (i.e., if one is using relatively simple benchmarks, like Linpack), horses are the fastest. If one plows a steep field (i.e., if one uses the benchmarks that favorize ManyCore over MultiCore), chickens are the fastest. However, if one plows a vertical stone field (BigData), ants are the fastest, and are the only animals that can do the task. Another question is the cost of feeding the animals used for plowing (power consumption)? A hay stack for horses is expensive, as well as multiple corn bit boxes for chickens. However, one can feed ants with crumbs left over from the plow driver breakfast (power consumption is negligible, which enables that seismic data processing is completed even on the seismic vessels, i.e., at the very source of data, without moving data to remote mainland data centers data movement is expensive [2]). Still another question is where to keep the animals (equipment size)? For horses one needs big stable, for chickens one needs small but multiple chicken cages. However, with ants there is no need for a new 8
9 and expensive data center with bricks and mortar. They easily find a way to live in a corner crack of an existing data center. 9
10 REFERENCES [1] Flynn, M., Mencer, O., Milutinovic, V., Rakocevic, G., Stenstrom, P., Trobec. R., Valero, M., Moving from Petaflops to Petadata, Communications of the ACM, ACM, New York, NY, USA, Volume 56, Issue 5, May 2013, pp [2] Dennis, J., Misunas, D., A Preliminary Architecture for a Basic Data-Flow Processor, Proceedings of the ISCA '75 the 2nd Annual Symposium on Computer Architecture, ACM, New York, NY, USA, 1975, pp [3] Agerwala, T., Arvind, -., "Data Flow Systems: Guest Editors' Introduction," IEEE Computer, Feb. 1982, Vol. 15, No. 2, pp [4] Mencer, O., Multiscale Dataflow Computing, Proceedings of the Final Conference Supercomputacion y eciencia, SyeC, Barcelona, Catalonia, Spain, May 27 - May 28, [5] Resch, M., Future Strategies for Supercomputing, Proceedings of the Final Conference Supercomputacion y eciencia, SyeC, Barcelona, Catalonia, Spain, May 27 - May 28, [6] Flynn, M., Area - Time - Power and Design Effort: The Basic Tradeoffs in Application Specific Systems, Proceedings of the 2005 IEEE International Conference on Application-Specific Systems and Architecture Processors (ASAP'05), Samos, Greece, July 23-July 25, [7] Salom, J., Fujii, H., Overview of Acceleration Results of Maxeler FPGA Machines, IPSI Transactions on Internet Research, July 2013, Volume 5, Number 1, pp [8] -, -., Maxeler Technologies FrontPage, London, UK, October 20, [9] Patt, Y., Future Microprocessors: What Must We do Differently if We Are to Effectively Utilize Multi-core and Many-core Chips?, IPSI Transactions on Internet Research, January 2009, Volume 5, Number 1, pp [10] Lindtjorn, O., Clapp, G., Pell, O., Mencer, O., Flynn, M., Fu, H., Beyond Traditional Microprocessors for Geoscience High-Performance Computing Applications, IEEE Micro, Washington, USA, March/April 2011, Vol. 31, No. 2, pp [11] Oriato, D., Pell, O., Andreoletti, C., Bienati, N., FD Modeling Beyond 70Hz with FPGA Acceleration, Maxeler Summary, Summary of a talk presented at the SEG 2010 HPC Workshop, Denver, Colorado, USA, October
11 [12] Mencer, O., Acceleration of a Meteorological Limited Area Model with Dataflow Engines, Proceedings of the Final Conference Supercomputacion y eciencia, SyeC, Barcelona, Catalonia, Spain, May 27 - May 28, [13] Stojanovic, S. et al, One Implementation of the Gross Pitaevskii Algorithm, Proceedings of Maxeler@SANU, Belgrade, Serbia, April 8, [14] Chow, G. C. T., Tse, A. H. T., Jin, O., Luk, W., Leong, P. H. W., Thomas, D. B., "A Mixed Precision Monte Carlo Methodology for Reconfigurable Accelerator Systems," Proceedings of ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), Monterey, CA, USA, February 2012, pp [15] Arram, J., Tsoi, K. H., Luk, W., Jiang, P., "Hardware Acceleration of Genetic Sequence Alignment," Proceedings of 9th International Symposium ARC 2013, Los Angeles, CA, USA, March 25-27, pp [16] Guo, C., Fu, H., Luk, W., "A Fully-Pipelined Expectation-Maximization Engine for Gaussian Mixture Models," Proceedings of 2012 International Conference on Field-Programmable Technology (FPT), Seoul, S. Korea, Dec. 2012, pp [17] Flynn, M., Mencer, O., Greenspon, I., Milutinovic, V., Stojanovic, S., Sustran, Z., The Current Challenges in DataFlow Supercomputer Programming, Proceedings of ISCA 2013, Tell Aviv, Israel, June
Paradigm Shift in Big Data SuperComputing: DataFlow vs. ControlFlow
Trifunovic et al. Journal of Big Data (2015) 2:4 DOI 10.1186/s40537-014-0010-z METHODOLOGY Paradigm Shift in Big Data SuperComputing: DataFlow vs. ControlFlow Nemanja Trifunovic 1*, Veljko Milutinovic
International Workshop on Field Programmable Logic and Applications, FPL '99
International Workshop on Field Programmable Logic and Applications, FPL '99 DRIVE: An Interpretive Simulation and Visualization Environment for Dynamically Reconægurable Systems? Kiran Bondalapati and
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
Dataflow Programming with MaxCompiler
Dataflow Programming with MaCompiler Lecture Overview Programming DFEs MaCompiler Streaming Kernels Compile and build Java meta-programming 2 Reconfigurable Computing with DFEs Logic Cell (10 5 elements)
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu [email protected] High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
The School of Electrical Engineering University of Belgrade
The School of Electrical Engineering University of Belgrade THE TUTORIAL MARATHON FROM THE UNIVERSITY OF BELGRADE Institutions: School of Electrical Engineering (12) School of Mathematics (6) School of
Going to the wire: The next generation financial risk management platform
Going to the wire: The next generation financial risk management platform Ari Studnitzer, MD of Platform Development, CME Group Oskar Mencer, CEO and Founder, Maxeler CME Group CME Group is the world s
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Secured Embedded Many-Core Accelerator for Big Data Processing
Secured Embedded Many- Accelerator for Big Data Processing Amey Kulkarni PhD Candidate Advisor: Professor Tinoosh Mohsenin Energy Efficient High Performance Computing (EEHPC) Lab University of Maryland,
http://www.ece.ucy.ac.cy/labs/easoc/people/kyrkou/index.html BSc in Computer Engineering, University of Cyprus
Christos Kyrkou, PhD KIOS Research Center for Intelligent Systems and Networks, Department of Electrical and Computer Engineering, University of Cyprus, Tel:(+357)99569478, email: [email protected] Education
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
FPGA-based MapReduce Framework for Machine Learning
FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China
Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system
Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system Joseph LaBauve Department of Electrical and Computer Engineering University of Central Florida
benchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
Automata Designs for Data Encryption with AES using the Micron Automata Processor
IJCSNS International Journal of Computer Science and Network Security, VOL.15 No.7, July 2015 1 Automata Designs for Data Encryption with AES using the Micron Automata Processor Angkul Kongmunvattana School
FPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications
Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications Theses of the Ph.D. dissertation Zoltán Nagy Scientific adviser: Dr. Péter Szolgay Doctoral School
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Systolic Computing. Fundamentals
Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW
An Open-source Framework for Integrating Heterogeneous Resources in Private Clouds
An Open-source Framework for Integrating Heterogeneous Resources in Private Clouds Julio Proaño, Carmen Carrión and Blanca Caminero Albacete Research Institute of Informatics (I3A), University of Castilla-La
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
TPCalc : a throughput calculator for computer architecture studies
TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University [email protected] [email protected]
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
How To Build An Ark Processor With An Nvidia Gpu And An African Processor
Project Denver Processor to Usher in a New Era of Computing Bill Dally January 5, 2011 http://blogs.nvidia.com/2011/01/project-denver-processor-to-usher-in-new-era-of-computing/ Project Denver Announced
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
A General Framework for Tracking Objects in a Multi-Camera Environment
A General Framework for Tracking Objects in a Multi-Camera Environment Karlene Nguyen, Gavin Yeung, Soheil Ghiasi, Majid Sarrafzadeh {karlene, gavin, soheil, majid}@cs.ucla.edu Abstract We present a framework
How to write high quality papers in algorithmic or experimental Computer Science
How to write high quality papers in algorithmic or experimental Computer Science Veljko Milutinovic (University of Belgrade, Serbia, [email protected]) Jelica Protic (University of Belgrade, Serbia, [email protected])
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS
Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Dynamic resource management for energy saving in the cloud computing environment
Dynamic resource management for energy saving in the cloud computing environment Liang-Teh Lee, Kang-Yuan Liu, and Hui-Yang Huang Department of Computer Science and Engineering, Tatung University, Taiwan
GUEST OPERATING SYSTEM BASED PERFORMANCE COMPARISON OF VMWARE AND XEN HYPERVISOR
GUEST OPERATING SYSTEM BASED PERFORMANCE COMPARISON OF VMWARE AND XEN HYPERVISOR ANKIT KUMAR, SAVITA SHIWANI 1 M. Tech Scholar, Software Engineering, Suresh Gyan Vihar University, Rajasthan, India, Email:
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
Data Mining with Hadoop at TACC
Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical
Networking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms
Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT
216 ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT *P.Nirmalkumar, **J.Raja Paul Perinbam, @S.Ravi and #B.Rajan *Research Scholar,
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
CFD Implementation with In-Socket FPGA Accelerators
CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline
Cloud Storage Solution for WSN Based on Internet Innovation Union
Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
White Paper FPGA Performance Benchmarking Methodology
White Paper Introduction This paper presents a rigorous methodology for benchmarking the capabilities of an FPGA family. The goal of benchmarking is to compare the results for one FPGA family versus another
high-performance computing so you can move your enterprise forward
Whether targeted to HPC or embedded applications, Pico Computing s modular and highly-scalable architecture, based on Field Programmable Gate Array (FPGA) technologies, brings orders-of-magnitude performance
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.
Volume 3, Issue 10, October 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Load Measurement
Examining the Viability of FPGA Supercomputing
1 Examining the Viability of FPGA Supercomputing Stephen D. Craven and Peter Athanas Bradley Department of Electrical and Computer Engineering Virginia Polytechnic Institute and State University Blacksburg,
Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design
Learning Outcomes Simple CPU Operation and Buses Dr Eddie Edwards [email protected] At the end of this lecture you will Understand how a CPU might be put together Be able to name the basic components
Building an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
A Parallel Processor for Distributed Genetic Algorithm with Redundant Binary Number
A Parallel Processor for Distributed Genetic Algorithm with Redundant Binary Number 1 Tomohiro KAMIMURA, 2 Akinori KANASUGI 1 Department of Electronics, Tokyo Denki University, [email protected]
PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts
PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts Workshop on Computer Architecture Education 2015 Dan Connors, Kyle Dunn, Ryan Bueter Department of Electrical Engineering University
Accelerate Cloud Computing with the Xilinx Zynq SoC
X C E L L E N C E I N N E W A P P L I C AT I O N S Accelerate Cloud Computing with the Xilinx Zynq SoC A novel reconfigurable hardware accelerator speeds the processing of applications based on the MapReduce
Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com
Best Practises for LabVIEW FPGA Design Flow 1 Agenda Overall Application Design Flow Host, Real-Time and FPGA LabVIEW FPGA Architecture Development FPGA Design Flow Common FPGA Architectures Testing and
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
Infrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
How To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology
Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting
Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting Introduction Big Data Analytics needs: Low latency data access Fast computing Power efficiency Latest
Data Center and Cloud Computing Market Landscape and Challenges
Data Center and Cloud Computing Market Landscape and Challenges Manoj Roge, Director Wired & Data Center Solutions Xilinx Inc. #OpenPOWERSummit 1 Outline Data Center Trends Technology Challenges Solution
Data Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
Hardware/Software Co-Design of a Java Virtual Machine
Hardware/Software Co-Design of a Java Virtual Machine Kenneth B. Kent University of Victoria Dept. of Computer Science Victoria, British Columbia, Canada [email protected] Micaela Serra University of Victoria
Scientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
Performance Oriented Management System for Reconfigurable Network Appliances
Performance Oriented Management System for Reconfigurable Network Appliances Hiroki Matsutani, Ryuji Wakikawa, Koshiro Mitsuya and Jun Murai Faculty of Environmental Information, Keio University Graduate
Attaining EDF Task Scheduling with O(1) Time Complexity
Attaining EDF Task Scheduling with O(1) Time Complexity Verber Domen University of Maribor, Faculty of Electrical Engineering and Computer Sciences, Maribor, Slovenia (e-mail: [email protected]) Abstract:
Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED
Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory 21 st Century Research Continuum Theory Theory embodied in computation Hypotheses tested through experiment SCIENTIFIC METHODS
Pedraforca: ARM + GPU prototype
www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.
Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.49-54 : isrp13-005 Optimized Communications on Cloud Computer Processor by Using
APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE
APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE Tuyou Peng 1, Jun Peng 2 1 Electronics and information Technology Department Jiangmen Polytechnic, Jiangmen, Guangdong, China, [email protected]
Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level
System: User s View System Components: High Level View Input Output 1 System: Motherboard Level 2 Components: Interconnection I/O MEMORY 3 4 Organization Registers ALU CU 5 6 1 Input/Output I/O MEMORY
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing Liang-Teh Lee, Kang-Yuan Liu, Hui-Yang Huang and Chia-Ying Tseng Department of Computer Science and Engineering,
Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim?
Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim? Successful FPGA datacenter usage at scale will require differentiated capability, programming ease, and scalable implementation models Executive
High Performance Computing in the Multi-core Area
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
CS 575 Parallel Processing
CS 575 Parallel Processing Lecture one: Introduction Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer
Computers CMPT 125: Lecture 1: Understanding the Computer Tamara Smyth, [email protected] School of Computing Science, Simon Fraser University January 3, 2009 A computer performs 2 basic functions: 1.
Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan
Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan Abstract AES is an encryption algorithm which can be easily implemented on fine grain many core systems.
HPC Programming Framework Research Team
HPC Programming Framework Research Team 1. Team Members Naoya Maruyama (Team Leader) Motohiko Matsuda (Research Scientist) Soichiro Suzuki (Technical Staff) Mohamed Wahib (Postdoctoral Researcher) Shinichiro
NIOS II Based Embedded Web Server Development for Networking Applications
NIOS II Based Embedded Web Server Development for Networking Applications 1 Sheetal Bhoyar, 2 Dr. D. V. Padole 1 Research Scholar, G. H. Raisoni College of Engineering, Nagpur, India 2 Professor, G. H.
