Paradigm Shift in Big Data SuperComputing: DataFlow vs. ControlFlow
|
|
|
- Daniela Foster
- 10 years ago
- Views:
Transcription
1 Trifunovic et al. Journal of Big Data (2015) 2:4 DOI /s z METHODOLOGY Paradigm Shift in Big Data SuperComputing: DataFlow vs. ControlFlow Nemanja Trifunovic 1*, Veljko Milutinovic 1, Jakob Salom 2 and Anton Kos 3 Open Access * Correspondence: [email protected] 1 School of Electrical Engineering, University of Belgrade, Belgrade, Serbia Full list of author information is available at the end of the article Abstract The paper discusses the shift in the computing paradigm and the programming model for Big Data problems and applications. We compare DataFlow and ControlFlow programming models through their quantity and quality aspects. Big Data problems and applications that are suitable for implementation on DataFlow computers should not be measured using the same measures as ControlFlow computers. We propose a new methodology for benchmarking, which takes into account not only the execution time, but also the power and space, needed to complete the task. Recent research shows that if the TOP500 ranking was based on the new performance measures, DataFlow machines would outperform ControlFlow machines. To support the above claims, we present eight recent implementations of various algorithms using the DataFlow paradigm, which show considerable speed-ups, power reductions and space savings over their implementation using the ControlFlow paradigm. Introduction Big Data is becoming a reality in more and more research areas every year. Also, Big Data applications are becoming more visible as they are slowly entering areas concerning the general public. In other words, Big Data applications that were up to now present mainly in the highly specialized areas of research, like geophysics [1,2] and financial engineering [3], are making its way into more general areas, like medicine and pharmacy [4], biology, aviation [5], politics, acoustics [6], etc. In the last years the ratio of data volume increase is higher than the ratio of processing power increase. With the growing adoption of data-collecting technologies, like sensor networks, Internet of Things, and others, the data volume growth ratio is expected to continue to increase. Among others, one important question arises: how do we process such quantities of data. One possible answer lies is in the shift of the computing paradigm and the programming model. With Big Data problems, it is many times more reasonable to concentrate on data rather than on the process. This can be achieved by employing DataFlow computing paradigm, programming model, and computers. Background and literature review The strength of DataFlow, compared to ControlFlow computers is in the fact that they accelerate the data flows and application loops from 10 to How many orders 2015 Trifunovic et al.; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
2 Trifunovic et al. Journal of Big Data (2015) 2:4 Page 2 of 9 of magnitude depends on the amount of data reusability within the loops. This feature is enabled by compiling down to levels much below the machine code, which brings important additional effects: much lower execution time, equipment size, and power dissipation. The above strengths can prove especially important in Big Data applications that can benefit from one or more of the DataFlow advantages. For instance: A daily periodic Big Data application, which would not finish in time, if executed on a ControlFlow computer, executes in time on a DataFlow computer of the same equipment size and power dissipation, A Big Data application with limited space and/or power resources (remote locations such as ships, research stations, etc.) executes in a reasonable amount of time, With Big Data applications, where execution time is not a prime concern, DataFlow computers can save space and energy. The previous paper [7] argues that time has come to redefine TOP500 benchmarking. Concrete measurement data from real applications in geophysics [1,2], financial engineering [3], and some other research fields [8,9,10-12], shows that a DataFlow machine (for example, the Maxeler MAX series) rates better than a ControlFlow machine (for example, Cray Titan), if a different benchmark is used (e.g., a Big Data benchmark), as well as a different ranking methodology (e.g., the benchmark execution time multiplied by the number of 1U boxes needed to accomplish the given execution time - 1U box represents one rack unit or equivalent - it is assumed, no matter what technology is inside, the 1U box always has the same size and always uses the same power). In reaction to the previous paper [7], scientific community insists that more light is shed on two issues: (a) Programming paradigm and (b) Benchmarking methodology. Consequently the stress of this viewpoint is on these two issues. Discussion What is the fastest, the least complex, and the least power consuming way to do (Big Data) computing? Answer: Rather than writing one program to control the flow of data through the computer, one has to write a program to configure the hardware of the computer, so that input data, when it arrives, can flow through the computer hardware in only one way - the way how the computer hardware has been configured. This is best achieved if the serial part of the application (the transactions) continues to run on the ControlFlow host and the parallel part of the application is migrated into a DataFlow accelerator. A DataFlow part of the application does (parallel) Big Data crunching and execution of loops. The early works of Dennis [13] and Arvind [14] could prove the concept, but could not result in commercial successes for three reasons: (a) Reconfigurable hardware technology was not yet ready. Contemporary ASIC was fast enough but not reconfigurable, while reconfigurable FPGA was nonexistent; (b) System software technology was not yet ready. Methodologies for fast creation of system software did exist, but effective tools for large scale efforts of this sort did not; and (c) Applications of those days were not of the Big Data type, so the streaming capabilities of the DataFlow computing
3 Trifunovic et al. Journal of Big Data (2015) 2:4 Page 3 of 9 model could not generate performance superiority. Recent measurements show, that, currently, Maxeler can move internally over 1 TB of data per second [15]. Programming model Each programming model is characterized with its quantity and quality. The quantity and quality aspects of the Maxeler DataFlow model, as one of the currently best evaluated, are explained in the next two paragraphs, based on Figure 1. Other DataFlow programming initiatives exist [16] that follow similar approaches as Maxeler systems. To the best of our knowledge Maxeler is the leading player on the field and employs the most advanced and flexible model. For that reason we are using the Maxeler system platform for the presentation of the DataFlow programming model (one of many possible). Quantitatively speaking, the complexity of DataFlow programming, in the case of Maxeler, is equal to 2n + 3, where n refers to the number of loops migrated from the ControlFlow host to the DataFlow accelerator. This means, the following programs have to be written: One kernel program per loop, to map the loop onto the DataFlow hardware; One kernel test program per loop, to test the above; One manager program (no matter how many kernels there are) to move data: (1) Into the DataFlow accelerator, (2) In between the kernels (if more than one kernel exists), and (a) (b) (c) Figure 1 An example of the Maxeler DataFlow programming model: (a) Host code, (b) Manager code, and (c) Kernel code (a single kernel case): 2D convolution. Legend: SLiC = Compiler support for selected domain specific languages and customer applications. DFEvar = keyword used for defining the variables that internally flow through the configured hardware (in contrast to standard Java variables used to instruct the compiler). FIFO = First In First Out.
4 Trifunovic et al. Journal of Big Data (2015) 2:4 Page 4 of 9 (3) Out of the DataFlow accelerator; One simulation builder program, to test code without the time consuming migration into the binary level; One hardware builder program, to exploit the code on the binary level. In addition, in the host program (initially written in Fortran, Hadoop, MapReduce, MathLab, Matematika, C++, or C), instead of each migrated loop, one has to include a streaming construct (send data + receive results), represented by automatically generated C function Calc(x, DATA_SIZE, see Figure 1 (a). Qualitatively speaking, the above quantity (2n + 3) is not any more difficult to realize because of the existence of a DSL (domain specific language) like MaxJ (an extension of standard Java with over 100 new functionalities). Figure 2 shows how a complex Big Data processing problem can be realized in MaxJ code. Note that the programming model implies the need for existence of two types of variables: (a) Standard Java variables, to control compile time activities, and (b) DFE (DataFlow Engine) variables, which actually flow through configured hardware (denoted with the DFE prefix in the examples of figures 1 and 2). The programming of the DataFlow part of the code is largely facilitated through the use of appropriate Java extensions. Research design and methodology Bare speed is definitely neither the only issue of importance nor the most crucial one [17]. Consequently, the TOP500 ranking should not concentrate on only one issue of importance, no matter if it is speed, power dissipation, or size (the size includes hardware complexity in the widest sense); it should concentrate on all three issues together, at the same time. In this paper we argue that the best methodology for TOP500 benchmarking should be based on the holistic performance measure H (T BigData,N 1U ) defined as the number of 1U boxes (N 1U = one rack units or equivalent) needed to accomplish the desired execution time using a given Big Data benchmark. Instead of using theoretical measures of size and volume, we have opted in this paper for a more practical measure, which is related to international standardization efforts: The size of a 1U box. Issues like power dissipation (monthly electricity bill), and the physical size of the equipment Figure 2 An example of the Host and Kernel code.
5 Trifunovic et al. Journal of Big Data (2015) 2:4 Page 5 of 9 (the assumption here is that the equipment size is proportional to the hardware complexity, and to the hardware production cost) are implicitly covered by H (T BigData, N 1U ) (every 1U box has limited power dissipation and size). Selection of the performance measure H is coherent with the TPA concept introduced in [18] and described in Figure 3. Note that the hardware design cost is not encompassed by the parameter A, which encompasses only the hardware production cost, and causes that the above defined H formula represents an upper bound for ControlFlow machines and a lower bound for DataFlow machines. This is due to the fact that ControlFlow machines are built using the Von Neumann logic, which is complex to design (execution control unit, cash control mechanism, prediction mechanisms, etc.), while the DataFlow machines are built using the FPGA logic, which is simple to design; mostly because the level of design repetitiveness is extremely high, etc. The latter is beneficiary for many Big Data problems, where a large amount of data is continuously processed through the use of relatively simple operations. As indicated in the previous paper [7], the performance measure H puts PetaFlops out of date, and brings PetaData into the focus. Consequently, if the TOP500 ranking was based on the performance measure H, DataFlow machines would outperform ControlFlow machines. This statement is backed up with performance data presented in the next section. Results A survey of recent implementations of various algorithms using the DataFlow paradigm can be found in [19]. Future trends in the development of the DataFlow paradigm can be found in [20]. For comparison purposes, future trends in the ControlFlow paradigm can be found in [21]. Figure 3 The TPA (Time, Power, and Area) concept of the optimal computer design. Design optimizations have to optimize the three essential issues jointly: T = Time, P = Power, and A = Area (complexity of a VLSI chip).
6 Trifunovic et al. Journal of Big Data (2015) 2:4 Page 6 of 9 Some of recent implementations of various DataFlow algorithms interesting within the context of the performance measure H are summarized below. (1) Lindtjorn et al. [1], proved: (T) That one DataFlow node has the performance equivalent to about 70 twin server Nehelem CPU machines and to 14 two card Tesla GPU machines (application: Schlumberger, GeoPhysics), (P) Using a 150 MHz FPGAs, and (A) Packaged as 1U. The algorithm involved was Reverse Time Migration (RTM). Starting from Acoustic wave equation: 2 u t 2 ¼ v2 2 u x 2 þ 2 u y 2 þ 2 u z 2 where u is acoustic pressure and v is velocity. (2) Oriato et al. [2], proved: (T) That two DataFlow nodes have the performance equivalent to more than 1,900 3 GHz X86 CPU cores (application: ENI, The velocity-stress form of the elastic wave equation), (P) Using sixteen 150 MHz FPGAs, and (A) Packaged as 2U. The algorithm involved (3D Finite Difference) was: υ χ; t b σij χ; t χ ¼ b 2 χ 4 f t χ i χ; t j σij χ; t λ υ 2 k χ; t χ δ ij μ υi χ; t χ 4 t χ k χj υ j þ 3 χ; t 5; χ j 3 χ; t 5 ¼ χ i m a ij þ m s ij χ; t : t where λ and μ are the so-called Lamé parameters describing the elastic properties of the medium, σ is the stress and f is the source function (driving force). (3) Mencer et al. [8] proved: (T) That one DataFlow node has the performance equivalent to more than 382 Intel Xeon 2.7 GHz CPU cores (application: ENI, CRS 4 Lab, Meteorological Modelling), (P) Using a 150 MHz FPGAs, and (A) Packaged as 1U. The algorithm involved (Continuity (1) and (2) and thermal energy (3) equations) was: Z p 1 s t ¼ 0 V h q t ¼ u q ah x λ θ t ¼ u θ ah x φ v a p dσ σ q q σ φ σ þ F q σ θ σ þ F θ ð1þ ð2þ ð3þ Where V h is the horizontal wind in vector form, for which Cartesian components are u and v, terms F u,f v,f q and F θ represent contributions to the tendencies from the parameterization of physical processes such as radiation, convection, dry adiabatic adjustments, surface friction, soil water and energy balance, large scale
7 Trifunovic et al. Journal of Big Data (2015) 2:4 Page 7 of 9 precipitation and evaporation. The PE describes the time evolution for the five prognostic variables u, v, p s, q and θ. (4) Stojanović et al. [9] proved: (T) That one DataFlow node has the performance of about 10 i7 CPU cores, (P) Power reduction of about 17, and (A) Packaged as 1U. The algorithm involved (Gross Pitaevskii equation) was: ih t Φðr; tþ ¼ h2 2 2m þ V ext ðþþg r j Φ ð r; t Þ j2 Φðr; tþ where m is the mass of the boson, r is the coordinate of the boson, V ext is the external potential, g is coupling constant and Φ is wave function. (5) Chow et al. [3] proved: (T) That one DataFlow node has the performance of about 163 quad core CPUs, (P) Power reduction of about 170, and (A) Packaged as 1U. The algorithm involved (Monte Carlo simulation) was: I hfhin ¼ 1 X N χ N i i¼1 where x i is the input vector, N is the number of sample points, I is approximated expected value, and fh N is the sampled mean value of the quantity. (6) Arram et al. [10] proved: (T) That one DataFlow node has the performance of about 13 Intel X core CPUs and about 4 NVDIA GTX 580 GPU machine, (P) Using one 150 MHz FPGAs, and (A) Packaged as 1U. The algorithm involved (Genetic Sequence Alignment) was based on FM-index. This index combines the properties of suffix array (SA) with the Burrows-Wheeler transform (BWT). SA interval is updated for each character in pattern Q, moving from the last character to the first: k new ¼ cðχþþsðχ; k current 1Þ l new ¼ cðχþþsðχ; l current 1Þ where pointers k and l are respectively the smallest and largest indices in the SA which starts with Q, c(x) (frequency) is the number of symbols in the BWT sequence that are lexicographically smaller than x and s(x, i) (occurrence) is the number of occurrences of the symbol x inthebwtsequencefromthe0th position to the i th position. (7) Guo et al. [11] proved: (T) That one DataFlow node has the performance of about 517 Intel i GHz CPU cores and of about 28 GPU machines, (P) Using one 150 MHz FPGA, and (A) Packaged as 1U. The algorithm involved (Gaussian Mixture Models) was: pðxjλþ ¼ XM i¼1 w i g Xjμi; X i where x is a d-dimensional continuous-valued data vector (i.e. measurement or features), w i,i=1,, M, are the mixture weights, and g(x ě i,ó i ), i = 1,, M, are the component Gaussian densities.
8 Trifunovic et al. Journal of Big Data (2015) 2:4 Page 8 of 9 (8) Kos et al. [12] proved: (T) That one DataFlow node has the performance of between 100 and 400 Intel Core2 Quad 2.66 GHz CPU cores, (P) using one 200 MHz FPGA, and (A) Packaged as 1U. The algorithm involved was network sorting. An example of the simple sorting network is given in the Figure 4. Size of the sorting network, the number of comparators needed, to sort N numbers is: S ¼ N log 2 N ðlog 2 N þ 1Þ The achieved speed-up depended on N and the bit-size of numbers being sorted (between 8 and 64 bits). Conclusion The viewpoint presented in this paper sheds more light on the recent development of the DataFlow computing concept (more details can be found in [20-22]). The DataFlow computing paradigm requires new ways of thinking and new ways of programming. In general it redefines the subordination of program and data; instead of writing a program that controls how the data flows, the data flow defines the way a program is written. DataFlow computing excells with applications which are having high repetettivenes of operation and some level of data reusability within the operations. The latter is particularly beneficiarry for many BigData problems, where a large amount of data is repetetively processed through the use of relatively simple operations. The newly presented benchmarking methodology performance measure H (defined as the number of 1U boxes needed to accomplish the desired execution time using a given Big Data benchmark), would considerably reorder the TOP500 list. If the TOP500 ranking was based on the performance measure H, DataFlow machines would outperform ControlFlow machines. This statement is backed up with the presented performance results. The results show that when using DataFlow computers, instead od ControlFlow computers, time, energy, and/or space can be saved. The above can be of great interest to those who have to make decisions about future developments of their Big Data centers. It also opens up a new important problem: The need for a development of a public cloud of ready-to-use Big Data applications. The only remaining question is: can a Big Data application be broken in to a set of tasks and operations that are easily mappable into a DataFlow execution graph for a FPGA structure? We argue that for most Big Data applications the answer is positive Figure 4 An example of a simple sorting network for sorting four input values with five comparators. Each comparator connects two wires and emits higher value to the bottom wire and lower value to the top wire. Two comparators on the left and two comparators in the middle can work in parallel. Parallel operation of this sorting network sorts the input numbers in three steps.
9 Trifunovic et al. Journal of Big Data (2015) 2:4 Page 9 of 9 Comment This paper was prepared in response to over 100 messages with questions from the CACM readership inspired by our previous CACM contribution [7]. Competing interests The authors declare that they have no competing interests. Authors contributions NT, VM, JS, and AK proposed a new methodology for benchmarking Big Data applications, which takes into account not only the execution time, but also the power and space, needed to complete the task. All authors read and approved the final manuscript. Author details 1 School of Electrical Engineering, University of Belgrade, Belgrade, Serbia. 2 Mathematical Institute of the Serbian Academy of Sciences and Arts, Belgrade, Serbia. 3 School of Electrical Engineering, University of Ljubljana, Ljubljana, Slovenia. References 1. Flynn M, Mencer O, Milutinovic V, Rakocevic G, Stenstrom P, Trobec R, Valero M (2013) Moving from petaflops to petadata. In: Communications of the ACM, vol 5, 56th edn. ACM, New York, NY, USA, pp Dennis J, Misunas D, A Preliminary Architecture for a Basic Data-Flow Processor, Proceedings of the ISCA '75 the 2nd Annual Symposium on Computer Architecture, ACM, New York, NY, USA, 1975, pp Agerwala T (1982) Arvind,., data flow systems: guest Editors introduction, IEEE. Computer 15(2): Mencer O, Multiscale Dataflow Computing, Proceedings of the Final Conference Supercomputacion y eciencia, Barcelona Supercomputing Center, SyeC, Barcelona, Catalonia, Spain, May 27 - May 28, Resch M, Future Strategies for Supercomputing, Proceedings of the Final Conference Supercomputacion y eciencia, Barcelona Supercomputing Center, SyeC, Barcelona, Catalonia, Spain, May 27 - May 28, Flynn M, Area - Time - Power and Design Effort: The Basic Tradeoffs in Application Specific Systems, Proceedings of the 2005 IEEE International Conference on Application-Specific Systems and Architecture Processors (ASAP'05), Samos, Greece, July 23-July 25, Salom J, Fujii H, Overview of Acceleration Results of Maxeler FPGA Machines, IPSI Transactions on Internet Research, July 2013, Volume 5, Number 1, pp Maxeler Technologies FrontPage, London, UK, October 20, Patt Y (2009) Future microprocessors: what must We do differently if We Are to effectively utilize multi-core and many-core chips? IPSI Transactions on Internet Res 5(1): Lindtjorn O, Clapp G, Pell O, Mencer O, Flynn M, Fu H, Beyond Traditional Microprocessors for Geoscience High-Performance Computing Applications, IEEE Micro, Washington, USA, March/April 2011, Vol. 31, No. 2, pp Oriato D, Pell O, Andreoletti C, Bienati N, FD Modeling Beyond 70 Hz with FPGA Acceleration, Maxeler Summary, Summary of a talk presented at the SEG 2010 HPC Workshop, Denver, Colorado, USA, October Mencer O, Acceleration of a Meteorological Limited Area Model with Dataflow Engines, Proceedings of the Final Conference Supercomputacion y eciencia, Barcelona Supercomputing Center, SyeC, Barcelona, Catalonia, Spain, May 27 - May 28, Stojanovic S et al., One Implementation of the Gross Pitaevskii Algorithm, Proceedings of Maxeler@SANU, MISANU, Belgrade, Serbia, April 8, Chow G C T, Tse A H T, Jin O, Luk W, Leong P H W, Thomas D B, A Mixed Precision Monte Carlo Methodology for Reconfigurable Accelerator Systems, Proceedings of ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), ACM, Monterey, CA, USA, February 2012, pp Arram J, Tsoi K H, Luk W, Jiang P, Hardware Acceleration of Genetic Sequence Alignment, Proceedings of 9th International Symposium ARC 2013, ACM, Los Angeles, CA, USA, March 25 27, pp Guo C, Fu H, Luk W, A Fully-Pipelined Expectation-Maximization Engine for Gaussian Mixture Models, Proceedings of 2012 International Conference on Field-Programmable Technology (FPT), Seoul, S. Korea, Dec. 2012, pp Kos A, Ranković V, Tomažič S, Sorting Networks on the Maxeler Data Flow Super Computing Systems, Advances in Computers, Elsevier, 225 Wyman Street, Waltham, MA 02451, USA, Flynn M, Mencer O, Greenspon I, Milutinovic V, Stojanovic S, Sustran Z, The Current Challenges in DataFlow Supercomputer Programming, Proceedings of ISCA 2013, ACM, Tell Aviv, Israel, June Seebode C, Ort M, Regenbrecht C, Peuker M (2013) BIG DATA infrastructures for pharmaceutical research. IEEE International Conference on Big Data, California 20. Ayhan S, Pesce J, Comitz P, Sweet D, Bliesner S, Gerberick G, Predictive analytics with aviation big data, Integrated Communications, Navigation and Surveillance ICNS Conference, 2013, Edinburgh, UK 21. Jinglan Zhang, Kai Huang, Cottman-Fields M, Truskinger A, Roe P, Shufei Duan, Xueyan Dong, Towsey M, Wimmer J, Managing and Analysing Big Audio Data for Environmental Monitoring, IEEE 16th International Conference on Computational Science and Engineering CSE Conference, 2013, Sydney, Australia 22. Convey Computer Web Page: October
Supercomputing: DataFlow vs. ControlFlow (Paradigm Shift in Both, Benchmarking Methodology and Programming Model)
Supercomputing: DataFlow vs. ControlFlow (Paradigm Shift in Both, Benchmarking Methodology and Programming Model) Introductory frame #1: Milutinovic, V., Salom, J., Trifunovic, N., and Valero, M. This
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
International Workshop on Field Programmable Logic and Applications, FPL '99
International Workshop on Field Programmable Logic and Applications, FPL '99 DRIVE: An Interpretive Simulation and Visualization Environment for Dynamically Reconægurable Systems? Kiran Bondalapati and
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu [email protected] High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
The School of Electrical Engineering University of Belgrade
The School of Electrical Engineering University of Belgrade THE TUTORIAL MARATHON FROM THE UNIVERSITY OF BELGRADE Institutions: School of Electrical Engineering (12) School of Mathematics (6) School of
Dataflow Programming with MaxCompiler
Dataflow Programming with MaCompiler Lecture Overview Programming DFEs MaCompiler Streaming Kernels Compile and build Java meta-programming 2 Reconfigurable Computing with DFEs Logic Cell (10 5 elements)
Going to the wire: The next generation financial risk management platform
Going to the wire: The next generation financial risk management platform Ari Studnitzer, MD of Platform Development, CME Group Oskar Mencer, CEO and Founder, Maxeler CME Group CME Group is the world s
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
benchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications
Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications Theses of the Ph.D. dissertation Zoltán Nagy Scientific adviser: Dr. Péter Szolgay Doctoral School
Dynamic resource management for energy saving in the cloud computing environment
Dynamic resource management for energy saving in the cloud computing environment Liang-Teh Lee, Kang-Yuan Liu, and Hui-Yang Huang Department of Computer Science and Engineering, Tatung University, Taiwan
FPGA-based MapReduce Framework for Machine Learning
FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
Performance Analysis of Remote Desktop Virtualization based on Hyper-V versus Remote Desktop Services
MACRo 2015-5 th International Conference on Recent Achievements in Mechatronics, Automation, Computer Science and Robotics Performance Analysis of Remote Desktop Virtualization based on Hyper-V versus
Big Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
Secured Embedded Many-Core Accelerator for Big Data Processing
Secured Embedded Many- Accelerator for Big Data Processing Amey Kulkarni PhD Candidate Advisor: Professor Tinoosh Mohsenin Energy Efficient High Performance Computing (EEHPC) Lab University of Maryland,
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing Liang-Teh Lee, Kang-Yuan Liu, Hui-Yang Huang and Chia-Ying Tseng Department of Computer Science and Engineering,
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
http://www.ece.ucy.ac.cy/labs/easoc/people/kyrkou/index.html BSc in Computer Engineering, University of Cyprus
Christos Kyrkou, PhD KIOS Research Center for Intelligent Systems and Networks, Department of Electrical and Computer Engineering, University of Cyprus, Tel:(+357)99569478, email: [email protected] Education
How to write high quality papers in algorithmic or experimental Computer Science
How to write high quality papers in algorithmic or experimental Computer Science Veljko Milutinovic (University of Belgrade, Serbia, [email protected]) Jelica Protic (University of Belgrade, Serbia, [email protected])
Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED
Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory 21 st Century Research Continuum Theory Theory embodied in computation Hypotheses tested through experiment SCIENTIFIC METHODS
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
Automata Designs for Data Encryption with AES using the Micron Automata Processor
IJCSNS International Journal of Computer Science and Network Security, VOL.15 No.7, July 2015 1 Automata Designs for Data Encryption with AES using the Micron Automata Processor Angkul Kongmunvattana School
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
CFD Implementation with In-Socket FPGA Accelerators
CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system
Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system Joseph LaBauve Department of Electrical and Computer Engineering University of Central Florida
Systolic Computing. Fundamentals
Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW
PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE
PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE Sudha M 1, Harish G M 2, Nandan A 3, Usha J 4 1 Department of MCA, R V College of Engineering, Bangalore : 560059, India [email protected] 2 Department
Data Mining with Hadoop at TACC
Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical
FPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.
Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.49-54 : isrp13-005 Optimized Communications on Cloud Computer Processor by Using
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
Online Failure Prediction in Cloud Datacenters
Online Failure Prediction in Cloud Datacenters Yukihiro Watanabe Yasuhide Matsumoto Once failures occur in a cloud datacenter accommodating a large number of virtual resources, they tend to spread rapidly
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim?
Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim? Successful FPGA datacenter usage at scale will require differentiated capability, programming ease, and scalable implementation models Executive
Challenges and Importance of Green Data Center on Virtualization Environment
Challenges and Importance of Green Data Center on Virtualization Environment Abhishek Singh Department of Information Technology Amity University, Noida, Uttar Pradesh, India Priyanka Upadhyay Department
Infrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
IMPLEMENTATION OF FPGA CARD IN CONTENT FILTERING SOLUTIONS FOR SECURING COMPUTER NETWORKS. Received May 2010; accepted July 2010
ICIC Express Letters Part B: Applications ICIC International c 2010 ISSN 2185-2766 Volume 1, Number 1, September 2010 pp. 71 76 IMPLEMENTATION OF FPGA CARD IN CONTENT FILTERING SOLUTIONS FOR SECURING COMPUTER
A General Framework for Tracking Objects in a Multi-Camera Environment
A General Framework for Tracking Objects in a Multi-Camera Environment Karlene Nguyen, Gavin Yeung, Soheil Ghiasi, Majid Sarrafzadeh {karlene, gavin, soheil, majid}@cs.ucla.edu Abstract We present a framework
An Open-source Framework for Integrating Heterogeneous Resources in Private Clouds
An Open-source Framework for Integrating Heterogeneous Resources in Private Clouds Julio Proaño, Carmen Carrión and Blanca Caminero Albacete Research Institute of Informatics (I3A), University of Castilla-La
PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0
PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0 15 th January 2014 Al Chrosny Director, Software Engineering TreeAge Software, Inc. [email protected] Andrew Munzer Director, Training and Customer
Dell Reference Configuration for Hortonworks Data Platform
Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution
Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.
Volume 3, Issue 10, October 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Load Measurement
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture. Dell Compellent Product Specialist Team
Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture Dell Compellent Product Specialist Team THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT
216 ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT *P.Nirmalkumar, **J.Raja Paul Perinbam, @S.Ravi and #B.Rajan *Research Scholar,
Scientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
Optimization of Preventive Maintenance Scheduling in Processing Plants
18 th European Symposium on Computer Aided Process Engineering ESCAPE 18 Bertrand Braunschweig and Xavier Joulia (Editors) 2008 Elsevier B.V./Ltd. All rights reserved. Optimization of Preventive Maintenance
GUEST OPERATING SYSTEM BASED PERFORMANCE COMPARISON OF VMWARE AND XEN HYPERVISOR
GUEST OPERATING SYSTEM BASED PERFORMANCE COMPARISON OF VMWARE AND XEN HYPERVISOR ANKIT KUMAR, SAVITA SHIWANI 1 M. Tech Scholar, Software Engineering, Suresh Gyan Vihar University, Rajasthan, India, Email:
TPCalc : a throughput calculator for computer architecture studies
TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University [email protected] [email protected]
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Attaining EDF Task Scheduling with O(1) Time Complexity
Attaining EDF Task Scheduling with O(1) Time Complexity Verber Domen University of Maribor, Faculty of Electrical Engineering and Computer Sciences, Maribor, Slovenia (e-mail: [email protected]) Abstract:
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
Informatica Ultra Messaging SMX Shared-Memory Transport
White Paper Informatica Ultra Messaging SMX Shared-Memory Transport Breaking the 100-Nanosecond Latency Barrier with Benchmark-Proven Performance This document contains Confidential, Proprietary and Trade
Parallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Cloud Storage Solution for WSN Based on Internet Innovation Union
Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,
Examining the Viability of FPGA Supercomputing
1 Examining the Viability of FPGA Supercomputing Stephen D. Craven and Peter Athanas Bradley Department of Electrical and Computer Engineering Virginia Polytechnic Institute and State University Blacksburg,
PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts
PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts Workshop on Computer Architecture Education 2015 Dan Connors, Kyle Dunn, Ryan Bueter Department of Electrical Engineering University
Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing
Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Deep Mann ME (Software Engineering) Computer Science and Engineering Department Thapar University Patiala-147004
Performance Oriented Management System for Reconfigurable Network Appliances
Performance Oriented Management System for Reconfigurable Network Appliances Hiroki Matsutani, Ryuji Wakikawa, Koshiro Mitsuya and Jun Murai Faculty of Environmental Information, Keio University Graduate
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)
WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Kashif Iqbal - PhD [email protected]
HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD [email protected] ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo
High Performance Computing in the Multi-core Area
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster
, pp.11-20 http://dx.doi.org/10.14257/ ijgdc.2014.7.2.02 A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster Kehe Wu 1, Long Chen 2, Shichao Ye 2 and Yi Li 2 1 Beijing
How To Build An Ark Processor With An Nvidia Gpu And An African Processor
Project Denver Processor to Usher in a New Era of Computing Bill Dally January 5, 2011 http://blogs.nvidia.com/2011/01/project-denver-processor-to-usher-in-new-era-of-computing/ Project Denver Announced
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN
1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction
Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Chapter 2. Why is some hardware better than others for different programs?
Chapter 2 1 Performance Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation Why is some hardware better than
Research Article Hadoop-Based Distributed Sensor Node Management System
Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung
HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect [email protected]
HPC and Big Data EPCC The University of Edinburgh Adrian Jackson Technical Architect [email protected] EPCC Facilities Technology Transfer European Projects HPC Research Visitor Programmes Training
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
