Accelerating Fast Fourier Transforms Using Hadoop and CUDA
|
|
|
- Sophie Weaver
- 10 years ago
- Views:
Transcription
1 Accelerating Fast Fourier Transforms Using Hadoop and CUDA Rostislav Tsiomenko SAIC Columbia, Md., USA 1 Abstract There has been considerable research into improving Fast Fourier Transform (FFT) performance through parallelization and optimization for specialized hardware. However, even with those advancements, processing of very large files, over 1TB in size, still remains prohibitively slow. Analysts performing signal processing are forced to wait hours or days for results, which results in a disruption of their workflow and a decrease in productivity. In this paper we present a unique approach that not only parallelizes the workload over multicores, but distributes the problem over a cluster of graphics processing unit (GPU)-equipped servers. By utilizing Hadoop (The Apache Software Foundation) and CUDA (NVIDIA Corporation), we can take advantage of inexpensive servers while still exceeding the processing power of a dedicated supercomputer, as demonstrated in our result using Amazon EC2 (Amazon Technologies, Inc.). Keywords-Distributed computing; Parallel processing; Fast Fourier transforms I. INTRODUCTION The Fast Fourier Transform (FFT) algorithm continues to play a critical role in many types of applications, from data compression, signal processing, and voice recognition, to image processing and simulation [5]. Our interest in the FFT algorithm relates to signal processing and its use in spectral analysis. Additionally, our interest lies in how to exploit existing FFT libraries to achieve high performance. The Cooley Tukey algorithm is the most commonly used FFT algorithm and has been refined and ported to a number of high performance platforms. Both the Math Kernel Library (MKL) from Intel Corporation [1] and the CUDA FFT (CUFFT) library from NVIDIA Corporation [2] offer highly optimized variants of the Cooley-Tukey algorithm. NVIDIA claims that CUFFT offers up to a tenfold increase in performance over MKL when using the latest NVIDIA GPUs. However, even 1 Now at Novetta. [email protected] Bradley S. Rees 1 SAIC Columbia, Md., USA [email protected] with the base O(n log n) time complexity of the Cooley-Tukey algorithm, and architecture-specific performance optimizations of MKL and CUDA, performing an FFT on large terabyte scale files remains computationally expensive. In this paper, we present a novel approach for processing very large files that splits data processing across a cluster of servers, where each server could be equipped with GPUs for additional performance acceleration. This approach provides the ability to process very large files without the need of costly supercomputers. Runtime is estimated at O(!!"#! ) where C is the number of cores per (!.!!)! server, S is the number of servers, and 0.8 is the efficiency factor (loss to overhead per server). We show that performance scales with the number of servers, with only a small loss of performance to Hadoop overhead. Our proof-of-concept software, discussed in Section III, uses Hadoop for handling file splitting and merging, and NVIDIA s CUDA for accelerating the computation of the FFT algorithm. We discuss results in section IV, where we baseline single server and single server with-gpu performance to illustrate the gains possible. We develop our conclusion in section V and present list of future research in section VI. II. RELATED WORK The Cooley-Tukey algorithm is one of the most popular FFT algorithms, primarily due to its inherent support of parallelization. Chen and Gao [3] recently performed an analysis of the algorithm on multi-core servers to see if they could predict performance. Chen and Gao showed that performance scales with the number of cores in the
2 server and that they could predict the runtime with fairly good accuracy. That work could be extended to our case to predict the number of servers needed, assuming all servers have an equal number of cores. In 1994 Agarwal, Gustavson, and Zubair [5] proposed a variant of the Argarwal-Cooley algorithm that is optimized for multi-core systems, specifically the IBM SP (International Business Machines Corporation) platform. While their approach allowed the algorithm to scale to multiple cores, the algorithm requires a supercomputer class machine to achieve the throughput we sought, and is tied to the IBM platform. Additionally, the 1.25 GFLOP performance they reached in 1994, while impressive for the time, pales in comparison with what can be achieved currently with inexpensive NVIDIA GPUs that can reach ~375 GFLOPS. [2] The performance benefit of using GPUs for FFTs has caught the attention of numerous researchers [4, 6,7, 8, 9, 10]. However, the problem with GPUs is the limited amount of data that can fit within GPU memory. When the amount of data exceeds GPU memory, performance starts to degrade due to the latency of moving data between main memory and GPU memory. This problem was addressed by Gu, Siegel, and Li [6] who developed an efficient method for processing large data files between main memory and GPU memory. Work closest to what we propose comes from Chen, Cui, and Mei [7] who developed an algorithm, along with supporting infrastructure, for performing FFTs using a GPU cluster. This work leverages a class of supercomputer that merges central processing units (CPUs) and GPUs for highperformance processing. GPU-based supercomputers ranked fifth, sixth, tenth, and fourteenth in the June 2012 release of the Top500 list of supercomputers [14]. Our goal is not to create a new variation of the FFT algorithm, or rely on expensive hardware platforms. Instead we would like to take advantage of the latest state-of-the-art implementation and be able to distribute the processing over multiple servers using Hadoop [12,13]. III. ALGORITHM DESCRIPTION Our initial intuition was to split the data file into a number of Hadoop Record 2 objects encapsulating an <offset, FFT segment> pair, where offset indicates the byte offset from the beginning of the file, and FFT segment was an array of FFT-size samples (for an FFT size of 1024 and single-precision data, this would be 4096 bytes). However, this implementation was quickly deemed inefficient; given a 1TB input file, a 1024-point FFT, for example, would create 268,435,456 individual Records, with each Record holding only a 4096 byte value, thus requiring excessive overhead to track the completion progress of every Record inside the framework. Besides creating unneeded complexity, this approach severely decreases the efficiency of CUDA; almost all GPU calculations occur faster than memory transfers from the host machine to the device (and vice versa) [15], as the on-gpu memory bus is much faster than the peripheral component interconnect express (PCI-e) bus, which has lower transfer speeds and higher latency. It is therefore beneficial to minimize memory transfers to and from the GPU. In order to solve this problem, we implemented custom Hadoop input and output classes that treat a single Hadoop Distributed File System (HDFS) block as a Split 3 (the default Hadoop behavior), which is then treated as a single Record. The benefit of this approach is two-fold; using 512MB HDFS blocks, the amount of map tasks launched for a 1TB file decreases to 2,048, with each map task having 512MB of data to work with. This data can be transferred to the GPU in a single pair of allocate and memory copy operations, and the partitioning of FFT segments can be done inside memory using CUFFT s batched FFT plan. The CPU implementation uses the same classes for file input and output, using a for-loop to go through the 512MB of data in-place. The block size was set at 512MB, as that was the maximum limit for pointers to in-place FFT data on the CUDA cards we used for development; however, this number can be easily adjusted by changing the Hadoop 2 Record refers to a Hadoop class 3 A Split is part of Hadoop s process of break input data into a number of smaller chunks.
3 dfs.block.size variable when copying the input file into HDFS, making it easy to change CUDA buffer sizes for clusters with different GPU configurations without rewriting the code. The default HDFS block size is 64MB; it is not clear if there are any Hadoop performance penalties for increasing this to 512MB or beyond, as the use of Hadoop block sizes larger than 128MB is poorly documented. The traditional Reduce portion is problematic, however, as MapReduce will output as many files as there are reducers. Since we desire a single output file, one reducer has to be used; this results in all the data from our potentially terabyte-sized file being sent over the network to a single node, where it can be written back to HDFS. Thus, using this method would impose a significant penalty on the processing time, as one machine would have to receive all the map outputs from the network, and then write them back to the cluster. A work-around was devised where the number of reducers is set to 0, and each map task writes its output directly to HDFS. This eliminates the reduce phase, and the user must instead use a post operation getmerge call to the HDFS, which merges files in the output directory (named by their position in the original file, and thus correctly sorted) and copies them to the local file system, cutting down on the time needed to obtain a local copy of the final output. This approach is bottlenecked by the speed of the system writing the final merged file to local disk. Figure 1 represents the workflow of our final Hadoop implementation. the primary constraining factor for performance. The portion of the algorithm that is parallelized through the use of a GPU is the calculation of the FFT blocks; by moving the entire process to Hadoop, it is in theory possible to parallelize the reading and writing process as well. In order to estimate the P and (1-P) proportions of the algorithm, we first ran single CPU and single GPU tests. File size was kept at 16GB to make multiple runs feasible. All single-machine tests were performed on an Intel E6400 (two cores with hyper-threading enabled), NVIDIA GT620 (96 CUDA cores) machine, utilizing JCUFFT (a Java (Oracle America, Inc.) wrapper around CUFFT) and JTransforms (a native Java FFT library comparable IV. RESULTS AND ANALYSIS In order to model the potential performance improvement from parallelization we utilize Amdahl s Argument, which states the maximum speedup is determined by 1 S(N) = (1 P)+ P, N where P is the proportion of calculations that can be parallelized, (1-P) is the proportion that cannot be parallelized, and N is the number of concurrent threads. Simply put, Amdahl s Argument states that parallelization can only speed up an algorithm to a certain point; in our case, the non-parallel portion of the program, in a single machine set-up, is the reading and writing of data to disk, thus making it Figure 1: Hadoop Implementation
4 in performance to Intel s MKL): Figure 4: Percent of time spent in I/O and FFT calculation for CPU test Figure 2: Total processing time for a 16GB file Figure 2 shows that total processing time for the GPU implementation was lower only by 10% -15% on average. Figure 3 eliminates reading/writing from the benchmark times and only compares FFT calculation time: Benchmarking I/O and FFT calculation times separately showed that the CPU implementation spends an average of 70%-75% of total calculation time reading and writing from and to the disk; the remaining 20%-25% time calculating FFTs, even if reduced by a factor of 10, will not significantly decrease total calculation time (as seen in Figure 2). This is proven further in Figure 5, where the FFT calculation time for the GPU makes up only 5%-8% of total time spent processing the file, and I/O dominates the benchmark. Figure 3: Total FFT calculation time Figure 3 shows that NVIDIA s claim of a tenfold speedup is realizable, as we were able to decrease the total FFT calculation time by a factor of 5 on average using a 96-core GPU. The current NVIDIA flagship GPUs have more than 1500 cores, to put things into perspective. It follows logically that a major portion of the total calculation time must be taken up by input/output (I/O); which we illustrate in Figures 4 and 5: Figure 5: Percent of time spent in I/O and FFT calculation for GPU test From these tests, we can deduce that reading and writing constitutes a large percentage of computation time for FFTs on large files, and though using a GPU is helpful, it can only take us so
5 far. Although it is tempting to set P at 0.75, these results will vary quite a bit across different hardware platforms and FFT libraries; everything from number of CPU/GPU cores to hard disk read/write speed can change the variables in Amdahl s equation. Measuring these differences becomes even harder when utilizing Hadoop; Even assuming the Hadoop cluster hardware is heterogeneous, other factors such as current network load, link speed, and varying background processes running on computation nodes may affect results. In addition to the previously mentioned varying environment variables, obtaining a GPU cluster of an applicable size (at least 8 to 16 nodes) is expensive. We therefore decided to utilize Amazon s EC2 (Amazon Technologies, Inc.) cloud computing service, which offers GPU clusters for high performance computing (HPC) applications for hourly rental. Unfortunately we were only able to run on an eight-node cluster with limited benchmarking, the result of which can be seen in Figure 6. Figure 6: Single machine vs Hadoop EC2 computation times V. CONCLUSION Hadoop is traditionally used for searching very large data sets, but that technology can be adapted to a wide variety of problems. In this paper, we have shown how Hadoop and GPUs could be combined to perform accelerated and distributed FFTs. Additionally, we have evidence to conclude that the combination can make FFT operation considerably faster than a single machine when processing terabyte size data files. This performance gain will reduce the wait time experienced by the signal analyst. Additionally, we have shown that the accelerated FFT performance can be obtained using commodity servers with inexpensive GPUs, including the use of Amazon s EC2 environment. VI. FUTURE WORK Although the Hadoop implementation has tangible benefits, there is the possibility that performance could actually be better than what was observed. The Amazon s EC2 virtualized environment is not the optimal test bed for benchmarking, as the virtual instances launched by the user reside on non-dedicated hardware that is always in use by other virtual instances. Amazon does not provide a way to track CPU/GPU utilization for hardware, therefore the hosts (CPU/GPU) are always under unpredictable load. Furthermore, storage is entirely virtualized as well, and usually resides in a separate location from the virtual instance hosts, making it prone to fluctuations in network traffic and usage as well. In separate tests for I/O performance, we determined read/write speeds could vary by as much as 200%; using a virtualized operating system (OS) for Java benchmarking (by way of System.nanoTime() call) also leads to granularity issues, and it is not clear if a specialized benchmarking package would be able to fix this [16]. Additionally, Apache recommends against using Hadoop in a virtualized environment; the framework was designed for a cluster of dedicated commodity machines, and virtualization undermines some design concepts central to Hadoop [17]. We plan on performing additional tests on a dedicated cluster to determine optimal performance gains. Additionally, we plan on making a slight modification to the code that will allow for hybrid CPU/GPU clusters, which could compete with pure GPU clusters in terms of cost. CUDA 5 will contain support for RDMA (Remote Direct Memory Access) when released, meaning the concept of paralleling an FFT across a server cluster can be implemented much faster using C for both computation and network distribution, entirely eliminating Java and Hadoop from the process. Such an implementation, however, would be both expensive and time consuming.
6 Lastly, our first implementation does not support all types of FFT operations. We plan on expanding our work to allow overlapping FFTs operation to be performed in our distributed environment. VII. ACKNOLEDGEMENT The authors would like to thank Benjamin Greenberg for his help in running the benchmarks, and Graham Cruickshank for his proof reading skill and support. VIII. REFERENCES [1] Math Kernel Library (MKL) [software]. Intel Corp. [2] CUDA (version 4) [software]. NVIDIA. [3] Chen, L. and Gao, G. R. (2010). Performance analysis of cooley-tukey fft algorithms for a many-core architecture. In Proceedings of the 2010 Spring Simulation Multiconference, SpringSim '10, pages 81:1-81:8, San Diego, CA, USA. Society for Computer Simulation International. [4] Nukada, A. and Matsuoka, S. (2009). Auto-tuning 3-d fft library for cuda gpus. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 30:1-30:10, New York, NY, USA. ACM. [5] Agarwal, R. C., Gustavson, F. G., and Zubair, M. (1994). A high performance parallel algorithm for 1-d fft. In Proceedings of the 1994 ACM/IEEE conference on Supercomputing, Supercomputing '94, pages 34-40, New York, NY, USA. ACM. [6] Gu, L., Siegel, J., and Li, X. (2011). Using gpus to compute large out-ofcard ffts. In Proceedings of the international conference on Supercomputing, ICS '11, pages , New York, NY, USA. ACM. [7] Chen, Y., Cui, X., and Mei, H. (2010). Large-scale fft on gpu clusters. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS '10, pages , New York, NY, USA. ACM. [8] Hinitt, N. and Kocak, T. (2010). Gpu-based fft computation for multigigabit wirelesshd baseband processing. EURASIP J. Wirel. Commun. Netw., 2010:30:1-30:13. [9] Moreland, K. and Angel, E. (2003). The fft on a gpu. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS '03, pages , Aire-la- Ville, Switzerland, Switzerland. Eurographics Association. [10] Nukada, A., Maruyama, Y., and Matsuoka, S. (2012). High performance 3-d fft using multiple cuda gpus. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, pages 57-63, New York, NY, USA. ACM. [11] Schonhage-strassen algorithm with mapreduce for multiplying terabit integers. In Proceedings of the 2011 International Workshop on Symbolic-Numeric Computation, SNC '11, pages 54-62, New York, NY, USA. ACM. [12] Sadasivam, G. S. and Baktavatchalam, G. (2010). A novel approach to multiple sequence alignment using hadoop data grids. In Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, MDAC '10, pages 2:1-2:7, New York, NY, USA. ACM. [13] Chen, L. and Agrawal, G. (2012). Optimizing mapreduce for gpus with effective shared memory usage. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, HPDC '12, pages , New York, NY, USA. ACM. [14] Top500 Supercomputer sites. Retrieved November [15] NVIDIA. "CUDA C Best Practices Guide." nvidia.com. NVIDIA, n.d. Web. 20 Jun < doc/cuda_c_best_practices_guide.pdf> [16] Boyer, Brent. "Robust Java Benchmarking." N.p., 24 June Web. 20 July 2012 < [17] Apache. "Virtual Hadoop." Apache.org. N.p., 22 Aug Web. 1 Sept <
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
Enabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Packet-based Network Traffic Monitoring and Analysis with GPUs
Packet-based Network Traffic Monitoring and Analysis with GPUs Wenji Wu, Phil DeMar [email protected], [email protected] GPU Technology Conference 2014 March 24-27, 2014 SAN JOSE, CALIFORNIA Background Main
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Performance Analysis of Mixed Distributed Filesystem Workloads
Performance Analysis of Mixed Distributed Filesystem Workloads Esteban Molina-Estolano, Maya Gokhale, Carlos Maltzahn, John May, John Bent, Scott Brandt Motivation Hadoop-tailored filesystems (e.g. CloudStore)
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Amazon EC2 Product Details Page 1 of 5
Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Functionality Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Flash Memory Arrays Enabling the Virtualized Data Center. July 2010
Flash Memory Arrays Enabling the Virtualized Data Center July 2010 2 Flash Memory Arrays Enabling the Virtualized Data Center This White Paper describes a new product category, the flash Memory Array,
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
EMC XtremSF: Delivering Next Generation Performance for Oracle Database
White Paper EMC XtremSF: Delivering Next Generation Performance for Oracle Database Abstract This white paper addresses the challenges currently facing business executives to store and process the growing
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks
WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance
Parallel Large-Scale Visualization
Parallel Large-Scale Visualization Aaron Birkland Cornell Center for Advanced Computing Data Analysis on Ranger January 2012 Parallel Visualization Why? Performance Processing may be too slow on one CPU
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
Quantcast Petabyte Storage at Half Price with QFS!
9-131 Quantcast Petabyte Storage at Half Price with QFS Presented by Silvius Rus, Director, Big Data Platforms September 2013 Quantcast File System (QFS) A high performance alternative to the Hadoop Distributed
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
Data Semantics Aware Cloud for High Performance Analytics
Data Semantics Aware Cloud for High Performance Analytics Microsoft Future Cloud Workshop 2011 June 2nd 2011, Prof. Jun Wang, Computer Architecture and Storage System Laboratory (CASS) Acknowledgement
Infrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
PARALLELS CLOUD STORAGE
PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...
Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.
Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance
HPC performance applications on Virtual Clusters
Panagiotis Kritikakos EPCC, School of Physics & Astronomy, University of Edinburgh, Scotland - UK [email protected] 4 th IC-SCCE, Athens 7 th July 2010 This work investigates the performance of (Java)
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Running a typical ROOT HEP analysis on Hadoop/MapReduce. Stefano Alberto Russo Michele Pinamonti Marina Cobal
Running a typical ROOT HEP analysis on Hadoop/MapReduce Stefano Alberto Russo Michele Pinamonti Marina Cobal CHEP 2013 Amsterdam 14-18/10/2013 Topics The Hadoop/MapReduce model Hadoop and High Energy Physics
Resource Scheduling Best Practice in Hybrid Clusters
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Resource Scheduling Best Practice in Hybrid Clusters C. Cavazzoni a, A. Federico b, D. Galetti a, G. Morelli b, A. Pieretti
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Cloud Computing. Adam Barker
Cloud Computing Adam Barker 1 Overview Introduction to Cloud computing Enabling technologies Different types of cloud: IaaS, PaaS and SaaS Cloud terminology Interacting with a cloud: management consoles
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies
PERFORMANCE MODELS FOR APACHE ACCUMULO:
Securely explore your data PERFORMANCE MODELS FOR APACHE ACCUMULO: THE HEAVY TAIL OF A SHAREDNOTHING ARCHITECTURE Chris McCubbin Director of Data Science Sqrrl Data, Inc. I M NOT ADAM FUCHS But perhaps
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud
IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud February 25, 2014 1 Agenda v Mapping clients needs to cloud technologies v Addressing your pain
GPU-Based Network Traffic Monitoring & Analysis Tools
GPU-Based Network Traffic Monitoring & Analysis Tools Wenji Wu; Phil DeMar [email protected], [email protected] CHEP 2013 October 17, 2013 Coarse Detailed Background Main uses for network traffic monitoring
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]
Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator
WHITE PAPER Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com SAS 9 Preferred Implementation Partner tests a single Fusion
Platforms and Algorithms for Big Data Analytics Chandan K. Reddy Department of Computer Science Wayne State University
Platforms and Algorithms for Big Data Analytics Chandan K. Reddy Department of Computer Science Wayne State University http://www.cs.wayne.edu/~reddy/ http://dmkd.cs.wayne.edu/tutorial/bigdata/ What is
Network Traffic Monitoring and Analysis with GPUs
Network Traffic Monitoring and Analysis with GPUs Wenji Wu, Phil DeMar [email protected], [email protected] GPU Technology Conference 2013 March 18-21, 2013 SAN JOSE, CALIFORNIA Background Main uses for network
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
A general-purpose virtualization service for HPC on cloud computing: an application to GPUs
A general-purpose virtualization service for HPC on cloud computing: an application to GPUs R.Montella, G.Coviello, G.Giunta* G. Laccetti #, F. Isaila, J. Garcia Blas *Department of Applied Science University
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
Integrated Grid Solutions. and Greenplum
EMC Perspective Integrated Grid Solutions from SAS, EMC Isilon and Greenplum Introduction Intensifying competitive pressure and vast growth in the capabilities of analytic computing platforms are driving
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems
Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems Applied Technology Abstract By migrating VMware virtual machines from one physical environment to another, VMware VMotion can
GPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
Network Traffic Monitoring & Analysis with GPUs
Network Traffic Monitoring & Analysis with GPUs Wenji Wu, Phil DeMar [email protected], [email protected] GPU Technology Conference 2013 March 18-21, 2013 SAN JOSE, CALIFORNIA Background Main uses for network
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies
Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Kurt Klemperer, Principal System Performance Engineer [email protected] Agenda Session Length:
A survey on platforms for big data analytics
Singh and Reddy Journal of Big Data 2014, 1:8 SURVEY PAPER Open Access A survey on platforms for big data analytics Dilpreet Singh and Chandan K Reddy * * Correspondence: [email protected] Department
Speeding Up RSA Encryption Using GPU Parallelization
2014 Fifth International Conference on Intelligent Systems, Modelling and Simulation Speeding Up RSA Encryption Using GPU Parallelization Chu-Hsing Lin, Jung-Chun Liu, and Cheng-Chieh Li Department of
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
DataStax Enterprise, powered by Apache Cassandra (TM)
PerfAccel (TM) Performance Benchmark on Amazon: DataStax Enterprise, powered by Apache Cassandra (TM) Disclaimer: All of the documentation provided in this document, is copyright Datagres Technologies
IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads
89 Fifth Avenue, 7th Floor New York, NY 10003 www.theedison.com @EdisonGroupInc 212.367.7400 IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads A Competitive Test and Evaluation Report
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products
MaxDeploy Ready Hyper- Converged Virtualization Solution With SanDisk Fusion iomemory products MaxDeploy Ready products are configured and tested for support with Maxta software- defined storage and with
NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB
bankmark UG (haftungsbeschränkt) Bahnhofstraße 1 9432 Passau Germany www.bankmark.de [email protected] T +49 851 25 49 49 F +49 851 25 49 499 NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB,
