High Performance Computing in Java and the Cloud
|
|
|
- Brianna Brown
- 10 years ago
- Views:
Transcription
1 High Performance Computing in Java and the Cloud Computer Architecture Group University of A Coruña (Spain) [email protected] January 21, 2013 Universitat Politècnica de Catalunya, Barcelona
2 Outline Java for High Performance Computing 1 Java for High Performance Computing 2 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud
3 Java is an Alternative for HPC in the Multi-core Era Language popularity: (% skilled developers) #1 C (17.8%) #2 Java (17.4%) #3 Objective-C (10.3%) #4 C++ (9.14%) #17 Matlab (0.64%) #25 Fortran (0.46%) C and Java are the most popular languages. Fortran, popular language in HPC, is the 25# (0.46%)
4 Java is an Alternative for HPC in the Multi-core Era Interesting features: Built-in networking Built-in multi-threading Portable, platform independent Object Oriented Main training language Many productive parallel/distributed programming libs: Java shared memory programming (high level facilities: Concurrency framework) Java Sockets Java RMI Message-Passing in Java (MPJ) libraries Apache Hadoop
5 Java Adoption in HPC HPC developers and users usually want to use Java in their projects. Java code is no longer slow (Just-In-Time compilation)! But still performance penalties in Java communications: Pros and Cons: high programming productivity. but they are highly concerned about performance.
6 Java Adoption in HPC HPC developers and users usually want to use Java in their projects. Java code is no longer slow (Just-In-Time compilation)! But still performance penalties in Java communications: JIT Performance: Like native performance. Java can even outperform native languages (C, Fortran) thanks to the dynamic compilation!!
7 Java Adoption in HPC HPC developers and users usually want to use Java in their projects. Java code is no longer slow (Just-In-Time compilation)! But still performance penalties in Java communications: High Java Communications Overhead: Poor high-speed networks support. The data copies between the Java heap and native code through JNI. Costly data serialization. The use of communication protocols unsuitable for HPC.
8 Emerging Interest in Java for HPC
9 Current State of Java for HPC
10 Java for HPC Java for High Performance Computing Current options in Java for HPC: (popular) Java RMI (poor performance) Java Sockets (low level programming) Message-Passing in Java (MPJ) (extended, easy and reasonably good performance) Apache Hadoop (for High Throughput Computing)
11 Java for HPC Java for High Performance Computing : Java Threads Concurrency Framework (ThreadPools, Tasks...) Parallel Java (PJ) Java OpenMP (JOMP and JaMP)
12 Listing 1: Java Threads class BasicThread extends Thread { / / This method i s c a l l e d when the thread runs public void run ( ) { for ( i n t i =0; i <1000; i ++) increasecount ( ) ; } } class Test { i n t counter = 0L ; increasecount ( ) { counter ++; } public s t a t i c void main ( String argv [ ] ) { / / Create and s t a r t the thread Thread t1 = new BasicThread ( ) ; Thread t2 = new BasicThread ( ) ; t1. s t a r t ( ) ; t2. s t a r t ( ) ; System. out. p r i n t l n ( " Counter= " +counter ) ; } }
13 class Test { Listing 2: Java Concurrency Framework highlights public s t a t i c void main ( String argv [ ] ) { / / create the thread pool ThreadPool threadpool = new ThreadPool ( numthreads ) ; / / run example tasks ( tasks implement Runnable ) for ( i n t i = 0; i < numtasks ; i ++) { threadpool. runtask ( createtask ( i ) ) ; } } / / close the pool and wait f o r a l l tasks to f i n i s h. threadpool. j o i n ( ) ; / / Another interesting classes from concurrency framework : / / C y c l i c B a r r i e r / / ConcurrentHashMap / / PriorityBlockingQueue / / Executors... }
14 JOMP is the Java OpenMP binding Listing 3: JOMP example public s t a t i c void main ( S t r i n g argv [ ] ) { i n t myid ; / / omp p a r a l l e l p r i v a t e ( myid ) { myid = OMP. getthreadnum ( ) ; System. out. p r i n t l n ( Hello from + myid ) ; } } / / omp p a r a l l e l f o r for ( i =1; i <n ; i ++) { b [ i ] = ( a [ i ] + a [ i 1]) 0. 5 ; }
15 MPJ is the Java binding of MPI import mpi. ; Listing 4: MPJ example public class Hello { } public s t a t i c void main ( String argv [ ] ) { MPI. I n i t ( args ) ; i n t rank = MPI.COMM_WORLD. Rank ( ) ; i n t msg_tag = 13; i f ( rank == 0 ) { i n t peer_process = 1; S t r i n g [ ] msg = new S t r i n g [ 1 ] ; msg [ 0 ] = new S t r i n g ( " Hello " ) ; MPI.COMM_WORLD. Send (msg, 0, 1, MPI.OBJECT, peer_process, msg_tag ) ; } else i f ( rank == 1) { i n t peer_process = 0; S t r i n g [ ] msg = new S t r i n g [ 1 ] ; MPI.COMM_WORLD. Recv (msg, 0, 1, MPI.OBJECT, peer_process, msg_tag ) ; System. out. p r i n t l n (msg [ 0 ] ) ; } MPI. F i n a l i z e ( ) ; }
16 Pure Java Impl. Socket impl. Java IO Java NIO High-speed network support Myrinet InfiniBand SCI mpijava 1.2 API JGF MPJ Other APIs MPJava Jcluster Parallel Java mpijava P2P-MPI MPJ Express MPJ/Ibis JMPI FastMPJ
17 FastMPJ Java for High Performance Computing MPJ implementation developed at the Computer Architecture Group of University of A Corunna. Features of FastMPJ include: High performance intra-node communication Efficient RDMA transfers over InfiniBand and RoCE Fully portable, as Java Scalability up to thousands of cores Highly productive development and maintenance Ideal for multicore servers, cluster and cloud computing
18 FastMPJ Supported Hardware FastMPJ runs on InfiniBand through IB Verbs, can rely on TCP/IP and has shared memory support through Java threads: MPJ Applications MPJ API (mpijava 1.2) FastMPJ Library MPJ Collective Primitives MPJ Point to Point Primitives The xxdev layer mxdev psmdev ibvdev niodev/iodev smdev Java Native Interface (JNI) Java Sockets Java Threads MX/Open MX InfiniPath PSM IBV API TCP/IP Myrinet/Ethernet InfiniBand Ethernet Shared Memory
19 FastMPJ in MareNostrum3 Latency (µs) Point-to-point Performance on InfiniBand FDR Open MPI 5 FastMPJ (ibvdev) K 4K 16K 64K 256K 1M 4M 16M 0 Message size (bytes) Bandwidth (Gbps)
20 Java for High Performance Computing Table: Available solutions for GPGPU computing in Java Java bindings User-friendly CUDA JCUDA Java-GPU jcuda Rootbeer OpenCL JOCL Aparapi JogAmp s JOCL
21 Listing 5: Java Code Example f i n a l double i n [ ], out [ ] ; for ( i n t i =0; i < i n. l e n g t h ; i ++) out [ i ]= i n [ i ] i n [ i ] ; Listing 6: Aparapi Code Example Kernel ker nel = new Kernel ( ) public void run ( ) { i n t i = getglobalid ( ) ; out [ i ]= i n [ i ] i n [ i ] ; } } ; k e r n e l. execute ( i n. l e n g t h ) ; Listing 7: Thread Code Example Thread thread = new Thread (new Runnable ( ) public void run ( ) { } } ) ; thread. s t a r t ( ) ; thread. j o i n ( ) ;
22 Listing 8: jcuda Matrix Multiplication private s t a t i c void dgemmjcublas ( i n t n, double alpha, double A [ ], double B [ ], double beta, double C [ ] ) { i n t nn = n n ; / / I n i t i a l i z e JCublas JCublas. c u b l a s I n i t ( ) ; / / A l l o c a t e memory on the device Pointer d_a = new Pointer ( ) ; Pointer d_b = new Pointer ( ) ; Pointer d_c = new Pointer ( ) ; JCublas. cublasalloc ( nn, Sizeof.DOUBLE, d_a ) ; JCublas. cublasalloc ( nn, Sizeof.DOUBLE, d_b ) ; JCublas. cublasalloc ( nn, Sizeof.DOUBLE, d_c ) ; / / Copy the memory from the host to the device JCublas. cublassetvector ( nn, Sizeof.DOUBLE, Pointer. to (A), 1, d_a, 1 ) ; JCublas. cublassetvector ( nn, Sizeof.DOUBLE, Pointer. to (B), 1, d_b, 1 ) ; JCublas. cublassetvector ( nn, Sizeof.DOUBLE, Pointer. to (C), 1, d_c, 1 ) ; / / Execute dgemm JCublas. cublasdgemm ( n, n, n, n, n, alpha, d_a, n, d_b, n, beta, d_c, n ) ; / / Copy the r e s u l t from the device to the host JCublas. cublasgetvector ( nn, Sizeof.DOUBLE, d_c, 1, Pointer. to (C), 1 ) ; / / Clean up JCublas. cublasfree (d_a ) ; JCublas. cublasfree (d_b ) ; JCublas. cublasfree (d_c ) ; JCublas. cublasshutdown ( ) ; }
23 Listing 9: Aparapi Matrix Multiplication class AparapiMatMul extends Kernel { double mata [ ], matb [ ], matc [ ], C [ ] ; i n t rows, cols1, cols2 public void run ( ) { i n t i = getglobalid ( ) / rows ; i n t j = getglobalid ( ) % rows ; f l o a t value = 0; for ( i n t k = 0; k < cols1 ; k++) { value += mata [ k + i cols1 ] matb [ k cols2 + j ] ; } matc [ i cols1 + j ] = value ; }
24 Java for High Performance Computing Table: Description of the GPU-based testbed CPU 1 Intel(R) Xeon hexacore 2.67GHz CPU performance GFLOPS DP (10.68 GFLOPS DP per core) GPU 1 NVIDIA Tesla Fermi M2050 GPU performance 515 GFLOPS DP Memory 12 GB DDR3 (1333 MHz) OS Debian GNU/Linux, kernel CUDA version 4.2 SDK Toolkit JDK version OpenJDK 1.6.0_24
25 Java for High Performance Computing GFLOPS Matrix Multiplication performance (Single precision) CUDA jcuda Aparapi Java x x x8192 Problem size GFLOPS Matrix Multiplication performance (Double precision) CUDA jcuda Aparapi Java x x x8192 Problem size Figure: Matrix multiplication kernel performance (SHOC GEMM)
26 Java for High Performance Computing Runtime (seconds) Stencil2D performance (Single precision) Java Aparapi jcuda CUDA 2048x x x8192 Problem size Runtime (seconds) Stencil2D performance (Double precision) Java Aparapi jcuda CUDA 2048x x x8192 Problem size Figure: Stencil 2D kernel performance (SHOC Stencil2D)
27 Java for High Performance Computing GFLOPS Stencil2D performance (Single precision) CUDA jcuda Aparapi Java 2048x x x8192 Problem size GFLOPS Stencil2D performance (Double precision) CUDA jcuda 50 Aparapi 45 Java x x x8192 Problem size Figure: Stencil 2D kernel performance (SHOC Stencil)
28 Java for High Performance Computing FFT performance (Single precision) Java Aparapi jcuda CUDA 2048x x x8192 Problem size FFT performance (Double precision) Java Aparapi jcuda CUDA 2048x x x8192 Problem size Figure: FFT kernel performance (SHOC FFT)
29 Java for High Performance Computing GFLOPS FFT performance (Single precision) CUDA jcuda Aparapi Java GFLOPS FFT performance (Double precision) CUDA jcuda Aparapi Java x x x8192 Problem size x x x8192 Problem size Figure: FFT kernel performance (SHOC FFT)
30 NPB-MPJ Characteristics (10,000 SLOC (Source LOC)) Name Operation SLOC Communicat. intensiveness CG EP Conjugate Gradient Embarrassingly Parallel Medium Low FT Fourier Transformation 1700 High IS Integer Sort 700 High MG Multi-Grid 2000 High SP Scalar Pentadiagonal 4300 Medium Kernel Applic.
31 Java HPC optimization: NPB-MPJ success case NPB-MPJ Optimization: JVM JIT compilation of heavy and frequent methods with runtime information Structured programming is the best option Small frequent methods are better. mapping elements from multidimensional to one-dimensional arrays (array flattening technique: arr3d[x][y][z] arr3d[pos3d(lenghtx,lengthy,x,y,z)]) NPB-MPJ code refactored, obtaining significant improvements (up to 2800% performance increase)
32 Experimental Results on One Core (relative perf.) MOPS NPB Kernels Serial Performance on Amazon EC2 (C Class) GNU compiler Intel compiler JVM FT (cc2.8xlarge) FT (cc1.4xlarge) CG (cc2.8xlarge) CG (cc1.4xlarge) IS (cc2.8xlarge) IS (cc1.4xlarge) MG (cc2.8xlarge) MG (cc1.4xlarge)
33 Java HPC in an InfiniBand cluster (DAS4) Latency (µs) Point-to-point Performance on DAS-4 (IB-QDR) MVAPICH2 Open MPI FastMPJ (ibvdev) K 4K 16K 64K 256K 1M 4M 16M Message size (bytes) Bandwidth (Gbps)
34 Java HPC in an InfiniBand cluster (DAS4) NPB CG Kernel Performance on DAS-4 (IB-QDR) MVAPICH2 Open MPI FastMPJ (ibvdev) MOPS Number of Cores
35 Java HPC in an InfiniBand cluster (DAS4) NPB FT Kernel Performance on DAS-4 (IB-QDR) MVAPICH2 Open MPI FastMPJ (ibvdev) MOPS Number of Cores
36 Java HPC in an InfiniBand cluster (DAS4) NPB MG Kernel Performance on DAS-4 (IB-QDR) MVAPICH2 Open MPI FastMPJ (ibvdev) MOPS Number of Cores
37 IaaS: Infrastructure-as-a-Service IaaS AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud The access to infrastructure supports the execution of computationally intensive tasks in the cloud. There are many IaaS providers: Amazon Web Services (AWS) Google Compute Engine (beta) IBM cloud HP cloud Rackspace ProfitBricks Penguin-On-Demand (POD) IMHO, AWS is the most suitable in terms of available resources, high performance Guillermo features López Taboada and performance/cost High Performance Computingratio. Java and the Cloud
38 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud HPC and Big Data in AWS AWS Provides a set of instances for HPC CC1: dual socket, quadcore Nehalem Xeon (93 GFlops) CG1: dual socket, quadcore Nehalem Xeon con 2 GPUs (1123 GFlops) CC2: dual socket, octcore Sandy Bridge Xeon (333 GFlops) Up to 63 GB RAM 10 Gigabit Ethernet high performance network
39 HPL Linpack in AWS (TOP500) AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud Table: Configuration of our EC2 AWS cluster CPU 2 Xeon quadcore [email protected] (46.88 GFlops DP each) EC2 Compute Units 33.5 GPU 2 NVIDIA Tesla Fermi M2050 (515 GFlops DP each) Memory 22 GB DDR3 Storage 1690 GB Virtualization Xen HVM 64-bit platform (PV drivers for I/O) Number of CGI nodes 32 Interconnection network 10 Gigabit Ethernet Total EC2 Compute Units 1072 Total CPU Cores 256 (3 TFLOPS DP) Total GPUs 64 (32.96 TFLOPS DP) Total FLOPS TFLOPS DP
40 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud HPL Linpack in AWS (TOP500) HPL Linpack ranks the performance of supercomputers in TOP500 list In AWS 14 TFlops are available for everybody! (67 USD$/h on demand, 21 USD$/h spot) HPL Performance on Amazon EC2 CPU CPU+GPU (CUDA) HPL Scalability on Amazon EC2 CPU CPU+GPU (CUDA) GFLOPS Speedup Number of Instances Number of Instances 16 32
41 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud HPL Linpack in AWS (TOP500) References Our AWS Cluster: 14,23 TFlops AWS (06/11): 41,82 TFlops (# 451 TOP500) AWS (11/11): 240,1 TFlops (# 42 TOP500, now # 102)
42 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud Radiography enhanced through Montecarlo (MC-GPU)
43 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud Image processing with Montecarlo (MC-GPU) Time: originally 187 minutes, in AWS with HPC 6 seconds! Cost: processing 500 radiographies in AWS: aprox. 70 USD$, 0,14 USD$ per unit MC-GPU Execution Time on Amazon EC2 CPU CPU+GPU (CUDA) MC-GPU Scalability on Amazon EC2 CPU CPU+GPU (CUDA) Runtime (s) Speedup Number of Instances Number of Instances 16 32
44 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud AWS HPC and Big Data Remarks HPC and Big Data in AWS High computational power (new Sandy Bridge processors and GPUs) High performance communications over 10 Gigabit Ethernet Efficient access to file network systems (e.g., NFS, OrangeFS) Without waiting times in accessing resources Easy applications and infrastructure deployment, ready-to-run applications in AMIs
45 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS Latency (µs) Point-to-point Performance on 10 GbE (cc2.8xlarge) MPICH2 OpenMPI FastMPJ K 4K 16K 64K 256K 1M 4M 16M 64M 0 Message size (bytes) Bandwidth (Gbps)
46 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS Latency (µs) Point-to-point Performance on SHMEM (cc2.8xlarge) MPICH2 OpenMPI FastMPJ K 4K 16K 64K 256K 1M 4M 16M 64M 0 Message size (bytes) Bandwidth (Gbps)
47 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS MPICH2 OpenMPI FastMPJ CG C Class Performance (cc2.8xlarge) MOPS Number of Cores
48 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS MPICH2 OpenMPI FastMPJ CG C Class Scalability (cc2.8xlarge) Speedup Number of Cores
49 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS MOPS MPICH2 OpenMPI FastMPJ FT C Class Performance (cc2.8xlarge) Number of Cores
50 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS MPICH2 OpenMPI FastMPJ FT C Class Scalability (cc2.8xlarge) Speedup Number of Cores
51 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS MPICH2 OpenMPI FastMPJ MG C Class Performance (cc2.8xlarge) MOPS Number of Cores
52 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS MPICH2 OpenMPI FastMPJ MG C Class Scalability (cc2.8xlarge) Speedup Number of Cores
53 HPC in AWS Java for High Performance Computing AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud Efficient Cloud Computing support MOPS CG Kernel OpenMPI Performance on Amazon EC2 8 Default Default (Fillup) Default (Tuned) 16 Instances (Fillup) 16 Instances (Tuned) Number Guillermo of Processes López Taboada 128 MOPS CG Kernel OpenMPI Performance on Amazon EC2 8 Default 16 Instances 32 Instances 64 Instances High Performance Computing Number in Java of Processes and the Cloud
54 Summary Questions Summary Current state of HPC in Java and the Cloud Java/IaaS are interesting/feasible alternatives (tradeoff some performance for appealing features) Suitable programming models: Java HPC: FastMPJ Cloud HPC: Hybrid MPI/MPJ + GPGPU/Threads Active research on suitability of Java and IaaS for HPC...but still not mainstream options Performance evaluations are highly important Analysis of current projects (promotion of joint efforts)
55 Summary Questions Questions? HIGH PERFORMANCE COMPUTING IN JAVA AND THE CLOUD HIGH PERFORMANCE COMPUTING SEMINAR WWW: gltaboada/ Computer Architecture Group, University of A Coruña
Java for High Performance Cloud Computing
Java for Computer Architecture Group University of A Coruña (Spain) [email protected] October 31st, 2012, Future Networks and Distributed SystemS (FUNDS), Manchester Metropolitan University, Manchester (UK)
Kashif Iqbal - PhD [email protected]
HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD [email protected] ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo
Performance Evaluation of Amazon EC2 for NASA HPC Applications!
National Aeronautics and Space Administration Performance Evaluation of Amazon EC2 for NASA HPC Applications! Piyush Mehrotra!! J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazanoff,! S. Saini, R. Biswas!
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
benchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
RDMA over Ethernet - A Preliminary Study
RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Outline Introduction Problem Statement
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster
Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster Ryousei Takano Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Running Scientific Codes on Amazon EC2: a Performance Analysis of Five High-end Instances
JCS&T Vol. 13 No. 3 December 213 Running Scientific Codes on Amazon EC2: a Performance Analysis of Five High-end Instances Roberto R. Expósito, Guillermo L. Taboada, Xoán C. Pardo, Juan Touriño and Ramón
SR-IOV: Performance Benefits for Virtualized Interconnects!
SR-IOV: Performance Benefits for Virtualized Interconnects! Glenn K. Lockwood! Mahidhar Tatineni! Rick Wagner!! July 15, XSEDE14, Atlanta! Background! High Performance Computing (HPC) reaching beyond traditional
Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013
Cluster performance, how to get the most out of Abel Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013 Introduction Architecture x86-64 and NVIDIA Compilers MPI Interconnect Storage Batch queue
Cloud Computing. Alex Crawford Ben Johnstone
Cloud Computing Alex Crawford Ben Johnstone Overview What is cloud computing? Amazon EC2 Performance Conclusions What is the Cloud? A large cluster of machines o Economies of scale [1] Customers use a
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS
PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS Philip J. Sokolowski Department of Electrical and Computer Engineering Wayne State University 55 Anthony Wayne Dr. Detroit, MI 822
Accelerating Spark with RDMA for Big Data Processing: Early Experiences
Accelerating Spark with RDMA for Big Data Processing: Early Experiences Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, Dip7 Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
FLOW-3D Performance Benchmark and Profiling. September 2012
FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute
Building a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
Retargeting PLAPACK to Clusters with Hardware Accelerators
Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.
1 DCSC/AU: HUGE. DeIC Sekretariat 2013-03-12/RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations
Bilag 1 2013-03-12/RB DeIC (DCSC) Scientific Computing Installations DeIC, previously DCSC, currently has a number of scientific computing installations, distributed at five regional operating centres.
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud
IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud February 25, 2014 1 Agenda v Mapping clients needs to cloud technologies v Addressing your pain
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
SMB Direct for SQL Server and Private Cloud
SMB Direct for SQL Server and Private Cloud Increased Performance, Higher Scalability and Extreme Resiliency June, 2014 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server
Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team
Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC Wenhao Wu Program Manager Windows HPC team Agenda Microsoft s Commitments to HPC RDMA for HPC Server RDMA for Storage in Windows 8 Microsoft
LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu
Performance and Scalability of the NAS Parallel Benchmarks in Java
Performance and Scalability of the NAS Parallel Benchmarks in Java Michael A. Frumkin, Matthew Schultz, Haoqiang Jin, and Jerry Yan NASA Advanced Supercomputing (NAS) Division NASA Ames Research Center,
White Paper Solarflare High-Performance Computing (HPC) Applications
Solarflare High-Performance Computing (HPC) Applications 10G Ethernet: Now Ready for Low-Latency HPC Applications Solarflare extends the benefits of its low-latency, high-bandwidth 10GbE server adapters
1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Overview of HPC systems and software available within
Overview of HPC systems and software available within Overview Available HPC Systems Ba Cy-Tera Available Visualization Facilities Software Environments HPC System at Bibliotheca Alexandrina SUN cluster
Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters
Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters Philip J. Sokolowski Dept. of Electrical and Computer Engineering Wayne State University 55 Anthony Wayne Dr., Detroit, MI 4822 [email protected]
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales
Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes Anthony Kenisky, VP of North America Sales About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer 2001-2007
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum
Auto-Tunning of Data Communication on Heterogeneous Systems
1 Auto-Tunning of Data Communication on Heterogeneous Systems Marc Jordà 1, Ivan Tanasic 1, Javier Cabezas 1, Lluís Vilanova 1, Isaac Gelado 1, and Nacho Navarro 1, 2 1 Barcelona Supercomputing Center
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster
Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster Mahidhar Tatineni ([email protected]) MVAPICH User Group Meeting August 27, 2014 NSF grants: OCI #0910847 Gordon: A Data
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez
Energy efficient computing on Embedded and Mobile devices Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez A brief look at the (outdated) Top500 list Most systems are built
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck
Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering
Modern Platform for Parallel Algorithms Testing: Java on Intel Xeon Phi
I.J. Information Technology and Computer Science, 2015, 09, 8-14 Published Online August 2015 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2015.09.02 Modern Platform for Parallel Algorithms
A quick tutorial on Intel's Xeon Phi Coprocessor
A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be [email protected] Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1
Introduction to High Performance Cluster Computing Cluster Training for UCL Part 1 What is HPC HPC = High Performance Computing Includes Supercomputing HPCC = High Performance Cluster Computing Note: these
ABSTRACT. TALPALLIKAR, NIKHIL VIVEK. High-Performance Cloud Computing: VCL Case Study. (Under the direction of Dr. Mladen Vouk.)
ABSTRACT TALPALLIKAR, NIKHIL VIVEK. High-Performance Cloud Computing: VCL Case Study. (Under the direction of Dr. Mladen Vouk.) High performance computing used to be domain of specialties and of the relatively
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
RWTH GPU Cluster. Sandra Wienke [email protected] November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky
RWTH GPU Cluster Fotos: Christian Iwainsky Sandra Wienke [email protected] November 2012 Rechen- und Kommunikationszentrum (RZ) The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) innovative
Advancing Applications Performance With InfiniBand
Advancing Applications Performance With InfiniBand Pak Lui, Application Performance Manager September 12, 2013 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server and
CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER
CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER Tender Notice No. 3/2014-15 dated 29.12.2014 (IIT/CE/ENQ/COM/HPC/2014-15/569) Tender Submission Deadline Last date for submission of sealed bids is extended
Case Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke [email protected] ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand
High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand Hari Subramoni *, Ping Lai *, Raj Kettimuthu **, Dhabaleswar. K. (DK) Panda * * Computer Science and Engineering Department
Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
SR-IOV In High Performance Computing
SR-IOV In High Performance Computing Hoot Thompson & Dan Duffy NASA Goddard Space Flight Center Greenbelt, MD 20771 [email protected] [email protected] www.nccs.nasa.gov Focus on the research side
Rootbeer: Seamlessly using GPUs from Java
Rootbeer: Seamlessly using GPUs from Java Phil Pratt-Szeliga. Dr. Jim Fawcett. Dr. Roy Welch. Syracuse University. Rootbeer Overview and Motivation Rootbeer allows a developer to program a GPU in Java
Evaluation of Java Message Passing in High Performance Data Analytics
Evaluation of Java Message Passing in High Performance Data Analytics Saliya Ekanayake, Geoffrey Fox School of Informatics and Computing Indiana University Bloomington, Indiana, USA {sekanaya, gcf}@indiana.edu
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
Introduction to Hybrid Programming
Introduction to Hybrid Programming Hristo Iliev Rechen- und Kommunikationszentrum aixcelerate 2012 / Aachen 10. Oktober 2012 Version: 1.1 Rechen- und Kommunikationszentrum (RZ) Motivation for hybrid programming
The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014
Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet Anand Rangaswamy September 2014 Storage Developer Conference Mellanox Overview Ticker: MLNX Leading provider of high-throughput,
Cluster Computing in a College of Criminal Justice
Cluster Computing in a College of Criminal Justice Boris Bondarenko and Douglas E. Salane Mathematics & Computer Science Dept. John Jay College of Criminal Justice The City University of New York 2004
Enabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
CENIC Private Cloud Pilot Using Amazon Reserved Instances (RIs) September 9, 2011
CENIC Private Cloud Pilot Using Amazon Reserved Instances (RIs) September 9, 2011 CENIC has been working with Amazon for some time to put in place procedures through which CENIC member institutions can
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN
1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction
Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking
Cluster Grid Interconects Tony Kay Chief Architect Enterprise Grid and Networking Agenda Cluster Grid Interconnects The Upstart - Infiniband The Empire Strikes Back - Myricom Return of the King 10G Gigabit
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu [email protected] High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
GTC Presentation March 19, 2013. Copyright 2012 Penguin Computing, Inc. All rights reserved
GTC Presentation March 19, 2013 Copyright 2012 Penguin Computing, Inc. All rights reserved Session S3552 Room 113 S3552 - Using Tesla GPUs, Reality Server and Penguin Computing's Cloud for Visualizing
How to Deploy OpenStack on TH-2 Supercomputer Yusong Tan, Bao Li National Supercomputing Center in Guangzhou April 10, 2014
How to Deploy OpenStack on TH-2 Supercomputer Yusong Tan, Bao Li National Supercomputing Center in Guangzhou April 10, 2014 2014 年 云 计 算 效 率 与 能 耗 暨 第 一 届 国 际 云 计 算 咨 询 委 员 会 中 国 高 峰 论 坛 Contents Background
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
10- High Performance Compu5ng
10- High Performance Compu5ng (Herramientas Computacionales Avanzadas para la Inves6gación Aplicada) Rafael Palacios, Fernando de Cuadra MRE Contents Implemen8ng computa8onal tools 1. High Performance
McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC [email protected]
McMPI Managed-code MPI library in Pure C# Dr D Holmes, EPCC [email protected] Outline Yet another MPI library? Managed-code, C#, Windows McMPI, design and implementation details Object-orientation,
