Java for High Performance Cloud Computing



Similar documents
High Performance Computing in Java and the Cloud

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

benchmarking Amazon EC2 for high-performance scientific computing

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Multicore Parallel Computing with OpenMP

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Can High-Performance Interconnects Benefit Memcached and Hadoop?

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

RDMA over Ethernet - A Preliminary Study

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

Cloud Computing. Alex Crawford Ben Johnstone

Running Scientific Codes on Amazon EC2: a Performance Analysis of Five High-end Instances

SR-IOV: Performance Benefits for Virtualized Interconnects!

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Parallel Programming Survey

HP ProLiant SL270s Gen8 Server. Evaluation Report

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

High Performance Computing in CST STUDIO SUITE

PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

FLOW-3D Performance Benchmark and Profiling. September 2012

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Retargeting PLAPACK to Clusters with Hardware Accelerators

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Performance and Scalability of the NAS Parallel Benchmarks in Java

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

1 Bull, 2011 Bull Extreme Computing

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Multi-core Programming System Overview

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

1 DCSC/AU: HUGE. DeIC Sekretariat /RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations

White Paper Solarflare High-Performance Computing (HPC) Applications

Trends in High-Performance Computing for Power Grid Applications

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Modern Platform for Parallel Algorithms Testing: Java on Intel Xeon Phi

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

SR-IOV In High Performance Computing

- An Essential Building Block for Stable and Reliable Compute Clusters

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Introduction to GPU hardware and to CUDA

Case Study on Productivity and Performance of GPGPUs

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

ABSTRACT. TALPALLIKAR, NIKHIL VIVEK. High-Performance Cloud Computing: VCL Case Study. (Under the direction of Dr. Mladen Vouk.)

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

A quick tutorial on Intel's Xeon Phi Coprocessor

SMB Direct for SQL Server and Private Cloud

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Overview of HPC systems and software available within

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Turbomachinery CFD on many-core platforms experiences and strategies

Parallel Computing with MATLAB

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Multi-Threading Performance on Commodity Multi-Core Processors

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Evaluation of Java Message Passing in High Performance Data Analytics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

HPC Wales Skills Academy Course Catalogue 2015

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

10- High Performance Compu5ng

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Advancing Applications Performance With InfiniBand

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

HPC with Multicore and GPUs

System Models for Distributed and Cloud Computing

Enabling Technologies for Distributed and Cloud Computing

McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk

Accelerating CFD using OpenFOAM with GPUs

Evaluation of CUDA Fortran for the CFD code Strukti

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

GTC Presentation March 19, Copyright 2012 Penguin Computing, Inc. All rights reserved

Cluster Computing at HRI

MOSIX: High performance Linux farm

Transcription:

Java for Computer Architecture Group University of A Coruña (Spain) taboada@udc.es October 31st, 2012, Future Networks and Distributed SystemS (FUNDS), Manchester Metropolitan University, Manchester (UK) Java for

Outline Java for High Performance Computing 1 Java for High Performance Computing Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks 2 AWS IaaS for HPC and Big Data 3 Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC Java for

Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks Java is an Alternative for HPC in the Multi-core Era Language popularity: (% skilled developers) #1 C (19.8%) #2 Java (17.2%) #3 Objective-C (9.48%) #4 C++ (9.3%) #19 Matlab (0.59%) #27 Fortran (0.35%) C and Java are the most popular languages. Fortran, popular language in HPC, is the 27# (0.42%) Java for

Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks Java is an Alternative for HPC in the Multi-core Era Interesting features: Built-in networking Built-in multi-threading Portable, platform independent Object Oriented Main training language Many productive parallel/distributed programming libs: Java shared memory programming (high level facilities: Concurrency framework) Java Sockets Java RMI Message-Passing in Java (MPJ) libraries Apache Hadoop Java for

Java Adoption in HPC Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks HPC developers and users usually want to use Java in their projects. Java code is no longer slow (Just-In-Time compilation)! But still performance penalties in Java communications: Pros and Cons: high programming productivity. but they are highly concerned about performance. Java for

Java Adoption in HPC Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks HPC developers and users usually want to use Java in their projects. Java code is no longer slow (Just-In-Time compilation)! But still performance penalties in Java communications: JIT Performance: Like native performance. Java can even outperform native languages (C, Fortran) thanks to the dynamic compilation!! Java for

Java Adoption in HPC Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks HPC developers and users usually want to use Java in their projects. Java code is no longer slow (Just-In-Time compilation)! But still performance penalties in Java communications: High Java Communications Overhead: Poor high-speed networks support. The data copies between the Java heap and native code through JNI. Costly data serialization. The use of communication protocols unsuitable for HPC. Java for

Emerging Interest in Java for HPC Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks Java for

Current State of Java for HPC Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks Java for

Java for HPC Java for High Performance Computing Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks Current options in Java for HPC: Java Shared Memory Programming (popular) Java RMI (poor performance) Java Sockets (low level programming) Message-Passing in Java (MPJ) (extended, easy and reasonably good performance) Apache Hadoop (for High Throughput Computing) Java for

Java for HPC Java for High Performance Computing Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks Java Shared Memory Programming: Java Threads Concurrency Framework (ThreadPools, Tasks...) Parallel Java (PJ) Java OpenMP (JOMP and JaMP) Java for

Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks Listing 1: Java Threads class BasicThread extends Thread { / / This method i s c a l l e d when the thread runs public void run ( ) { for ( i n t i =0; i <1000; i ++) increasecount ( ) ; } } class Test { i n t counter = 0L ; increasecount ( ) { counter ++; } public s t a t i c void main ( String argv [ ] ) { / / Create and s t a r t the thread Thread t1 = new BasicThread ( ) ; Thread t2 = new BasicThread ( ) ; t1. s t a r t ( ) ; t2. s t a r t ( ) ; System. out. p r i n t l n ( " Counter= " +counter ) ; } } Java for

Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks class Test { Listing 2: Java Concurrency Framework highlights public s t a t i c void main ( String argv [ ] ) { / / create the thread pool ThreadPool threadpool = new ThreadPool ( numthreads ) ; / / run example tasks ( tasks implement Runnable ) for ( i n t i = 0; i < numtasks ; i ++) { threadpool. runtask ( createtask ( i ) ) ; } } / / close the pool and wait f o r a l l tasks to f i n i s h. threadpool. j o i n ( ) ; / / Another interesting classes from concurrency framework : / / C y c l i c B a r r i e r / / ConcurrentHashMap / / PriorityBlockingQueue / / Executors... } Java for

Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks JOMP is the Java OpenMP binding Listing 3: JOMP example public s t a t i c void main ( S t r i n g argv [ ] ) { i n t myid ; / / omp p a r a l l e l p r i v a t e ( myid ) { myid = OMP. getthreadnum ( ) ; System. out. p r i n t l n ( Hello from + myid ) ; } } / / omp p a r a l l e l f o r for ( i =1; i <n ; i ++) { b [ i ] = ( a [ i ] + a [ i 1]) 0. 5 ; } Java for

Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks MPJ is the Java binding of MPI import mpi. ; Listing 4: MPJ example public class Hello { } public s t a t i c void main ( String argv [ ] ) { MPI. I n i t ( args ) ; i n t rank = MPI.COMM_WORLD. Rank ( ) ; i n t msg_tag = 13; i f ( rank == 0 ) { i n t peer_process = 1; S t r i n g [ ] msg = new S t r i n g [ 1 ] ; msg [ 0 ] = new S t r i n g ( " Hello " ) ; MPI.COMM_WORLD. Send (msg, 0, 1, MPI.OBJECT, peer_process, msg_tag ) ; } else i f ( rank == 1) { i n t peer_process = 0; S t r i n g [ ] msg = new S t r i n g [ 1 ] ; MPI.COMM_WORLD. Recv (msg, 0, 1, MPI.OBJECT, peer_process, msg_tag ) ; System. out. p r i n t l n (msg [ 0 ] ) ; } MPI. F i n a l i z e ( ) ; } Java for

Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks Pure Java Impl. Socket impl. Java IO Java NIO High-speed network support Myrinet InfiniBand SCI mpijava 1.2 API JGF MPJ Other APIs MPJava Jcluster Parallel Java mpijava P2P-MPI MPJ Express MPJ/Ibis JMPI F-MPJ Java for

FastMPJ Java for High Performance Computing Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks MPJ implementation developed at the Computer Architecture Group of University of A Corunna. Features of FastMPJ include: High performance intra-node communication Efficient RDMA transfers over InfiniBand and RoCE Fully portable, as Java Scalability up to thousands of cores Highly productive development and maintenance Ideal for multicore servers, cluster and cloud computing Java for

FastMPJ Supported Hardware Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks FastMPJ runs on InfiniBand through IB Verbs, can rely on TCP/IP and has shared memory support through Java threads: MPJ Applications MPJ API (mpijava 1.2) FastMPJ Library MPJ Collective Primitives MPJ Point to Point Primitives The xxdev layer mxdev psmdev ibvdev niodev/iodev smdev Java Native Interface (JNI) Java Sockets Java Threads MX/Open MX InfiniPath PSM IBV API TCP/IP Myrinet/Ethernet InfiniBand Ethernet Shared Memory Java for

Java GPGPU Java for High Performance Computing Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks Table: Available solutions for GPGPU computing in Java Java bindings User-friendly CUDA JCUDA Java-GPU jcuda Rootbeer OpenCL JOCL Aparapi JogAmp s JOCL Java for

Java GPGPU Java for High Performance Computing Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks Table: Description of the GPU-based testbed CPU 1 Intel(R) Xeon hexacore X5650 @ 2.67GHz CPU performance 64.08 GFLOPS DP (10.68 GFLOPS DP per core) GPU 1 NVIDIA Tesla Fermi M2050 GPU performance 515 GFLOPS DP Memory 12 GB DDR3 (1333 MHz) OS Debian GNU/Linux, kernel 3.2.0-3 CUDA version 4.2 SDK Toolkit JDK version OpenJDK 1.6.0_24 Java for

Java GPGPU Java for High Performance Computing Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks GFLOPS 1000 900 800 700 600 500 400 300 200 100 Matrix Multiplication performance (Single precision) CUDA jcuda Aparapi Java 0 2048x2048 4096x4096 8192x8192 Problem size GFLOPS 500 450 400 350 300 250 200 150 100 50 Matrix Multiplication performance (Double precision) CUDA jcuda Aparapi Java 0 2048x2048 4096x4096 8192x8192 Problem size Figure: Matrix multiplication kernel performance (SHOC GEMM) Java for

Java GPGPU Java for High Performance Computing Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks Runtime (seconds) 512 256 128 64 32 16 8 4 2 1 0.5 Stencil2D performance (Single precision) Java Aparapi jcuda CUDA 2048x2048 4096x4096 8192x8192 Problem size Runtime (seconds) 1024 512 256 128 64 32 16 8 4 2 1 Stencil2D performance (Double precision) Java Aparapi jcuda CUDA 2048x2048 4096x4096 8192x8192 Problem size Figure: Stencil 2D kernel performance (SHOC Stencil2D) Java for

Java GPGPU Java for High Performance Computing Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks GFLOPS 120 110 100 90 80 70 60 50 40 30 20 10 0 Stencil2D performance (Single precision) CUDA jcuda Aparapi Java 2048x2048 4096x4096 8192x8192 Problem size GFLOPS Stencil2D performance (Double precision) 60 55 CUDA jcuda 50 Aparapi 45 Java 40 35 30 25 20 15 10 5 0 2048x2048 4096x4096 8192x8192 Problem size Figure: Stencil 2D kernel performance (SHOC Stencil) Java for

Java GPGPU Java for High Performance Computing Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks 256 128 64 32 16 8 4 2 1 0.5 0.25 0.125 0.0625 FFT performance (Single precision) Java Aparapi jcuda CUDA 2048x2048 4096x4096 8192x8192 Problem size 256 128 64 32 16 8 4 2 1 0.5 0.25 0.125 0.0625 FFT performance (Double precision) Java Aparapi jcuda CUDA 2048x2048 4096x4096 8192x8192 Problem size Figure: FFT kernel performance (SHOC FFT) Java for

Java GPGPU Java for High Performance Computing Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks GFLOPS 400 350 300 250 200 150 100 50 FFT performance (Single precision) CUDA jcuda Aparapi Java GFLOPS 300 250 200 150 100 50 FFT performance (Double precision) CUDA jcuda Aparapi Java 0 2048x2048 4096x4096 8192x8192 Problem size 0 2048x2048 4096x4096 8192x8192 Problem size Figure: FFT kernel performance (SHOC FFT) Java for

Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks NPB-MPJ Characteristics (10,000 SLOC (Source LOC)) Name Operation SLOC Communicat. intensiveness CG EP Conjugate Gradient Embarrassingly Parallel 1000 350 Medium Low FT Fourier Transformation 1700 High IS Integer Sort 700 High MG Multi-Grid 2000 High SP Scalar Pentadiagonal 4300 Medium Kernel Applic. Java for

Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks Java HPC optimization: NPB-MPJ success case NPB-MPJ Optimization: JVM JIT compilation of heavy and frequent methods with runtime information Structured programming is the best option Small frequent methods are better. mapping elements from multidimensional to one-dimensional arrays (array flattening technique: arr3d[x][y][z] arr3d[pos3d(lenghtx,lengthy,x,y,z)]) NPB-MPJ code refactored, obtaining significant improvements (up to 2800% performance increase) Java for

Java Shared Memory Programming Java Message-Passing Java GPGPU Development of Efficient HPC Benchmarks Experimental Results on One Core (relative perf.) MOPS 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 0 NPB Kernels Serial Performance on Amazon EC2 (C Class) GNU compiler Intel compiler JVM FT (cc2.8xlarge) FT (cc1.4xlarge) CG (cc2.8xlarge) CG (cc1.4xlarge) IS (cc2.8xlarge) IS (cc1.4xlarge) MG (cc2.8xlarge) MG (cc1.4xlarge) Java for

IaaS: Infrastructure-as-a-Service IaaS AWS IaaS for HPC and Big Data The access to infrastructure supports the execution of computationally intensive tasks in the cloud. There are many IaaS providers: Amazon Web Services (AWS) Google Compute Engine (beta) IBM cloud HP cloud Rackspace ProfitBricks Penguin-On-Demand (POD) IMHO, AWS is the most suitable in terms of available resources, high performance Guillermo features López Taboada and performance/cost Java High Performance Cloud ratio. Computing

HPC and Big Data in AWS AWS IaaS for HPC and Big Data AWS Provides a set of instances for HPC CC1: dual socket, quadcore Nehalem Xeon (93 GFlops) CG1: dual socket, quadcore Nehalem Xeon con 2 GPUs (1123 GFlops) CC2: dual socket, octcore Sandy Bridge Xeon (333 GFlops) Up to 63 GB RAM 10 Gigabit Ethernet high performance network Java for

HPL Linpack in AWS (TOP500) AWS IaaS for HPC and Big Data Table: Configuration of our EC2 AWS cluster CPU 2 Xeon quadcore X5570@2.93GHz (46.88 GFlops DP each) EC2 Compute Units 33.5 GPU 2 NVIDIA Tesla Fermi M2050 (515 GFlops DP each) Memory 22 GB DDR3 Storage 1690 GB Virtualization Xen HVM 64-bit platform (PV drivers for I/O) Number of CGI nodes 32 Interconnection network 10 Gigabit Ethernet Total EC2 Compute Units 1072 Total CPU Cores 256 (3 TFLOPS DP) Total GPUs 64 (32.96 TFLOPS DP) Total FLOPS 35.96 TFLOPS DP Java for

HPL Linpack in AWS (TOP500) AWS IaaS for HPC and Big Data HPL Linpack ranks the performance of supercomputers in TOP500 list In AWS 14 TFlops are available for everybody! (67 USD$/h on demand, 21 USD$/h spot) 16000 14000 HPL Performance on Amazon EC2 CPU CPU+GPU (CUDA) 1400 1200 HPL Scalability on Amazon EC2 CPU CPU+GPU (CUDA) 12000 1000 GFLOPS 10000 8000 6000 Speedup 800 600 4000 400 2000 200 0 1 2 4 8 Number of Instances 16 32 0 1 2 4 8 Number of Instances 16 32 Java for

HPL Linpack in AWS (TOP500) AWS IaaS for HPC and Big Data References Finis Terrae (CESGA): 14,01 TFlops Our AWS Cluster: 14,23 TFlops AWS (06/11): 41,82 TFlops (# 451 TOP500 06/11) MareNostrum (BSC): 63,83 TFlops (# 299 TOP500) MinoTauro (BSC): 103,2 TFlops (# 114 TOP500) AWS (11/11): 240,1 TFlops (# 42 TOP500) Java for

AWS IaaS for HPC and Big Data Radiography enhanced through Montecarlo (MC-GPU) Java for

AWS IaaS for HPC and Big Data Image processing with Montecarlo (MC-GPU) Time: originally 187 minutes, in AWS with HPC 6 seconds! Cost: processing 500 radiographies in AWS: aprox. 70 USD$, 0,14 USD$ per unit 1600 1400 MC-GPU Execution Time on Amazon EC2 CPU CPU+GPU (CUDA) 1800 1600 MC-GPU Scalability on Amazon EC2 CPU CPU+GPU (CUDA) 1200 1400 Runtime (s) 1000 800 600 Speedup 1200 1000 800 600 400 400 200 200 0 1 2 4 8 Number of Instances 16 32 0 1 2 4 8 Number of Instances 16 32 Java for

AWS HPC and Big Data Remarks AWS IaaS for HPC and Big Data HPC and Big Data in AWS High computational power (new Sandy Bridge processors and GPUs) High performance communications over 10 Gigabit Ethernet Efficient access to file network systems (e.g., NFS, OrangeFS) Without waiting times in accessing resources Easy applications and infrastructure deployment, ready-to-run applications in AMIs Java for

Java HPC in an InfiniBand cluster Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC Latency (µs) 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 Point-to-point Performance on DAS-4 (IB-QDR) MVAPICH2 Open MPI FastMPJ (ibvdev) 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 16M Message size (bytes) 28 26 24 22 20 18 16 14 12 10 8 6 4 2 Bandwidth (Gbps) Java for

Java HPC in an InfiniBand cluster Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC 70000 60000 NPB CG Kernel Performance on DAS-4 (IB-QDR) MVAPICH2 Open MPI FastMPJ (ibvdev) 50000 MOPS 40000 30000 20000 10000 0 1 16 32 64 128 256 512 Number of Cores Java for

Java HPC in an InfiniBand cluster Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC 250000 200000 NPB FT Kernel Performance on DAS-4 (IB-QDR) MVAPICH2 Open MPI FastMPJ (ibvdev) MOPS 150000 100000 50000 0 1 16 32 64 128 256 512 Number of Cores Java for

Java HPC in an InfiniBand cluster Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC 6000 5000 NPB IS Kernel Performance on DAS-4 (IB-QDR) MVAPICH2 Open MPI FastMPJ (ibvdev) 4000 MOPS 3000 2000 1000 0 1 16 32 64 128 256 512 Number of Cores Java for

Java HPC in an InfiniBand cluster Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC 400000 350000 NPB MG Kernel Performance on DAS-4 (IB-QDR) MVAPICH2 Open MPI FastMPJ (ibvdev) 300000 250000 MOPS 200000 150000 100000 50000 0 1 16 32 64 128 256 512 Number of Cores Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC NPB in AWS Latency (µs) 75 70 65 60 55 50 Point-to-point Performance on 10 GbE (cc1.4xlarge) MPICH2 OpenMPI FastMPJ 45 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 16M 64M 0 Message size (bytes) 10 9 8 7 6 5 4 3 2 1 Bandwidth (Gbps) Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC NPB in AWS Latency (µs) 75 70 65 60 55 50 Point-to-point Performance on 10 GbE (cc2.8xlarge) MPICH2 OpenMPI FastMPJ 45 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 16M 64M 0 Message size (bytes) 10 9 8 7 6 5 4 3 2 1 Bandwidth (Gbps) Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC NPB in AWS Latency (µs) Point-to-point Performance on SHMEM (cc1.4xlarge) 0.8 75 MPICH2 OpenMPI 70 0.7 FastMPJ 65 60 0.6 55 0.5 50 45 0.4 40 35 0.3 30 25 0.2 20 15 0.1 10 5 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 16M 64M 0 Message size (bytes) Bandwidth (Gbps) Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC NPB in AWS Latency (µs) Point-to-point Performance on SHMEM (cc2.8xlarge) 0.8 75 MPICH2 OpenMPI 70 0.7 FastMPJ 65 60 0.6 55 0.5 50 45 0.4 40 35 0.3 30 25 0.2 20 15 0.1 10 5 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 16M 64M 0 Message size (bytes) Bandwidth (Gbps) Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC NPB in AWS MOPS 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 MPICH2 OpenMPI FastMPJ CG C Class Performance (cc1.4xlarge) 0 1 8 16 32 64 128 256 512 Number of Cores Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC NPB in AWS 25000 20000 MPICH2 OpenMPI FastMPJ FT C Class Performance (cc1.4xlarge) MOPS 15000 10000 5000 0 1 8 16 32 64 128 256 512 Number of Cores Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC NPB in AWS 600 500 MPICH2 OpenMPI FastMPJ IS C Class Performance (cc1.4xlarge) 400 MOPS 300 200 100 0 1 8 16 32 64 128 256 512 Number of Cores Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC NPB in AWS 60000 50000 MPICH2 OpenMPI FastMPJ MG C Class Performance (cc1.4xlarge) 40000 MOPS 30000 20000 10000 0 1 8 16 32 64 128 256 512 Number of Cores Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC NPB in AWS 16000 14000 MPICH2 OpenMPI FastMPJ CG C Class Performance (cc2.8xlarge) 12000 10000 MOPS 8000 6000 4000 2000 0 1 8 16 32 64 128 256 512 Number of Cores Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC NPB in AWS MOPS 22000 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 MPICH2 OpenMPI FastMPJ FT C Class Performance (cc2.8xlarge) 0 1 8 16 32 64 128 256 512 Number of Cores Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC NPB in AWS 700 600 IS C Class Performance (cc2.8xlarge) MPICH2 OpenMPI FastMPJ MOPS 500 400 300 200 100 0 1 8 16 32 64 128 256 512 Number of Cores Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC NPB in AWS 90000 80000 70000 MPICH2 OpenMPI FastMPJ MG C Class Performance (cc2.8xlarge) 60000 MOPS 50000 40000 30000 20000 10000 0 1 8 16 32 64 128 256 512 Number of Cores Java for

HPC in AWS Java for High Performance Computing Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC Efficient Cloud Computing support MOPS 13000 12000 11000 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 1 CG Kernel OpenMPI Performance on Amazon EC2 8 Default Default (Fillup) Default (Tuned) 16 Instances (Fillup) 16 Instances (Tuned) 16 32 64 Number Guillermo of Processes López Taboada 128 MOPS 16000 14000 12000 10000 8000 6000 4000 2000 0 1 CG Kernel OpenMPI Performance on Amazon EC2 8 Default 16 Instances 32 Instances 64 Instances 16 32 64 128 Java for High Performance Number Cloud Computing of Processes 256 512

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC ProtTest-HPC Phylogeny application ProtTest: a phylogeny application ProtTest is a Java application widely used in Phylogeny (4000 users) Implements 112 models to validate the maximum likelihood (ML) of the evolution of several taxa The input data are amino acid sequencies from several taxa Base on Markov Chains / Monte Carlo simulations Time consuming sequential application. A fairly simple example can last weeks/months Relies on a C application for the computational intensive part of the application (ML optimization) Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC ProtTest-HPC Phylogeny application Java for

ProtTest-HPC design Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC ProtTest-HPC Implementation (Shared Memory) Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC ProtTest-HPC Implementation (MPJ) Java for

Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC ProtTest-HPC Implementation (Hybrid MPJ+Threads) Java for

ProtTest-HPC Performance Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC Speedup 8 7 6 5 4 3 ProtTest HPC shared memory performance (8 core system) RIB COX HIV RIBML COXML HIVML 2 1 0 1 2 4 8 Number of Threads Figure: ProtTest-HPC shared memory (8-core machine) Java for

ProtTest-HPC Performance Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC ProtTest HPC shared memory performance (24 core system) 20 18 16 14 RIB COX HIV RIBML COXML HIVML Speedup 12 10 8 6 4 2 0 1 2 4 8 16 24 Number of Threads Figure: ProtTest-HPC shared memory (24-core machine) Java for

ProtTest-HPC Performance Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC 48 44 40 36 32 ProtTest HPC distributed memory performance (Harpertown) RIBML COXML HIVML 10K 20K 100K Speedup 28 24 20 16 12 8 4 0 1 8 16 28 56 112 Number of Cores Figure: ProtTest-HPC distributed memory (256-core cluster) Java for

ProtTest-HPC Performance Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC Speedup 160 140 120 100 ProtTest HPC hybrid implementation performance (Harpertown) 80 60 RIBML COXML HIVML 10K 20K 100K 40 20 0 1 8 16 28 56 112 224 Number of Cores Figure: ProtTest-HPC hybrid shared/distributed memory (28 nodes) Java for

ProtTest-HPC Performance Evaluation of Java HPC Evaluation of HPC in the Cloud Case study: ProtTest-HPC ProtTest HPC hybrid implementation performance (Nehalem) 60 50 RIBML COXML HIVML 10K 20K 100K Speedup 40 30 20 10 1 2 4 8 16 32 64 Number of Cores Figure: ProtTest-HPC hybrid shared/distributed memory (8 nodes) Java for

Summary Java for High Performance Computing Summary Questions Current state of Java for HPC (interesting/feasible alternative) Available programming models in Java for HPC: Shared memory programming Distributed memory programming Distributed shared memory programming Active research on Java for HPC (>30 projects)...but still not a mainstream language for HPC Adoption of Java for HPC in the cloud: It is an alternative for the cloud (tradeoff some performance for appealing features) Performance evaluations are highly important Analysis of current projects (promotion of joint efforts) Java for

Summary Questions Questions? JAVA FOR HIGH PERFORMANCE CLOUD COMPUTING FUNDS SEMINAR SERIES 12 Email: taboada@udc.es WWW: http://gac.udc.es/ gltaboada/ Computer Architecture Group, University of A Coruña Java for