High Performance Computing in Java and the Cloud

Transcription

1 High Performance Computing in Java and the Cloud Computer Architecture Group University of A Coruña (Spain) [email protected] January 21, 2013 Universitat Politècnica de Catalunya, Barcelona

2 Outline Java for High Performance Computing 1 Java for High Performance Computing 2 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud

3 Java is an Alternative for HPC in the Multi-core Era Language popularity: (% skilled developers) #1 C (17.8%) #2 Java (17.4%) #3 Objective-C (10.3%) #4 C++ (9.14%) #17 Matlab (0.64%) #25 Fortran (0.46%) C and Java are the most popular languages. Fortran, popular language in HPC, is the 25# (0.46%)

4 Java is an Alternative for HPC in the Multi-core Era Interesting features: Built-in networking Built-in multi-threading Portable, platform independent Object Oriented Main training language Many productive parallel/distributed programming libs: Java shared memory programming (high level facilities: Concurrency framework) Java Sockets Java RMI Message-Passing in Java (MPJ) libraries Apache Hadoop

5 Java Adoption in HPC HPC developers and users usually want to use Java in their projects. Java code is no longer slow (Just-In-Time compilation)! But still performance penalties in Java communications: Pros and Cons: high programming productivity. but they are highly concerned about performance.

6 Java Adoption in HPC HPC developers and users usually want to use Java in their projects. Java code is no longer slow (Just-In-Time compilation)! But still performance penalties in Java communications: JIT Performance: Like native performance. Java can even outperform native languages (C, Fortran) thanks to the dynamic compilation!!

7 Java Adoption in HPC HPC developers and users usually want to use Java in their projects. Java code is no longer slow (Just-In-Time compilation)! But still performance penalties in Java communications: High Java Communications Overhead: Poor high-speed networks support. The data copies between the Java heap and native code through JNI. Costly data serialization. The use of communication protocols unsuitable for HPC.

8 Emerging Interest in Java for HPC

9 Current State of Java for HPC

10 Java for HPC Java for High Performance Computing Current options in Java for HPC: (popular) Java RMI (poor performance) Java Sockets (low level programming) Message-Passing in Java (MPJ) (extended, easy and reasonably good performance) Apache Hadoop (for High Throughput Computing)

11 Java for HPC Java for High Performance Computing : Java Threads Concurrency Framework (ThreadPools, Tasks...) Parallel Java (PJ) Java OpenMP (JOMP and JaMP)

12 Listing 1: Java Threads class BasicThread extends Thread { / / This method i s c a l l e d when the thread runs public void run ( ) { for ( i n t i =0; i <1000; i ++) increasecount ( ) ; } } class Test { i n t counter = 0L ; increasecount ( ) { counter ++; } public s t a t i c void main ( String argv [ ] ) { / / Create and s t a r t the thread Thread t1 = new BasicThread ( ) ; Thread t2 = new BasicThread ( ) ; t1. s t a r t ( ) ; t2. s t a r t ( ) ; System. out. p r i n t l n ( " Counter= " +counter ) ; } }

13 class Test { Listing 2: Java Concurrency Framework highlights public s t a t i c void main ( String argv [ ] ) { / / create the thread pool ThreadPool threadpool = new ThreadPool ( numthreads ) ; / / run example tasks ( tasks implement Runnable ) for ( i n t i = 0; i < numtasks ; i ++) { threadpool. runtask ( createtask ( i ) ) ; } } / / close the pool and wait f o r a l l tasks to f i n i s h. threadpool. j o i n ( ) ; / / Another interesting classes from concurrency framework : / / C y c l i c B a r r i e r / / ConcurrentHashMap / / PriorityBlockingQueue / / Executors... }

14 JOMP is the Java OpenMP binding Listing 3: JOMP example public s t a t i c void main ( S t r i n g argv [ ] ) { i n t myid ; / / omp p a r a l l e l p r i v a t e ( myid ) { myid = OMP. getthreadnum ( ) ; System. out. p r i n t l n ( Hello from + myid ) ; } } / / omp p a r a l l e l f o r for ( i =1; i <n ; i ++) { b [ i ] = ( a [ i ] + a [ i 1]) 0. 5 ; }

15 MPJ is the Java binding of MPI import mpi. ; Listing 4: MPJ example public class Hello { } public s t a t i c void main ( String argv [ ] ) { MPI. I n i t ( args ) ; i n t rank = MPI.COMM_WORLD. Rank ( ) ; i n t msg_tag = 13; i f ( rank == 0 ) { i n t peer_process = 1; S t r i n g [ ] msg = new S t r i n g [ 1 ] ; msg [ 0 ] = new S t r i n g ( " Hello " ) ; MPI.COMM_WORLD. Send (msg, 0, 1, MPI.OBJECT, peer_process, msg_tag ) ; } else i f ( rank == 1) { i n t peer_process = 0; S t r i n g [ ] msg = new S t r i n g [ 1 ] ; MPI.COMM_WORLD. Recv (msg, 0, 1, MPI.OBJECT, peer_process, msg_tag ) ; System. out. p r i n t l n (msg [ 0 ] ) ; } MPI. F i n a l i z e ( ) ; }

16 Pure Java Impl. Socket impl. Java IO Java NIO High-speed network support Myrinet InfiniBand SCI mpijava 1.2 API JGF MPJ Other APIs MPJava Jcluster Parallel Java mpijava P2P-MPI MPJ Express MPJ/Ibis JMPI FastMPJ

17 FastMPJ Java for High Performance Computing MPJ implementation developed at the Computer Architecture Group of University of A Corunna. Features of FastMPJ include: High performance intra-node communication Efficient RDMA transfers over InfiniBand and RoCE Fully portable, as Java Scalability up to thousands of cores Highly productive development and maintenance Ideal for multicore servers, cluster and cloud computing

18 FastMPJ Supported Hardware FastMPJ runs on InfiniBand through IB Verbs, can rely on TCP/IP and has shared memory support through Java threads: MPJ Applications MPJ API (mpijava 1.2) FastMPJ Library MPJ Collective Primitives MPJ Point to Point Primitives The xxdev layer mxdev psmdev ibvdev niodev/iodev smdev Java Native Interface (JNI) Java Sockets Java Threads MX/Open MX InfiniPath PSM IBV API TCP/IP Myrinet/Ethernet InfiniBand Ethernet Shared Memory

19 FastMPJ in MareNostrum3 Latency (µs) Point-to-point Performance on InfiniBand FDR Open MPI 5 FastMPJ (ibvdev) K 4K 16K 64K 256K 1M 4M 16M 0 Message size (bytes) Bandwidth (Gbps)

20 Java for High Performance Computing Table: Available solutions for GPGPU computing in Java Java bindings User-friendly CUDA JCUDA Java-GPU jcuda Rootbeer OpenCL JOCL Aparapi JogAmp s JOCL

21 Listing 5: Java Code Example f i n a l double i n [ ], out [ ] ; for ( i n t i =0; i < i n. l e n g t h ; i ++) out [ i ]= i n [ i ] i n [ i ] ; Listing 6: Aparapi Code Example Kernel ker nel = new Kernel ( ) public void run ( ) { i n t i = getglobalid ( ) ; out [ i ]= i n [ i ] i n [ i ] ; } } ; k e r n e l. execute ( i n. l e n g t h ) ; Listing 7: Thread Code Example Thread thread = new Thread (new Runnable ( ) public void run ( ) { } } ) ; thread. s t a r t ( ) ; thread. j o i n ( ) ;

22 Listing 8: jcuda Matrix Multiplication private s t a t i c void dgemmjcublas ( i n t n, double alpha, double A [ ], double B [ ], double beta, double C [ ] ) { i n t nn = n n ; / / I n i t i a l i z e JCublas JCublas. c u b l a s I n i t ( ) ; / / A l l o c a t e memory on the device Pointer d_a = new Pointer ( ) ; Pointer d_b = new Pointer ( ) ; Pointer d_c = new Pointer ( ) ; JCublas. cublasalloc ( nn, Sizeof.DOUBLE, d_a ) ; JCublas. cublasalloc ( nn, Sizeof.DOUBLE, d_b ) ; JCublas. cublasalloc ( nn, Sizeof.DOUBLE, d_c ) ; / / Copy the memory from the host to the device JCublas. cublassetvector ( nn, Sizeof.DOUBLE, Pointer. to (A), 1, d_a, 1 ) ; JCublas. cublassetvector ( nn, Sizeof.DOUBLE, Pointer. to (B), 1, d_b, 1 ) ; JCublas. cublassetvector ( nn, Sizeof.DOUBLE, Pointer. to (C), 1, d_c, 1 ) ; / / Execute dgemm JCublas. cublasdgemm ( n, n, n, n, n, alpha, d_a, n, d_b, n, beta, d_c, n ) ; / / Copy the r e s u l t from the device to the host JCublas. cublasgetvector ( nn, Sizeof.DOUBLE, d_c, 1, Pointer. to (C), 1 ) ; / / Clean up JCublas. cublasfree (d_a ) ; JCublas. cublasfree (d_b ) ; JCublas. cublasfree (d_c ) ; JCublas. cublasshutdown ( ) ; }

23 Listing 9: Aparapi Matrix Multiplication class AparapiMatMul extends Kernel { double mata [ ], matb [ ], matc [ ], C [ ] ; i n t rows, cols1, cols2 public void run ( ) { i n t i = getglobalid ( ) / rows ; i n t j = getglobalid ( ) % rows ; f l o a t value = 0; for ( i n t k = 0; k < cols1 ; k++) { value += mata [ k + i cols1 ] matb [ k cols2 + j ] ; } matc [ i cols1 + j ] = value ; }

24 Java for High Performance Computing Table: Description of the GPU-based testbed CPU 1 Intel(R) Xeon hexacore 2.67GHz CPU performance GFLOPS DP (10.68 GFLOPS DP per core) GPU 1 NVIDIA Tesla Fermi M2050 GPU performance 515 GFLOPS DP Memory 12 GB DDR3 (1333 MHz) OS Debian GNU/Linux, kernel CUDA version 4.2 SDK Toolkit JDK version OpenJDK 1.6.0_24

25 Java for High Performance Computing GFLOPS Matrix Multiplication performance (Single precision) CUDA jcuda Aparapi Java x x x8192 Problem size GFLOPS Matrix Multiplication performance (Double precision) CUDA jcuda Aparapi Java x x x8192 Problem size Figure: Matrix multiplication kernel performance (SHOC GEMM)

26 Java for High Performance Computing Runtime (seconds) Stencil2D performance (Single precision) Java Aparapi jcuda CUDA 2048x x x8192 Problem size Runtime (seconds) Stencil2D performance (Double precision) Java Aparapi jcuda CUDA 2048x x x8192 Problem size Figure: Stencil 2D kernel performance (SHOC Stencil2D)

27 Java for High Performance Computing GFLOPS Stencil2D performance (Single precision) CUDA jcuda Aparapi Java 2048x x x8192 Problem size GFLOPS Stencil2D performance (Double precision) CUDA jcuda 50 Aparapi 45 Java x x x8192 Problem size Figure: Stencil 2D kernel performance (SHOC Stencil)

28 Java for High Performance Computing FFT performance (Single precision) Java Aparapi jcuda CUDA 2048x x x8192 Problem size FFT performance (Double precision) Java Aparapi jcuda CUDA 2048x x x8192 Problem size Figure: FFT kernel performance (SHOC FFT)

29 Java for High Performance Computing GFLOPS FFT performance (Single precision) CUDA jcuda Aparapi Java GFLOPS FFT performance (Double precision) CUDA jcuda Aparapi Java x x x8192 Problem size x x x8192 Problem size Figure: FFT kernel performance (SHOC FFT)

30 NPB-MPJ Characteristics (10,000 SLOC (Source LOC)) Name Operation SLOC Communicat. intensiveness CG EP Conjugate Gradient Embarrassingly Parallel Medium Low FT Fourier Transformation 1700 High IS Integer Sort 700 High MG Multi-Grid 2000 High SP Scalar Pentadiagonal 4300 Medium Kernel Applic.

31 Java HPC optimization: NPB-MPJ success case NPB-MPJ Optimization: JVM JIT compilation of heavy and frequent methods with runtime information Structured programming is the best option Small frequent methods are better. mapping elements from multidimensional to one-dimensional arrays (array flattening technique: arr3d[x][y][z] arr3d[pos3d(lenghtx,lengthy,x,y,z)]) NPB-MPJ code refactored, obtaining significant improvements (up to 2800% performance increase)

32 Experimental Results on One Core (relative perf.) MOPS NPB Kernels Serial Performance on Amazon EC2 (C Class) GNU compiler Intel compiler JVM FT (cc2.8xlarge) FT (cc1.4xlarge) CG (cc2.8xlarge) CG (cc1.4xlarge) IS (cc2.8xlarge) IS (cc1.4xlarge) MG (cc2.8xlarge) MG (cc1.4xlarge)

33 Java HPC in an InfiniBand cluster (DAS4) Latency (µs) Point-to-point Performance on DAS-4 (IB-QDR) MVAPICH2 Open MPI FastMPJ (ibvdev) K 4K 16K 64K 256K 1M 4M 16M Message size (bytes) Bandwidth (Gbps)

34 Java HPC in an InfiniBand cluster (DAS4) NPB CG Kernel Performance on DAS-4 (IB-QDR) MVAPICH2 Open MPI FastMPJ (ibvdev) MOPS Number of Cores

35 Java HPC in an InfiniBand cluster (DAS4) NPB FT Kernel Performance on DAS-4 (IB-QDR) MVAPICH2 Open MPI FastMPJ (ibvdev) MOPS Number of Cores

36 Java HPC in an InfiniBand cluster (DAS4) NPB MG Kernel Performance on DAS-4 (IB-QDR) MVAPICH2 Open MPI FastMPJ (ibvdev) MOPS Number of Cores

37 IaaS: Infrastructure-as-a-Service IaaS AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud The access to infrastructure supports the execution of computationally intensive tasks in the cloud. There are many IaaS providers: Amazon Web Services (AWS) Google Compute Engine (beta) IBM cloud HP cloud Rackspace ProfitBricks Penguin-On-Demand (POD) IMHO, AWS is the most suitable in terms of available resources, high performance Guillermo features López Taboada and performance/cost High Performance Computingratio. Java and the Cloud

38 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud HPC and Big Data in AWS AWS Provides a set of instances for HPC CC1: dual socket, quadcore Nehalem Xeon (93 GFlops) CG1: dual socket, quadcore Nehalem Xeon con 2 GPUs (1123 GFlops) CC2: dual socket, octcore Sandy Bridge Xeon (333 GFlops) Up to 63 GB RAM 10 Gigabit Ethernet high performance network

39 HPL Linpack in AWS (TOP500) AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud Table: Configuration of our EC2 AWS cluster CPU 2 Xeon quadcore [email protected] (46.88 GFlops DP each) EC2 Compute Units 33.5 GPU 2 NVIDIA Tesla Fermi M2050 (515 GFlops DP each) Memory 22 GB DDR3 Storage 1690 GB Virtualization Xen HVM 64-bit platform (PV drivers for I/O) Number of CGI nodes 32 Interconnection network 10 Gigabit Ethernet Total EC2 Compute Units 1072 Total CPU Cores 256 (3 TFLOPS DP) Total GPUs 64 (32.96 TFLOPS DP) Total FLOPS TFLOPS DP

40 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud HPL Linpack in AWS (TOP500) HPL Linpack ranks the performance of supercomputers in TOP500 list In AWS 14 TFlops are available for everybody! (67 USD$/h on demand, 21 USD$/h spot) HPL Performance on Amazon EC2 CPU CPU+GPU (CUDA) HPL Scalability on Amazon EC2 CPU CPU+GPU (CUDA) GFLOPS Speedup Number of Instances Number of Instances 16 32

41 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud HPL Linpack in AWS (TOP500) References Our AWS Cluster: 14,23 TFlops AWS (06/11): 41,82 TFlops (# 451 TOP500) AWS (11/11): 240,1 TFlops (# 42 TOP500, now # 102)

42 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud Radiography enhanced through Montecarlo (MC-GPU)

43 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud Image processing with Montecarlo (MC-GPU) Time: originally 187 minutes, in AWS with HPC 6 seconds! Cost: processing 500 radiographies in AWS: aprox. 70 USD$, 0,14 USD$ per unit MC-GPU Execution Time on Amazon EC2 CPU CPU+GPU (CUDA) MC-GPU Scalability on Amazon EC2 CPU CPU+GPU (CUDA) Runtime (s) Speedup Number of Instances Number of Instances 16 32

44 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud AWS HPC and Big Data Remarks HPC and Big Data in AWS High computational power (new Sandy Bridge processors and GPUs) High performance communications over 10 Gigabit Ethernet Efficient access to file network systems (e.g., NFS, OrangeFS) Without waiting times in accessing resources Easy applications and infrastructure deployment, ready-to-run applications in AMIs

45 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS Latency (µs) Point-to-point Performance on 10 GbE (cc2.8xlarge) MPICH2 OpenMPI FastMPJ K 4K 16K 64K 256K 1M 4M 16M 64M 0 Message size (bytes) Bandwidth (Gbps)

46 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS Latency (µs) Point-to-point Performance on SHMEM (cc2.8xlarge) MPICH2 OpenMPI FastMPJ K 4K 16K 64K 256K 1M 4M 16M 64M 0 Message size (bytes) Bandwidth (Gbps)

47 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS MPICH2 OpenMPI FastMPJ CG C Class Performance (cc2.8xlarge) MOPS Number of Cores

48 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS MPICH2 OpenMPI FastMPJ CG C Class Scalability (cc2.8xlarge) Speedup Number of Cores

49 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS MOPS MPICH2 OpenMPI FastMPJ FT C Class Performance (cc2.8xlarge) Number of Cores

50 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS MPICH2 OpenMPI FastMPJ FT C Class Scalability (cc2.8xlarge) Speedup Number of Cores

51 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS MPICH2 OpenMPI FastMPJ MG C Class Performance (cc2.8xlarge) MOPS Number of Cores

52 AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud NPB Performance Evaluation in AWS MPICH2 OpenMPI FastMPJ MG C Class Scalability (cc2.8xlarge) Speedup Number of Cores

53 HPC in AWS Java for High Performance Computing AWS IaaS for HPC and Big Data Evaluation of HPC in the Cloud Efficient Cloud Computing support MOPS CG Kernel OpenMPI Performance on Amazon EC2 8 Default Default (Fillup) Default (Tuned) 16 Instances (Fillup) 16 Instances (Tuned) Number Guillermo of Processes López Taboada 128 MOPS CG Kernel OpenMPI Performance on Amazon EC2 8 Default 16 Instances 32 Instances 64 Instances High Performance Computing Number in Java of Processes and the Cloud

54 Summary Questions Summary Current state of HPC in Java and the Cloud Java/IaaS are interesting/feasible alternatives (tradeoff some performance for appealing features) Suitable programming models: Java HPC: FastMPJ Cloud HPC: Hybrid MPI/MPJ + GPGPU/Threads Active research on suitability of Java and IaaS for HPC...but still not mainstream options Performance evaluations are highly important Analysis of current projects (promotion of joint efforts)

55 Summary Questions Questions? HIGH PERFORMANCE COMPUTING IN JAVA AND THE CLOUD HIGH PERFORMANCE COMPUTING SEMINAR WWW: gltaboada/ Computer Architecture Group, University of A Coruña