Retargeting PLAPACK to Clusters with Hardware Accelerators

Similar documents

High Performance Matrix Inversion with Several GPUs

Notes on Cholesky Factorization

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Overview of HPC systems and software available within

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

ST810 Advanced Computing

High Performance Computing in CST STUDIO SUITE

A quick tutorial on Intel's Xeon Phi Coprocessor

Introduction to GPGPU. Tiziano Diamanti

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Evaluation of CUDA Fortran for the CFD code Strukti

HPC with Multicore and GPUs

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Petascale Visualization: Approaches and Initial Results

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Accelerating CFD using OpenFOAM with GPUs

GPGPU accelerated Computational Fluid Dynamics

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Turbomachinery CFD on many-core platforms experiences and strategies

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HP ProLiant SL270s Gen8 Server. Evaluation Report

Introduction to GPU Programming Languages

Trends in High-Performance Computing for Power Grid Applications

Parallel Programming Survey

HPC Wales Skills Academy Course Catalogue 2015

Recent Advances in HPC for Structural Mechanics Simulations

Next Generation GPU Architecture Code-named Fermi

Introduction to Hybrid Programming

GPU for Scientific Computing. -Ali Saleh

Introduction to GPU hardware and to CUDA

Stream Processing on GPUs Using Distributed Multimedia Middleware

GPU Parallel Computing Architecture and CUDA Programming Model

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Le langage OCaml et la programmation des GPU

1 DCSC/AU: HUGE. DeIC Sekretariat /RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

CUDA programming on NVIDIA GPUs

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

How to choose a suitable computer

Performance of the JMA NWP models on the PC cluster TSUBAME.

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

CUDAMat: a CUDA-based matrix class for Python

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

CUDA in the Cloud Enabling HPC Workloads in OpenStack With special thanks to Andrew Younge (Indiana Univ.) and Massimo Bernaschi (IAC-CNR)

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

Several tips on how to choose a suitable computer

Parallel Large-Scale Visualization

A distributed CPU-GPU sparse direct solver

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Graphic Processing Units: a possible answer to High Performance Computing?

Introduction to GPU Computing

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Performance Measurement of a High-Performance Computing System Utilized for Electronic Medical Record Management

HPC enabling of OpenFOAM R for CFD applications

The Asterope compute cluster

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Several tips on how to choose a suitable computer

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

GPU Programming in Computer Vision

Large-Scale Reservoir Simulation and Big Data Visualization

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Transcription:

Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain). 2 Department of Computer Sciences. Texas Institute for Computational Engineering and Sciences. The University of Texas at Austin. Retargeting PLAPACK to Clusters of GPUs 1 Francisco Igual

Motivation Motivation Past, present and future of GPU computing Systems with one GPU Multi-GPU systems Clusters of GPUs Software: CUDA OpenCL OpenGL, Cg,... ATI Stream Future is today... Software: GPUSs Star-PU FLAME SuperMatrix?? Software: Our approach: CUPLAPACK China s new Nebulae Supercomputer is No. 2 at Top500 (June 2010) Powered with Tesla C2050 GPUs Retargeting PLAPACK to Clusters of GPUs 2 Francisco Igual

Motivation Motivation Past, present and future of GPU computing Systems with one GPU Multi-GPU systems Clusters of GPUs Software: CUDA OpenCL OpenGL, Cg,... ATI Stream Future is today... Software: GPUSs Star-PU FLAME SuperMatrix?? Software: Our approach: CUPLAPACK China s new Nebulae Supercomputer is No. 2 at Top500 (June 2010) Powered with Tesla C2050 GPUs Retargeting PLAPACK to Clusters of GPUs 2 Francisco Igual

Motivation Motivation Past, present and future of GPU computing Systems with one GPU Multi-GPU systems Clusters of GPUs Software: CUDA OpenCL OpenGL, Cg,... ATI Stream Future is today... Software: GPUSs Star-PU FLAME SuperMatrix?? Software: Our approach: CUPLAPACK China s new Nebulae Supercomputer is No. 2 at Top500 (June 2010) Powered with Tesla C2050 GPUs Retargeting PLAPACK to Clusters of GPUs 2 Francisco Igual

Outline Outline 1 Introduction 2 PLAPACK 3 Porting PLAPACK to clusters of GPUs 4 Experimental results 5 Conclusions and future work Retargeting PLAPACK to Clusters of GPUs 3 Francisco Igual

Introduction Introduction PLAPACK: Parallel dense linear algebra package for message-passing architectures (i.e., clusters). CUPLAPACK: Retarget of PLAPACK to clusters with graphics processors. Why a PLAPACK-based approach?: Goals: Modularity: layered design. Programmability: FLAME methodology. Object-based approach. 1 Programmability: Transparent port to clusters of GPUs. 2 Performance: Clusters of GPUs with hundreds of nodes. 3 Scalability: Keep performance with large problems. Retargeting PLAPACK to Clusters of GPUs 4 Francisco Igual

PLAPACK overview PLAPACK PLAPACK: Parallel Linear Algebra Package Library infrastructure for parallel implementation of linear algebra algorithms on distributed memory machines. Natural transition from algorithms to distributed-memory codes. Focuses on programmability. Layered design. Focuses on extensibility and flexibility. Physically Based Matrix Distribution (PBMD) and templates. Retargeting PLAPACK to Clusters of GPUs 5 Francisco Igual

PLAPACK Cholesky factorization. Right-looking variant (I) Cholesky factorization of a positive definite matrix A: Partition: ( A11 A = A 21 A 22 A = LL T where A 11 and L 11 are b b matrices. ) ( L11 0, L = L 21 L 22 ), From A = LL T, we have that: ( ) A11 = A 21 A 22 = ( )( L11 0 L T 11 L21 T L 21 L 22 0 L22 T ( L11 L11 T L 21 L11 T L 21 L21 T + L 22L22 T ) ). Retargeting PLAPACK to Clusters of GPUs 6 Francisco Igual

PLAPACK Cholesky factorization. Right-looking variant (I) Cholesky factorization of a positive definite matrix A: Partition: ( A11 A = A 21 A 22 A = LL T where A 11 and L 11 are b b matrices. ) ( L11 0, L = L 21 L 22 ), From A = LL T, we have that: ( ) A11 = A 21 A 22 = ( )( L11 0 L T 11 L21 T L 21 L 22 0 L22 T ( L11 L11 T L 21 L11 T L 21 L21 T + L 22L22 T ) ). Retargeting PLAPACK to Clusters of GPUs 6 Francisco Igual

PLAPACK Cholesky factorization. Right-looking variant (II) This yields to equations: L 11 L11 T = A 11, L 21 = A 21 L11 T A 22 L 21 L21 T = L 22 L22. T Concluding with the algorithm: ( ) A11 1 Partition A =, being A 11 a b b submatrix. A 21 A 22 2 A 11 := L 11 such that A 11 = L 11 L T 11 (Cholesky factor of A 11). 3 A 21 := L 21 = A 21 L T 11. 4 A 22 := A 22 L 21 L T 21. 5 Continue recursively with A = A 22 with step 1. Retargeting PLAPACK to Clusters of GPUs 7 Francisco Igual

PLAPACK Cholesky factorization. Right-looking variant (II) This yields to equations: L 11 L11 T = A 11, L 21 = A 21 L11 T A 22 L 21 L21 T = L 22 L22. T Concluding with the algorithm: ( ) A11 1 Partition A =, being A 11 a b b submatrix. A 21 A 22 2 A 11 := L 11 such that A 11 = L 11 L T 11 (Cholesky factor of A 11). 3 A 21 := L 21 = A 21 L T 11. 4 A 22 := A 22 L 21 L T 21. 5 Continue recursively with A = A 22 with step 1. Retargeting PLAPACK to Clusters of GPUs 7 Francisco Igual

PLAPACK Using PLAPACK as a library From algorithms to code i n t PLA_Chol ( i n t b, PLA_Obj A ) { PLA_Obj ABR = NULL, A11 = NULL, A21 = NULL ; /... / / View ABR = A / PLA_Obj_view_all ( A, &ABR ) ; while ( TRUE ) { / Partition ABR = / A11 \ ===== ===== \ A21 ABR / where A11 i s b x b / PLA_Obj_split_4 ( ABR, b, b, &A11, PLA_DUMMY, &A21, &ABR ) ; / A11 := L11 = Cholesky Factor ( A11 ) / PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; / Update A21 := L21 = A21 inv ( L11 ) / PLA_Trsm ( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG, one, A11, A21 ) ; / Update A22 := A22 L21 L21 / PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one, A21, one, ABR ) ; } /... / } Retargeting PLAPACK to Clusters of GPUs 8 Francisco Igual

PLAPACK Using PLAPACK as a library From algorithms to code i n t PLA_Chol ( i n t b, PLA_Obj A ) { PLA_Obj ABR = NULL, A11 = NULL, A21 = NULL ; /... / / View ABR = A / PLA_Obj_view_all ( A, &ABR ) ; while ( TRUE ) { / Partition ABR = / A11 \ ===== ===== \ A21 ABR / where A11 i s b x b / PLA_Obj_split_4 ( ABR, b, b, &A11, PLA_DUMMY, &A21, &ABR ) ; / A11 := L11 = Cholesky Factor ( A11 ) / PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; / Update A21 := L21 = A21 inv ( L11 ) / PLA_Trsm ( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG, one, A11, A21 ) ; / Update A22 := A22 L21 L21 / PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one, A21, one, ABR ) ; } /... / } Retargeting PLAPACK to Clusters of GPUs 8 Francisco Igual

PLAPACK Using PLAPACK as a library. Cholesky code 1 i n t PLA_Chol ( i n t b, PLA_Obj A ) 2 { 3 PLA_Obj ABR = NULL, 4 A11 = NULL, A21 = NULL ; 5 /... / 6 7 / View ABR = A / 8 PLA_Obj_view_all ( A, &ABR ) ; 9 10 while ( TRUE ) { 11 / Partition ABR = / A11 \ 12 ===== ===== 13 \ A21 ABR / 14 where A11 i s b x b / 15 PLA_Obj_split_4 ( ABR, b, b, 16 &A11, PLA_DUMMY, 17 &A21, &ABR ) ; 18 19 / A11 := L11 = Cholesky Factor ( A11 ) / 20 PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; 21 22 / Update A21 := L21 = A21 inv ( L11 ) / 23 PLA_Trsm( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, 24 PLA_TRANSPOSE, PLA_NONUNIT_DIAG, 25 one, A11, 26 A21 ) ; 27 28 / Update A22 := A22 L21 L21 / 29 PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, 30 minus_one, A21, 31 one, ABR ) ; 32 } 33 /... / 34 } i n t CUPLA_Chol( i n t b, PLA_Obj A ) { PLA_Obj ABR = NULL, A11 = NULL, A21 = NULL ; /... / / View ABR = A / PLA_Obj_view_all ( A, &ABR ) ; while ( TRUE ) { / Partition ABR = / A11 \ ===== ===== \ A21 ABR / where A11 i s b x b / PLA_Obj_split_4 ( ABR, b, b, &A11, PLA_DUMMY, &A21, &ABR ) ; / A11 := L11 = Cholesky Factor ( A11 ) / CUPLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; / Update A21 := L21 = A21 inv ( L11 ) / CUPLA_Trsm( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG, one, A11, A21 ) ; / Update A22 := A22 L21 L21 / CUPLA_Syrk( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one, A21, one, ABR ) ; } /... / } Retargeting PLAPACK to Clusters of GPUs 9 Francisco Igual

PLAPACK Using PLAPACK as a library. Cholesky code 1 i n t PLA_Chol ( i n t b, PLA_Obj A ) 2 { 3 PLA_Obj ABR = NULL, 4 A11 = NULL, A21 = NULL ; 5 /... / 6 7 / View ABR = A / 8 PLA_Obj_view_all ( A, &ABR ) ; 9 10 while ( TRUE ) { 11 / Partition ABR = / A11 \ 12 ===== ===== 13 \ A21 ABR / 14 where A11 i s b x b / 15 PLA_Obj_split_4 ( ABR, b, b, 16 &A11, PLA_DUMMY, 17 &A21, &ABR ) ; 18 19 / A11 := L11 = Cholesky Factor ( A11 ) / 20 PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; 21 22 / Update A21 := L21 = A21 inv ( L11 ) / 23 PLA_Trsm( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, 24 PLA_TRANSPOSE, PLA_NONUNIT_DIAG, 25 one, A11, 26 A21 ) ; 27 28 / Update A22 := A22 L21 L21 / 29 PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, 30 minus_one, A21, 31 one, ABR ) ; 32 } 33 /... / 34 } i n t CUPLA_Chol( i n t b, PLA_Obj A ) { PLA_Obj ABR = NULL, A11 = NULL, A21 = NULL ; /... / / View ABR = A / PLA_Obj_view_all ( A, &ABR ) ; while ( TRUE ) { / Partition ABR = / A11 \ ===== ===== \ A21 ABR / where A11 i s b x b / PLA_Obj_split_4 ( ABR, b, b, &A11, PLA_DUMMY, &A21, &ABR ) ; / A11 := L11 = Cholesky Factor ( A11 ) / CUPLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; / Update A21 := L21 = A21 inv ( L11 ) / CUPLA_Trsm( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG, one, A11, A21 ) ; / Update A22 := A22 L21 L21 / CUPLA_Syrk( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one, A21, one, ABR ) ; } /... / } Retargeting PLAPACK to Clusters of GPUs 9 Francisco Igual

Porting PLAPACK to clusters of GPUs GPU acceleration of PLAPACK How to accelerate PLAPACK? Naive implementation: Objects created in RAM. GPU CPU transfers bound to calculations. Intercept calls to local BLAS and transfer data to/from GPU. Straightforward implementation. Tuned implementation: Objects created in GPU. GPU CPU transfers bound to communications. Only transfer to RAM when necessary (communication). Creation of objects involves allocation on GPUs. Modification of communication layer in PLAPACK (PLA_Copy, PLA_Reduce). Retargeting PLAPACK to Clusters of GPUs 10 Francisco Igual

Porting PLAPACK to clusters of GPUs GPU acceleration of PLAPACK How to accelerate PLAPACK? Naive implementation: Objects created in RAM. GPU CPU transfers bound to calculations. Intercept calls to local BLAS and transfer data to/from GPU. Straightforward implementation. Tuned implementation: Objects created in GPU. GPU CPU transfers bound to communications. Only transfer to RAM when necessary (communication). Creation of objects involves allocation on GPUs. Modification of communication layer in PLAPACK (PLA_Copy, PLA_Reduce). Retargeting PLAPACK to Clusters of GPUs 10 Francisco Igual

Porting PLAPACK to clusters of GPUs PLAPACK infrastructure PLAPACK layered design Necessary changes to use GPU 1 PLA_Local_BLAS: NVIDIA CUBLAS. 2 LA object manipulation: management of objects in GPU memory. 3 Communication layer: transfers to RAM prior to communications. Retargeting PLAPACK to Clusters of GPUs 11 Francisco Igual

Porting PLAPACK to clusters of GPUs PLAPACK infrastructure PLAPACK layered design Necessary changes to use GPU 1 PLA_Local_BLAS: NVIDIA CUBLAS. 2 LA object manipulation: management of objects in GPU memory. 3 Communication layer: transfers to RAM prior to communications. Retargeting PLAPACK to Clusters of GPUs 11 Francisco Igual

Porting PLAPACK to clusters of GPUs LA object manipulation. Cholesky driver using GPU 1 i n t main ( void ) { 2 /... / 3 / / Object c r e a t i o n on GPU 4 CUPLA_Matrix_create (MPI_FLOAT, size, size, templ, 5 PLA_ALIGN_FIRST, PLA_ALIGN_FIRST, 6 &A_GPU ) ; 7 8 / / Object i n i t i a l i z a t i o n on GPU 9 CUPLA_Obj_set_to_SPD_random ( A_GPU ) ; 10 11 /... / 12 13 / / Cholesky f a c t o r i z a t i o n on GPU 14 CUPLA_Chol( nb_alg, A_GPU ) ; 15 /... / 16 17 / / Object d e s t r u c t i o n on GPU 18 CUPLA_Obj_free ( &A_GPU ) ; 19 } CUPLA_Matrix_create creates a L. A. object on GPU. CUPLA_Obj_free destroys an object on GPU. CUPLA_Chol operates with objects created on GPU. Retargeting PLAPACK to Clusters of GPUs 12 Francisco Igual

Porting PLAPACK to clusters of GPUs Communication layer Communication is restricted to routines PLA_Copy and PLA_Reduce: int PLA_Copy ( PLA_Obj obj_from, PLA_Obj obj_to ) Purpose: Copy contents between linear algebra objects. IN obj_from Object to be copied IN/OUT obj_to Object into which to copy int PLA_Reduce ( PLA_Obj obj_from, MPI_OP op, PLA_Obj obj_to ) Purpose: Reduce using op the contents in the duplicated object given by obj_from and overwrite obj_to with the result. IN obj_from Object to be reduced IN op Reduce operator to be used IN/OUT obj_to Object into which to reduce result Retargeting PLAPACK to Clusters of GPUs 13 Francisco Igual

Porting PLAPACK to clusters of GPUs Communication layer Communication is restricted to routines PLA_Copy and PLA_Reduce: int PLA_Copy ( PLA_Obj obj_from, PLA_Obj obj_to ) Purpose: Copy contents between linear algebra objects. IN obj_from Object to be copied IN/OUT obj_to Object into which to copy int PLA_Reduce ( PLA_Obj obj_from, MPI_OP op, PLA_Obj obj_to ) Purpose: Reduce using op the contents in the duplicated object given by obj_from and overwrite obj_to with the result. IN obj_from Object to be reduced IN op Reduce operator to be used IN/OUT obj_to Object into which to reduce result Retargeting PLAPACK to Clusters of GPUs 13 Francisco Igual

Porting PLAPACK to clusters of GPUs Modifications to PLA_Copy (I) 1 / PLAPACK PLA_Copy between column aligned matrices / 2 3 while ( TRUE ) { 4 PLA_Obj_split_size ( from_cur, PLA_SIDE_TOP, &size_from, &owner_from ) ; 5 PLA_Obj_split_size ( to_cur, PLA_SIDE_TOP, &size_to, &owner_to ) ; 6 7 if ( 0 == ( size = min ( size_from, size_to ) ) ) break ; 8 9 PLA_Obj_horz_split_2 ( from_cur, size, &from_1, &from_cur ) ; 10 PLA_Obj_horz_split_2 ( to_cur, size, &to_1, &to_cur ) ; 11 12 i f ( myrow == owner_from && owner_from == owner_to ) { 13 PLA_Local_copy ( from_1, to_1 ) ; 14 } 15 else { 16 i f ( myrow == owner_from ) { 17 PLA_Obj_get_local_contents ( from_1, PLA_NO_TRANS, &dummy, &dummy, 18 buffer_temp, size, 1 ) ; 19 20 MPI_Send ( BF( buffer_temp ), size local_width, datatype, 21 owner_to, 0, comm_col ) ; 22 } 23 i f ( myrow == owner_to ) { 24 MPI_Recv ( BF( buffer_temp ), size local_width, datatype, 25 owner_from, MPI_ANY_TAG, comm_col, &status ) ; 26 27 PLA_Obj_set_local_contents ( PLA_NO_TRANS, size, local_width, 28 buffer_temp, size, 1, to_1 ) ; 29 } 30 } 31 } 32 PLA_free ( buffer_temp ) ; Retargeting PLAPACK to Clusters of GPUs 14 Francisco Igual

Porting PLAPACK to clusters of GPUs Modifications to PLA_Copy (II) 1 / PLAPACK PLA_Obj_get_local_contents / 2 3 i n t PLA_Obj_get_local_contents ( 4 PLA_Obj obj, 5 i n t trans, 6 i n t rows_in_buf, 7 i n t cols_in_buf, 8 void buf, 9 i n t ldim_buf, 10 i n t s t r i d e _ b u f ) 11 12 / / 13 14 for ( j =0; j <n ; j ++ ) { 15 tmp_local = buf_local + j ldim_buf typesize ; 16 tmp_obj = buf_obj + j ldim typesize ; 17 memcpy( tmp_local, tmp_obj, m typesize ) ; 18 } 19 20 / / / CUPLAPACK PLA_Obj_get_local_contents / i n t CUPLA_Obj_get_local_contents ( PLA_Obj obj, i n t trans, i n t rows_in_buf, i n t cols_in_buf, void buf, i n t ldim_buf, i n t s t r i d e _ b u f ) / / cublasgetmatrix ( m, n, typesize, buf_obj, ldim, buf_local, ldim_buf ) ; / / Necessary GPU CPU transfers Data is transferred to RAM prior to a communication. Data is transferred to GPU after each communication. Retargeting PLAPACK to Clusters of GPUs 15 Francisco Igual

Porting PLAPACK to clusters of GPUs Modifications to PLA_Copy (II) 1 / PLAPACK PLA_Obj_get_local_contents / 2 3 i n t PLA_Obj_get_local_contents ( 4 PLA_Obj obj, 5 i n t trans, 6 i n t rows_in_buf, 7 i n t cols_in_buf, 8 void buf, 9 i n t ldim_buf, 10 i n t s t r i d e _ b u f ) 11 12 / / 13 14 for ( j =0; j <n ; j ++ ) { 15 tmp_local = buf_local + j ldim_buf typesize ; 16 tmp_obj = buf_obj + j ldim typesize ; 17 memcpy( tmp_local, tmp_obj, m typesize ) ; 18 } 19 20 / / / CUPLAPACK PLA_Obj_get_local_contents / i n t CUPLA_Obj_get_local_contents ( PLA_Obj obj, i n t trans, i n t rows_in_buf, i n t cols_in_buf, void buf, i n t ldim_buf, i n t s t r i d e _ b u f ) / / cublasgetmatrix ( m, n, typesize, buf_obj, ldim, buf_local, ldim_buf ) ; / / Necessary GPU CPU transfers Data is transferred to RAM prior to a communication. Data is transferred to GPU after each communication. Retargeting PLAPACK to Clusters of GPUs 15 Francisco Igual

Experimental results The LONGHORN visualization cluster Per Node Per System (256 nodes) Number of cores 8 2,048 CPU Intel Xeon Nehalem @ 2.53 GHz Available memory 48 Gbytes 13.5 TBytes Interconnection network QDR Infiniband Graphics system 128 NVIDIA Quadro Plex S4s GPU 2 x NVIDIA Quadro FX5800 512 x NVIDIA Quadro FX5800 Interconnection bus PCIExpress 2.0 (8x) Available video memory 8 Gbytes DDR3 2 TBytes DDR3 Peak performance (SP) 161.6 GFLOPS 41.40 TFLOPS Peak performance (DP) 80.8 GFLOPS 20.70 TFLOPS Table: Detailed features of the LONGHORN cluster. PLAPACK: version R32. BLAS: NVIDIA CUBLAS 2.2, MKL 10.1. MPI: MVAPICH 1.4. Retargeting PLAPACK to Clusters of GPUs 16 Francisco Igual

Experimental results GEMM: performance 9000 8000 7000 CUPLAPACK. Matrix-matrix multiplication on LONGHORN 32 Quadro FX5800 16 Quadro FX5800 8 Quadro FX5800 4 Quadro FX5800 2 Quadro FX5800 1 Quadro FX5800 6000 GFLOPS 5000 4000 3000 2000 1000 0 0 10000 20000 30000 40000 50000 60000 Matrix size (m=n=k) Figure: Performance of CUPLAPACK for the matrix-matrix multiplication on LONGHORN. Retargeting PLAPACK to Clusters of GPUs 17 Francisco Igual

Experimental results Cholesky factorization: performance 5000 4000 CUPLAPACK. Cholesky factorization on LONGHORN 32 Quadro FX5800 16 Quadro FX5800 8 Quadro FX5800 4 Quadro FX5800 2 Quadro FX5800 1 Quadro FX5800 3000 GFLOPS 2000 1000 0 0 20000 40000 60000 80000 100000 Matrix size Figure: Performance of CUPLAPACK for the Cholesky factorization on LONGHORN. Retargeting PLAPACK to Clusters of GPUs 18 Francisco Igual

GEMM: scalability Experimental results CUPLAPACK. Matrix-matrix multiplication on LONGHORN 30 25 32 Quadro FX5800 16 Quadro FX5800 8 Quadro FX5800 4 Quadro FX5800 2 Quadro FX5800 1 Quadro FX5800 20 Speedup 15 10 5 0 0 10000 20000 30000 40000 50000 60000 Matrix size (m=n=k) Figure: Scalability of the PLAPACK-based codes for the matrix-matrix multiplication on LONGHORN. The reference is the performance of CUBLAS SGEMM on one GPU. Retargeting PLAPACK to Clusters of GPUs 19 Francisco Igual

Experimental results GEMM: PLAPACK vs. CUPLAPACK 7000 6000 CUPLAPACK. Matrix-matrix multiplication on LONGHORN (16 nodes) CUPLAPACK - 32 Quadro FX5800 CUPLAPACK - 16 Quadro FX5800 PLAPACK - 128 Intel Xeon Nehalem Cores 5000 GFLOPS 4000 3000 2000 1000 0 0 5000 10000 15000 20000 25000 30000 35000 40000 Matrix size (m=n=k) Figure: GEMM on 16 nodes of LONGHORN. PLAPACK vs CUPLAPACK. Retargeting PLAPACK to Clusters of GPUs 20 Francisco Igual

Experimental results Cholesky factorization: PLAPACK vs. CUPLAPACK 3500 3000 CUPLAPACK. Cholesky factorization on LONGHORN (16 nodes) CUPLAPACK - 32 Quadro FX5800 CUPLAPACK - 16 Quadro FX5800 PLAPACK - 128 Intel Xeon Nehalem Cores 2500 GFLOPS 2000 1500 1000 500 0 0 10000 20000 30000 40000 50000 60000 Matrix size Figure: Cholesky factorization on 16 nodes of LONGHORN. PLAPACK vs CUPLAPACK. Retargeting PLAPACK to Clusters of GPUs 21 Francisco Igual

Experimental results Multiple GPUs per node vs. One GPU per node 5000 CUPLAPACK. Cholesky factorization on LONGHORN 32 Quadro FX5800 on 32 nodes 32 Quadro FX5800 on 16 nodes 4000 3000 GFLOPS 2000 1000 0 0 20000 40000 60000 80000 100000 Matrix size Figure: Cholesky factorization on 32 GPUs of LONGHORN, using one or two GPUs per node. Retargeting PLAPACK to Clusters of GPUs 22 Francisco Igual

Conclusions and future work Conclusions and future work Conclusions Retarget of PLAPACK to hybrid architectures. Transparent for programmers and library developers. Benefits from the layered design of PLAPACK. Remarkable performance and scalability on large GPU clusters. Future work Port the Elemental framework to GPUs. Out-of-core + GPU support to deal with even larger problems. Retargeting PLAPACK to Clusters of GPUs 23 Francisco Igual

Conclusions and future work Acknowledgements Thank you... Questions? We thank the Texas Advanced Computing Center for granting access to the platform where the experiments were conducted a. a http://www.tacc.utexas.edu/resources/visualization/#longhorn Retargeting PLAPACK to Clusters of GPUs 24 Francisco Igual