Retargeting PLAPACK to Clusters with Hardware Accelerators
|
|
|
- Franklin Rogers
- 10 years ago
- Views:
Transcription
1 Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain). 2 Department of Computer Sciences. Texas Institute for Computational Engineering and Sciences. The University of Texas at Austin. Retargeting PLAPACK to Clusters of GPUs 1 Francisco Igual
2 Motivation Motivation Past, present and future of GPU computing Systems with one GPU Multi-GPU systems Clusters of GPUs Software: CUDA OpenCL OpenGL, Cg,... ATI Stream Future is today... Software: GPUSs Star-PU FLAME SuperMatrix?? Software: Our approach: CUPLAPACK China s new Nebulae Supercomputer is No. 2 at Top500 (June 2010) Powered with Tesla C2050 GPUs Retargeting PLAPACK to Clusters of GPUs 2 Francisco Igual
3 Motivation Motivation Past, present and future of GPU computing Systems with one GPU Multi-GPU systems Clusters of GPUs Software: CUDA OpenCL OpenGL, Cg,... ATI Stream Future is today... Software: GPUSs Star-PU FLAME SuperMatrix?? Software: Our approach: CUPLAPACK China s new Nebulae Supercomputer is No. 2 at Top500 (June 2010) Powered with Tesla C2050 GPUs Retargeting PLAPACK to Clusters of GPUs 2 Francisco Igual
4 Motivation Motivation Past, present and future of GPU computing Systems with one GPU Multi-GPU systems Clusters of GPUs Software: CUDA OpenCL OpenGL, Cg,... ATI Stream Future is today... Software: GPUSs Star-PU FLAME SuperMatrix?? Software: Our approach: CUPLAPACK China s new Nebulae Supercomputer is No. 2 at Top500 (June 2010) Powered with Tesla C2050 GPUs Retargeting PLAPACK to Clusters of GPUs 2 Francisco Igual
5 Outline Outline 1 Introduction 2 PLAPACK 3 Porting PLAPACK to clusters of GPUs 4 Experimental results 5 Conclusions and future work Retargeting PLAPACK to Clusters of GPUs 3 Francisco Igual
6 Introduction Introduction PLAPACK: Parallel dense linear algebra package for message-passing architectures (i.e., clusters). CUPLAPACK: Retarget of PLAPACK to clusters with graphics processors. Why a PLAPACK-based approach?: Goals: Modularity: layered design. Programmability: FLAME methodology. Object-based approach. 1 Programmability: Transparent port to clusters of GPUs. 2 Performance: Clusters of GPUs with hundreds of nodes. 3 Scalability: Keep performance with large problems. Retargeting PLAPACK to Clusters of GPUs 4 Francisco Igual
7 PLAPACK overview PLAPACK PLAPACK: Parallel Linear Algebra Package Library infrastructure for parallel implementation of linear algebra algorithms on distributed memory machines. Natural transition from algorithms to distributed-memory codes. Focuses on programmability. Layered design. Focuses on extensibility and flexibility. Physically Based Matrix Distribution (PBMD) and templates. Retargeting PLAPACK to Clusters of GPUs 5 Francisco Igual
8 PLAPACK Cholesky factorization. Right-looking variant (I) Cholesky factorization of a positive definite matrix A: Partition: ( A11 A = A 21 A 22 A = LL T where A 11 and L 11 are b b matrices. ) ( L11 0, L = L 21 L 22 ), From A = LL T, we have that: ( ) A11 = A 21 A 22 = ( )( L11 0 L T 11 L21 T L 21 L 22 0 L22 T ( L11 L11 T L 21 L11 T L 21 L21 T + L 22L22 T ) ). Retargeting PLAPACK to Clusters of GPUs 6 Francisco Igual
9 PLAPACK Cholesky factorization. Right-looking variant (I) Cholesky factorization of a positive definite matrix A: Partition: ( A11 A = A 21 A 22 A = LL T where A 11 and L 11 are b b matrices. ) ( L11 0, L = L 21 L 22 ), From A = LL T, we have that: ( ) A11 = A 21 A 22 = ( )( L11 0 L T 11 L21 T L 21 L 22 0 L22 T ( L11 L11 T L 21 L11 T L 21 L21 T + L 22L22 T ) ). Retargeting PLAPACK to Clusters of GPUs 6 Francisco Igual
10 PLAPACK Cholesky factorization. Right-looking variant (II) This yields to equations: L 11 L11 T = A 11, L 21 = A 21 L11 T A 22 L 21 L21 T = L 22 L22. T Concluding with the algorithm: ( ) A11 1 Partition A =, being A 11 a b b submatrix. A 21 A 22 2 A 11 := L 11 such that A 11 = L 11 L T 11 (Cholesky factor of A 11). 3 A 21 := L 21 = A 21 L T A 22 := A 22 L 21 L T Continue recursively with A = A 22 with step 1. Retargeting PLAPACK to Clusters of GPUs 7 Francisco Igual
11 PLAPACK Cholesky factorization. Right-looking variant (II) This yields to equations: L 11 L11 T = A 11, L 21 = A 21 L11 T A 22 L 21 L21 T = L 22 L22. T Concluding with the algorithm: ( ) A11 1 Partition A =, being A 11 a b b submatrix. A 21 A 22 2 A 11 := L 11 such that A 11 = L 11 L T 11 (Cholesky factor of A 11). 3 A 21 := L 21 = A 21 L T A 22 := A 22 L 21 L T Continue recursively with A = A 22 with step 1. Retargeting PLAPACK to Clusters of GPUs 7 Francisco Igual
12 PLAPACK Using PLAPACK as a library From algorithms to code i n t PLA_Chol ( i n t b, PLA_Obj A ) { PLA_Obj ABR = NULL, A11 = NULL, A21 = NULL ; /... / / View ABR = A / PLA_Obj_view_all ( A, &ABR ) ; while ( TRUE ) { / Partition ABR = / A11 \ ===== ===== \ A21 ABR / where A11 i s b x b / PLA_Obj_split_4 ( ABR, b, b, &A11, PLA_DUMMY, &A21, &ABR ) ; / A11 := L11 = Cholesky Factor ( A11 ) / PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; / Update A21 := L21 = A21 inv ( L11 ) / PLA_Trsm ( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG, one, A11, A21 ) ; / Update A22 := A22 L21 L21 / PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one, A21, one, ABR ) ; } /... / } Retargeting PLAPACK to Clusters of GPUs 8 Francisco Igual
13 PLAPACK Using PLAPACK as a library From algorithms to code i n t PLA_Chol ( i n t b, PLA_Obj A ) { PLA_Obj ABR = NULL, A11 = NULL, A21 = NULL ; /... / / View ABR = A / PLA_Obj_view_all ( A, &ABR ) ; while ( TRUE ) { / Partition ABR = / A11 \ ===== ===== \ A21 ABR / where A11 i s b x b / PLA_Obj_split_4 ( ABR, b, b, &A11, PLA_DUMMY, &A21, &ABR ) ; / A11 := L11 = Cholesky Factor ( A11 ) / PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; / Update A21 := L21 = A21 inv ( L11 ) / PLA_Trsm ( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG, one, A11, A21 ) ; / Update A22 := A22 L21 L21 / PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one, A21, one, ABR ) ; } /... / } Retargeting PLAPACK to Clusters of GPUs 8 Francisco Igual
14 PLAPACK Using PLAPACK as a library. Cholesky code 1 i n t PLA_Chol ( i n t b, PLA_Obj A ) 2 { 3 PLA_Obj ABR = NULL, 4 A11 = NULL, A21 = NULL ; 5 /... / 6 7 / View ABR = A / 8 PLA_Obj_view_all ( A, &ABR ) ; 9 10 while ( TRUE ) { 11 / Partition ABR = / A11 \ 12 ===== ===== 13 \ A21 ABR / 14 where A11 i s b x b / 15 PLA_Obj_split_4 ( ABR, b, b, 16 &A11, PLA_DUMMY, 17 &A21, &ABR ) ; / A11 := L11 = Cholesky Factor ( A11 ) / 20 PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; / Update A21 := L21 = A21 inv ( L11 ) / 23 PLA_Trsm( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, 24 PLA_TRANSPOSE, PLA_NONUNIT_DIAG, 25 one, A11, 26 A21 ) ; / Update A22 := A22 L21 L21 / 29 PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, 30 minus_one, A21, 31 one, ABR ) ; 32 } 33 /... / 34 } i n t CUPLA_Chol( i n t b, PLA_Obj A ) { PLA_Obj ABR = NULL, A11 = NULL, A21 = NULL ; /... / / View ABR = A / PLA_Obj_view_all ( A, &ABR ) ; while ( TRUE ) { / Partition ABR = / A11 \ ===== ===== \ A21 ABR / where A11 i s b x b / PLA_Obj_split_4 ( ABR, b, b, &A11, PLA_DUMMY, &A21, &ABR ) ; / A11 := L11 = Cholesky Factor ( A11 ) / CUPLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; / Update A21 := L21 = A21 inv ( L11 ) / CUPLA_Trsm( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG, one, A11, A21 ) ; / Update A22 := A22 L21 L21 / CUPLA_Syrk( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one, A21, one, ABR ) ; } /... / } Retargeting PLAPACK to Clusters of GPUs 9 Francisco Igual
15 PLAPACK Using PLAPACK as a library. Cholesky code 1 i n t PLA_Chol ( i n t b, PLA_Obj A ) 2 { 3 PLA_Obj ABR = NULL, 4 A11 = NULL, A21 = NULL ; 5 /... / 6 7 / View ABR = A / 8 PLA_Obj_view_all ( A, &ABR ) ; 9 10 while ( TRUE ) { 11 / Partition ABR = / A11 \ 12 ===== ===== 13 \ A21 ABR / 14 where A11 i s b x b / 15 PLA_Obj_split_4 ( ABR, b, b, 16 &A11, PLA_DUMMY, 17 &A21, &ABR ) ; / A11 := L11 = Cholesky Factor ( A11 ) / 20 PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; / Update A21 := L21 = A21 inv ( L11 ) / 23 PLA_Trsm( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, 24 PLA_TRANSPOSE, PLA_NONUNIT_DIAG, 25 one, A11, 26 A21 ) ; / Update A22 := A22 L21 L21 / 29 PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, 30 minus_one, A21, 31 one, ABR ) ; 32 } 33 /... / 34 } i n t CUPLA_Chol( i n t b, PLA_Obj A ) { PLA_Obj ABR = NULL, A11 = NULL, A21 = NULL ; /... / / View ABR = A / PLA_Obj_view_all ( A, &ABR ) ; while ( TRUE ) { / Partition ABR = / A11 \ ===== ===== \ A21 ABR / where A11 i s b x b / PLA_Obj_split_4 ( ABR, b, b, &A11, PLA_DUMMY, &A21, &ABR ) ; / A11 := L11 = Cholesky Factor ( A11 ) / CUPLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; / Update A21 := L21 = A21 inv ( L11 ) / CUPLA_Trsm( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG, one, A11, A21 ) ; / Update A22 := A22 L21 L21 / CUPLA_Syrk( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one, A21, one, ABR ) ; } /... / } Retargeting PLAPACK to Clusters of GPUs 9 Francisco Igual
16 Porting PLAPACK to clusters of GPUs GPU acceleration of PLAPACK How to accelerate PLAPACK? Naive implementation: Objects created in RAM. GPU CPU transfers bound to calculations. Intercept calls to local BLAS and transfer data to/from GPU. Straightforward implementation. Tuned implementation: Objects created in GPU. GPU CPU transfers bound to communications. Only transfer to RAM when necessary (communication). Creation of objects involves allocation on GPUs. Modification of communication layer in PLAPACK (PLA_Copy, PLA_Reduce). Retargeting PLAPACK to Clusters of GPUs 10 Francisco Igual
17 Porting PLAPACK to clusters of GPUs GPU acceleration of PLAPACK How to accelerate PLAPACK? Naive implementation: Objects created in RAM. GPU CPU transfers bound to calculations. Intercept calls to local BLAS and transfer data to/from GPU. Straightforward implementation. Tuned implementation: Objects created in GPU. GPU CPU transfers bound to communications. Only transfer to RAM when necessary (communication). Creation of objects involves allocation on GPUs. Modification of communication layer in PLAPACK (PLA_Copy, PLA_Reduce). Retargeting PLAPACK to Clusters of GPUs 10 Francisco Igual
18 Porting PLAPACK to clusters of GPUs PLAPACK infrastructure PLAPACK layered design Necessary changes to use GPU 1 PLA_Local_BLAS: NVIDIA CUBLAS. 2 LA object manipulation: management of objects in GPU memory. 3 Communication layer: transfers to RAM prior to communications. Retargeting PLAPACK to Clusters of GPUs 11 Francisco Igual
19 Porting PLAPACK to clusters of GPUs PLAPACK infrastructure PLAPACK layered design Necessary changes to use GPU 1 PLA_Local_BLAS: NVIDIA CUBLAS. 2 LA object manipulation: management of objects in GPU memory. 3 Communication layer: transfers to RAM prior to communications. Retargeting PLAPACK to Clusters of GPUs 11 Francisco Igual
20 Porting PLAPACK to clusters of GPUs LA object manipulation. Cholesky driver using GPU 1 i n t main ( void ) { 2 /... / 3 / / Object c r e a t i o n on GPU 4 CUPLA_Matrix_create (MPI_FLOAT, size, size, templ, 5 PLA_ALIGN_FIRST, PLA_ALIGN_FIRST, 6 &A_GPU ) ; 7 8 / / Object i n i t i a l i z a t i o n on GPU 9 CUPLA_Obj_set_to_SPD_random ( A_GPU ) ; /... / / / Cholesky f a c t o r i z a t i o n on GPU 14 CUPLA_Chol( nb_alg, A_GPU ) ; 15 /... / / / Object d e s t r u c t i o n on GPU 18 CUPLA_Obj_free ( &A_GPU ) ; 19 } CUPLA_Matrix_create creates a L. A. object on GPU. CUPLA_Obj_free destroys an object on GPU. CUPLA_Chol operates with objects created on GPU. Retargeting PLAPACK to Clusters of GPUs 12 Francisco Igual
21 Porting PLAPACK to clusters of GPUs Communication layer Communication is restricted to routines PLA_Copy and PLA_Reduce: int PLA_Copy ( PLA_Obj obj_from, PLA_Obj obj_to ) Purpose: Copy contents between linear algebra objects. IN obj_from Object to be copied IN/OUT obj_to Object into which to copy int PLA_Reduce ( PLA_Obj obj_from, MPI_OP op, PLA_Obj obj_to ) Purpose: Reduce using op the contents in the duplicated object given by obj_from and overwrite obj_to with the result. IN obj_from Object to be reduced IN op Reduce operator to be used IN/OUT obj_to Object into which to reduce result Retargeting PLAPACK to Clusters of GPUs 13 Francisco Igual
22 Porting PLAPACK to clusters of GPUs Communication layer Communication is restricted to routines PLA_Copy and PLA_Reduce: int PLA_Copy ( PLA_Obj obj_from, PLA_Obj obj_to ) Purpose: Copy contents between linear algebra objects. IN obj_from Object to be copied IN/OUT obj_to Object into which to copy int PLA_Reduce ( PLA_Obj obj_from, MPI_OP op, PLA_Obj obj_to ) Purpose: Reduce using op the contents in the duplicated object given by obj_from and overwrite obj_to with the result. IN obj_from Object to be reduced IN op Reduce operator to be used IN/OUT obj_to Object into which to reduce result Retargeting PLAPACK to Clusters of GPUs 13 Francisco Igual
23 Porting PLAPACK to clusters of GPUs Modifications to PLA_Copy (I) 1 / PLAPACK PLA_Copy between column aligned matrices / 2 3 while ( TRUE ) { 4 PLA_Obj_split_size ( from_cur, PLA_SIDE_TOP, &size_from, &owner_from ) ; 5 PLA_Obj_split_size ( to_cur, PLA_SIDE_TOP, &size_to, &owner_to ) ; 6 7 if ( 0 == ( size = min ( size_from, size_to ) ) ) break ; 8 9 PLA_Obj_horz_split_2 ( from_cur, size, &from_1, &from_cur ) ; 10 PLA_Obj_horz_split_2 ( to_cur, size, &to_1, &to_cur ) ; i f ( myrow == owner_from && owner_from == owner_to ) { 13 PLA_Local_copy ( from_1, to_1 ) ; 14 } 15 else { 16 i f ( myrow == owner_from ) { 17 PLA_Obj_get_local_contents ( from_1, PLA_NO_TRANS, &dummy, &dummy, 18 buffer_temp, size, 1 ) ; MPI_Send ( BF( buffer_temp ), size local_width, datatype, 21 owner_to, 0, comm_col ) ; 22 } 23 i f ( myrow == owner_to ) { 24 MPI_Recv ( BF( buffer_temp ), size local_width, datatype, 25 owner_from, MPI_ANY_TAG, comm_col, &status ) ; PLA_Obj_set_local_contents ( PLA_NO_TRANS, size, local_width, 28 buffer_temp, size, 1, to_1 ) ; 29 } 30 } 31 } 32 PLA_free ( buffer_temp ) ; Retargeting PLAPACK to Clusters of GPUs 14 Francisco Igual
24 Porting PLAPACK to clusters of GPUs Modifications to PLA_Copy (II) 1 / PLAPACK PLA_Obj_get_local_contents / 2 3 i n t PLA_Obj_get_local_contents ( 4 PLA_Obj obj, 5 i n t trans, 6 i n t rows_in_buf, 7 i n t cols_in_buf, 8 void buf, 9 i n t ldim_buf, 10 i n t s t r i d e _ b u f ) / / for ( j =0; j <n ; j ++ ) { 15 tmp_local = buf_local + j ldim_buf typesize ; 16 tmp_obj = buf_obj + j ldim typesize ; 17 memcpy( tmp_local, tmp_obj, m typesize ) ; 18 } / / / CUPLAPACK PLA_Obj_get_local_contents / i n t CUPLA_Obj_get_local_contents ( PLA_Obj obj, i n t trans, i n t rows_in_buf, i n t cols_in_buf, void buf, i n t ldim_buf, i n t s t r i d e _ b u f ) / / cublasgetmatrix ( m, n, typesize, buf_obj, ldim, buf_local, ldim_buf ) ; / / Necessary GPU CPU transfers Data is transferred to RAM prior to a communication. Data is transferred to GPU after each communication. Retargeting PLAPACK to Clusters of GPUs 15 Francisco Igual
25 Porting PLAPACK to clusters of GPUs Modifications to PLA_Copy (II) 1 / PLAPACK PLA_Obj_get_local_contents / 2 3 i n t PLA_Obj_get_local_contents ( 4 PLA_Obj obj, 5 i n t trans, 6 i n t rows_in_buf, 7 i n t cols_in_buf, 8 void buf, 9 i n t ldim_buf, 10 i n t s t r i d e _ b u f ) / / for ( j =0; j <n ; j ++ ) { 15 tmp_local = buf_local + j ldim_buf typesize ; 16 tmp_obj = buf_obj + j ldim typesize ; 17 memcpy( tmp_local, tmp_obj, m typesize ) ; 18 } / / / CUPLAPACK PLA_Obj_get_local_contents / i n t CUPLA_Obj_get_local_contents ( PLA_Obj obj, i n t trans, i n t rows_in_buf, i n t cols_in_buf, void buf, i n t ldim_buf, i n t s t r i d e _ b u f ) / / cublasgetmatrix ( m, n, typesize, buf_obj, ldim, buf_local, ldim_buf ) ; / / Necessary GPU CPU transfers Data is transferred to RAM prior to a communication. Data is transferred to GPU after each communication. Retargeting PLAPACK to Clusters of GPUs 15 Francisco Igual
26 Experimental results The LONGHORN visualization cluster Per Node Per System (256 nodes) Number of cores 8 2,048 CPU Intel Xeon 2.53 GHz Available memory 48 Gbytes 13.5 TBytes Interconnection network QDR Infiniband Graphics system 128 NVIDIA Quadro Plex S4s GPU 2 x NVIDIA Quadro FX x NVIDIA Quadro FX5800 Interconnection bus PCIExpress 2.0 (8x) Available video memory 8 Gbytes DDR3 2 TBytes DDR3 Peak performance (SP) GFLOPS TFLOPS Peak performance (DP) 80.8 GFLOPS TFLOPS Table: Detailed features of the LONGHORN cluster. PLAPACK: version R32. BLAS: NVIDIA CUBLAS 2.2, MKL MPI: MVAPICH 1.4. Retargeting PLAPACK to Clusters of GPUs 16 Francisco Igual
27 Experimental results GEMM: performance CUPLAPACK. Matrix-matrix multiplication on LONGHORN 32 Quadro FX Quadro FX Quadro FX Quadro FX Quadro FX Quadro FX GFLOPS Matrix size (m=n=k) Figure: Performance of CUPLAPACK for the matrix-matrix multiplication on LONGHORN. Retargeting PLAPACK to Clusters of GPUs 17 Francisco Igual
28 Experimental results Cholesky factorization: performance CUPLAPACK. Cholesky factorization on LONGHORN 32 Quadro FX Quadro FX Quadro FX Quadro FX Quadro FX Quadro FX GFLOPS Matrix size Figure: Performance of CUPLAPACK for the Cholesky factorization on LONGHORN. Retargeting PLAPACK to Clusters of GPUs 18 Francisco Igual
29 GEMM: scalability Experimental results CUPLAPACK. Matrix-matrix multiplication on LONGHORN Quadro FX Quadro FX Quadro FX Quadro FX Quadro FX Quadro FX Speedup Matrix size (m=n=k) Figure: Scalability of the PLAPACK-based codes for the matrix-matrix multiplication on LONGHORN. The reference is the performance of CUBLAS SGEMM on one GPU. Retargeting PLAPACK to Clusters of GPUs 19 Francisco Igual
30 Experimental results GEMM: PLAPACK vs. CUPLAPACK CUPLAPACK. Matrix-matrix multiplication on LONGHORN (16 nodes) CUPLAPACK - 32 Quadro FX5800 CUPLAPACK - 16 Quadro FX5800 PLAPACK Intel Xeon Nehalem Cores 5000 GFLOPS Matrix size (m=n=k) Figure: GEMM on 16 nodes of LONGHORN. PLAPACK vs CUPLAPACK. Retargeting PLAPACK to Clusters of GPUs 20 Francisco Igual
31 Experimental results Cholesky factorization: PLAPACK vs. CUPLAPACK CUPLAPACK. Cholesky factorization on LONGHORN (16 nodes) CUPLAPACK - 32 Quadro FX5800 CUPLAPACK - 16 Quadro FX5800 PLAPACK Intel Xeon Nehalem Cores 2500 GFLOPS Matrix size Figure: Cholesky factorization on 16 nodes of LONGHORN. PLAPACK vs CUPLAPACK. Retargeting PLAPACK to Clusters of GPUs 21 Francisco Igual
32 Experimental results Multiple GPUs per node vs. One GPU per node 5000 CUPLAPACK. Cholesky factorization on LONGHORN 32 Quadro FX5800 on 32 nodes 32 Quadro FX5800 on 16 nodes GFLOPS Matrix size Figure: Cholesky factorization on 32 GPUs of LONGHORN, using one or two GPUs per node. Retargeting PLAPACK to Clusters of GPUs 22 Francisco Igual
33 Conclusions and future work Conclusions and future work Conclusions Retarget of PLAPACK to hybrid architectures. Transparent for programmers and library developers. Benefits from the layered design of PLAPACK. Remarkable performance and scalability on large GPU clusters. Future work Port the Elemental framework to GPUs. Out-of-core + GPU support to deal with even larger problems. Retargeting PLAPACK to Clusters of GPUs 23 Francisco Igual
34 Conclusions and future work Acknowledgements Thank you... Questions? We thank the Texas Advanced Computing Center for granting access to the platform where the experiments were conducted a. a Retargeting PLAPACK to Clusters of GPUs 24 Francisco Igual
High Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
Notes on Cholesky Factorization
Notes on Cholesky Factorization Robert A. van de Geijn Department of Computer Science Institute for Computational Engineering and Sciences The University of Texas at Austin Austin, TX 78712 [email protected]
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Overview of HPC systems and software available within
Overview of HPC systems and software available within Overview Available HPC Systems Ba Cy-Tera Available Visualization Facilities Software Environments HPC System at Bibliotheca Alexandrina SUN cluster
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
ST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
A quick tutorial on Intel's Xeon Phi Coprocessor
A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be [email protected] Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
RWTH GPU Cluster. Sandra Wienke [email protected] November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky
RWTH GPU Cluster Fotos: Christian Iwainsky Sandra Wienke [email protected] November 2012 Rechen- und Kommunikationszentrum (RZ) The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) innovative
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems
Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems Murilo Boratto Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Universidade do Estado
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
Building a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
Petascale Visualization: Approaches and Initial Results
Petascale Visualization: Approaches and Initial Results James Ahrens Li-Ta Lo, Boonthanome Nouanesengsy, John Patchett, Allen McPherson Los Alamos National Laboratory LA-UR- 08-07337 Operated by Los Alamos
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Accelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
GPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
A general-purpose virtualization service for HPC on cloud computing: an application to GPUs
A general-purpose virtualization service for HPC on cloud computing: an application to GPUs R.Montella, G.Coviello, G.Giunta* G. Laccetti #, F. Isaila, J. Garcia Blas *Department of Applied Science University
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
Recent Advances in HPC for Structural Mechanics Simulations
Recent Advances in HPC for Structural Mechanics Simulations 1 Trends in Engineering Driving Demand for HPC Increase product performance and integrity in less time Consider more design variants Find the
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Introduction to Hybrid Programming
Introduction to Hybrid Programming Hristo Iliev Rechen- und Kommunikationszentrum aixcelerate 2012 / Aachen 10. Oktober 2012 Version: 1.1 Rechen- und Kommunikationszentrum (RZ) Motivation for hybrid programming
GPU for Scientific Computing. -Ali Saleh
1 GPU for Scientific Computing -Ali Saleh Contents Introduction What is GPU GPU for Scientific Computing K-Means Clustering K-nearest Neighbours When to use GPU and when not Commercial Programming GPU
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Le langage OCaml et la programmation des GPU
Le langage OCaml et la programmation des GPU GPU programming with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte Le projet OpenGPU : un an plus tard Ecole Polytechnique - 8 juin 2011 Outline
1 DCSC/AU: HUGE. DeIC Sekretariat 2013-03-12/RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations
Bilag 1 2013-03-12/RB DeIC (DCSC) Scientific Computing Installations DeIC, previously DCSC, currently has a number of scientific computing installations, distributed at five regional operating centres.
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu [email protected] High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015
The GPU Accelerated Data Center Marc Hamilton, August 27, 2015 THE GPU-ACCELERATED DATA CENTER HPC DEEP LEARNING PC VIRTUALIZATION CLOUD GAMING RENDERING 2 Product design FROM ADVANCED RENDERING TO VIRTUAL
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications
AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications Dr. Bjoern Landmann Dr. Kerstin Wieczorek Stefan Bachschuster 18.03.2015 FluiDyna GmbH, Lichtenbergstr. 8, 85748 Garching
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013
Cluster performance, how to get the most out of Abel Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013 Introduction Architecture x86-64 and NVIDIA Compilers MPI Interconnect Storage Batch queue
How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers
Information Technology Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers Effective for FY2016 Purpose This document summarizes High Performance Computing
How to choose a suitable computer
How to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and post-processing your data with Artec Studio. While
Performance of the JMA NWP models on the PC cluster TSUBAME.
Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)
HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014
HPC Cluster Decisions and ANSYS Configuration Best Practices Diana Collier Lead Systems Support Specialist Houston UGM May 2014 1 Agenda Introduction Lead Systems Support Specialist Cluster Decisions Job
www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING
www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING GPU COMPUTING VISUALISATION XENON Accelerating Exploration Mineral, oil and gas exploration is an expensive and challenging
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
CUDAMat: a CUDA-based matrix class for Python
Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 November 25, 2009 UTML TR 2009 004 CUDAMat: a CUDA-based
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University
CUDA in the Cloud Enabling HPC Workloads in OpenStack With special thanks to Andrew Younge (Indiana Univ.) and Massimo Bernaschi (IAC-CNR)
CUDA in the Cloud Enabling HPC Workloads in OpenStack John Paul Walters Computer Scien5st, USC Informa5on Sciences Ins5tute [email protected] With special thanks to Andrew Younge (Indiana Univ.) and Massimo
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer
ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop Emily Apsey Performance Engineer Presentation Overview What it takes to successfully virtualize ArcGIS Pro in Citrix XenApp and XenDesktop - Shareable
Several tips on how to choose a suitable computer
Several tips on how to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and postprocessing of your data with Artec
Parallel Large-Scale Visualization
Parallel Large-Scale Visualization Aaron Birkland Cornell Center for Advanced Computing Data Analysis on Ranger January 2012 Parallel Visualization Why? Performance Processing may be too slow on one CPU
A distributed CPU-GPU sparse direct solver
A distributed CPU-GPU sparse direct solver Piyush Sao 1, Richard Vuduc 1, and Xiaoye Li 2 1 Georgia Institute of Technology, {piyush3,richie}@gatech.edu 2 Lawrence Berkeley National Laboratory, [email protected]
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca
Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca Carlo Cavazzoni CINECA Supercomputing Application & Innovation www.cineca.it 21 Aprile 2015 FERMI Name: Fermi Architecture: BlueGene/Q
Graphic Processing Units: a possible answer to High Performance Computing?
4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/
Introduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji [email protected]
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration Sina Meraji [email protected] Please Note IBM s statements regarding its plans, directions, and intent are subject to
Performance Measurement of a High-Performance Computing System Utilized for Electronic Medical Record Management
Performance Measurement of a High-Performance Computing System Utilized for Electronic Medical Record Management 1 Kiran George, 2 Chien-In Henry Chen 1,Corresponding Author Computer Engineering Program,
HPC enabling of OpenFOAM R for CFD applications
HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,
The Asterope compute cluster
The Asterope compute cluster ÅA has a small cluster named asterope.abo.fi with 8 compute nodes Each node has 2 Intel Xeon X5650 processors (6-core) with a total of 24 GB RAM 2 NVIDIA Tesla M2050 GPGPU
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
Several tips on how to choose a suitable computer
Several tips on how to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and postprocessing of your data with Artec
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
GPU Programming in Computer Vision
Computer Vision Group Prof. Daniel Cremers GPU Programming in Computer Vision Preliminary Meeting Thomas Möllenhoff, Robert Maier, Caner Hazirbas What you will learn in the practical course Introduction
Large-Scale Reservoir Simulation and Big Data Visualization
Large-Scale Reservoir Simulation and Big Data Visualization Dr. Zhangxing John Chen NSERC/Alberta Innovates Energy Environment Solutions/Foundation CMG Chair Alberta Innovates Technology Future (icore)
Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez
Energy efficient computing on Embedded and Mobile devices Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez A brief look at the (outdated) Top500 list Most systems are built
Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH
Equalizer Parallel OpenGL Application Framework Stefan Eilemann, Eyescale Software GmbH Outline Overview High-Performance Visualization Equalizer Competitive Environment Equalizer Features Scalability
Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales
Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes Anthony Kenisky, VP of North America Sales About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer 2001-2007
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
Introduction to Linux and Cluster Basics for the CCR General Computing Cluster
Introduction to Linux and Cluster Basics for the CCR General Computing Cluster Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY 14203 Phone: 716-881-8959
