ST810 Advanced Computing

Size: px

Start display at page:

Download "ST810 Advanced Computing"

Mitchell Malone
9 years ago
Views:

1 ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013

2 Outline computing Hardware computing overview Matlab R

i7-3720qm CPU) and a NVIDIA GeForce GT 650M R NVIDIA GT 650M has 1G

3 ST810 Lecture 17 Hardware Typical s on current laptops I I I R E.g., my MacBook Pro has an Intel HD Graphics 4000 (built in R with i7-3720qm CPU) and a NVIDIA GeForce GT 650M R NVIDIA GT 650M has 1G memory, 524K L2 cache, GHz Theoretical throughput: SP GFLOPS

4 Hardware Typical s on current desktops My desktop (Dell Alienware) has a NVIDIA R GeForce GTX 580 GTX 580 has 1.5G memory, 786K L2 cache, GHz Theoretical throughput: 1581 SP GFLOPS Release Price: $500 (Nov 2010)

5G memory, 786K L2 cache, 512 cores @ 1.

5 Hardware Typical s on current servers The teaching server has 4 x NVIDIA R Tesla M2070Q Each Tesla M2070Q has 6G memory (5.25G with ECC), 786K L2 cache, GHz Theoretical throughput: 4 x 1288 SP GFLOPS or 4 x 512 DP GFLOPS

memory (5.25G with ECC), 786K L2 cache, 448 cores @ 1.

6 Hardware Graphics Processing Units (s) Ubiquitous in today s hardware (PCs, laptops, servers) Cost effective for high performance computing Rapid growth in recent years Our department has at least two servers. Many nodes in NCSU HPC henry2 are equipped with s too

performance computing Rapid growth in recent years Our department

7 Hardware vs CPU architecture s contain 100s of processing cores on a single chip; several chips can fit in a desktop PC Each core carries out the same operations in parallel on different input data single program, multiple data (SPMD) paradigm Extremely high arithmetic intensity *if* one can transfer the data onto and results off of the processors quickly

different input data single program, multiple data (SPMD) paradigm Extremely high

8 ST810 Lecture 17 Hardware CPU An analogy taken from Andrew Beam s presentation in ST790

9 computing overview GP - General purpose computing My experience Almost always involve (new) algorithm development and/or revamping CPU code Research before going for GP (next slide) Easier to develop in C/C++ (free compiler), Fortran (compiler $), and Matlab Do not reinvent the wheel use libraries

Research before going for GP (next slide) Easier to develop in C/C++ (free

10 computing overview Before using s 0. Frustrated by slow code Am I using the right algorithm(s)? Go to your ST758 notes or a numerical analysis book. E.g., for massive data (terabytes), an O(n 2 ) algorithm vs an O(n log n) means a years vs 27 seconds difference on a TFLOPS supercomputer 2. Repeat: Profile and optimize original code 3. Can a compiled language or optimized library (MKL, ATLAS) help? 4. Identify the bottleneck routine and research the potential gain on 5. Can my data fit into memory? 6. Can other routines be easily implemented on? Is that necessary? 7. Decide the toolchain: Matlab, CUDA, PGI toolchain,...

Identify the bottleneck routine and research the potential gain on 5. Can my data fit into memory? 6. Can other routines be easily implemented on?

11 computing overview GP development A few approaches to developing GP code CUDA R toolchain provided by NVIDIA R free C/C++ only for NIVIDA cards PGI R toolchain (CUDA Fortran) $$$ C/C++, Fortran only for NVIDIA cards OpenCL TM (Open Computing Language) open source Specs for cross-platform, parallel programming of modern processors (PCs, servers, handheld/embedded devices) Adopted by Intel, AMD,... Use a higher level language such as Matlab

OpenCL TM (Open Computing Language) open source Specs for cross-platform, parallel programming of modern

12 computing overview Which card to use? AMD vs NVIDIA NVIDIA cards are more widely adopted for GP E.g., servers in our department and NCSU henry2 cluster all have NVIDIA NVIDIA has a much richer set of math libraries Cross-platform feature of OpenCL is attractive AMD NVIDIA Cards ATI Radeon GTX, Tesla Language OpenCL CUDA C/C++, PGI CUDA Fortran math libraries APPML (BLAS,FFT) cublas, cufft, cusparse curand, CUDA MATH, Thrust,... Platforms Linux, Windows Linux, Windows, MacOS

, servers in our department and NCSU henry2 cluster all have NVIDIA NVIDIA has a much richer set of math libraries

13 Matlab computing in Matlab Getting started gpudevice(): query device methods( gpuarray ): built-in functions that support

14 Matlab 290 built-in functions in Matlab 2012b support

15 Matlab Scheme for algorithm development on Matlab % transfer data to and initialize variables gx = gpuarray (X); gy = gpuarray (Y); gbetahat = gpuarray.randn (5, 1);... % computation on... % transfer result off betahat = gather (gbetahat); Key: minimize memory transfer between host memory and memory

gpuarray.randn (5, 1);... % computation on.

16 Matlab Benchmarking Always benchmark the bottleneck routine before embarking on E.g., to benchmark A\b (solve linear equations) on my desktop paralleldemo_gpu_backslash() in Matlab 2012b CPU Single precision performance CPU Double precision performance Gigaflops Gigaflops Matrix size Matrix size Intel i7 960 CPU vs NVIDIA GTX 580

$, to benchmark A\b (solve linear equations) on my desktop paralleldemo_gpu_backslash() in Matlab 2012b 700 600 CPU$

17 R computing in R Not supported in base R (opportunity? HiPLARM package) A few contributed packages in specific application areas: gputools (some data-mining algorithms), cudabayesreg (fmri analysis),... Develop in C/C++ or Fortran and call compiled code from R

application areas: gputools (some data-mining algorithms),

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems