ST810 Advanced Computing

ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013

Outline computing Hardware computing overview Matlab R

ST810 Lecture 17 Hardware Typical s on current laptops I I I R E.g., my MacBook Pro has an Intel HD Graphics 4000 (built in R with i7-3720qm CPU) and a NVIDIA GeForce GT 650M R NVIDIA GT 650M has 1G memory, 524K L2 cache, 384 cores @ 0.9 GHz Theoretical throughput: 641.3 SP GFLOPS

Hardware Typical s on current desktops My desktop (Dell Alienware) has a NVIDIA R GeForce GTX 580 GTX 580 has 1.5G memory, 786K L2 cache, 512 cores @ 1.59 GHz Theoretical throughput: 1581 SP GFLOPS Release Price: $500 (Nov 2010)

Hardware Typical s on current servers The teaching server has 4 x NVIDIA R Tesla M2070Q Each Tesla M2070Q has 6G memory (5.25G with ECC), 786K L2 cache, 448 cores @ 1.15 GHz Theoretical throughput: 4 x 1288 SP GFLOPS or 4 x 512 DP GFLOPS

Hardware Graphics Processing Units (s) Ubiquitous in today s hardware (PCs, laptops, servers) Cost effective for high performance computing Rapid growth in recent years Our department has at least two servers. Many nodes in NCSU HPC henry2 are equipped with s too

Hardware vs CPU architecture s contain 100s of processing cores on a single chip; several chips can fit in a desktop PC Each core carries out the same operations in parallel on different input data single program, multiple data (SPMD) paradigm Extremely high arithmetic intensity *if* one can transfer the data onto and results off of the processors quickly

ST810 Lecture 17 Hardware CPU An analogy taken from Andrew Beam s presentation in ST790

computing overview GP - General purpose computing My experience Almost always involve (new) algorithm development and/or revamping CPU code Research before going for GP (next slide) Easier to develop in C/C++ (free compiler), Fortran (compiler $), and Matlab Do not reinvent the wheel use libraries

computing overview Before using s 0. Frustrated by slow code... 1. Am I using the right algorithm(s)? Go to your ST758 notes or a numerical analysis book. E.g., for massive data (terabytes), an O(n 2 ) algorithm vs an O(n log n) means a 31710 years vs 27 seconds difference on a TFLOPS supercomputer 2. Repeat: Profile and optimize original code 3. Can a compiled language or optimized library (MKL, ATLAS) help? 4. Identify the bottleneck routine and research the potential gain on 5. Can my data fit into memory? 6. Can other routines be easily implemented on? Is that necessary? 7. Decide the toolchain: Matlab, CUDA, PGI toolchain,...

computing overview GP development A few approaches to developing GP code CUDA R toolchain provided by NVIDIA R free C/C++ only for NIVIDA cards PGI R toolchain (CUDA Fortran) $$$ C/C++, Fortran only for NVIDIA cards OpenCL TM (Open Computing Language) open source Specs for cross-platform, parallel programming of modern processors (PCs, servers, handheld/embedded devices) Adopted by Intel, AMD,... Use a higher level language such as Matlab

computing overview Which card to use? AMD vs NVIDIA NVIDIA cards are more widely adopted for GP E.g., servers in our department and NCSU henry2 cluster all have NVIDIA NVIDIA has a much richer set of math libraries Cross-platform feature of OpenCL is attractive AMD NVIDIA Cards ATI Radeon GTX, Tesla Language OpenCL CUDA C/C++, PGI CUDA Fortran math libraries APPML (BLAS,FFT) cublas, cufft, cusparse curand, CUDA MATH, Thrust,... Platforms Linux, Windows Linux, Windows, MacOS

Matlab computing in Matlab Getting started gpudevice(): query device methods( gpuarray ): built-in functions that support

Matlab 290 built-in functions in Matlab 2012b support

Matlab Scheme for algorithm development on Matlab % transfer data to and initialize variables gx = gpuarray (X); gy = gpuarray (Y); gbetahat = gpuarray.randn (5, 1);... % computation on... % transfer result off betahat = gather (gbetahat); Key: minimize memory transfer between host memory and memory

Matlab Benchmarking Always benchmark the bottleneck routine before embarking on E.g., to benchmark A\b (solve linear equations) on my desktop paralleldemo_gpu_backslash() in Matlab 2012b 700 600 CPU Single precision performance 200 180 CPU Double precision performance 160 500 140 Gigaflops 400 300 Gigaflops 120 100 200 80 60 100 40 0 0 2000 4000 6000 8000 10000 12000 Matrix size 20 1000 2000 3000 4000 5000 6000 7000 8000 9000 Matrix size Intel i7 960 CPU vs NVIDIA GTX 580

R computing in R Not supported in base R (opportunity? HiPLARM package) A few contributed packages in specific application areas: gputools (some data-mining algorithms), cudabayesreg (fmri analysis),... Develop in C/C++ or Fortran and call compiled code from R