Calcul Parallèle sous MATLAB Journée Calcul Parallèle GPU/CPU du PEPI MACS Olivier de Mouzon INRA Gremaq Toulouse School of Economics Lundi 28 novembre 2011 Paris Présentation en grande partie fondée sur des extraits du séminaire récent : Mathworks Seminar @ GREMAQ TSE 15 th of November 2011 Mounir El Bedraoui Stefan Duprey Sales Account Manager Academia Financial Application Engineer 2011 The MathWorks, Inc. 1
Plan Fonctions déjà (silencieusement) parallélisées Parallel Computing Toolbox (PCT) PCT seule vs. MDCS (Matlab Distributed Computing System) Local vs. cluster Utilisation (explicite) de fonctions parallélisées Utilisation de fonctions de base : parfor & spmd Utilisation de travaux et tâches Un point sur MPI Un point sur GPU (NVIDIA CUDA) 2
Fonctions déjà (silencieusement) parallélisées Multithreaded computations, introduced in R2007a, are now on by default. Many MATLAB functions are now multithreaded: sort bsxfun mldivide for sparse rectangular matrix input qr for sparse matrix input filter for matrices and higher-dimensional arrays gamma, gammaln erf, erfc, erfcx, erfinv, erfcinv 3
Parallel Computing Toolbox (PCT) matlabpool open local Code matlabpool close matlabpool open 9 Par défaut : 12 coeurs 4
Parallel Computing enables you to Larger Compute Pool Larger Memory Pool Speed up Computations Running Independent Tasks or Iterations Work with Large Data 11 26 41 12 27 42 13 28 43 14 29 44 15 30 45 16 31 46 17 32 47 17 33 48 19 34 49 20 35 50 21 36 51 22 37 52 5
Parallel Computing with MATLAB Parallel Computing Toolbox MATLAB Distributed Computing Server MATLAB Workers User s Desktop Compute Cluster 6
PCT : utilisation (explicite) de fonctions parallélisées - exemple %% Parallel bootstrapped aggregated tree % crossval, jackknife, bootstrp ntrees = 50; matlabpool open local; opt = statset('useparallel','always'); tic; b = TreeBagger(nTrees, X, Y, 'opt',opt); toc; matlabpool close; 7
Tools with Built-in parallel Support Optimization Toolbox Global Optimization Toolbox Statistics Toolbox SystemTest Simulink Design Optimization Bioinformatics Toolbox Model-Based Calibration Toolbox TOOLBOXES BLOCKSETS Worker Worker Worker Worker Worker Worker Worker http://www.mathworks.com/products/parallel-computing/builtin-parallel-support.html Directly leverage functions in Parallel Computing Toolbox 8
PCT : fonctions de base parfor et spmd 9
PCT : parfor N = 250; a = zeros(n, 1); matlabpool open local; tic; parfor i = 1:N %for i = 1:N a(i) = max(eig(rand(300))); end toc; matlabpool close; 10
Case 1 : Speed up Computations Distributing similar problems to different processors (or Task-parallelism) Processes Time Time 11
The Mechanics of parfor Loops 1 12 23 34 4 5 66 7 8 9 910 10 a = zeros(10, 1) parfor i = 1:10 a(i) = i; end a Worker a(i) = i; Worker a(i) = i; Worker a(i) = i; Worker a(i) = i; Pool of MATLAB Workers 12
PCT : parfor Chaque itération doit être indépendante des autres Minimiser les échanges de données avec les différents cœurs Passage des variables en entrée Récupération des variables en sortie 13
PCT : spmd matlabpool open 2; n = 100; spmd % simple spmd block a = rand(n,n); display(size(a)) display(a(1:2,1:2)) end spmd % creating distributed arrays a = rand(n,n,codistributor); display( size(getlocalpart(a))); end spmd d = svd(a) end dgathered=gather(d); D = distributed.rand(1000); % Data is created and stored on the workers. b = distributed.rand(1000, 1); % Created on the workers x = D \ b; matlabpool close 14
Case 2 : Work with large data Distributing arrays to different processors (or Data-parallelism) 11 26 41 12 27 42 13 28 43 14 29 44 15 30 45 16 31 46 17 32 47 17 33 48 19 34 49 20 35 50 21 36 51 22 37 52 TOOLBOXES BLOCKSETS 11 26 41 12 27 42 13 28 43 14 29 44 15 30 45 16 31 46 17 32 47 17 33 48 19 34 49 20 35 50 21 36 51 22 37 52 C O N F I D E N T I A L 15
Examples of distributed and codistributed arrays spmd blocks spmd % single program across workers end Run on a pool of MATLAB resources Single Program runs simultaneously across workers Multiple Data spread across multiple workers C O N F I D E N T I A L 16
A mental model for SPMD END x = 1 spmd y = x + 1 end y Worker x 1 y = x + 1 Worker x 1 y = x + 1 Worker x 1 y = x + 1 Worker x 1 y = x + 1 Pool of MATLAB Workers C O N F I D E N T I A L 17
Client-Side Distributed Arrays and spmd Client-side distributed arrays Class distributed Ability to create and manipulate directly from the client Simpler access to memory on labs Client-side visualization capabilities spmd Block of code executed on workers Worker-specific commands Explicit communication between workers Mixture of parallel and serial code C O N F I D E N T I A L 18
Enhanced MATLAB Functions That Operate on Distributed Arrays C O N F I D E N T I A L 19
PCT : travaux et tâches Used findresource to find scheduler Used createjob and createtask to set up the problem Used submit to offload and run in parallel Used getalloutputarguments to retrieve all task outputs 20
Scheduling Work Work Worker TOOLBOXES BLOCKSETS Result Scheduler Worker Worker Worker 21
Scheduling Task-parallel applications Compute cluster MATLAB Distributed Computing Server Client Machine Task Job Result CPU Worker Task Result Worker CPU TOOLBOXES Parallel Computing Toolbox Result Scheduler Task Result Worker CPU BLOCKSETS Task Result Worker CPU 22
23
24
Scheduling Data-parallel applications Compute cluster MATLAB Distributed Computing Server Client Machine Task Result CPU Lab Job Task Result Lab CPU TOOLBOXES Parallel Computing Toolbox Result Scheduler Task Result Lab CPU BLOCKSETS Task Result Lab CPU 25
26
27
PCT : MPI-Based Functions Use when a high degree of control over parallel algorithm is required High-level abstractions of MPI functions labsendreceive, labbroadcast, and others Send, receive, and broadcast any data type in MATLAB Automatic bookkeeping Setup: communication, ranks, etc. Error detection: deadlocks and miscommunications Pluggable Use any MPI implementation that is binary-compatible with MPICH2 28
Using an InfiniBand network Parallel Computing Toolbox does not have built-in support for InfiniBand. However, the toolboxes provide all the necessary hooks to take advantage of it. The user will need to provide their own custom build of MPI that supports InfiniBand. See "Using a Different MPI Build on UNIX Operating Systems for more details 29
Summary 30
Programming Parallel Applications Level of control Parallel Options Minimal Support built into Toolboxes Some Extensive High-Level Programming Constructs: (e.g. parfor, batch, distributed, Jobs/Tasks) Low-Level Programming Constructs: (e.g. MPI-based) 31
Parallel Computing with MATLAB Built-in parallel functionality within specific toolboxes (also requires Parallel Computing Toolbox) Optimization Toolbox Global Optimization Toolbox Statistics Toolbox SystemTest Simulink Design Optimization Bioinformatics Toolbox Model-Based Calibration Toolbox High-level parallel language MATLAB and Parallel Computing Tools parfor matlabpool batch Low-level parallel functions createjob createtask Built on industry-standard libraries Industry Libraries Message Passing Interface (MPI) ScaLAPACK 32
GPU support - R2010b 33
What is a GPU Originally for graphics acceleration, now also used for scientific calculations Massively parallel array of integer and floating point processors Typically hundreds of processors per card GPU cores complement CPU cores Dedicated high-speed memory 34
GPU vs CPU 35
Performance Gain with More Hardware Using More Cores (CPUs) Using GPUs Core 1 Core 2 Core 3 Core 4 Cache Device Memory 36
Technical language of GPU 37
Nvidia solutions GPU GeForce Quadro Tesla Mass market Graphics Calculations ECC Memory Faster PCIe Communication 38
Supported Cards and Operating Systems To use GPU functionalities the user should have: MATLAB + PCT R2010b 32-bit or 64-bit Microsoft Windows or Linux operating system NVIDIA CUDA-enabled device with compute capability of 1.3 or greater NVIDIA CUDA device driver 3.0 or greater NVIDIA CUDA Toolkit 3.0 (recommended) for compiling PTX files 39
Using GPU with PCT R2010b 3 Main Ways to Access GPU: Ease of Use 1. Use GPU array interface and MATLAB built-in functions 2. Execute custom functions on elements of the GPU array 3. Create kernels from your CUDA code and PTX files Greater Functionality 40
1. Using MATLAB Built-In functions Feels like using distributed arrays A = gpuarray(rand(1000)); B = gpuarray(rand(1000)); C = transpose(a); D = C * log(b); E = gather(d); C O N F I D E N T I A L 41
Performance: A\b with Double Precision 42
Supported functions 43
2. Using MATLAB function file Create a MATLAB function (kernel) function c = myop(a,b) a1 = log(a); b1 = log(b); c = round(a1.* b1); From MATLAB: a = gpuarray(1/2*rand(1000)); b = gpuarray(3*rand(1000)); c = arrayfun(@myop,a,b); d = gather(c); arrayfun() 44
Performances 45
Main Limitations The code can call only supported functions and cannot call scripts Indexing is not supported Persistent or global variables are not supported if, for, while, parfor, spmd, switch, try/catch, and return not supported single, double, int32, uint32, and logical are the only supported data type conversions Functional forms of arithmetic operators are not supported, but symbol operators are (i.e. + supported, plus() not supported) 46
3. Invoking CUDA code Develop the CUDA code (kernel) for your computation Compile the CUDA code in MATLAB using NVIDIA compiler nvcc ptx myfun.cu Create a MATLAB function MyFun.m containing the commands kernel = parallel.gpu.cudakernel( myfun.ptx, myfun.cu ); To create a kernel object Res = feval(kernel, input_arguments); Allows users to evaluate their kernel on the GPU Execute the MATLAB function Res = myfun (input_arguments); 47
Example of CUDA code 48
Performances 49
Performance Acceleration Options in the Parallel Computing Toolbox Technology Example MATLAB Workers Execution Target matlabpool parfor Required CPU Cores user-defined tasks createtask Required CPU Cores GPU-based parallelism GPUArray No NVIDIA GPU with Compute Capability 1.3 or greater 50
Questions? Thank you. 51