Calcul Parallèle sous MATLAB

Transcription

1 Calcul Parallèle sous MATLAB Journée Calcul Parallèle GPU/CPU du PEPI MACS Olivier de Mouzon INRA Gremaq Toulouse School of Economics Lundi 28 novembre 2011 Paris Présentation en grande partie fondée sur des extraits du séminaire récent : Mathworks GREMAQ TSE 15 th of November 2011 Mounir El Bedraoui Stefan Duprey Sales Account Manager Academia Financial Application Engineer 2011 The MathWorks, Inc. 1

2 Plan Fonctions déjà (silencieusement) parallélisées Parallel Computing Toolbox (PCT) PCT seule vs. MDCS (Matlab Distributed Computing System) Local vs. cluster Utilisation (explicite) de fonctions parallélisées Utilisation de fonctions de base : parfor & spmd Utilisation de travaux et tâches Un point sur MPI Un point sur GPU (NVIDIA CUDA) 2

3 Fonctions déjà (silencieusement) parallélisées Multithreaded computations, introduced in R2007a, are now on by default. Many MATLAB functions are now multithreaded: sort bsxfun mldivide for sparse rectangular matrix input qr for sparse matrix input filter for matrices and higher-dimensional arrays gamma, gammaln erf, erfc, erfcx, erfinv, erfcinv 3

4 Parallel Computing Toolbox (PCT) matlabpool open local Code matlabpool close matlabpool open 9 Par défaut : 12 coeurs 4

5 Parallel Computing enables you to Larger Compute Pool Larger Memory Pool Speed up Computations Running Independent Tasks or Iterations Work with Large Data

6 Parallel Computing with MATLAB Parallel Computing Toolbox MATLAB Distributed Computing Server MATLAB Workers User s Desktop Compute Cluster 6

7 PCT : utilisation (explicite) de fonctions parallélisées - exemple %% Parallel bootstrapped aggregated tree % crossval, jackknife, bootstrp ntrees = 50; matlabpool open local; opt = statset('useparallel','always'); tic; b = TreeBagger(nTrees, X, Y, 'opt',opt); toc; matlabpool close; 7

8 Tools with Built-in parallel Support Optimization Toolbox Global Optimization Toolbox Statistics Toolbox SystemTest Simulink Design Optimization Bioinformatics Toolbox Model-Based Calibration Toolbox TOOLBOXES BLOCKSETS Worker Worker Worker Worker Worker Worker Worker Directly leverage functions in Parallel Computing Toolbox 8

9 PCT : fonctions de base parfor et spmd 9

10 PCT : parfor N = 250; a = zeros(n, 1); matlabpool open local; tic; parfor i = 1:N %for i = 1:N a(i) = max(eig(rand(300))); end toc; matlabpool close; 10

11 Case 1 : Speed up Computations Distributing similar problems to different processors (or Task-parallelism) Processes Time Time 11

12 The Mechanics of parfor Loops a = zeros(10, 1) parfor i = 1:10 a(i) = i; end a Worker a(i) = i; Worker a(i) = i; Worker a(i) = i; Worker a(i) = i; Pool of MATLAB Workers 12

13 PCT : parfor Chaque itération doit être indépendante des autres Minimiser les échanges de données avec les différents cœurs Passage des variables en entrée Récupération des variables en sortie 13

14 PCT : spmd matlabpool open 2; n = 100; spmd % simple spmd block a = rand(n,n); display(size(a)) display(a(1:2,1:2)) end spmd % creating distributed arrays a = rand(n,n,codistributor); display( size(getlocalpart(a))); end spmd d = svd(a) end dgathered=gather(d); D = distributed.rand(1000); % Data is created and stored on the workers. b = distributed.rand(1000, 1); % Created on the workers x = D \ b; matlabpool close 14

15 Case 2 : Work with large data Distributing arrays to different processors (or Data-parallelism) TOOLBOXES BLOCKSETS C O N F I D E N T I A L 15

16 Examples of distributed and codistributed arrays spmd blocks spmd % single program across workers end Run on a pool of MATLAB resources Single Program runs simultaneously across workers Multiple Data spread across multiple workers C O N F I D E N T I A L 16

17 A mental model for SPMD END x = 1 spmd y = x + 1 end y Worker x 1 y = x + 1 Worker x 1 y = x + 1 Worker x 1 y = x + 1 Worker x 1 y = x + 1 Pool of MATLAB Workers C O N F I D E N T I A L 17

18 Client-Side Distributed Arrays and spmd Client-side distributed arrays Class distributed Ability to create and manipulate directly from the client Simpler access to memory on labs Client-side visualization capabilities spmd Block of code executed on workers Worker-specific commands Explicit communication between workers Mixture of parallel and serial code C O N F I D E N T I A L 18

19 Enhanced MATLAB Functions That Operate on Distributed Arrays C O N F I D E N T I A L 19

20 PCT : travaux et tâches Used findresource to find scheduler Used createjob and createtask to set up the problem Used submit to offload and run in parallel Used getalloutputarguments to retrieve all task outputs 20

21 Scheduling Work Work Worker TOOLBOXES BLOCKSETS Result Scheduler Worker Worker Worker 21

22 Scheduling Task-parallel applications Compute cluster MATLAB Distributed Computing Server Client Machine Task Job Result CPU Worker Task Result Worker CPU TOOLBOXES Parallel Computing Toolbox Result Scheduler Task Result Worker CPU BLOCKSETS Task Result Worker CPU 22

23 23

24 24

25 Scheduling Data-parallel applications Compute cluster MATLAB Distributed Computing Server Client Machine Task Result CPU Lab Job Task Result Lab CPU TOOLBOXES Parallel Computing Toolbox Result Scheduler Task Result Lab CPU BLOCKSETS Task Result Lab CPU 25

26 26

27 27

28 PCT : MPI-Based Functions Use when a high degree of control over parallel algorithm is required High-level abstractions of MPI functions labsendreceive, labbroadcast, and others Send, receive, and broadcast any data type in MATLAB Automatic bookkeeping Setup: communication, ranks, etc. Error detection: deadlocks and miscommunications Pluggable Use any MPI implementation that is binary-compatible with MPICH2 28

29 Using an InfiniBand network Parallel Computing Toolbox does not have built-in support for InfiniBand. However, the toolboxes provide all the necessary hooks to take advantage of it. The user will need to provide their own custom build of MPI that supports InfiniBand. See "Using a Different MPI Build on UNIX Operating Systems for more details 29

30 Summary 30

31 Programming Parallel Applications Level of control Parallel Options Minimal Support built into Toolboxes Some Extensive High-Level Programming Constructs: (e.g. parfor, batch, distributed, Jobs/Tasks) Low-Level Programming Constructs: (e.g. MPI-based) 31

32 Parallel Computing with MATLAB Built-in parallel functionality within specific toolboxes (also requires Parallel Computing Toolbox) Optimization Toolbox Global Optimization Toolbox Statistics Toolbox SystemTest Simulink Design Optimization Bioinformatics Toolbox Model-Based Calibration Toolbox High-level parallel language MATLAB and Parallel Computing Tools parfor matlabpool batch Low-level parallel functions createjob createtask Built on industry-standard libraries Industry Libraries Message Passing Interface (MPI) ScaLAPACK 32

33 GPU support - R2010b 33

34 What is a GPU Originally for graphics acceleration, now also used for scientific calculations Massively parallel array of integer and floating point processors Typically hundreds of processors per card GPU cores complement CPU cores Dedicated high-speed memory 34

35 GPU vs CPU 35

36 Performance Gain with More Hardware Using More Cores (CPUs) Using GPUs Core 1 Core 2 Core 3 Core 4 Cache Device Memory 36

37 Technical language of GPU 37

38 Nvidia solutions GPU GeForce Quadro Tesla Mass market Graphics Calculations ECC Memory Faster PCIe Communication 38

39 Supported Cards and Operating Systems To use GPU functionalities the user should have: MATLAB + PCT R2010b 32-bit or 64-bit Microsoft Windows or Linux operating system NVIDIA CUDA-enabled device with compute capability of 1.3 or greater NVIDIA CUDA device driver 3.0 or greater NVIDIA CUDA Toolkit 3.0 (recommended) for compiling PTX files 39

40 Using GPU with PCT R2010b 3 Main Ways to Access GPU: Ease of Use 1. Use GPU array interface and MATLAB built-in functions 2. Execute custom functions on elements of the GPU array 3. Create kernels from your CUDA code and PTX files Greater Functionality 40

41 1. Using MATLAB Built-In functions Feels like using distributed arrays A = gpuarray(rand(1000)); B = gpuarray(rand(1000)); C = transpose(a); D = C * log(b); E = gather(d); C O N F I D E N T I A L 41

42 Performance: A\b with Double Precision 42

43 Supported functions 43

44 2. Using MATLAB function file Create a MATLAB function (kernel) function c = myop(a,b) a1 = log(a); b1 = log(b); c = round(a1.* b1); From MATLAB: a = gpuarray(1/2*rand(1000)); b = gpuarray(3*rand(1000)); c = arrayfun(@myop,a,b); d = gather(c); arrayfun() 44

45 Performances 45

46 Main Limitations The code can call only supported functions and cannot call scripts Indexing is not supported Persistent or global variables are not supported if, for, while, parfor, spmd, switch, try/catch, and return not supported single, double, int32, uint32, and logical are the only supported data type conversions Functional forms of arithmetic operators are not supported, but symbol operators are (i.e. + supported, plus() not supported) 46

47 3. Invoking CUDA code Develop the CUDA code (kernel) for your computation Compile the CUDA code in MATLAB using NVIDIA compiler nvcc ptx myfun.cu Create a MATLAB function MyFun.m containing the commands kernel = parallel.gpu.cudakernel( myfun.ptx, myfun.cu ); To create a kernel object Res = feval(kernel, input_arguments); Allows users to evaluate their kernel on the GPU Execute the MATLAB function Res = myfun (input_arguments); 47

48 Example of CUDA code 48

49 Performances 49

50 Performance Acceleration Options in the Parallel Computing Toolbox Technology Example MATLAB Workers Execution Target matlabpool parfor Required CPU Cores user-defined tasks createtask Required CPU Cores GPU-based parallelism GPUArray No NVIDIA GPU with Compute Capability 1.3 or greater 50

51 Questions? Thank you. 51