Introduction to Matlab Distributed Computing Server (MDCS) Dan Mazur and Pier-Luc St-Onge guillimin@calculquebec.ca December 1st, 2015 1
Partners and sponsors 2
Exercise 0: Login and Setup Ubuntu login: Username: csuser07 Password: @[S07 Example hand-out slip: 07: k41a0?wy# Guillimin login: ssh class07@guillimin.hpc.mcgill.ca Password: k41a0?wy# 3
Outline Introduction and Overview Configuring MDCS for Guillimin Submitting and monitoring jobs on Guillimin batch command Parallel toolbox parfor loops (parallel for loops) spmd sections (single program multiple data) distributed arrays (large memory problems) GPUs and Xeon Phis 4
Parallel Computing Toolbox (PCT) High-level constructs for parallel programming parallel for loops distributed arrays data parallel (spmd) sections Implicit (automatic) parallelism Implemented with MPI (MPICH2) Restricted to 12 cores on a single node Multi-node scalability built into MPICH2 Scalability intentionally limited through technological effort 5
MDCS Overview MDCS allows parallel toolbox users access to a number of workers (set by the license terms) on any number of nodes 6
MDCS vs. PCT differences MDCS jobs are submitted to the batch system on a cluster, not run locally Client - Server model In PCT, one explicitly starts a parpool environment In MDCS, this environment is requested in the batch() command 7
MDCS Overview Guillimin Worker Nodes MDCS.m script + attached files Matlab Your PC 8
MDCS Overview Guillimin Worker Nodes Job MDCS Scheduler.m script + attached files Matlab Your PC 9
MDCS Overview Guillimin Worker Nodes Job MDCS Monitoring information Scheduler.m script + attached files Matlab Your PC 10
MDCS Overview Guillimin Worker Nodes Job MDCS Monitoring information Scheduler.m script + attached files Important: Do not attach large data files. Data transfer to and from Guillimin is best accomplished with scp or sftp. See http://www.hpc.mcgill.ca for large file transfers. Matlab Your PC 11
MDCS Licensing One N-worker MDCS Job N x MDCS worker licenses Desktop Matlab license 1 master process worker license Parallel computing toolbox license Pool of 64 MDCS licenses Additional toolbox licenses Provided by user (often via institution) Provided by McGill HPC 12
MDCS Scenario Researchers begin using desktop Matlab using institutional licenses Eventually, researchers and research programs depend on the resulting software Problem sizes increase with time, eventually necessitating parallel computing No problem: Mathworks uses an implementation of MPI with good scaling behaviour provided by the free software community to implement their parallel computing toolbox functionality But, place restrictions on number of nodes and cores Require additional licenses to remove these restrictions Because of decisions they made years ago, researchers find themselves facing either Potentially expensive license fees to unlock their software's capabilities, or Financial and time barriers to switching vendors (i.e. porting code) 13
MDCS Alternatives Compile MPI functions with mex Use Matlab MPI - Use global file system for MPI-like communication Low performance for tightly-coupled problems Use GNU Octave Reduces the switching costs by re-implementing the Matlab programming language Parallel capabilities are less mature than Matlab Porting code to another language (Python, R, Fortran, etc.) Difficult to maintain, cannot use PCT functions, cannot use Matlab debugger, must have access to many individual Matlab licenses (e.g. TAH license) Significant effort and time Contact us for help and advice guillimin@calculquebec.ca 14
MDCS Desktop Configuration 1) Install scripts used for communicating with scheduler 2) Configure the cluster profile 3) Verify your setup 15
Exercise 1: Install Scripts Download and unpack.tar.gz configuration file on your local machine E.g. Linux: cd <workdir> wget \ http://www.hpc.mcgill.ca/downloads/mdcs_config/guillimin_mdcs_config_v2.3.tar.gz tar -xvf guillimin_mdcs_config_v2.3.tar.gz copy all "config/toolbox-local/* " files to the "<your_matlab_install>/toolbox/local " folder on your local machine Start or restart Matlab. Then test your installation: >> glmnversion 16
Permissions What if you don't have write access to the toolbox/local folder? Create a new folder in your home directory for Matlab scripts Add the new path to your Matlab path path('newpath', path); Set new path in a startup.m file Use MATLABPATH environment variable in Mac and Linux OSs 17
MDCS Integration Scripts glmncommsubfcn.m glmnindsubfcn.m glmngetremoteconn.m glmndeletejobfcn.m glmngetjobstatefcn.m glmnpbs.m glmncreatesubscript.m Main drivers for submitting jobs Establishes connection to cluster with ssh Cancel job on cluster through Matlab Get the job status from the cluster Specifies the submission parameters Creates a script which will run on cluster to submit the job glmngensubmitstring.m Generates the qsub command glmnextractjobid.m Gets the PBS jobid from the cluster glmncommjobwrapper.sh glmnindjobwrapper.sh The script that is submitted to the worker nodes by qsub 18
Avoiding Metadata Corruption Each pair (Server, Matlab installation) requires a pair of metadata folders, one on the submitting computer and one on Guillimin E.g. installing a new version of Matlab and re-using the same metadata folders will result in corruption E.g. Submitting to a new MDCS server and re-using the same metadata folders will result in corruption E.g. Multiple users from the same client will require a shared metadata folder (read and write) or separate profiles Important: You cannot re-use your class account configuration for other Guillimin accounts 19
How many metadata folders? Servers: guillimin R2013a Clients: orcinus R2014a Lab computer R2013a Home computer 20
How many metadata folders? Answer: 12 Servers: guillimin R2013a Clients: orcinus R2014a Lab computer R2013a Home computer 21
Exercise 2: Configure your computer We have made a script, glmnconfigcluster.m, to make configuration easier Warning: glmnconfigcluster will overwrite any profiles called 'guillimin' >> glmnconfigcluster Enter a unique name for your local computer (e.g. the hostname): workshop Home directory on local computer (e.g. /home/alex, /Users/alex, or C:\\Users\\alex): /Users/dmazur Home directory on guillimin (e.g. /home/alex): /home/dmazur One last step: please connect to guillimin, and create your Matlab job directory: mkdir -p /home/dmazur/.matlab/jobs/workshop/guillimin/r2014a Once done, your local computer will be configured to submit jobs to guillimin. 22
Exercise 3: Validation You will want to test your new cluster with simple tests before trying more complicated codes Clicking the validation button in Matlab can take a long time and the final test is expected to fail Perform the validation procedure from the McGill HPC documentation Must be performed in the TestParfor directory cd examples/testparfor In glmnpbs.m, set procspernode to 3 23
A simple batch job mycluster = parcluster('guillimin') j = batch(mycluster,...) Selects a cluster profile Submits jobs to a cluster Prompted for username Select 'no' when asked to use identity file Prompted for password wait(j): Waits for job to finish 24
Exercise 4: Simple Batch Job >> mycluster = parcluster('guillimin') >> j = batch(mycluster, @rand, 1, {10, 10}, 'CurrentDirectory', '.'); >> wait(j) >> r = fetchoutputs(j) 25
glmnpbs.m For parallel jobs, we have a script (glmnpbs.m) to make job submission easier Place this script in your working directory Before submission, check that you have a valid glmnpbs.m file, and that your submission parameters are correct >> test = glmnpbs(); >> test.getsubmitargs() 26
classdef glmnpbs %Guillimin PBS submission arguments properties % Local script, remote working directory (home, by default) localscript = 'TestParfor'; workingdirectory = '.'; % nodes, ppn, gpus, phis and other attributes numberofnodes = 1; procspernode = 3; gpus = 0; phis = 0; attributes = ''; % Specify the memory per process required pmem = '1700m' % Requested walltime walltime = '00:30:00' % Please use metaq unless you require a specific node type queue = 'metaq' % All jobs should specify an account or RAPid: % e.g. % account = 'xyz-123-aa' account = ''; % You may use otheroptions to append a string to the qsub command % e.g. % otheroptions = '-M email[at]address.com -m bae' otheroptions = '' end 27
Submitting with glmnpbs.m methods(static) function job = submitto(cluster) opt = glmnpbs(); job = batch(cluster, opt.localscript,... 'matlabpool', opt.getnbworkers(),... 'CurrentDirectory', opt.workingdirectory... ); end end >> cluster = parcluster('guillimin'); >> glmnpbs.submitto(cluster); Note that glmnpbs.m must be present for all job submissions, even with batch() Called by glmncommsubfcn.m 28
Matlab Job Monitor Parallel > Monitor Jobs Select Profile: guillimin Enter username Select 'no' Enter password Tip: Set autoupdate to 'never', or use an identity file. Otherwise, Matlab interrupts your work with password requests. 29
Matlab Job Monitor Job Monitor can report the state, and more details such as output and errors (right click). 30
Monitoring Jobs on Guillimin Show running and queued jobs qstat -u class01 qstat shows both MDCS and other Guillimin jobs Detailed scheduler information for job w/ jobid=######## qstat -f ######## Meta-data is stored in job-specific folders /home/username/.matlab/jobs/workshop/guillimin/r2014a/ Job1 The.log files contain output and error from Matlab itself The.txt files contain output from disp() and fprintf() You should create output and save matlab (.mat) files within your Guillimin storage (scratch, home, or project spaces) fprintf() save() 31
Exercise 5: Submit Parallel Job Change the working directory to the examples/testparfor folder you copied from the.tar.gz configuration file Launch TestParFor.m using glmnpbs.m >> cluster = parcluster('guillimin') >> job = glmnpbs.submitto(cluster) 32
Make sure you are in the correct directory >> cluster = parcluster('guillimin') >> job = glmnpbs.submitto(cluster) This script runs for ~15 minutes. You may use showq or the job monitor to monitor its progress. 33
Exercise Codes While your job is waiting/running... Please download and extract the exercise codes from our website http://www.hpc.mcgill.ca/downloads/ intro_mdcs/dec2015.tar.gz 34
Parallel Matlab Benefits of parallelism Computations complete faster Scale to larger data sets in the same amount of time Work with larger data sets using distributed memory 35
Parallel Matlab Implicit (automatic) parallelism Bioinformatics toolbox Image processing toolbox optimization toolbox signal processing toolbox statistics toolbox etc... Explicit parallelism parallel toolbox parfor spmd distributed() 36
TestParfor.m function TestParfor; clear all; N=4000; filename='~/output_test_parfor.txt'; outfile = fopen(filename,'w'); fprintf(outfile, 'CALCULATION LOG: \n\n'); Location of output file on Guillimin tic; for k=1:10 Ham(:,:,k)=rand(N)+i*rand(N); fprintf(outfile,'serial: Doing K-point : %3i\n', k); inv(ham(:,:,k)); end t2=toc; Serial 'for' loop executed on head processor fprintf(outfile, 'Time serial = %12f\n', t2); fclose(outfile); tic; parfor k=1:10 Ham(:,:,k)=rand(N)+i*rand(N); outfile = fopen(filename,'a'); fprintf(outfile,'parallel: Doing K-point : %3i\n', k); fclose(outfile); inv(ham(:,:,k)); end Parallel 'parfor' loop executed on 2 worker nodes t2=toc; outfile = fopen(filename,'a'); fprintf(outfile, 'Time parallel = %12f\n', t2); fprintf(outfile, 'CALCULATIONS DONE... \n\n'); fclose(outfile); 37
Parfor i=1 i=2 i=3 i=4 Serial for loop Time i=1 i=2 i=3 i=4 Time Parallel parfor loop with 4 workers 38
~/output_test_parfor.txt CALCULATION LOG: Serial: Doing K-point : 1 Serial: Doing K-point : 2 Serial: Doing K-point : 3 Serial: Doing K-point : 4 Serial: Doing K-point : 5 Serial: Doing K-point : 6 Serial: Doing K-point : 7 Serial: Doing K-point : 8 Serial: Doing K-point : 9 Serial: Doing K-point : 10 Time serial = 553.056296 Parallel: Doing K-point : 7 Parallel: Doing K-point : 4 Parallel: Doing K-point : 6 Parallel: Doing K-point : 3 Parallel: Doing K-point : 5 Parallel: Doing K-point : 2 Parallel: Doing K-point : 1 Parallel: Doing K-point : 9 Parallel: Doing K-point : 8 Parallel: Doing K-point : 10 Time parallel = 291.879429 CALCULATIONS DONE... Serial 'for' loop executed on head processor Parallel 'parfor' loop executed on 2 worker nodes Ideal speedup = 2.00X Actual speedup = 1.90X 39
Parfor loops Loop index must be consecutive integers Iterations must be independent from one another Local or temporary variables modified inside the parfor loop can't be used after the for loop Cannot nest parfor loops Cannot be altered in the loop Don't need to be the outermost for loop Matlab editor will automatically warn about problems 40
Load Balancing Each iteration of the for loop should do an equal amount of work Good load balancing: Bad load balancing: parfor i = 1: 40 x = rand(1000, 1000); inv(x); end parfor i = 1: 40 x = rand(100*i, 100*i); inv(x); end 40th iteration has much more work than 1st iteration 41
Parallel Reduction >> s = 0; >> parfor i = 1:40 >> s = s + 1; >> end >> disp(s) 820 Operation will be done 'atomically' Operation must be associative e.g. addition or multiplication not subtraction or division 42
Aside: Atomic Operations >> s = 0; >> parfor i = 1:40 >> s = s + 1; >> end >> disp(s) 820 Step 1: Read s from memory Step 2: add 1 Step 3: Store result in s non-atomic addition Worker 1 s Worker 2 0 1 0 2 0+1 3 0 0 0 1 1 0+1 2 3 1 43
Aside: Atomic Operations >> s = 0; >> parfor i = 1:40 >> s = s + 1; >> end >> disp(s) 820 Step 1: Read s from memory Step 2: add 1 Step 3: Store result in s non-atomic addition Worker 1 s atomic addition Worker 2 Worker 1 0 1 0 2 0+1 3 Worker 2 0 0 0 0 1 1 0+1 2 3 1 s 1 0 0 2 3 0+1 0 1 1 1 1 1 1 1+1 2 3 2 44
Aside: Atomic Operations >> s = 0; >> parfor i = 1:40 >> s = s + 1; >> end >> disp(s) 820 Step 1: Read s from memory Step 2: add 1 Step 3: Store result in s atomic addition Worker 1 s Worker 2 0 Matlab calls 's' a 'reduction variable' and these operations are automatically atomic. http://www.mathworks.com/help/ distcomp/reduction-variables.html 1 0 0 2 3 0+1 0 1 1 1 1 1 1 1+1 2 3 2 45
Parallel Concatenation >> y = []; >> parfor i = 1:10 >> y = [y, i] ; >> end >> disp(y) 1 2 3 4 5 6 Matrix is stored in 'correct' order according to index i 7 8 9 10 46
Parameter Sweep Damped harmonic oscillator Give initial velocity for a variety of k's and b's and watch maximum response amplitude 47
Exercise 6: Parameter Sweep paramsweep.m solves a second-order ordinary differential equation (ODE) for varying parameter values Modify this code to run in parallel on 2 workers Submit your modified code to the MDCS Retrieve the resulting plot from Guillimin using scp and view it on your laptop [laptop]$ scp \ class07@guillimin.hpc.mcgill.ca:~/paramsweep.png./ 48
49
Single Program Multiple Data spmd command allows each worker to execute the same program on different data Variables labindex and numlabs are (for example) used to index the data Automatically defined inside SPMD sections Functions labsend() and labreceive() are used to send and receive data between the workers 50
>> matlabpool(3) % Or parpool(3) on newer versions of Matlab Starting matlabpool using the 'local' profile... connected to 3 workers. >> spmd labindex >> q end Lab 1: q= ans = 1 Lab 2: ans = 2 Lab 1: class = double, size = [3 3] Lab 2: class = double, size = [4 4] Lab 3: class = double, size = [5 5] >> q{1} ans = Lab 3: 8 3 4 ans = >> q{2} 3 >> spmd q = magic(labindex + 2); end 1 5 9 6 7 2 ans = 16 2 3 13 5 11 10 8 9 7 6 12 4 14 15 1 51
SPMD data load example SPMD can be used to have each worker process data from separate files Example, process data stored in files datafile1.mat, datafile2.mat, etc... spmd infile = load(['datafile' num2str(labindex) '.mat']); result = myfunc(infile) end 52
Serial numerical integration m = 10; b = pi/2; dx = b/m; x = dx/2:dx:b-dx/2; int = sum(cos(x)*dx) 53
SPMD Integral We would like to parallelize this integral using spmd In terms of m, b, numlabs and labindex: How many increments per lab? Integration length per lab? Local integration range? We can use gplus() to perform a global sum over workers 54
SPMD Integral We would like to parallelize this integral using spmd In terms of m, b, numlabs and labindex: How many increments per lab? Integration length per lab? n = m / numlabs Delta = dx * n = (b / m) * (m / numlabs) = b / numlabs Local integration range? ai = (labindex 1) * Delta bi = labindex * Delta We can use gplus(int, 1) to perform a global sum over int from each worker 55
SPMD Integral e.g.) m = 10, numlabs = 5 n = 10/5 = 2 increments per lab Delta = (pi/2)/5 = pi/10 ai = (labindex-1)*pi/10 bi = labindex*pi/10 Sum over increments for a worker: int = sum(cos(x)*dx); Global sum over all workers: int = gplus(int, 1); 56
Exercise 7: Numerical Integration integration.m is a serial numerical integration program Modify this code to run in parallel using the spmd command Submit your modified code to the MDCS using 2 workers 57
Distributed Arrays 1 matlabpool(4) A = distributed( [ a b c d; e f g h; i j k l; m n o p]); 2 [a; e; i; m] [b; f; j; n] [c; g; k; o] [d; h; l; p] 3 4 MDCS Workers 58
Distributed Arrays Allow large data sets to be distributed over multiple nodes Distributed by columns Can be constructed by partitioning a large array already in memory combining smaller arrays into one large array using distributed matrix constructor functions (distributed.rand(), distributed.zeros(), etc.) Operations on distributed arrays are automatically parallelized Arrays do not persist if the matlabpool is closed 59
Codistributed Arrays Codistributed arrays provide much more control over how arrays are distributed Can be distributed by any dimension Can distribute different amounts of data to different workers Codistributed arrays can be declared inside spmd sections 60
Exercise 8: Matrix Multiplication matrixmul.m is a serial matrix multiplication Modify this file to use distributed arrays create distributed random arrays a, b time a matrix multiplication: tic; c = a*b; toc Submit the job for 1 worker and then for 4 workers What is the speedup (serial time / parallel time)? 61
Using GPUs with Matlab The Parallel Computing Toolbox can utilize CUDA-capable GPUs on the system (e.g. the K20s on Guillimin) GPU-enabled functions fft, filter toolbox functions Linear-algebra operations Custom CUDA kernels.cu or.ptx formats 62
GPU Arrays Matlab can copy arrays to the GPU Perform matrix operations on the GPU to speed them up e.g. x = rand(1000, 'single', 'gpuarray'); x2 = x.*x; %performed on GPU 63
Exercise 9: GPU Job fourier.m is a serial fast Fourier transform (FFT) code Modify this file to perform the same calculation using normal and GPU Arrays Use tic and toc to time both operations and output the results Submit this job to a Guillimin GPU node Hint: Simply request in glmnpbs.m numberofnodes = 1; procspernode = 1; gpus = 1; What is the speedup from the GPU? 64
Summary Today we learned: How to configure a desktop installation of Matlab to submit jobs to a cluster computer using MDCS How to submit jobs to a cluster and monitor their output How to write parallel Matlab applications using parfor, spmd, and distributed arrays Many Matlab programs can be parallelized with a very small change Note that parallel programming is a huge topic and we have only scratched the surface! 65
Questions What questions do you have? 66
Using Xeon Phi with Matlab Matlab uses the Intel MKL math library Version >= 11.0 of MKL has automatic offloading to Xeon Phi Included in Matlab R2014a and newer On Guillimin: module add ifort_icc export MKL_MIC_MAX_MEMORY=16G export MKL_MIC_ENABLE=1 matlab & 67