Calcul Parallèle sous MATLAB



Similar documents
Speed up numerical analysis with MATLAB

Parallel Computing with MATLAB

Bringing Big Data Modelling into the Hands of Domain Experts

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Matlab on a Supercomputer

Introduction to Matlab Distributed Computing Server (MDCS) Dan Mazur and Pier-Luc St-Onge December 1st, 2015

High-Performance Computing

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

What s New in MATLAB and Simulink

GPU Parallel Computing Architecture and CUDA Programming Model

Next Generation GPU Architecture Code-named Fermi

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

2015 The MathWorks, Inc. 1

Introduction to GPU hardware and to CUDA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

HPC Wales Skills Academy Course Catalogue 2015

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Speeding up MATLAB and Simulink Applications

:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

MATLAB in Business Critical Applications Arvind Hosagrahara Principal Technical Consultant

Overview of HPC Resources at Vanderbilt

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

ST810 Advanced Computing

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Installation Guide. (Version ) Midland Valley Exploration Ltd 144 West George Street Glasgow G2 2HG United Kingdom

Origins, Evolution, and Future Directions of MATLAB Loren Shure

10- High Performance Compu5ng

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

CS1112 Spring 2014 Project 4. Objectives. 3 Pixelation for Identity Protection. due Thursday, 3/27, at 11pm

Optimizing and interfacing with Cython. Konrad HINSEN Centre de Biophysique Moléculaire (Orléans) and Synchrotron Soleil (St Aubin)

Parallel Computing with Mathematica UVACSE Short Course

bwgrid Treff MA/HD Sabine Richling, Heinz Kredel Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim 29.

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

GPUs for Scientific Computing

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Parallel Programming Survey

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

OPC COMMUNICATION IN REAL TIME

Case Study on Productivity and Performance of GPGPUs

Introduction to GPU Programming Languages

Evaluation of CUDA Fortran for the CFD code Strukti

Accelerating CFD using OpenFOAM with GPUs

GeoImaging Accelerator Pansharp Test Results

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

CUDA programming on NVIDIA GPUs

High Performance Computing in CST STUDIO SUITE

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

A quick tutorial on Intel's Xeon Phi Coprocessor

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU for Scientific Computing. -Ali Saleh

GPGPU accelerated Computational Fluid Dynamics

Clusters with GPUs under Linux and Windows HPC

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Packet-based Network Traffic Monitoring and Analysis with GPUs


Introduction to GPGPU. Tiziano Diamanti

Operating Systems. Notice that, before you can run programs that you write in JavaScript, you need to jump through a few hoops first

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Computer Graphics Hardware An Overview

HPC with Multicore and GPUs

Introduction to MATLAB Gergely Somlay Application Engineer

Solving Big Data Problems in Computer Vision with MATLAB Loren Shure

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Stream Processing on GPUs Using Distributed Multimedia Middleware

International Engineering Journal For Research & Development

HIGH PERFORMANCE BIG DATA ANALYTICS

Parallel and Distributed Computing Programming Assignment 1

Parallel Firewalls on General-Purpose Graphics Processing Units

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Echtzeittesten mit MathWorks leicht gemacht Simulink Real-Time Tobias Kuschmider Applikationsingenieur

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

CUDAMat: a CUDA-based matrix class for Python

OpenCL Programming for the CUDA Architecture. Version 2.3

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Part I Courses Syllabus

Transcription:

Calcul Parallèle sous MATLAB Journée Calcul Parallèle GPU/CPU du PEPI MACS Olivier de Mouzon INRA Gremaq Toulouse School of Economics Lundi 28 novembre 2011 Paris Présentation en grande partie fondée sur des extraits du séminaire récent : Mathworks Seminar @ GREMAQ TSE 15 th of November 2011 Mounir El Bedraoui Stefan Duprey Sales Account Manager Academia Financial Application Engineer 2011 The MathWorks, Inc. 1

Plan Fonctions déjà (silencieusement) parallélisées Parallel Computing Toolbox (PCT) PCT seule vs. MDCS (Matlab Distributed Computing System) Local vs. cluster Utilisation (explicite) de fonctions parallélisées Utilisation de fonctions de base : parfor & spmd Utilisation de travaux et tâches Un point sur MPI Un point sur GPU (NVIDIA CUDA) 2

Fonctions déjà (silencieusement) parallélisées Multithreaded computations, introduced in R2007a, are now on by default. Many MATLAB functions are now multithreaded: sort bsxfun mldivide for sparse rectangular matrix input qr for sparse matrix input filter for matrices and higher-dimensional arrays gamma, gammaln erf, erfc, erfcx, erfinv, erfcinv 3

Parallel Computing Toolbox (PCT) matlabpool open local Code matlabpool close matlabpool open 9 Par défaut : 12 coeurs 4

Parallel Computing enables you to Larger Compute Pool Larger Memory Pool Speed up Computations Running Independent Tasks or Iterations Work with Large Data 11 26 41 12 27 42 13 28 43 14 29 44 15 30 45 16 31 46 17 32 47 17 33 48 19 34 49 20 35 50 21 36 51 22 37 52 5

Parallel Computing with MATLAB Parallel Computing Toolbox MATLAB Distributed Computing Server MATLAB Workers User s Desktop Compute Cluster 6

PCT : utilisation (explicite) de fonctions parallélisées - exemple %% Parallel bootstrapped aggregated tree % crossval, jackknife, bootstrp ntrees = 50; matlabpool open local; opt = statset('useparallel','always'); tic; b = TreeBagger(nTrees, X, Y, 'opt',opt); toc; matlabpool close; 7

Tools with Built-in parallel Support Optimization Toolbox Global Optimization Toolbox Statistics Toolbox SystemTest Simulink Design Optimization Bioinformatics Toolbox Model-Based Calibration Toolbox TOOLBOXES BLOCKSETS Worker Worker Worker Worker Worker Worker Worker http://www.mathworks.com/products/parallel-computing/builtin-parallel-support.html Directly leverage functions in Parallel Computing Toolbox 8

PCT : fonctions de base parfor et spmd 9

PCT : parfor N = 250; a = zeros(n, 1); matlabpool open local; tic; parfor i = 1:N %for i = 1:N a(i) = max(eig(rand(300))); end toc; matlabpool close; 10

Case 1 : Speed up Computations Distributing similar problems to different processors (or Task-parallelism) Processes Time Time 11

The Mechanics of parfor Loops 1 12 23 34 4 5 66 7 8 9 910 10 a = zeros(10, 1) parfor i = 1:10 a(i) = i; end a Worker a(i) = i; Worker a(i) = i; Worker a(i) = i; Worker a(i) = i; Pool of MATLAB Workers 12

PCT : parfor Chaque itération doit être indépendante des autres Minimiser les échanges de données avec les différents cœurs Passage des variables en entrée Récupération des variables en sortie 13

PCT : spmd matlabpool open 2; n = 100; spmd % simple spmd block a = rand(n,n); display(size(a)) display(a(1:2,1:2)) end spmd % creating distributed arrays a = rand(n,n,codistributor); display( size(getlocalpart(a))); end spmd d = svd(a) end dgathered=gather(d); D = distributed.rand(1000); % Data is created and stored on the workers. b = distributed.rand(1000, 1); % Created on the workers x = D \ b; matlabpool close 14

Case 2 : Work with large data Distributing arrays to different processors (or Data-parallelism) 11 26 41 12 27 42 13 28 43 14 29 44 15 30 45 16 31 46 17 32 47 17 33 48 19 34 49 20 35 50 21 36 51 22 37 52 TOOLBOXES BLOCKSETS 11 26 41 12 27 42 13 28 43 14 29 44 15 30 45 16 31 46 17 32 47 17 33 48 19 34 49 20 35 50 21 36 51 22 37 52 C O N F I D E N T I A L 15

Examples of distributed and codistributed arrays spmd blocks spmd % single program across workers end Run on a pool of MATLAB resources Single Program runs simultaneously across workers Multiple Data spread across multiple workers C O N F I D E N T I A L 16

A mental model for SPMD END x = 1 spmd y = x + 1 end y Worker x 1 y = x + 1 Worker x 1 y = x + 1 Worker x 1 y = x + 1 Worker x 1 y = x + 1 Pool of MATLAB Workers C O N F I D E N T I A L 17

Client-Side Distributed Arrays and spmd Client-side distributed arrays Class distributed Ability to create and manipulate directly from the client Simpler access to memory on labs Client-side visualization capabilities spmd Block of code executed on workers Worker-specific commands Explicit communication between workers Mixture of parallel and serial code C O N F I D E N T I A L 18

Enhanced MATLAB Functions That Operate on Distributed Arrays C O N F I D E N T I A L 19

PCT : travaux et tâches Used findresource to find scheduler Used createjob and createtask to set up the problem Used submit to offload and run in parallel Used getalloutputarguments to retrieve all task outputs 20

Scheduling Work Work Worker TOOLBOXES BLOCKSETS Result Scheduler Worker Worker Worker 21

Scheduling Task-parallel applications Compute cluster MATLAB Distributed Computing Server Client Machine Task Job Result CPU Worker Task Result Worker CPU TOOLBOXES Parallel Computing Toolbox Result Scheduler Task Result Worker CPU BLOCKSETS Task Result Worker CPU 22

23

24

Scheduling Data-parallel applications Compute cluster MATLAB Distributed Computing Server Client Machine Task Result CPU Lab Job Task Result Lab CPU TOOLBOXES Parallel Computing Toolbox Result Scheduler Task Result Lab CPU BLOCKSETS Task Result Lab CPU 25

26

27

PCT : MPI-Based Functions Use when a high degree of control over parallel algorithm is required High-level abstractions of MPI functions labsendreceive, labbroadcast, and others Send, receive, and broadcast any data type in MATLAB Automatic bookkeeping Setup: communication, ranks, etc. Error detection: deadlocks and miscommunications Pluggable Use any MPI implementation that is binary-compatible with MPICH2 28

Using an InfiniBand network Parallel Computing Toolbox does not have built-in support for InfiniBand. However, the toolboxes provide all the necessary hooks to take advantage of it. The user will need to provide their own custom build of MPI that supports InfiniBand. See "Using a Different MPI Build on UNIX Operating Systems for more details 29

Summary 30

Programming Parallel Applications Level of control Parallel Options Minimal Support built into Toolboxes Some Extensive High-Level Programming Constructs: (e.g. parfor, batch, distributed, Jobs/Tasks) Low-Level Programming Constructs: (e.g. MPI-based) 31

Parallel Computing with MATLAB Built-in parallel functionality within specific toolboxes (also requires Parallel Computing Toolbox) Optimization Toolbox Global Optimization Toolbox Statistics Toolbox SystemTest Simulink Design Optimization Bioinformatics Toolbox Model-Based Calibration Toolbox High-level parallel language MATLAB and Parallel Computing Tools parfor matlabpool batch Low-level parallel functions createjob createtask Built on industry-standard libraries Industry Libraries Message Passing Interface (MPI) ScaLAPACK 32

GPU support - R2010b 33

What is a GPU Originally for graphics acceleration, now also used for scientific calculations Massively parallel array of integer and floating point processors Typically hundreds of processors per card GPU cores complement CPU cores Dedicated high-speed memory 34

GPU vs CPU 35

Performance Gain with More Hardware Using More Cores (CPUs) Using GPUs Core 1 Core 2 Core 3 Core 4 Cache Device Memory 36

Technical language of GPU 37

Nvidia solutions GPU GeForce Quadro Tesla Mass market Graphics Calculations ECC Memory Faster PCIe Communication 38

Supported Cards and Operating Systems To use GPU functionalities the user should have: MATLAB + PCT R2010b 32-bit or 64-bit Microsoft Windows or Linux operating system NVIDIA CUDA-enabled device with compute capability of 1.3 or greater NVIDIA CUDA device driver 3.0 or greater NVIDIA CUDA Toolkit 3.0 (recommended) for compiling PTX files 39

Using GPU with PCT R2010b 3 Main Ways to Access GPU: Ease of Use 1. Use GPU array interface and MATLAB built-in functions 2. Execute custom functions on elements of the GPU array 3. Create kernels from your CUDA code and PTX files Greater Functionality 40

1. Using MATLAB Built-In functions Feels like using distributed arrays A = gpuarray(rand(1000)); B = gpuarray(rand(1000)); C = transpose(a); D = C * log(b); E = gather(d); C O N F I D E N T I A L 41

Performance: A\b with Double Precision 42

Supported functions 43

2. Using MATLAB function file Create a MATLAB function (kernel) function c = myop(a,b) a1 = log(a); b1 = log(b); c = round(a1.* b1); From MATLAB: a = gpuarray(1/2*rand(1000)); b = gpuarray(3*rand(1000)); c = arrayfun(@myop,a,b); d = gather(c); arrayfun() 44

Performances 45

Main Limitations The code can call only supported functions and cannot call scripts Indexing is not supported Persistent or global variables are not supported if, for, while, parfor, spmd, switch, try/catch, and return not supported single, double, int32, uint32, and logical are the only supported data type conversions Functional forms of arithmetic operators are not supported, but symbol operators are (i.e. + supported, plus() not supported) 46

3. Invoking CUDA code Develop the CUDA code (kernel) for your computation Compile the CUDA code in MATLAB using NVIDIA compiler nvcc ptx myfun.cu Create a MATLAB function MyFun.m containing the commands kernel = parallel.gpu.cudakernel( myfun.ptx, myfun.cu ); To create a kernel object Res = feval(kernel, input_arguments); Allows users to evaluate their kernel on the GPU Execute the MATLAB function Res = myfun (input_arguments); 47

Example of CUDA code 48

Performances 49

Performance Acceleration Options in the Parallel Computing Toolbox Technology Example MATLAB Workers Execution Target matlabpool parfor Required CPU Cores user-defined tasks createtask Required CPU Cores GPU-based parallelism GPUArray No NVIDIA GPU with Compute Capability 1.3 or greater 50

Questions? Thank you. 51