Faculté Polytechnique

Size: px

Start display at page:

Download "Faculté Polytechnique"

Alban Houston
7 years ago
Views:

Faculté Polytechnique Optimizing Performance of Batch of applications on Cloud Servers exploiting Multiple GPUs Sébastien Frémal, Michel Bagein,

1 Faculté Polytechnique Optimizing Performance of Batch of applications on Cloud Servers exploiting Multiple GPUs Sébastien Frémal, Michel Bagein, Pierre Manneback 2012 International Conference on Complex Systems Palais des Roses Hotel Agadir, Morocco November 5-, 2012

2 Part 1 : GPU General Presentation S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

CPU & GPU: architectural differences CPU : from 1 to 1 cores (tens of threads) GPU : from 32 to 800 cores (millions of threads) SP : Streaming Processor SM : Streaming

3 CPU & GPU: architectural differences CPU : from 1 to 1 cores (tens of threads) GPU : from 32 to 800 cores (millions of threads) SP : Streaming Processor SM : Streaming Multiprocessor CPU : asynchronous code execution on cores GPU : synchronous code execution on SPs of an SM. S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

4 Device memories Memory Info. Size Latency (clock cycles) Register Local Shared On chip Thread Scope Off chip - Thread scope On chip Block scope 8192 x 4 bytes 1 Undetermined 400 to 00 1 kb 4 Constant Off chip Read only 8 kb Min. 1 Texture Off chip - Read only 1 kb / core 4 Global Off chip Main memory Can reach 4 GB 400 to 00 S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

Device Host Steps to execute a kernel on a GPU 1. Device memory allocation and data transfer from the host memory to the device memory. 2. Kernel launching. 3.

5 Device Host Steps to execute a kernel on a GPU 1. Device memory allocation and data transfer from the host memory to the device memory. 2. Kernel launching. 3. Data reading and processing. 4. Data writing. 5. Results transfer from the device memory to the host memory S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

6 Some constraints of GPUs - Data transfers from a memory to another : needs to get enough treatments to compensate lantecy due to transfer times - Different applications are not distributed on all available GPUs but are all assigned at the same default GPU -The number of coexisting CUDA contexts (GPU processus) is limited (+- 30 on a NVIDIA GTX 580) - CUDA contexts initialization takes some time (400 ms for a NVIDIA GTX 580) S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November 2012

7 Programming languages : CUDA & OpenCL C for CUDA: CUDA = parallel computing architecture developed by Nvidia Proprietary (NVIDIA) cudamalloc, cudamemcpy OpenCL: OpenCL = open, royalty-free standard for crossplatform, parallel programming Open (NVIDIA, ATI) clcreatebuffer, clcreatekernel, clsetkernelarg S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

8 Part 2 : GPUs Management S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

9 Efficient use of a set of graphic Default behavior of CUDA applications : Use the default GPU (GPU 0) The system freezes when there are too much CUDA contexts Initialization of a CUDA context takes some time Goals of our work : processors CUDA context mutualization : sharing CUDA contexts between multiple applications avoids freezing the system when there is a lot of applications and avoids the initialization time of CUDA contexts Using all available GPUs : distributing the applications so they use in a transparent way all available GPUs to improve performances S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

10 A first system distributing applications on GPUs of a computer S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November 2012 Assignment Manager : distributing applications on GPUs GPU Calls : interface sending requests for GPU functions to GPU Manager thanks to message queues GPU Manager : receiving requests from applications and executing them on GPUs. Each one have one thread managine one CUDA context contexts mutualization Memory Manager : manages GPU memory and a memory zone shared with applications 10

11 Makespan of applications batch (s) Results with the first system Apps 20 Apps 30 Apps 40 Apps 50 Apps Number of applications per batch CUDA - 1 GPU CUDA - 3 GPUs Our system - 1 GPU Our system - 3 GPUs S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

GVirtus : a system allowing an instanced virtual machine to access GPGPUs Similar to our system but gets some differences : Communications are available through TCP and allow the server to

12 GVirtus : a system allowing an instanced virtual machine to access GPGPUs Similar to our system but gets some differences : Communications are available through TCP and allow the server to communicate with an application in a virtual machine. The frontend library uses the CUDA Runtime API function s interfaces. Applications don t need a lot of modifications to use GVirtus. GVirtus uses one GPU (the default GPU). There is no contexts mutualization. For each application, the server spawns a processus. This processus initializes and manages a CUDA context. It receives requests of the application it s bound to and executes them on the GPU. A GPGPU transparent virtualization component for high performance computing clouds, G. Giunta et Al. S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

Makespan of applications batch (s) Results with GVirtus 14 12 10 13 12 8 4 2 0 10 Apps 20 Apps Number of

13 Makespan of applications batch (s) Results with GVirtus Apps 20 Apps Number of applications per batch CUDA - 1 GPU GVirtuS - 1 GPU S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

14 GVirtus + : improvements of GVirtus We brought two modifications we experienced with our system to GVirtus to improve its performances : CUDA contexts mutualization Using all available GPUs S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

15 Makespan of applications batch (s) Results with GVirtus Apps 20 Apps 30 Apps 40 Apps 50 Apps Number of applications per batch CUDA - 1 GPU CUDA - 3 GPUs GVirtuS GPU GVirtuS GPUs S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

16 Makespan of applications batch (s) Results with GVirtus + : virtualized applications Apps 20 Apps 30 Apps 40 Apps 50 Apps Number of applications per batch CUDA - 1 GPU CUDA - 3 GPUs GVirtuS + Virtualisé - 1 GPU GVirtuS + Virtualisé - 3 GPUs S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

17 Makespan of applications batch (s) Comparison of systems on 3 GPUs Apps 20 Apps 30 Apps 40 Apps 50 Apps Number of applications per batch CUDA - 3 GPUs Our system - 3 GPUs GVirtuS GPUs GVirtuS + Virtualisé - 3 GPUs S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

18 Conclusions & Perspectives Applications benefit from the initialized contexts and the mutualization allows executing more applications. Applications are properly distributed on available GPUs. Communications will be improved to make global execution faster. This is especially important when applications are virtualized. A memory manager will be implemented to remove memory limitation. The use of CUDA Stream will be experimented to overlap transfer with kernel execution S. Fremal, M. Bagein, P. Manneback ICCS'12, Agadir, 5- November

NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale