Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Similar documents
GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

GPUs for Scientific Computing

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Turbomachinery CFD on many-core platforms experiences and strategies

Stream Processing on GPUs Using Distributed Multimedia Middleware

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Parallel Programming Survey

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Auto-Tunning of Data Communication on Heterogeneous Systems

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

HPC with Multicore and GPUs

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

ST810 Advanced Computing

CUDA programming on NVIDIA GPUs

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Next Generation GPU Architecture Code-named Fermi

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Overview of HPC Resources at Vanderbilt

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Retargeting PLAPACK to Clusters with Hardware Accelerators

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

The Design and Implement of Ultra-scale Data Parallel. In-situ Visualization System

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Overview of HPC systems and software available within

Case Study on Productivity and Performance of GPGPUs

ultra fast SOM using CUDA

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

GPGPU accelerated Computational Fluid Dynamics

Final Project Report. Trading Platform Server

Computer Graphics Hardware An Overview

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Introduction to GPU Programming Languages

A quick tutorial on Intel's Xeon Phi Coprocessor

Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction

CUDA in the Cloud Enabling HPC Workloads in OpenStack With special thanks to Andrew Younge (Indiana Univ.) and Massimo Bernaschi (IAC-CNR)

Coping with Complexity: CPUs, GPUs and Real-world Applications

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Simulation Platform Overview

High Performance Computing in CST STUDIO SUITE

Direct GPU/FPGA Communication Via PCI Express

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Xeon+FPGA Platform for the Data Center

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Introduction to GPGPU. Tiziano Diamanti

Project Discussion Multi-Core Architectures and Programming

Accelerating CFD using OpenFOAM with GPUs

Introduction to GPU Computing

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

1 DCSC/AU: HUGE. DeIC Sekretariat /RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Jean-Pierre Panziera Teratec 2011

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

A Flexible Cluster Infrastructure for Systems Research and Software Development

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Chapter 2 Parallel Architecture, Software And Performance

Recent Advances in HPC for Structural Mechanics Simulations

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Part I Courses Syllabus

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Parallel Algorithm Engineering

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Resource Scheduling Best Practice in Hybrid Clusters

Introduction to GPU hardware and to CUDA

Evaluation of CUDA Fortran for the CFD code Strukti

Performance Measurement of a High-Performance Computing System Utilized for Electronic Medical Record Management

GPU Parallel Computing Architecture and CUDA Programming Model

IP Video Rendering Basics

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Texture Cache Approximation on GPUs

HPC Wales Skills Academy Course Catalogue 2015

MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs.

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU for Scientific Computing. -Ali Saleh

FPGA-based Multithreading for In-Memory Hash Joins

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Trends in High-Performance Computing for Power Grid Applications

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Transcription:

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la Unión Científica Internacional de Radio Valencia, 3-5 septiembre 2014 G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 1 / 27

Outline 1 Introduction and Motivation 2 3D-FWT basics 3 Autotuning architecture to run the 3D-FWT kernel automatically on clusters 4 Experimental results 5 Conclusion and Future works G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 2 / 27

Introduction and Motivation Cluster of nodes: To solve scientific problems like 3D-FWT: G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 3 / 27

Introduction and Motivation The development and optimization of parallel code is a complex task: Deep knowledge of the different components to exploit their computing capacity Programming and combining efficiently different programming paradigms (message passing, shared memory and SIMD GPU) Autotuning architecture to run the 3D-FWT kernel automatically on clusters of multicore+gpus Autotuning architecture Multicores GPUs 3D-FWT G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 4 / 27

3D-FWT basics 1D-FWT The wavelet transform uses simple filters for fast computing The filters are applied to the signal. The filter output is downsampled by generating two bands Maintaining the amount of data on each additional level Access pattern is determined by the mother wavelet function G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 5 / 27

3D-FWT basics 2D-FWT Generalize the 1D-FWT for an image (2D) Applying the 1D-FWT to each row and to each column of the image Original image Columns transformed Rows transformed... after a three level application of the filters G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 6 / 27

3D-FWT basics 3D-FWT Generalize the 1D-FWT for a sequence of video (3D) Nrows Ncols calls to 1D-FWT on frames Each of Nframes calls to 2D-FWT G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 7 / 27

3D-FWT basics 3D-FWT with tiling Decompose frames processing Improving data locality Instead of calculating two entire frames in the time, we split them into small blocks, on which the 2D-FWT with tiling is applied G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 8 / 27

Autotuning Architecture The architecture consists of the main objects Cluster Analysis Cluster Performance Map Theoretical Searching of the Best Number of Slave Nodes Best Number of Slave Nodes G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 9 / 27

Cluster Analysis Detects automatically the available GPUs and CPUs in each node For each platform in each node (GPU or CPU) do: If GPU Nvidia: CUDA 3D-FWT calculates automatically block size If GPU ATI: OpenCL 3D-FWT computes automatically work-group size If CPU: Tiling and pthreads fast analysis to obtain the best number of threads Send one sequence Computer performance of the 3D-FWT kernel Measure the performance of the interconnection network among the nodes G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 10 / 27

Cluster Performance Map For each platform (CPU or GPU) Number and type of CPUs and GPUs Computer performance of the 3D-FWT kernel Network bandwidth G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 11 / 27

Searching for the Best Number of Slave Nodes Inputs Outputs Cluster Performance Map Routines of the 3D-FWT Input data (the video sequences) The temporal distribution scheme for different number of slave nodes Selects the Best Number of Slave Nodes G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 12 / 27

Searching for the Best Number of Slave Nodes in a cluster of N + 1 nodes A temporal distribution scheme, TS(n), for the n most powerful slave nodes working jointly with the master node (for n = 1 to N) Taking into account a double balance goal: Inter-node coarse grain balance: Distribution of the workload between the slaves according to their relative performances when solving data sequences Intra-node fine grain balance: Inside each node, balance the CPU - GPU workload G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 13 / 27

Execution Environment The master process is in Saturno and the other nodes are the candidates to be slaves Saturno Marte 4 Intel Xeon X7542 Hexacore 1 Nvidia Tesla C2050 AMD Phenom II X6 1075T 2 Geforce GTX 590 Luna InfiniBand Mercurio Intel Core2 Quad Q6600 1 GeForce 9800 GT AMD Phenom II X6 1075T 2 Geforce GTX 590 G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 14 / 27

Cluster Analysis Detects automatically the available GPUs and CPUs in each node Selects the CUDA implementation for the Nvidia GPUs Studies the problem for different block sizes and varying the number of threads: Selects block size 192 in the Tesla C2050, GeForce GTX 590 and GeForce 9800 GT GPU the 3D-FWT tiling version with pthreads for the CPUs, with number of threads: 4 for Luna, 12 for Saturno and 12 for Mercurio and Marte Executes an MPI program to determine the bandwidth (1.5 Gbytes/s) Computes the performance of the 3D-FWT kernel for each node and platform G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 15 / 27

Cluster Performance Map Under the name of each platform in a node, the three numbers represent the relative speedups of the platform in the node for different image resolutions 512 1024 2048 G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 16 / 27

Cluster Performance Map Under the name of the slave nodes, the three numbers represent the speedups of the nodes (Marte and Mercurio) with respect to the slowest slave node (Luna) for different image resolutions G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 17 / 27

Searching of the Best Number of Slave Nodes Objectives: Obtain a distribution to provide jobs for all platforms in the cluster Maintain the proportion among the nodes Once the sequences are received at each slave node, they are divided in proportion to the computer performance of each platform in the node Builds different possible Temporal distribution schemes for one, two and three slaves G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 18 / 27

TS for 1 slave node For master node Saturno and slave node Marte, for images 2048 2048 The master sends 5 sequences to the slave, because the computer performance between the platforms in Marte is 2 for each GPU and 1 for the multicore Marte processes the 5 sequences and the master node waits for the results During shipment, processing and reception, the GPU and the other cores of Saturno process sequences Nvdia C2050-6.06 sequences 23 cores CPU - 5.60 sequences 1 core 5 seq. Saturno Marte Marte Process 5 seq Marte 2048x2048 0.42 s 0.48 s 0.42 s Time Speedups of scheme TS(1) are 1.55, 1.43 and 1.43 for the different image resolutions G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 19 / 27

TS for 2 slave nodes For master node Saturno and slave nodes Marte and Mercurio The master sends 5 sequences to each slave Marte and Mercurio process the sequences During shipment, processing and reception, the GPU and the other cores of Saturno process sequences Nvdia C2050 7.98 sequences 23 cores CPU 7.37 sequences 1 core Saturno 5 seq. Marte 5 seq. Mercurio 5 seq Marte 5 seq Mercurio 0.42 s 0.42 s 0.06 + 0.42 s 0.42 s Time Speedups of TS(2) are 1.84, 1.63 and 1.65 G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 20 / 27

TS for 3 slave nodes For master node Saturno and slave nodes Luna, Marte and Mercurio The proportions between Marte or Mercurio and Luna is 2 to 1 The minimum common multiples of (1, 2, 2) and (2, 5, 5) are 2 and 10 The master sends 20 (2 10) sequences to Marte and Mercurio and 10 to Luna 2048x2048 G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 21 / 27

TS for 3 slave nodes This pattern is repeated four times Nvdia C2050 9.22 sequences Pattern 4 times 23 cores CPU 8.52 sequences 1 core Saturno 5 seq. Marte 5 seq. Mercurio 2 seq. Luna 5 seq Marte 5 seq Mercurio 2 seq. Luna 0.42 s 0.42 s 0.17 s 0.42 s 0.42 s 0.17 s Time Speedups of TS(3) are 1.89, 1.64 and 1.68 G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 22 / 27

Comparison with GPU Speedups of the 3D-FWT with autotuning with respect to the case in which all the sequences are sent to one or two of the GPUs of a node and the optimal block size is selected. For a sequence of video of 2 hours with 25 frames per second, split in group of 32 frames. speedup 512 512 1024 1024 2048 2048 AT vs Nvidia C2050 GPU 3.37 3.17 3.22 AT vs 2 GeForce GTX 590 GPU 1.58 1.60 1.66 AT vs GeForce 9800 GT GPU 8.35 8.53 8.96 An average speedup of 4.49 versus a user who sends all the sequences to the GPU of a node and obtains the optimal block size. G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 23 / 27

Comparison with execution in a node Speedups of the 3D-FWT with autotuning with respect to the case in which all the sequences are sent to a node (CPU+GPU) and the optimal block size is selected. For a sequence of video of 2 hours with 25 frames per second, split in group of 32 frames. speedup 512 512 1024 1024 2048 2048 AT vs AT on Saturno 2.06 1.81 1.74 AT vs AT on Marte/Mercurio 1.34 1.33 1.43 AT vs AT on Luna 5.57 6.34 6.15 An average speedup of 3.09 versus processing all the sequences in only one node of the cluster and using the autotuning proposed in: G. Bernabé, J. Cuenca, D. Giménez, Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs. ICCS 2013 G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 24 / 27

Conclusion An autotuning engine to run automatically the 3D-FWT kernel on clusters of multicore+gpu: Detects the number and type of CPUs and GPUs, the 3D-FWT computer performance and the bandwidth of the interconnection network. The Theoretical Searching of the Best Number of Slave Nodes automatically computes the proportions at which the different sequences of video are divided among the nodes in the cluster. Searches for the Temporal distribution scheme for n slave nodes working jointly to a master node. Chooses the best number of slave nodes. Average gains of 3.09 and 4.49 versus an optimization engine for a node (CPU+GPU) or a single GPU. G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 25 / 27

Future works The work is part of the development of an image processing library oriented to biomedical applications, allowing users the efficient executions of different routines automatically. The proposed autotuning methodology is applicable to other complex compute applications. It can be extended to more complex computational systems, multicore+gpu+mic. Selection of the number of threads for a sequence of video of 64 frames of 1024 1024 and 2048 2048 on Intel Xeon Phi. #threads 1024 1024 2048 2048 1 3783.24 21774.10 32 317.34 903.40 64 389.16 1283.29 96 60.83 35.78 128 41.09 41.04 160 50.96 55.65 192 68.13 62.89 224 70.60 68.85 G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 26 / 27

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la Unión Científica Internacional de Radio Valencia, 3-5 septiembre 2014 G. Bernabé, J. Cuenca, D. Giménez () domingo@um.es URSI / 3-5 septiembre 2014 27 / 27