Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Similar documents
Lecture 3. Optimising OpenCL performance

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

CERN openlab III. Major Review Platform CC. Sverre Jarp Alfio Lazzaro Julien Leduc Andrzej Nowak

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introduction to GPU hardware and to CUDA

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Parallel Algorithm Engineering

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Case Study on Productivity and Performance of GPGPUs

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPU Parallel Computing Architecture and CUDA Programming Model

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Turbomachinery CFD on many-core platforms experiences and strategies

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Next Generation GPU Architecture Code-named Fermi

Introduction to OpenCL Programming. Training Guide

Introduction to GPGPU. Tiziano Diamanti

A quick tutorial on Intel's Xeon Phi Coprocessor

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Overview of HPC Resources at Vanderbilt

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

GPGPU accelerated Computational Fluid Dynamics

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Java GPU Computing. Maarten Steur & Arjan Lamers

Parallel Programming Survey

ST810 Advanced Computing

Introduction to GPU Programming Languages

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

HPC with Multicore and GPUs

Embedded Systems: map to FPGA, GPU, CPU?

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Multi-Threading Performance on Commodity Multi-Core Processors

GPUs for Scientific Computing

COSCO 2015 Heterogeneous Computing Programming

Several tips on how to choose a suitable computer

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Stream Processing on GPUs Using Distributed Multimedia Middleware

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA programming on NVIDIA GPUs

How to choose a suitable computer

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5 Posted. Project

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Binary search tree with SIMD bandwidth optimization using SSE

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

~ Greetings from WSU CAPPLab ~

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Several tips on how to choose a suitable computer

QCD as a Video Game?

Choosing a Computer for Running SLX, P3D, and P5

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

OpenCL Programming for the CUDA Architecture. Version 2.3

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

GeoImaging Accelerator Pansharp Test Results

Findings in High-Speed OrthoMosaic

CUDA Basics. Murphy Stein New York University

GPU Performance Analysis and Optimisation

Introduction to Cloud Computing

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

SOFTWARE DEVELOPMENT TOOLS USING GPGPU POTENTIALITIES

gpus1 Ubuntu Available via ssh

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

GPGPU. General Purpose Computing on. Diese Folien wurden von Mathias Bach und David Rohr erarbeitet

Performance Monitoring of the Software Frameworks for LHC Experiments

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Introduction to GPU Architecture

Evaluation of CUDA Fortran for the CFD code Strukti

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

GPU ACCELERATED DATABASES Database Driven OpenCL Programming. Tim Child 3DMashUp CEO

Computer Graphics Hardware An Overview

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux

OpenCL for programming shared memory multicore CPUs

ultra fast SOM using CUDA

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

CHAPTER 1 INTRODUCTION

Lecture 1. Course Introduction

VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

Introduction to GPU Computing

GPU-based Decompression for Medical Imaging Applications

Big Data Visualization on the MIC

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

High Performance GPGPU Computer for Embedded Systems

Parallel Computing with MATLAB

ARM Processors for Computer-On-Modules. Christian Eder Marketing Manager congatec AG

Transcription:

Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva, Switzerland Workshop on Future Computing in Particle Physics, e-science Institute, Edinburgh (UK) June 15 th 17 th, 2011

Tiny OpenCL intro OpenCL device abstractions Different hardware/sdks/drivers are represented by different «platform» objects A platform object can have a range of devices (of course, if you have them physically) An example cl_platform platform; cl_device device; cl_context context; cl_command_queue queue; cl_int status; clgetplatformids(1, &platform, NULL); clgetdeviceids(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL); context = clcreatecontext(null, 1, &device, NULL, NULL, &status); queue = clcreatecommandqueue(context, device, 0, &status);

Declaring a computational kernel Tiny OpenCL intro kernel void evaluatepdfgaussian( const double mu, const double sigma, global const double *data, global double *results, const int N) { } int i = get_global_id(0); if (i >= N) return; double x = data[i]; double temp = (x-mu)/sigma; temp *= temp; results[i] = exp(-0.5*temp); Executing a computational kernel //Assume we have the required arguments and a kernel object for the Gaussian kernel above clsetkernelarg(evaluatepdfgaussian, 0, sizeof(float), (void*)&mu); clsetkernelarg(evaluatepdfgaussian, 1, sizeof(float), (void*)&sigma); clsetkernelarg(evaluatepdfgaussian, 2, sizeof(cl_mem), (void*)&data); clsetkernelarg(evaluatepdfgaussian, 3, sizeof(cl_mem), (void*)&results); clsetkernelarg(evaluatepdfgaussian, 4, sizeof(int), (void*)&n); size_t workgroupsize = 128; //e.g. size_t numworkgroups = N % workgroupsize == 0? N/workGroupSize : N/workGroupSize + 1; size_t total = workgroupsize * numworkgroups; clenqueuendrangekernel(queue, evaluatepdfgaussian, 1, NULL, &total, &workgroupsize, 0, NULL, NULL);

GPU Implementation (OpenCL) With OpenMP, each thread can evaluate the tree top-down directly in fully parallel. Using a GPU requires an explicit call to a kernel inside each PDF (see 2nd illustration), suggesting lower parallel efficiency. Leads to larger serial fraction, many kernel calls and in general, stalls Data is uploaded once, in the beginning of the run.

GPU Implementation (OpenCL) Parallel block-wise reduction is used. Improves the speedup significantly (uses GPU shared mem) Double precision and general accuracy requirements prevents using native transcendental units and also limits performance in general (GPUs are made for single-precision primarily) Not memory-bound (on an Nvidia GTX470, at least) since we re doing expensive computations, so texture cache has no effect Straight-forward implementation. No possibility to use e.g. shared memory (except for reduction). But this is also beneficial from a user perspective 5

Downsides Introduces more expressive code when setting up environment and e.g. calling kernels. Duplication of code since we now use an OpenCL compiler in addition to the C/C++ compiler May be necessary to explicitly program with vector types to exploit performance on AMD cards (we have not tested this yet). We have also tried OpenCL for CPUs. Our experiences: Have to use vector types to achieve vectorization. But even then AMDs OpenCL compiler does not vectorize transcendentals for instance To obtain performant code, it is necessary to do more work per OpenCL thread. Like doing work by hand instead of making a computer do it Talked to Intel OpenCL guru today, he says that this is not the case with Intels implementation It would of course be nice to have one unified programming model for any device, but that seems like somewhat of a silver bullet so far

GPU Test environment PC (host) Desktop system CPU: Intel Nehalem @ 3.2GHz: 4 cores 8 hardware threads Linux 64bit, Intel C++ compiler version 11.1 GPU: ASUS nvidia GTX470 PCI-e 2.0 Commodity card (for gamers) Architecture: GF100 (Fermi) Memory: 1280MB DDR5 Core/Memory Clock: 607MHz/837MHz Maximum # of Threads per Block: 1024 Number of SMs: 14 Power Consumption 200W Price ~$300 (July 2010) 7

Performance This is not a fair CPU vs GPU comparison because of different algorithm

Conclusion The two algorithms (OpenMP and OpenCL) can coexist seamlessly in the application Up to a factor 2.5x (on our tests) with respect to OpenMP with 8 SMT threads (i7 965 and GTX470). The CPU scalability compared to one core is ~4.6x. GPUs behaves better with more events, as expected Seems ideal to load-balance, since equally priced products perform comparable It is clear that reduction must be done on the GPU to achieve high GPU performance. This reduction is deterministic, which can be a requirement from minimization algorithms We have measured the GPU idle percentage to be around 12% in ideal cases, which is not too bad, taking the algorithm into account

Conclusion Note that our target is running at the user-level on the GPU of small systems (laptops, desktops), i.e. with small number of CPU cores and commodity GPU cards Comparisons with a GPU Tesla card is more appropriate with a CPU server system, which is not our goal Main limitation is the algorithm and the double precision Small limitation due to CPUGPU communication Soon the code will be released in the standard RooFit (discussion with the authors of the package ongoing) 10

Current/future developments Try the code on LHC analyses Test vector types on AMD cards to see if they have any performance effect Concurrent execution on CPU with OpenMP and GPU with OpenCL 11