The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System



Similar documents
Uintah Framework. Justin Luitjens, Qingyu Meng, John Schmidt, Martin Berzins, Todd Harman, Chuch Wight, Steven Parker, et al

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

High Performance Computing in CST STUDIO SUITE

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

GPUs for Scientific Computing

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Next Generation GPU Architecture Code-named Fermi

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

HP ProLiant SL270s Gen8 Server. Evaluation Report

(Toward) Radiative transfer on AMR with GPUs. Dominique Aubert Université de Strasbourg Austin, TX,

UINTAH-X A Software Approach that leads towards Exascale

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Parallel Programming Survey

A quick tutorial on Intel's Xeon Phi Coprocessor

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Direct GPU/FPGA Communication Via PCI Express

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Case Study on Productivity and Performance of GPGPUs

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Scientific Computing Programming with Parallel Objects

Stream Processing on GPUs Using Distributed Multimedia Middleware

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Accelerator Beam Dynamics on Multicore, GPU and MIC Systems. James Amundson, Qiming Lu, and Panagiotis Spentzouris Fermilab

BLM 413E - Parallel Programming Lecture 3

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Parallel Algorithm Engineering

A Case Study - Scaling Legacy Code on Next Generation Platforms

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

GPU Parallel Computing Architecture and CUDA Programming Model

MIDeA: A Multi-Parallel Intrusion Detection Architecture

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Performance Analysis and Optimisation

CUDA in the Cloud Enabling HPC Workloads in OpenStack With special thanks to Andrew Younge (Indiana Univ.) and Massimo Bernaschi (IAC-CNR)

HPC with Multicore and GPUs

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Turbomachinery CFD on many-core platforms experiences and strategies

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Parallel Computing with MATLAB

ultra fast SOM using CUDA

Data Centric Systems (DCS)

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

GPU Accelerated Signal Processing in OpenStack. John Paul Walters. Computer Scien5st, USC Informa5on Sciences Ins5tute

Mixing Multi-Core CPUs and GPUs for Scientific Simulation Software

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

~ Greetings from WSU CAPPLab ~

Introduction to GPU Programming Languages

Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux

Auto-Tunning of Data Communication on Heterogeneous Systems

Performance Characteristics of Large SMP Machines

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

CUDA programming on NVIDIA GPUs

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Accelerating CFD using OpenFOAM with GPUs

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

High Performance Applications over the Cloud: Gains and Losses

Packet-based Network Traffic Monitoring and Analysis with GPUs

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Cosmological Simulations on Large, Heterogeneous Supercomputers

Introduction. Xiangke Liao 1, Shaoliang Peng, Yutong Lu, Chengkun Wu, Yingbo Cui, Heng Wang, Jiajun Wen

Clustering Billions of Data Points Using GPUs

Texture Cache Approximation on GPUs

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Managing GPUs by Slurm. Massimo Benini HPC Advisory Council Switzerland Conference March 31 - April 3, 2014 Lugano

Experiences With Mobile Processors for Energy Efficient HPC

GeoImaging Accelerator Pansharp Test Results

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Final Project Report. Trading Platform Server

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Embedded Systems: map to FPGA, GPU, CPU?

Introduction to GPU hardware and to CUDA

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Overview of HPC Resources at Vanderbilt

HPC Wales Skills Academy Course Catalogue 2015

Moab and TORQUE Highlights CUG 2015

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Evaluation of CUDA Fortran for the CFD code Strukti

Introduction to Cloud Computing

Big Data Visualization on the MIC

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Exploiting Communication Concurrency on High Performance Computing Systems

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Transcription:

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens and Steve Parker, NVIDIA DOE for funding the CSAFE project (97-10), DOE NETL, DOE NNSA NSF for funding via SDCI and PetaApps

Current and Past Uintah Applications CPU pins Sandstone Compaction Explosions Angiogenesis Shaped Charges Fires Gas mixing Foam Compaction Coal Boiler Virtual Soldier

Uintah Data Parallelism Uintah uses both data and task parallelism Gird and Patches (Physical Domain) Structured Grid(Flows) + Particles System(Solids) Patch-based Domain Decomposition for Parallel Processing Adaptive Mesh Refinement Dynamic Load Balancing Profiling + Forecasting Model Parallel Space Filling Curves Data Migration

Uintah Task Parallelism and Uintah Task Graph Patch-based domain decomposition tasks on patch Compile User defines Uintah Tasks: Cores Serial code (call back functions) Input and output variables Distributed: only creates tasks on local patches Framework analyzes task dependencies and creates TG Automatic MPI message generation Dynamic Task Execution (Data Driven Overlap) Task Graph

Uintah Runtime System: How Uintah Runs Tasks Memory Manager: Uintah Data Warehouse (DW) Variable dictionary (hashed map from: Variable Name, Patch ID, Material ID keys to memory) Provide interfaces for tasks to Allocate variables Put variables into DW Get variables from DW Automatic Scrubbing (garbage collection) Checkpointing & Restart (data archiver) Task Manager: Uintah schedulers Decides when and where to run tasks Decides when to process MPI

Thread/MPI Scheduler (De-centralized) Threads Shared Data Task Graph completed task completed task satisfied task Running Task Running Task Running Task Task Queues PUT GET PUT GET Ready task MPI Data ready Data Warehouse (variables) sends receives Network Memory saving: reduce ghost copies and metadata Work stealing inside node: all threads directly pull tasks from task queues, no on-node MPI Full Overlapping: All threads process MPI sends/receives and execute tasks Use lock-free data structure (avoid locking overhead)

Scalability Improvement AMR MPMICE: Scaling AMR MPMICE: Scaling Mean Time Per Timestep (s) 10 0 Strong Weak Mean Time Per Timestep (s) 10 1 Strong Weak 12 24 48 96 192 384 768 1.5K 3K 6K 12K 24K 49K 98K Cores 10 0 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K256K Cores Original Dynamic MPI-only Scheduler De-centralized MPI/Thread Hybrid Scheduler (with Lock-free Data Warehouse) Achieve much better CPU Scalability 95% weak scaling efficiency on 256K cores (Jaguar XK6) Use GPUs to accelerate Uintah Components

First step to GPU 1 2 2 1 Generated by Google profiling tool, visualized by Kcachegrind Profile & find most time consuming task Port task s serial CPU code to GPU Call CUDA API inside task code Framework unaware of GPU(s) Result: ~2x speedup (stencil code) must hide PCIe latency

Uintah GPU Task Management Framework manages all CUDA data movement (NOT inside task) Use Asynchronous API Automatically generate CUDA stream for each dependency Concurrently execute kernels and memcopies Prefetching data before task kernel execute Multi-GPU support Two call back functions for both CPU version and GPU version: Compatible for non-gpu nodes Pin this memory with cudahostregister() existing host memory hostrequires Page locked buffer Result back on host hostcomputes Free pinned host memory 1 2 5 6 cudamemcpyasync(h2d) devrequires computation Call-back executed here (kernel run) devcomputes cudamemcpyasync(d2h) 3 4 Component requests D2H copy here Stages of GPU task in Uintah runtime Normal Page-locked Memory Data Transfer Data Transfer Kernel Execution Data Transfer Kernel Execution Data Transfer Kernel Execution GPU Task GPU Task

Multistage Task Queues Architecture Fully Overlap computation with PCIe transfers and MPI communication

Unified Heterogeneous Scheduler & Runtime GPU Kernels completed tasks Running GPU Task Running GPU Task PUT Running CPU Task GET stream events GPU Data Warehouse H2D stream D2H stream MPI sends MPI recvs CPU Threads Running CPU Task PUT GET Shared Scheduler Objects (host MEM) Task Graph Internal ready tasks Running CPU Task GPU ready tasks GPU Task Queues GPU-enabled tasks ready tasks CPU Task Queues PUT GET MPI Data Ready Data Warehouse (variables) Network

GPU RMCRT Speedup Results (Multi-Node) All CPU cores vs Single GPU Nodes CPU(sec) CPU+GPU (sec) Speedup (x) 1 319.769 8.49714 38 X Keeneland Initial Delivery System 2 163.773 7.68571 21X 4 70.0943 4.80571 14X 8 39.5686 2.62857 15X 16 20.2414 3.52857 6X 32 18.8043 2.73857 6X 64 9.11571 2.95714 3X * CPU Core (2) Intel Xeon 6-core X5660 (Westmere) @2.8GHz GPU (1) Nvidia M2090 * GPU implementation quickly runs out of work, scaling breaks down

Scaling Comparisons Uintah strong scaling results when using: MPI-only Multi-threaded MPI Multi-threaded MPI w/ GPU Two Problems: AMR MPMICE Nearest neighbors communication 3.62 billion particles GPU-enabled ray tracer All-to-all communication 100 rays per cell 128 3 cells Uintah Scaling Overview MPI only AMR MPMICE: N=6144 CPU cores; Largest = 98K CPU cores Thread/MPI AMR MPMICE: N=8192 CPU cores; Largest=256K CPU cores Thread/MPI RayTracing: N=16 CPU cores; Largest=1024 CPU cores Thread/MPI/GPU RayTracing: N=16 CPU and 1 GPU; Largest=1024 CPU and 64 GPU

Scheduler Infrastructure Future Work GPU affinity for multi socket/gpu nodes Support Intel MIC (Xeon Phi) offload-mode PETSc GPU interface utilization Mechanism to dynamically determine whether to run GPU or CPU version task Optimize GPU codes for Nvidia Kepler CUDA 5.0 Dynamic Parallelism GPU sub-scheduler

Questions? Alstom Clean Coal Boiler Simulation Monte Carlo Ray Tracing on GPU, Flow simulation on CPU Software Homepage http://www.uintah.utah.edu/

Using GPUs in Energy Applications ARCHES Combustion Component Alstom Clean Coal Boiler Problem Need to approximate radiation transfer equation Reverse Monte Carlo Ray Tracing (RMCRT) Rays mutually exclusive - traced simultaneously Ideal for SIMD parallelization of GPU Offload Ray Tracing and RNG to GPU(s) CPU cores can perform other computation

Performance Comparisons Master-Slave Model vs. Unified Execution Times CPU Only #Cores 2 4 8 16 32 Master Slave 57.28 20.72 9.40 4.81 2.95 Unified 29.79 15.70 8.23 4.54 2.78 Execution Times With GPU #Cores 2 4 6 8 10 12 Master Slave 4.55 4.09 3.95 3.68 3.64 3.34 Unified 3.82 3.52 3.09 2.90 2.50 2.09 CPU Problem: Combined MPMICE problem using AMR Run on single Cray XE6 node with two 16-core AMD Opteron 6200 Series (Interlagos cores @2.6GHz) processors GPU Problem: Reverse Monte Carlo Ray Tracer Run on a single 12-core heterogeneous node (two Intel Xeon X5650 processors each with Westmere 6-core @2.67GHz, (2) Nvidia Tesla C2070 GPUs and (1) Nvidia GeForce 570 GTX GPU)