Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Similar documents
GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Introduction to GPU Programming Languages

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Turbomachinery CFD on many-core platforms experiences and strategies

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Trends in High-Performance Computing for Power Grid Applications

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

CUDA programming on NVIDIA GPUs

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Next Generation GPU Architecture Code-named Fermi

HPC enabling of OpenFOAM R for CFD applications

Scientific Computing Programming with Parallel Objects

FPGA-based MapReduce Framework for Machine Learning

HPC with Multicore and GPUs

GPU Parallel Computing Architecture and CUDA Programming Model

Parallel Computing with MATLAB

High Performance Computing in CST STUDIO SUITE

Case Study on Productivity and Performance of GPGPUs

Data Centric Systems (DCS)

Visit to the National University for Defense Technology Changsha, China. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

HPC Wales Skills Academy Course Catalogue 2015

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Introduction to GPU hardware and to CUDA

FPGA area allocation for parallel C applications

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

GPU Acceleration of the SENSEI CFD Code Suite

Introduction. Xiangke Liao 1, Shaoliang Peng, Yutong Lu, Chengkun Wu, Yingbo Cui, Heng Wang, Jiajun Wen

HPC-related R&D in 863 Program

HPC Programming Framework Research Team

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Parallel Programming Survey

10- High Performance Compu5ng

Performance of the JMA NWP models on the PC cluster TSUBAME.

High Performance Computing. Course Notes HPC Fundamentals

Multicore Parallel Computing with OpenMP

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Retargeting PLAPACK to Clusters with Hardware Accelerators

GPUs for Scientific Computing

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

CFD Implementation with In-Socket FPGA Accelerators

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Data Center and Cloud Computing Market Landscape and Challenges

ultra fast SOM using CUDA

A Case Study - Scaling Legacy Code on Next Generation Platforms

Part I Courses Syllabus

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Evaluation of CUDA Fortran for the CFD code Strukti

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

ST810 Advanced Computing

GPU Computing - CUDA

~ Greetings from WSU CAPPLab ~

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Recent Advances in HPC for Structural Mechanics Simulations

Large-Scale Reservoir Simulation and Big Data Visualization

Stream Processing on GPUs Using Distributed Multimedia Middleware

HPC Deployment of OpenFOAM in an Industrial Setting

System Models for Distributed and Cloud Computing

Parallel Computing. Introduction

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

An Introduction to Parallel Computing/ Programming

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Big Data Management in the Clouds and HPC Systems

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

An OpenCL Candidate Slicing Frequent Pattern Mining Algorithm on Graphic Processing Units*


Transcription:

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University April/5th/2016

Tsinghua HPGC Group HPGC: high performance geo-computing http://www.thuhpgc.org High performance computational solutions for geoscience applications simulation-oriented research: providing highly efficient and highly scalable simulation applications (exploration geophysics, climate modeling) data-oriented research: data processing, data compression, and data mining Combine optimizations from three different perspectives (Application, Algorithm, and Architecture), especially focused on new accelerator architectures

A Design Process That Combines Optimizations from Different Layers Algorithm Application Architecture The Best Computational Solution 3

Exploration Geophysics GPU-based BEAM Migration (sponsored by Statoil) GPU-based ETE Forward Modeling (sponsored by BGP) Parallel Finite Element Electromagnetic Forward Modeling Method (sponsored by NSFC) FPGA-based RTM (sponsored by NSFC and IBM) Application Climate Modeling global-scale atmospheric simulation (800 Tflops Shallow Water Equation Solver on Tianhe-1A, 1.4 Pflops atmospheric simulation 3D Euler Equation Solver on Tianhe-2) FPGA-based atmospheric simulation (selected as one of the 27 Significant papers in the 25 years of the FPL conference) Remote Sensing Data Processing data analysis and visualization (sponsored by Microsoft) deep learning based land cover mapping Algorithm Parallel Stencil on Different HPC Architectures Parallel Sparse Matrix Solver Parallel Data Compression (PLZMA) (sponsored by ZTE) Hardware-Based Gaussian Mixture Model Clustering Engine: 517x speedup Architecture multi-core/many-core (CPU, GPU, MIC) reconfigurable hardware (FPGA) Tsinghua HPGC Group: a Quick Overview on existing projects

A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers 5

The Gap between Software and Hardware 50P 100T millions lines of legacy code poor scalability written for multi-core, rather than many-core China s models pure CPU code scaling to hundreds or thousands of cores China s supercomputers heterogeneous systems with GPUs or MICs millions of cores 6

Our Research Goals highly scalable framework that can efficiently utilize many-core accelerators automated tools to with the legacy code 100T~1P China s models pure CPU code scaling to hundreds or thousands of cores China s supercomputers heterogeneous systems with GPUs or MICs millions of cores 7

Our Research Goals highly scalable framework that can efficiently utilize many-core accelerators automated tools to with the legacy code 100T~1P China s models pure CPU code scaling to hundreds or thousands of cores China s supercomputers heterogeneous systems with GPUs or MICs millions of cores 8

Example: Highly-Scalable Atmospheric Simulation Framework cube-sphere grid or other grid explicit, implicit, or semi-implicit method Xue, Wei Tsinghua University computer science Algorithm Yang, Chao Institute of Software, CAS computational mathematics Application cloud resolving Wang, Lanning Beijing Normal University climate modeling CPU, GPU, MIC, FPGA Architecture Fu, Haohuan Tsinghua University geo-computing C/C++, Fortran, MPI, CUDA, Java, The Best Computational Solution 9

A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: Previous Efforts 10

Highly-Scalable Framework for Atmospheric Modeling 2012: solving 2D SWE using CPU + GPU 800 Tflops on 40,000 CPU cores, and 3750 GPUs 11 For more details, please refer to our PPoPP 2013 paper: A Peta-Scalable CPU-GPU Algorithm for Global Atmospheric Simulations, in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 1-12, Shenzhen, 2013..

Highly-Scalable Framework for Atmospheric Modeling 2012: solving 2D SWE using CPU + GPU 800 Tflops on 40,000 CPU cores, and 3750 GPUs 2013: 2D SWE on MIC and FPGA 1.26 Pflops on 207,456 CPU cores, and 25,932 MICs another 10x on FPGA For more details, please refer to our IPDPS 2014 paper: "Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2 ; and our FPL 2013 paper: Accelerating Solvers for Global Atmospheric Equations Through Mixed-Precision Data Flow Engine.

Highly-Scalable Framework for Atmospheric Modeling 2012: solving 2D SWE using CPU + GPU 800 Tflops on 40,000 CPU cores, and 3750 GPUs 2013: 2D SWE on CPU+MIC and CPU+FPGA 1.26 Pflops on 207,456 CPU cores, and 25,932 MICs another 10x on FPGA 2014: 3D Euler on MIC 1.7 Pflops on 147,456 CPU cores, and 18,432 MICs For more details, please refer to our paper: Ultra-scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2, IEEE Transaction on Computers.

A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: 3D Euler on CPU+GPU 14

CPU-only Algorithm Parallel Version - Multi-node & Multi-core - MPI Parallelism 25 points stencil 3D channel 15

CPU-only Algorithm Parallel Version Multi-node & Multi-core MPI Parallelism CPU Algorithm Workflow CPU Algorithm per Stencil sweep For each subdomain 1 Update Halo 2 Calculate Euler stencil a. Compute Local Coordinate b. Compute Fluxes c. Compute Source Terms Per Stencil Sweep CPU Halo Updating Stencil Computation 1 2 CPU Workflow 16

Hybrid (CPU+GPU) Algorithm Hybrid Partition GPU Inner Stencil Computation CPU Halo Updating & Outer Stencil Computation CPU-GPU Hybrid Algorithm CPU-GPU Hybrid Algorithm Per Stencil Sweep For each subdomain GPU side: PETSc Inner-part Euler Stencil CPU side: 1 Update Halo 2 Outer-part Euler stencil BARRIER 4 layers CPU-GPU Exchange 3D channel Outer part CPU Inner part GPU 17

Hybrid Algorithm Design Per Stencil Sweep CPU Halo Updating Stencil Computation 1 Per Stencil Sweep 2 GPU Inner Stencil Computation G2C CPU Halo Updating Outer Stencil Computation C2G 1 2 3 Workflow Barrier 18

A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: GPU-related Optimizations 19

Optimizations Pinned Memory SMEM/L1 AoS -> SoA GPU Opt Register Adjustment Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling CPU Opt OpenMP SIMD Vectorization Cache blocking 20

Optimizations Pinned Memory SMEM/L1 AoS -> SoA GPU Opt Register Adjustment Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling CPU Opt OpenMP SIMD Vectorization Cache blocking 21

Optimizations Pinned-memory Virtual Memory Physical Memory Physical Memory T1 GPU T2 GPU Theoretic: T2 = 1/3 * T1 Reality: T2 < 1/2 * T1 22

Optimizations Pinned Memory SMEM/L1 AoS -> SoA GPU Opt Register Adjustment Kernel Splitting CPU Opt Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP SIMD Vectorization Cache blocking Compiler option -Xptxas dlcm= ca 23

Optimizations Pinned Memory SMEM/L1 AoS -> SoA GPU Opt Register Adjustment Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling CPU Opt OpenMP SIMD Vectorization Cache blocking 24

Optimizations Pinned Memory GPU Opt CPU Opt SMEM/L1 AoS -> SoA Register Adjustment Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP SIMD Vectorization Cache blocking Streaming Multi- Processor 64K Register 2048 threads Rt: Register per thread Occupancy = (64*1024) / (2048*Rt) 256 registers per threads Rt = 256 1 Block per SM Occupancy = (64*1024) / (2048*Rt) = 12.5% 25

Optimizations Pinned Memory GPU Opt CPU Opt SMEM/L1 AoS -> SoA Register Adjustment Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP SIMD Vectorization Cache blocking Streaming Multi- Processor 64K Register 2048 threads Rt: Register per thread Occupancy = (64*1024) / (2048*Rt) 64 registers per threads Rt = 64 4 Block per SM Occupancy = (64*1024) / (2048*Rt) = 50% Compiler option -maxrregcount = 64 26

Optimizations Pinned Memory SMEM/L1 AoS -> SoA GPU Opt Register Adjustment Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling CPU Opt OpenMP SIMD Vectorization Cache blocking 27

Optimizations 28

Optimizations 29

Optimizations 30

A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: Results 31

Experimental Result CPU Opt GPU Opt OpenMP SIMD Vectorization Cache blocking Pinned Memory SMEM/L1 AoS -> SoA Kernel Splitting Register Adjustment Other Methods 19.7s 5.91s 1.80s 70% 69% 31.64x speedup over 12-core CPU (E5-2697 v2) Customized Data Cache 49% Inner-thread Rescheduling 0.92s 32

Experimental Result 33

Experimental Result First-Round Optimizations Five general-used optimizations A 80GFlops performance result is achieved on a single Tesla K40 Customized Optimizations A customized cache mechanism & Inter-thread Rescheduling A 146GFlops performance result is achieved on a single Tesla K40 Experimental Results 451GFLOPs on a single Tesla K80, which is 31.64x speedup over a 12-core CPU (E5-2697 v2) 16.87% of peak based on Tesla K80 Weak Scaling Result 98.7% among 32 Node

Acknowledgement 35

Thank You! haohuan@tsinghua.edu.cn 36