Accelerating classical MD for multi-core CPUs and GPUs



Similar documents
GPU System Architecture. Alan Gray EPCC The University of Edinburgh

A quick tutorial on Intel's Xeon Phi Coprocessor

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Parallel Computing. Parallel shared memory computing with OpenMP

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Case Study on Productivity and Performance of GPGPUs

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

LAMMPS Developer Guide 23 Aug 2011

Parallel Computing. Shared memory parallel programming with OpenMP

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Overview of HPC Resources at Vanderbilt

Parallel Programming Survey

Introduction to GPU hardware and to CUDA

GPUs for Scientific Computing

Scalability evaluation of barrier algorithms for OpenMP

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Running applications on the Cray XC30 4/12/2015

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Retargeting PLAPACK to Clusters with Hardware Accelerators

Introduction to GPU Programming Languages

Introduction to Hybrid Programming

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Multicore Parallel Computing with OpenMP

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Parallel Algorithm Engineering

Part I Courses Syllabus

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

OpenACC Basics Directive-based GPGPU Programming

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

CUDA programming on NVIDIA GPUs

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Spring 2011 Prof. Hyesoon Kim

HPC Wales Skills Academy Course Catalogue 2015

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Performance Analysis and Optimization Tool

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

A Case Study - Scaling Legacy Code on Next Generation Platforms

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Multi-Threading Performance on Commodity Multi-Core Processors

MAQAO Performance Analysis and Optimization Tool

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

High Performance Computing in CST STUDIO SUITE

Optimizing Code for Accelerators: The Long Road to High Performance

Turbomachinery CFD on many-core platforms experiences and strategies

OpenACC 2.0 and the PGI Accelerator Compilers

ST810 Advanced Computing

GPU Parallel Computing Architecture and CUDA Programming Model

An Introduction to Parallel Computing/ Programming

Evaluation of CUDA Fortran for the CFD code Strukti

An Introduction to Parallel Programming with OpenMP

Parallel and Distributed Computing Programming Assignment 1

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Graphic Processing Units: a possible answer to High Performance Computing?

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

MPI and Hybrid Programming Models. William Gropp

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

BLM 413E - Parallel Programming Lecture 3

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

OpenACC Programming and Best Practices Guide

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

The Asterope compute cluster

OpenCL Programming for the CUDA Architecture. Version 2.3

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

COSCO 2015 Heterogeneous Computing Programming

Software Testing in Science

Getting OpenMP Up To Speed

64-Bit versus 32-Bit CPUs in Scientific Computing

Using the Windows Cluster

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

GPU Acceleration of the SENSEI CFD Code Suite

Intelligent Heuristic Construction with Active Learning

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

OpenMP and Performance

Cloud-based OpenMP Parallelization Using a MapReduce Runtime. Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Debugging with TotalView

OpenCL for programming shared memory multicore CPUs

The CNMS Computer Cluster

Transcription:

Accelerating classical MD for multi-core CPUs and GPUs Dr. Axel Kohlmeyer Associate Dean for Scientific Computing College of Science and Technology Temple University, Philadelphia http://sites.google.com/site/akohlmey/ a.kohlmeyer@temple.edu LAMMPS Users and Developers Workshop and Symposium, March 24th-28th 2014

Standard LAMMPS Parallelization MPI based (MPI emulator for serial execution) Uses domain decomposition with 1 domain per MPI task (= processor). Each MPI task looks after the atoms in its domain Atoms move from MPI task to MPI task as they move through the system Assumes same amount of work (force computations) in each domain. 2

Why Bother Adding OpenMP? 1.Why not do it? a) LAMMPS is already very parallel b) Even more run-time settings to optimize c) OpenMP is often less effective than MPI (for MD) 2. Why do it anyway? a) On multi-core machines (Cray XT5) LAMMPS can run faster with MPI when some CPU cores are idle b) Parallelization over particles, not domains c) PPPM has scaling limitations. At high node counts it would be better to run it only on a subset of tasks 3

OpenMP Parallelization OpenMP is directive based => well written code works with or without OpenMP can be added incrementally OpenMP only works in shared memory => multi-core processors are now ubiquitous OpenMP hides the calls to a threads library => less flexible, more overhead, but less effort Caution: need to worry about race conditions, memory corruption, false sharing, Amdahl's law 4

How to add OpenMP to LAMMPS LAMMPS is very modular, just add new classes derived from non-threaded implementation Pairwise interactions (consume most time) i,j nested loop over neighbors can be parallelized each thread processes different i atoms Neighbor list build (binning still serial) i,j nested loop over atoms and neighboring bins Dihedrals and other bonded interactions Replace selected function(s) in derived class 5

Threading Class Relations PairLJ - serial implementation - all non-threaded code ThrOMP - thread-safe utility functions - reduction of per-thread force PairLJOMP - derived from PairLJ and ThrOMP - replaces ::compute() with threaded version - gets access to ThrData instance from FixOMP ThrData - per-thread accumulators - one instance per thread FixOMP - regularly called during MD loop - determines when to reduce forces - manages ThrData instances - toggles thread-related features 6

Naive OpenMP LJ Kernel #if defined(_openmp) #pragma omp parallel for default(shared) \ private(i) reduction(+:epot) Each thread will #endif for(i=0; i < (sys >natoms) 1; ++i) { work on different double rx1=sys >rx[i]; values of i double ry1=sys >ry[i]; double rz1=sys >rz[i]; [...] The critical directive will let only { #if defined(_openmp) #pragma omp critical sys >fx[i] += rx*ffac; one thread execute this block at a time Race condition: #endif sys >fy[i] += ry*ffac; i will be unique for { sys >fz[i] += rz*ffac; Timings (108 atoms): sys >fx[i] += rx*ffac; each thread, but not j sys >fx[j] = rx*ffac; serial: 4.0s sys >fy[i] += ry*ffac; Or some j may be an sys >fy[j] = ry*ffac; 1 thread: 4.2s sys >fz[i] += rz*ffac; sys >fz[j] = rz*ffac; i of another thread sys >fx[j] = rx*ffac; } 2 threads: 7.1s sys >fy[j] = ry*ffac; => multiple threads 4 threads: 7.7s sys >fz[j] = rz*ffac; update the same location 8 threads: 8.6s } 7

Alternatives to omp critical Use omp atomic to protect each force addition => requires hardware support (modern x86) 1Thr: 6.3s, 2Thr: 5.0s, 4Thr: 4.4s, 8Thr: 4.2s => faster than omp critical for multiple threads but it is slower than the serial code (4.0s) rd Don't use Newton's 3 Law => no race condition 1Thr: 6.5s, 2Thr: 3.7s, 4Thr: 2.3s, 8Thr: 2.1s => better scaling, but 2 threads ~= serial speed => this is what is done on GPU (many threads) 8

MPI-like Approach with OpenMP #if defined(_openmp) #pragma omp parallel reduction(+:epot) #endif { double *fx, *fy, *fz; Thread number is like MPI rank #if defined(_openmp) int tid=omp_get_thread_num(); #else sys->fx holds storage for one full fx array for int tid=0; each thread => race condition is avoided. #endif fx=sys >fx + (tid*sys >natoms); azzero(fx,sys >natoms); fy=sys >fy + (tid*sys >natoms); azzero(fy,sys >natoms); fz=sys >fz + (tid*sys >natoms); azzero(fz,sys >natoms); for(int i=0; i < (sys >natoms 1); i += sys >nthreads) { int ii = i + tid; if (ii >= (sys >natoms 1)) break; rx1=sys >rx[ii]; ry1=sys >ry[ii]; rz1=sys >rz[ii]; 9

MPI-like Approach with OpenMP (2) We need to write our own reduction: #if defined (_OPENMP) Need to make certain, all threads #pragma omp barrier are done with computing forces #endif i = 1 + (sys >natoms / sys >nthreads); fromidx = tid * i; toidx = fromidx + i; if (toidx > sys >natoms) toidx = sys >natoms; for (i=1; i < sys >nthreads; ++i) { int offs = i*sys >natoms; for (int j=fromidx; j < toidx; ++j) { sys >fx[j] += sys >fx[offs+j]; sys >fy[j] += sys >fy[offs+j]; sys >fz[j] += sys >fz[offs+j]; } } Use threads to parallelize the reductions 10

OpenMP Timings Comparison omp critical timings 1Thr: 4.2s, 2Thr: 7.1s, 4Thr: 7.7s, 8Thr: 8.6s omp atomic timings 1Thr: 6.3s, 2Thr: 5.0s, 4Thr: 4.4s, 8Thr: 4.2s omp parallel region (MPI-like) timings 1Thr: 4.0s, 2Thr: 2.5s, 4Thr: 2.2s, 8Thr: 2.5s No Newton's 3rd law timings 1Thr: 6.5s, 2Thr: 3.7s, 4Thr: 2.3s, 8Thr: 2.1s => the omp parallel variant is best for few threads, no Newton's 3rd variant better for more threads => cost for force reduction larger for more threads 11

2x Intel Xeon 2.66Ghz (Harpertown) w/ DDR Infiniband CHARMM (lj/charmm/coul/long + pppm), 32000 Atoms Time for 1000 MD steps /s 20 15 10 5 0 1 3 4 6 8 12 16 20 24 32 Number of Nodes 8 MPI / Node 4 MPI + 2 OpenMP / Node 4 MPI / Node 2 MPI + 4 OpenMP / Node 80% 70% 60% Parallel Efficiency 2 50% 40% 30% 20% 10% 1 MPI + 8 OpenMP / Node 0% 1 6 11 16 Number of Nodes 21 26 31 12

13

Running Big Vesicle fusion study: impact of lipid ratio in binary mixture cg/cmm/coul/long Experimental size => 4M CG-beads for 1 vesicle and solvent 30,000,000 CPU hour INCITE project 14

Strong Scaling (Cray XT5) 1 Vesicle CG System / 3,862,854 CG-Beads 12 MPI / 1 OpenMP 6 MPI / 2 OpenMP 4 MPI / 3 OpenMP 2 MPI / 6 OpenMP Time per MD step (sec) 0.36 0.17 0.08 0.04 27 63 148 345 805 1878 # Nodes 15

Strong Scaling (2) (Cray XT5) 8 Vesicles CG-System / 30,902,832 CG-Beads 0.61 12 MPI / 1 OpenMP 6 MPI / 2 OpenMP 4 MPI / 3 OpenMP 2 MPI / 6 OpenMP Time per MD Step (sec) 0.39 0.24 0.15 0.1 256 569 1263 2805 6231 # Nodes 16

The Curse of the k-space (1) 256 384 768 Other Neighbor Comm Kspace Bond Pair 1 MPI/Node + OpenMP 10 5 2 MPI + 6 OpenMP/Node 15 12 MPI/Node Time in seconds 20 6 MPI/Node 25 4 MPI/Node Rhodopsin Benchmark, 860k Atoms, 64 Nodes, Cray XT5 0 128 128 256 384 768 768 # PE 17

The Curse of the k-space (2) Rhodops in Benchm ark, 860k Atom s, 128 Nodes, Cray XT5 25 Neighbor 2 MPI + 6 OpenMP/Node Time in seconds 15 12 MPI/Node 20 6 MPI/Node 4 MPI/Node Other Comm Kspace 1 MPI/Node + OpenMP Bond Pair 10 5 0 256 512 768 1536 256 512 768 1536 1536 # PE 18

The Curse of the k-space (3) Time in sec onds 15 Other N eighbor C om m Kspace Bond Pair 1 MPI/Node + OpenMP 10 5 2 MPI + 6 OpenMP/Node 20 12 MPI/Node 25 6 MPI/Node 4 MPI/Node Rhodopsin Benchmark, 860k Atoms, 512 Nodes, Cray XT5 0 1024 2048 3072 6144 1024 2048 3072 6144 6144 # PE 19

Additional Improvements OpenMP threading added to charge density accumulation and force application in PPPM Force reduction only done on last /omp style Integration style verlet/split contributed by Voth group which run k-space on separate partition (compatible with OpenMP version of PPPM) Added threading to selected fixes like charge equilibration for COMB many-body potential Added threading to fix nve/sphere integrator 20

Current GPU Support in LAMMPS Multiple developments from different groups Converged to two efforts with two philosophies GPU package (minimalistic) pair styles, neighbor lists and k-space (optional): Download coordinates, retrieve forces Run asynchronously to bonded (and k-space) USER-CUDA package (see next talk) Replace all classes that touch atom data Data transfer between host and GPU as needed 21

Special Features of GPU Package Can be compiled for CUDA or OpenCL due to using Geryon preprocessor macros Can attach multiple MPI tasks to one GPU for improved GPU utilization (up to 4x oversubscription on Fermi, up to 15x on Kepler ) Uses a fix to manage GPUs and compute kernel dispatch, styles dispatch kernels asynchronously, fix then retrieves the forces after all other force computations are completed Tuned for good scaling with fewer atoms/gpu 22

1x GPU Performance in LAMMPS 140 Bulk Water, LJ+long-range electrostatics 5,376 Water 5000 Step 21,504 Water 1000 Step PPPM on GPU 120 Time in seconds 100 FirePro 80 60 Tesla GeForce 40 20 0 8 Core, 2.8GHz GTX 480 sp GTX 480 mp GTX 480 dp C2050 sp C2050 mp C2050 dp ATI V8800 sp ATI V8800 mp ATI V8800 dp 23

Multiple GPUs per Node Bulk Water, LJ + long-range electrostatics 100 PPPM on GPU 90 80 70 Time in seconds FirePro 5,376 water, 5000 steps 21,504 water, 1000 steps 60 50 Tesla 40 30 20 10 0 8 Cores, 2.8GHz 1x C2050 mp 2x C2050 mp 4x C2050 mp 1x V8800 mp 2x V8800 mp 4x V8800 mp 24

Comments on GPU Acceleration Mixed precision (force computation in single, force accumulation in double precision) good compromise: little overhead, good accuracy on forces, stress/pressure less so GPU acceleration larger for models that require more computation in force kernel Acceleration drops with lower number of atoms per GPU => limited strong scaling on big iron Acceleration amount dependent on host & GPU 25

Installation of USER-OMP and GPU USER-OMP package: make yes-user-omp to install sources Add -fopenmp (GNU) or -openmp (Intel) to CC and LINK definitions in your makefile to enable OpenMP Compilation without OpenMP => similar to OPT GPU package: Compile library in lib/gpu for CUDA or OpenCL make yes-gpu to install style sources which are wrappers for GPU library Tweak lib/gpu/makefile.lammps.??? as needed 26

Using Accelerated Code All accelerated styles are optional and need to be activated in the input or from command line Naming convention lj/cut -> lj/cut/omp lj/cut/gpu From command line -sf omp or -sf gpu Inside script: suffix omp or suffix gpu and suffix on or suffix off Use package omp/gpu command to adjust settings for acceleration and selection of GPUs -sf command line flag implies default settings 27

Conclusions and Outlook: OpenMP OpenMP+MPI is almost always a win, especially with large node counts (=> capability computing) USER-OMP also contains serial optimizations and thus useful without OpenMP compiled in Minimal changes to LAMMPS core code USER-OMP only a transitional implementation since efficient only on a small number of threads Longer-term solution also needs to consider vectorization and thus be more GPU-like and benefits from different data layout (see next talk) 28