Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Similar documents

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU Parallel Computing Architecture and CUDA Programming Model

ultra fast SOM using CUDA

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Introduction to GPU hardware and to CUDA

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GPU Programming Languages

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPUs for Scientific Computing

Introduction to GPGPU. Tiziano Diamanti

CUDA programming on NVIDIA GPUs

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Next Generation GPU Architecture Code-named Fermi

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Parallel Programming Survey

GPGPU Computing. Yong Cao

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

NVIDIA GeForce GTX 580 GPU Datasheet

ST810 Advanced Computing

Interactive Level-Set Deformation On the GPU

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Evaluation of CUDA Fortran for the CFD code Strukti

Clustering Billions of Data Points Using GPUs

~ Greetings from WSU CAPPLab ~

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPGPU accelerated Computational Fluid Dynamics

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

OpenCL Programming for the CUDA Architecture. Version 2.3

Accelerating CFD using OpenFOAM with GPUs

Interactive Level-Set Segmentation on the GPU

Molecular Dynamics Simulations

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Texture Cache Approximation on GPUs

Real-time Visual Tracker by Stream Processing

Stream Processing on GPUs Using Distributed Multimedia Middleware

Graphic Processing Units: a possible answer to High Performance Computing?

NVIDIA Tesla. GPU Computing Technical Brief. Version /24/07

HPC with Multicore and GPUs

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Parallel Computing with MATLAB

Turbomachinery CFD on many-core platforms experiences and strategies

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Case Study on Productivity and Performance of GPGPUs

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Hardware Acceleration for CST MICROWAVE STUDIO

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Parallel Firewalls on General-Purpose Graphics Processing Units

1 Bull, 2011 Bull Extreme Computing

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

CFD Implementation with In-Socket FPGA Accelerators

QuickSpecs. NVIDIA Quadro M GB Graphics INTRODUCTION. NVIDIA Quadro M GB Graphics. Overview

THE NAS KERNEL BENCHMARK PROGRAM

NVIDIA GeForce GTX 750 Ti

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

GPGPU acceleration in OpenFOAM

HP ProLiant SL270s Gen8 Server. Evaluation Report

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

The Design and Implement of Ultra-scale Data Parallel. In-situ Visualization System

Introduction to GPU Architecture

L20: GPU Architecture and Models

Introduction to Cloud Computing

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

64-Bit versus 32-Bit CPUs in Scientific Computing

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

High Performance Computing in CST STUDIO SUITE

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Introduction to GPU Computing

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

NVIDIA VIDEO ENCODER 5.0

System requirements for Autodesk Building Design Suite 2017

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPU for Scientific Computing. -Ali Saleh

Table of Contents. P a g e 2

GPU Computing - CUDA

Fast Implementations of AES on Various Platforms

ACCELERATION OF SPIKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS

Computer Graphics Hardware An Overview

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

GPU Acceleration of the SENSEI CFD Code Suite

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

10- High Performance Compu5ng

IP Video Rendering Basics

Transcription:

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Joe Davis, Sandeep Patel, and Michela Taufer University of Delaware

Outline Introduction Introduction to GPU programming Why MD on GPUs Water simulation on GPU Performance Accuracy Ionic solution simulation on GPU Performance Membrane with explicit water on GPU Pitfalls Conclusions 3 3

MD on GPUs Why MD on GPU? Non-bond expand scales of time and physical dimension (system complexity) All-atom resolution (micro to milliseconds) Course-graining (seconds) Explicit solvent simulations Continuum physics with molecular detail? 4 4

MD Code Overview CUDA code emulating the CHARMM molecular modelling package (Brooks, B. R. et al, J. Comput. Chem., 1983, 4: 187) Intra-molecular potential Bonds, angles, and dihedral potentials Computed on GPU using a list-based evaluation Non-bond interactions (pair interactions) Non-bond list is generated by checking all pair distances against the cut-off in parallel (efficient tiling approach) A thread iterates through the non-bond list for a single atom and accumulates the non-bonded interactions 5 5

Water Simulations Water simulations on GPU vs. on CPU Reference simulation of CHARMM on Beowulf cluster Intel Xeon 5150 2.66 GHz (Woodcrest) NVIDIA GPUs: Single precision Quadro FX 5600 (1.5GB memory) Single precision GeForce 9800 GX2 (dual GPUs per card, 500MB memory) Single and double precision GTX 280 Single precision Tesla S1070 system 6 6

Water Model Flexible Water SPC/Fw (Wu et al, J. Chem. Phys., 2006) Intra-molecular potential: Computed on GPU using lists (bonds/angles lists) Non-bonded potential: Lennard-Jones potential Shifted-force electrostatics with cut-off only (no Ewald) Computed on GPU using a list-based evaluation 7 7

System Parameters Explicit water simulation Pre-equilibrated box PBC: Cubic Density = 1.012 g/ml t = 1 fs Integrator: Verlet on GPU Orig. Verlet with CHARMM on CPU 8 8 8

Performance Performance metrics: number of MD steps in one second NVIDIA GPU: single precision GeForce 9800 GX2 (dual GPUs per card, 500MB memory) (data from 100,000 MD steps) GPU is ~7x faster on average! 9 9

Accuracy Constant energy simulation presents significant drift Some GPU operations are not IEEE compliant Philip and Omar will present our solution Constant temperature simulation with velocity reassignment has no drift Adjust energy state at specific intervals thereby eliminating drift 10 10

Ionic Solutions Reproduce ab initio water interactions with sodium iodide (crystalline salt) System size: 988 water molecules, 18 Na + ions and 18 I- ions Ion model: Electrostatic and van der Waals only CHARMM parameters modified for more accurate interaction energies 1 1 Lamoureux, G. and Roux, B., J. Phys. Chem. B, 2006, 110: 3308. 11 11

Performance (I) Performance metrics: number of MD steps in one second NVIDIA GPUs: Single precision GeForce 9800 GX2 (dual GPUs per card, 500MB memory) Single precision Quadro FX 5600 (1.5GB memory) Double and single precision GTX 280 Single precision Tesla S 1070 12 12

Performance (II) Performance metrics: number of MD steps in one second NVIDIA GPUs: one GPU of the Tesla S1070 system Different system sizes, i.e., number of atoms and ions # water # NaI # atoms MD steps / sec 1865 35 5665 392.1 1935 0 5805 382.4 3665 70 11135 168.4 3805 0 11415 169.7 10243 194 31117 42.7 10631 0 31893 40 15333 290 46579 21.9 15913 0 47739 20.9 30339 570 92157 7.3 31485 0 94455 6.9 13 13

Membrane and Solvent Molecular system including a membrane and explicit water Intra-molecular potential Bonds, angles, and dihedral potentials Computed on GPU using a listbased evaluation Non-bonded potential: Lennard-Jones potential Shifted-force electrostatics with cut-off only (no Ewald) Computed on GPU using a listbased evaluation Study of performance and accuracy is work in progress 14 14

Pitfalls Single or double precision? G8x GPU FP is 32-bit, newer T10P GPU is 64-bit Some 32-bit operations are not IEEE compliant 64-bit arithmetic is more accurate, but more costly Fortran compilation? Fortran compiler is forthcoming and PGI allows integration of Fortran code We have rewritten code components for GPUs in C Code optimization is everything? Targets: memory access, efficient list building/updating, loops, conditional, data structure, etc. The optimization of our code is a work in progress Limited number of GPUs per card? 870: board (1 GPU), D870: desk-side unit (2 GPUs), S1070: 1u server unit (4 GPUs) 15 CUDA 2.2 uses parallel organizations of GPUs 15

Related Work Yang et al 1 Proof of concept Limited by graphics-specific programming interface Stone et al 2 Moved non-bond force calculation to GPU Focus mostly on modelling applications Anderson et al 3 MD running entirely on GPU (HOOMD) Neighbor list implementation Van Meel et al 4 Very similar to Anderson's study 1 Yang, J. et al, J. Comput. Phys., 2007, 221: 799. 2 Stone, J. E. et al, J. Comput. Chem., 2007, 28: 2618. 16 3 Anderson, J. A. et al, J. Comput. Phys., 2008, 227: 5342. 4 Van Meel, J. A. et al, Mol. Sim., 2008, 34: 259. 16

Conclusions Current achievements: Implementation of a local version of MD code on current generation of GPUs Straightforward, naive implementation Promising results Work in progress: Optimization and tuning of code Expand MD options (additional potentials, PME) Final goals: Effective compilation of CHARMM-like code on GPU Study of larger molecular systems for longer simulation times, up to 100ns 17 17

Acknowledgements Thanks to: David Hoff, Sumit Gupta, Scott LeGrand (NVIDIA) Joshua Anderson (Iowa State University) Sponsors: 18 18

Introduction (I) Graphics Processing Units (GPUs) have been extensively used in graphics intensive applications Development driven by economy, e.g., video game industry, motion picture The inherent parallelization of GPUs makes them suitable for scientific applications Recent exploration of potential of GPUs for mathematics and scientific, and clinical computing Medical diagnostics: GPUs coupled to MRI Hardware (Stone et al. Proc. of 2007 Computing Frontiers conference, 7-9 May, 2008) Molecular modelling: Electrostatic Potential Calculation (Stone et al. J. Comp. Chem. 28, #16, pp. 2618-2640) Ion Placement (Stone et al. J. Comp. Chem. 28, #16, pp. 2618-2640) Van der Waals Fluids / Polymers (Anderson et al. J. Comput. Physics 2008) 19 3

Introduction (II) Special purpose hardware: specific types of calculations Protein Explorer systems and its LSI 'MDGRAPE-3 chip (Taiji et al. in Proc. of 2003 ACM/IEEE Supercomputing Conference, 15-21 Nov. 2003) Anton and its 12 identical MD-specific ASICs (Shaw et al. in Proc. of the 34th Annual International Symposium on Computer Architecture, 9-13 June, 2007) General Purpose GPUs (or GPGPUs): cost effective and readily available in recent workstations GeForce FX5600 1.5GBytes memory Cost $2,795 GeForce 9800 GX2 Dual GPU-based graphics card 512MBytes memory per GPU Cost $665 20 4

GPU Overview (I) NVIDIA GeForce 8 Series: 16 Streaming Multiprocessor (SM) 8 Scalar Processors (cores) / SM 16, 8-way SIMD cores = 128 PEs Massively parallel multithreaded Up to 12,288 active threads handled by thread execution manager From CUDA Programming Guide, NVIDIA Actual application performance Molecular dynamics -VMD ion placement: 290 GFLOPS FFT: 52 benchfft GFLOPS 21 5

GPU Overview (II) Memory types: Read/write per thread Registers Local memory Read/write per block Shared memory Read/write per grid Global memory Read-only per grid Constant memory Texture memory From CUDA Programming Guide, NVIDIA Communication among devices and with CPU Through PCI bus 22 6

Programming Paradigm Program in C: Serial program executed on CPU Parallel kernels executed on GPU Parallel kernels composed of many threads Threads are grouped into thread blocks Threads in the same block can cooperate Thread block = a (data) parallel task (SIMD) Same entry point but can execute any code - conditions are allowed in block threads Different blocks are independent Several blocks = task parallelism Thread t Multiprocessor 0 t 1 0 t 1 n t t n 0 t 1 t n Block B Kernel launched by host (CPU) Kernel 0 Device processor array 23 7

Programming GPUs Past: APIs originally through graphics interfaces e.g., OpenGL Not easy to use for general usage: cast computation in terms of graphics operations: Draw the calculation Interpret image post-calculation Present: NVDIA CUDA (Compute Unified Device Architecture) language/library Easy to use: CUDA provides minimal set of extensions necessary to expose power of GPGPUs Includes C-compiler and development tools CUDA optimization strategy: Maximize independent parallelism Maximize arithmetic intensive computation Take advantage of on-chip per-block shared memory Do computation on the GPUs and avoid data transfer 24 24