Cosmological Simulations on Large, Accelerated Supercomputers

Similar documents
Cosmological Simulations on Large, Heterogeneous Supercomputers

Present and Future Computing Requirements for Computational Cosmology

Computation-Driven Discovery for the Dark Universe

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Trends in High-Performance Computing for Power Grid Applications

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Parallel Programming Survey

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

YALES2 porting on the Xeon- Phi Early results

Hank Childs, University of Oregon

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Computer Graphics Hardware An Overview

Next Generation GPU Architecture Code-named Fermi

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Parallel Computing. Benson Muite. benson.

GPUs for Scientific Computing

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

(Toward) Radiative transfer on AMR with GPUs. Dominique Aubert Université de Strasbourg Austin, TX,

Performance of the JMA NWP models on the PC cluster TSUBAME.

An ArrayLibraryforMS SQL Server

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory

Pedraforca: ARM + GPU prototype

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

OpenMP Programming on ScaleMP

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Summit and Sierra Supercomputers:

Cosmological Simulations and the Data Big Crunch

Turbomachinery CFD on many-core platforms experiences and strategies

Cosmic Structure Formation and Dynamics: Cosmological N-body Simulations of Galaxy Formation and Magnetohydrodynamic Simulations of Solar Atmosphere

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Interactive Level-Set Deformation On the GPU

Cosmological simulations on High Performance Computers

High Performance Computing in CST STUDIO SUITE

SYSTAP / bigdata. Open Source High Performance Highly Available. 1 bigdata Presented to CSHALS 2/27/2014

Data Centric Systems (DCS)

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Big Data Visualization on the MIC

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

HP ProLiant SL270s Gen8 Server. Evaluation Report

Overlapping Data Transfer With Application Execution on Clusters


Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Petascale Visualization: Approaches and Initial Results

Accelerating CFD using OpenFOAM with GPUs

Introduction to GPU hardware and to CUDA

Evaluation of CUDA Fortran for the CFD code Strukti

The Fusion of Supercomputing and Big Data. Peter Ungaro President & CEO

ECDF Infrastructure Refresh - Requirements Consultation Document

SGI High Performance Computing

Introduction to GPGPU. Tiziano Diamanti

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

The search engine you can see. Connects people to information and services

GPGPU Computing. Yong Cao

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Build an Energy Efficient Supercomputer from Items You can Find in Your Home (Sort of)!

Performance analysis of parallel applications on modern multithreaded processor architectures

Writing Applications for the GPU Using the RapidMind Development Platform

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

GPU Architecture. Michael Doggett ATI

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Intel DPDK Boosts Server Appliance Performance White Paper

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

In situ data analysis and I/O acceleration of FLASH astrophysics simulation on leadership-class system using GLEAN

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Performance Monitoring of Parallel Scientific Applications

NVLink High-Speed Interconnect: Application Performance

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Parallel Computing. Introduction

~ Greetings from WSU CAPPLab ~

Data Centric Interactive Visualization of Very Large Data

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Why the Network Matters

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Jean-Pierre Panziera Teratec 2011


In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Optimizing Code for Accelerators: The Long Road to High Performance

1 Bull, 2011 Bull Extreme Computing

Achieving Performance Isolation with Lightweight Co-Kernels

Stream Processing on GPUs Using Distributed Multimedia Middleware

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Sun Constellation System: The Open Petascale Computing Architecture

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Transcription:

Cosmological Simulations on Large, Accelerated Supercomputers Adrian Pope (LANL) ICCS Workshop 27 January 2011 Slide 1

People LANL: Jim Ahrens, Lee Ankeny, Suman Bhattacharya, David Daniel, Pat Fasel, Salman Habib, Katrin Heitmann, Zarija Lukic, Pat McCormick, Jon Woodring ORNL: Chung-Hsing Hsu Berkeley: Jordan Carlson, Martin White Virginia Tech: Paul Sathre IBM: Mike Perks Aerospace: Nehal Desai Slide 2

The Survey Challenge Scaling of astronomical surveys Driven by solid state technology CCD pixels in focal plane Data volume/rate scaling similar to Moore s law Huge investment by scientists and funding agencies Transformational effect on demands of theoretical modeling for cosmology ~10% now, semi-analytic, perturbation theory ~1% soon, numerical theory (eg. simulations) Precise theoretical predictions to match survey observations Many related probes: large-scale structure, weak lensing, clusters Maximize return on investment in surveys Slide 3

SDSS 2000-2008 LSST (image from DLS) Pan-STARRS, DES, etc. 1/4 sky 2018-2028 1/2 sky (multiple scans) UNCLASSIFIED Slide 4

De Lapparent, Geller, Huchra Gregory & Thompson, 1978 CfA, 1986 SDSS, 2008 1,000,000 1,100 galaxies BOSS, 2009-2014 1,000,000 M. Blanton M. Blanton UNCLASSIFIED Slide 5

Slide 6

Scaling Roadrunner ~1.1 Pflop Slide 7

HACC (Hardware Accelerated Cosmology Code) Throughput Dynamic range Volume for long wavelength modes Resolution for halos/galaxy locations Repeat runs Vary initial conditions Sample parameter space, emulators for observables (Coyote Universe, Cosmic Calibration) Scale to current and future supercomputers (many MPI ranks, even more cores) Flexibility Supercomputer architecture (CPU,Cell, GPGPU, BG) Compute intensive code takes advantage of hardware Bulk of code easily portable (MPI) Development/maintenance Few developers Simpler code easier to develop, maintain, and port to different architectures On-the-fly analysis, data reduction Reduce size/number of outputs, ease file system stress Slide 8

Collisionless Gravity f t + ẋ f x φ f p =0, p = a2 ẋ 2 φ =4πGa 2 (ρ(x,t) ρ b (t)), ρ(x,t)=a 3 m d 3 pf(x, ẋ,t) Evolution of over-density perturbations in smooth, expanding background (Vlasov-Poisson) Gravity has infinite extent and causes instabilities on all scales N-Body Tracer particles for phase-space distribution Self-consistent force Symplectic integrator Slide 9

Force Long-range (PM = particle-mesh) Deposit particles on grid (CIC) Distributed memory FFT (Poisson) Pros: fast, good error control Cons: uses memory Short-range Inter-particle force calculation Several short steps per long step Limited spatial extent Local n 2 comparisons Several choices for implementations Tree solver (TreePM: CPU) Direct particle-particle (P 3 M: Cell, OpenCL) Spectral smoothing at handover More flexible than real-space stencils (eg. TSC) Slide 10

FFT Decomposition Compute:Communication::Volume:Area Independent of particle decomposition Buffers to re-arrange Roadrunner 1D tests (Weak) scaling up to 9000 3, up to 6000 MPI ranks Probably about as far as 1D will go (thin slabs) Analysis: 2D should work for likely exascale systems 2D FFT is under testing Not as critical to calculate on accelerated hardware Network bandwidth limited Still relatively fast and accurate force calculation 1D slab 2D pencil 3D Slide 11

Particle Overloading 3D spatial decomposition (max volume:area) Large-scale homogeneity = load balancing Cache nearby particles from neighbors Update cached particles like others Move in/out of sub-volumes Skip short-range update at very edge to avoid anisotropy Can refresh cache (error diffusion) Not every (long) time step Network communication Mostly via FFT Occasional neighbor communication None during short-range force calculation No MPI development for short-range force Slide 12

Architecture Interconnect (~GB/s) Node CPU ~Gflops Caches Mem Slide 13

Architecture Interconnect (~GB/s) Node CPU ~Gflops Caches Mem PCIe (~GB/s) Accelerator Cores??? Mem ~Tflops Slide 14

Modular Code Decomposition and communication is independent of hardware (MPI) Particles class Particle/grid deposit for long-range force (CIC, CIC -1 ) Particle position update (easy) Short-range force, velocity update (bottleneck) Use algorithms and data structures to suite hardware Fixed set of public methods Slide 15

Accelerators P 3 M Code development Simple data structures Easier to avoid branching statements, etc. Exact calculation can be a reference for approximate methods Chaining mesh: sort particles into buckets, ~force-length Memory hierarchy, coherence Asynchronous transfers Overlap movement and computation No competing writes to main memory Concurrent execution Organize into independent work units Slide 16

Cell (LANL Roadrunner) Memory Balanced between CPU blade and Cell blade Particles in Cell memory Grid over PCIe Multi-buffer by hand to SPU local store Heterogeneous PPU: communication, sorting, low flop/byte SPU: high flop/byte Node 16 GB RAM Concurrency Scheduled by hand 2xCPU PCIe 2xCPU PCIe CBE SPU SPU SPU SPU 8GB RAM 8GB RAM PPU SPU SPU SPU SPU CBE CBE CBE CBE Slide 17

OpenCL Memory Possibly (probably?) unbalanced memory Particles in CPU main memory CPU does low flop/byte operations Stream slabs through GPU memory Pre-fetch Asynchronous result updates CPU Concurrency Data-parallel kernel execution Many independent work units per slab Many threads Efficient scheduling GPU Slide 18

PM Science Slide 19

First Roadrunner Universe (RRU) Science Runs Roadrunner (LANL) 3060 nodes 2x dual core Opterons, 10% flops 4x Cell, 90% flops (8 vector processors per Cell) 1 petaflops double precision, 2 petaflops single precision Simulation parameters 750 Mpc/h side length 64 billion particles (resolve IGM Jeans mass) ~100 kpc/h (resolve IGM Jeans length) 9 realizations (single cosmology) 1000 nodes (1/3) ~Day per simulation Analysis Density along skewers calculated on-the-fly Cross-correlations along nearby lines-of-sight in post processing Slide 20

Ly-α BAO studies BOSS: Cross-correlation along pairs of QSO DM only simulation, some gas physics in post processing Can test noise/error scenarios White et al. 2010 (ApJ, arxiv:0911.5341) Slide 21

Slide 22

P 3 M Commissioning Slide 23

Cell Code comparison (256 3 ) < 1% agreement Tests at scale Roadrunner 4 Gpc/h side length 64 billion particles 1000 nodes (1/3) Cerrillos (360 nodes, open network) 2 Gpc/h side length 8 billion particles 128 nodes Both ~5-10 kpc/h force resolution 500x3 time steps ~Week (+queue) Verifying results Slide 24

Cell 0.0001 dn/dlog M [1/Mpc 3 ] 1e-05 1e-06 1e-07 1/512 of 8 billion particle run 1e-08 1e-09 Fit from Bhattacharya et al. 2010 Results from MC 3 1e+13 1e+14 1e+15 Halomass [M sun /h] Slide 25

OpenCL Port of Particles class Summer student project Quicker/easier development than Cell SC10 demo Calculation in real time (small problem size) Mix of NVIDIA and ATI hardware Interactive 3D visualization in real time Initial performance not awful Fast on NVIDIA Improving ATI performance Kernels Single kernel with optional vectorization? Tune kernels for each hardware? Settle data structures UNCLASSIFIED Slide 26

Future Cell Debugging speed improvements (3x faster) Clean up code from beta to 1.0 OpenCL Improve code from demo to production Should soon have access to a machine large enough for real tests Local tree solver (CPU) Data structures in place (threaded tree) Need to implement force solve walk Early Science Program on Argonne BG/Q OpenMP thread some operations (planning) P 3 M? Tree? Both? Baryon physics Exploring methods Slide 27