Trends in High-Performance Computing for Power Grid Applications

Similar documents
GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Parallel Programming Survey

Overview of HPC Resources at Vanderbilt

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

1 Bull, 2011 Bull Extreme Computing

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

High Performance Computing in CST STUDIO SUITE

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Scientific Computing Programming with Parallel Objects

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Turbomachinery CFD on many-core platforms experiences and strategies

Cluster Computing at HRI

Visit to the National University for Defense Technology Changsha, China. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

High Performance Computing. Course Notes HPC Fundamentals

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Jean-Pierre Panziera Teratec 2011

10- High Performance Compu5ng

Evaluation of CUDA Fortran for the CFD code Strukti

ST810 Advanced Computing

Lecture 1: the anatomy of a supercomputer

Kalray MPPA Massively Parallel Processing Array

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Supercomputing Status und Trends (Conference Report) Peter Wegner

Virtualization of ArcGIS Pro. An Esri White Paper December 2015

Next Generation GPU Architecture Code-named Fermi

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Jezelf Groen Rekenen met Supercomputers

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Accelerating CFD using OpenFOAM with GPUs

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

Petascale Visualization: Approaches and Initial Results

Overview of HPC systems and software available within

Building Clusters for Gromacs and other HPC applications

David Vicente Head of User Support BSC

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

HPC Wales Skills Academy Course Catalogue 2015

Parallel Computing. Introduction

CS140: Parallel Scientific Computing. Class Introduction Tao Yang, UCSB Tuesday/Thursday. 11:00-12:15 GIRV 1115

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Parallel Algorithm Engineering

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

OpenMP Programming on ScaleMP

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Barry Bolding, Ph.D. VP, Cray Product Division

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

~ Greetings from WSU CAPPLab ~

HPC with Multicore and GPUs

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

ANALYSIS OF SUPERCOMPUTER DESIGN

Introduction to Cloud Computing

GPU Parallel Computing Architecture and CUDA Programming Model

Embedded Systems: map to FPGA, GPU, CPU?

The K computer: Project overview

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Summit and Sierra Supercomputers:

Parallel Computing. Benson Muite. benson.

High Performance Matrix Inversion with Several GPUs

HPC Software Requirements to Support an HPC Cluster Supercomputer

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

BSC - Barcelona Supercomputer Center

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

Retargeting PLAPACK to Clusters with Hardware Accelerators

HPC enabling of OpenFOAM R for CFD applications

Cloud Computing. Alex Crawford Ben Johnstone

Data Centric Systems (DCS)

GeoImaging Accelerator Pansharp Test Results

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Multi-Threading Performance on Commodity Multi-Core Processors

Introduction to GPU Programming Languages

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Parallel Large-Scale Visualization

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

SPARC64 VIIIfx: CPU for the K computer

A Very Brief History of High-Performance Computing

Getting the Performance Out Of High Performance Computing. Getting the Performance Out Of High Performance Computing

GPUs for Scientific Computing

Multicore Parallel Computing with OpenMP

Kriterien für ein PetaFlop System

Designed for Maximum Accelerator Performance

Transcription:

Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views and is not endorsed by any other party mentioned in it. The copyright of all images is held by the respective owners. They are used under fair use.

This Talk Is Not About 2050 Quantum computing DNA computing Optical computing Other far-out technologies

This Talk: HPC and Supercomputing in 2018 Where are we today? How did we end up here? Where will we go? How will it impact you?

What Is High Performance Computing? 1 flop/s = one floating-point operation (addition or multiplication) per second mega (M) = 10 6, giga (G) = 10 9, tera (T) = 10 12, peta (P) = 10 15, exa E) = 10 18 Computing systems in 2010 Cell phone 1 CPUs 1 Gflop/s $300 1 W power Laptop 2 CPUs 20 Gflop/s $1,200 30 W power Workstation 8 CPUs 1 Tflop/s $10,000 1 kw power HPC 200 CPUs 20 Tflop/s $700,000 8 kw power #1 supercomputer 224,162 CPUs 2.3 Pflop/s $100,000,000 7 MW power Economical HPC Power grid scenario Central servers (planning, contingency analysis) Autonomous controllers (smart grids) Operator workstations (decision support)

How Big are the Computational Problems? 1 flop/s = one floating-point operation (addition or multiplication) per second mega (M) = 10 6, giga (G) = 10 9, tera (T) = 10 12, peta (P) = 10 15, exa E) = 10 18 Matrix-matrix multiplication = for i=1:n for j=1:n for k=1:n C[i,j] = A[i,k]*B[k,j]..running on Cell phone 1 Gflop/s 1k 1k 8MB, 2s Laptop 20 Gflop/s 8k 8k 0.5 GB, 5.5s Workstation 1 Tflop/s 16k 16k 2 GB, 8s HPC 20 Tflop/s 64k 64k 32 GB, 28s #1 supercomputer 2.3 Pflop/s 1M 1M 8 TB, 1,000s

The Evolution of Performance Carnegie Mellon

How Do We Compare? 1 flop/s = one floating-point operation (addition or multiplication) per second mega (M) = 10 6, giga (G) = 10 9, tera (T) = 10 12, peta (P) = 10 15, exa E) = 10 18 In 2010 Cell phone 1 Gflop/s Laptop 20 Gflop/s Workstation 1 Tflop/s HPC 20 Tflop/s #1 supercomputer 2.3 Pflop/s would have been the #1 supercomputer back in Cray X-MP/48 941 Mflop/s 1984 NEC SX-3/44R 23.2 Gflop/s 1990 Intel ASCI Red 1.338 Tflop/s 1997 Earth Simulator 35.86 Tflop/s 2002

If History Predicted the Future the performance of the #1 supercomputer of 2010 #1 supercomputer 1 Pflop/s could be available as HPC 1 Pflop/s 2018 Workstation 1 Pflop/s 2023 Laptop 1 Pflop/s 2030 Cell phone 1 Pflop/s 2036 How do we get here?

HPC: ExtremeScale Computing DARPA UHPC ExtremeScale system goals Time-frame: 2018 1 Pflop/s, air-cooled, single 19-inch cabinet Power budget: 57 kw, including cooling 50 Gflop/W for HPL benchmark

Developing The New #1: DoE ExaScale Challenges to achieve ExaScale Energy and power Memory and storage Concurrency and locality Resiliency X-Stack: Software for ExaScale System software Fault management Programming environments Applications frameworks Workflow systems #1 supercomputer 2 Pflop/s 2010 #1 supercomputer 1 Eflop/s 2018

Some Predictions for ExaScale Machines Processors 10 billion-way concurrency 100 s of cores per die 10 to 10-way per-core concurrency 100 million to 1 billion cores at 1 to 2 GHz Multi-threaded fine grain concurrency 10,000s of cycles system-wide latency Memory Global address space without cache coherence Explicitly managed high speed buffer caches 128 PB capacity Deep memory hierarchies Technology 22 to 11 nanometers CMOS 3-D packaging of dies Optical communications at 1TB/s Fault tolerance Active power management

International Semiconductor Roadmap Near-term (through 2016) and long-term (2017 through 2024) Process Integration, Devices, and Structures RF and Analog / Mixed-signal Technologies for Wireless Communications Emerging Research Devices System Drivers Design Test and Test Equipment Front End Processes Lithography Interconnect Factory Integration Assembly and Packaging

Prediction 1: Network of Nodes Why? State-of-the-art for large machines Allows scaling from Tflop/s to Eflop/s Designs can be tailored to application Fault tolerance Implications Segmented address space Multiple instructions, multiple data (MIMD) Packet-based messaging Long inter-node latencies

Prediction 2: Multicore CPUs Why? State-of-the-art CPU design Growing transistor count (Moore s law) Limited power budget Implications On-chip multithreading Instruction set extensions targeting applications Physically segmented cache Software and/or hardware managed cache Non-uniform memory access (NUMA) IBM Cell BE 8+1 cores Intel Core i7 8 cores, 2-way SMT IBM POWER7 8 cores, 4-way SMT Intel SCC 48 cores Nvidia Fermi 448 cores, SMT Tilera TILE Gx 100 cores

Prediction 3: Accelerators Why? Special purpose enables better efficiency 10x to 100x gain for data parallel problems Limited applicability, thus co-processor Can be discrete chip or integrated on die Implications Multiple programming models Coarse-grain partitioning necessary Programs often become non-portable RoadRunner 6,480 CPUs + 12,960 Cells 3240 TriBlades Rack-mount server components 2 quad-core CPUs + 4 GPUs 200 Gflop/s + 4 Tflop/s HPC cabinet CPU blades + GPU blades Custom interconnect

Prediction 4: Memory Capacity Limited Why? Good machine balance: 1 byte/flop Multicore CPUs have huge performance Limited power budget Need to limit memory size Implications Saving memory complicates programs Trade-off: memory vs. operations Requires new algorithm optimization Dell PowerEdge R910 2 x 8-core CPUs 256 GB, 145 Gflop/s 1 core: 16 GB for 9 Gflop/s 1.7 byte/flop BlueGene/L 65,536 dual-core CPUs 16 TB RAM, 360 Tflop/s 1 core: 128 MB for 2.8 Gflop/s 0.045 byte/flop Nvidia Tesla M2050 (Fermi) 1 GPU, 448 cores 6 GB, 515 Gflop/s 1 core: 13 MB for 1.15 Gflop/s 0.011 byte/flop

HPC Software Development Popular HPC programming languages 1953: Fortran 1973: C 1985: C++ 1997: OpenMP 2007: CUDA Popular HPC libraries 1979: BLAS 1992: LAPACK 1994: MPI 1995: ScaLAPACK 1995: PETSc 1997: FFTW Proposed and maturing (?) Chapel, X10, Fortress, UPC, GA, HTA, OpenCL, Brook, Sequoia, Charm++, CnC, STAPL, TBB, Cilk, Slow change in direction

The Cost Of Portability and Maintainability Matrix-Matrix Multiplication Performance [Gflop/s] 200 180 Best GPU code 10,000 lines CUDA 160 140 120 100 80 60 40 20 0 6,500x 256 512 1,024 2,048 4,096 8,192 matrix size =15 years technology loss Best CPU code 100,000 lines, SSE, OpenMP simple C code 4 lines

Summary Hardware vendors will somehow keep Moore s law on track Software development changes very slowly Portable and maintainable code costs performance Unoptimized program = 15 years technology loss