2013 IBM Corporation. Exploring the Capabilities of a Massively Scalable, Compute-in-Storage Architecture

Similar documents

Blue Gene Active Storage for High Performance BG/Q I/O and Scalable Data-centric Analytics

Hadoop on the Gordon Data Intensive Cluster

Data Centric Systems (DCS)

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

SMB Direct for SQL Server and Private Cloud

Cloud Data Center Acceleration 2015

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Benchmarking Cassandra on Violin

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

FLOW-3D Performance Benchmark and Profiling. September 2012

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Performance Analysis of Flash Storage Devices and their Application in High Performance Computing

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Advancing Applications Performance With InfiniBand

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

ebay Storage, From Good to Great

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

HPC Update: Engagement Model

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Scaling from Datacenter to Client

Sun Constellation System: The Open Petascale Computing Architecture

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Parallel Computing. Benson Muite. benson.

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

760 Veterans Circle, Warminster, PA Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA

Data Deduplication in a Hybrid Architecture for Improving Write Performance

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

HPC data becomes Big Data. Peter Braam

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Infrastructure Matters: POWER8 vs. Xeon x86

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Achieving Performance Isolation with Lightweight Co-Kernels

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Parallel file I/O bottlenecks and solutions

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Enabling High performance Big Data platform with RDMA

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

BSC vision on Big Data and extreme scale computing

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Enabling Technologies for Distributed Computing

Overview: X5 Generation Database Machines

GraySort on Apache Spark by Databricks

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

SR-IOV In High Performance Computing

SGI High Performance Computing

Performance Beyond PCI Express: Moving Storage to The Memory Bus A Technical Whitepaper

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

Mellanox Academy Online Training (E-learning)

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

LS DYNA Performance Benchmarks and Profiling. January 2009

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Pedraforca: ARM + GPU prototype

Enabling Technologies for Distributed and Cloud Computing

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Lustre * Filesystem for Cloud and Hadoop *

Cray DVS: Data Virtualization Service

HP ProLiant SL270s Gen8 Server. Evaluation Report

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

High speed pattern streaming system based on AXIe s PCIe connectivity and synchronization mechanism

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

Main Memory Data Warehouses

Big Data Challenges in Bioinformatics

Cloud Computing Where ISR Data Will Go for Exploitation

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

PCI Express and Storage. Ron Emerick, Sun Microsystems

Overlapping Data Transfer With Application Execution on Clusters

Transcription:

Exploring the Capabilities of a Massively Scalable, Compute-in-Storage Architecture Blake G. Fitch bgf@us.ibm.com

2 The Extended, World Wide Active Storage Fabric Team Blake G. Fitch Robert S. Germain Michele Franceschini Todd Takken Lars Schneidenbach Bernard Metzler Heiko J. Schick Peter Morjan Ben Krill T.J. Chris Ward Michael Deindl Michael Kaufmann Marc Dombrowa David Satterfield Joachim Fenkes Ralph Bellofatto. and many other associates and contributors.

Scalable Data-centric Computing Analytics Visualization Interpretation Data Modeling Simulation Data Acquisition Massive Scale Data and Compute 3 3

Blue Gene/Q: A Quick Review 3. Compute card: One chip module, 16 GB DDR3 Memory, Heat Spreader for H 2 O Cooling 4. Node Card: 32 Compute Cards, Optical Modules, Link Chips; 5D Torus 2. Single Chip Module 1. Chip: 16+2 P cores 5b. IO drawer: 8 IO cards w/16 GB 8 PCIe Gen2 x8 slots 3D I/O torus 7. System: 96 racks, 20PF/s 5a. Midplane: 16 Node Cards Sustained single node perf: 10x P, 20x L MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria) Software and hardware support for programming models for exploitation of node hardware concurrency 4 6. Rack: 2 Midplanes

Blue Gene/Q Node: A System-on-a-Chip 5 Integrates processors, memory and networking logic into a single chip 360 mm² Cu-45 technology (SOI) 16 user + 1 service PPC processors plus 1 redundant processor all processors are symmetric 11 metal layer each 4-way multi-threaded 64 bits 1.6 GHz L1 I/D cache = 16kB/16kB L1 prefetch engines each processor has Quad FPU (4-wide double precision, SIMD) peak performance 204.8 GFLOPS @ 55 W Central shared L2 cache: 32 MB edram multiversioned cache supports transactional memory, speculative execution. supports scalable atomic operations Dual memory controller 16 GB external DDR3 memory 42.6 GB/s DDR3 bandwidth (1.333 GHz DDR3) (2 channels each with chip kill protection) Chip-to-chip networking 5D Torus topology + external link 5 x 2 + 1 high speed serial links each 2 GB/s send + 2 GB/s receive DMA, remote put/get, collective operations External (file) IO -- when used as IO chip. PCIe Gen2 x8 interface (4 GB/s Tx + 4 GB/s Rx) re-uses 2 serial links interface to Ethernet or Infiniband cards

HPC Applications Running on Blue Gene/Q at Release greatly enhanced by system level co-design Application Owner Application Owner Application Owner CFD Alya System Barcelona SC DFT igryd Jülich BM: SPEC2006, SPEC openmp SPEC CFD (Flame) AVBP CERFACS Consortium DFT KKRnano Jülich BM: NAS Parallel Benchmarks NASA CFD dns3d Argonne National Lab DFT ls3df Argonne National Lab BM: RZG (AIMS,Gadget,GENE, GROMACS,NEMORB,Octopus, Vertex) CFD OpenFOAM SGI DFT PARATEC NERSC / LBL Coulomb Solver - PEPC Jülich CFD NEK5000, NEKTAR Argonne, Brown U DFT CPMD IBM/Max Planck MPI PALLAS UCB CFD OVERFLOW NASA, Boeing DFT QBOX LLNL Mesh AMR CCSE, LBL CFD Saturne EDF DFT VASP U Vienna & Duisburg PETSC Argonne National Lab CFD LBM Erlanger-Nuremberg Q Chem GAMESS Ames Lab/Iowa State MpiBlast-pio Biology VaTech / ANL MD Amber UCSF Nuclear Physics GFMC Argonne National Lab RTM Seismic Imaging ENI MD Dalton Univ Oslo/Argonne Neutronics SWEEP3D LANL Supernova Ia FLASH Argonne National Lab MD ddcmd LLNL QCD CPS Columbia U/IBM Ocean HYCOM NOPP / Consortium MD LAMMPS Sandia National Labs QCD MILC Indiana University Ocean POP LANL/ANL/NCAR MD MP2C Jülich Plasma GTC PPPL Weather/Climate CAM NCAR MD NAMD UIUC/NCSA Plasma GYRO (Tokamak) General Atomics Weather/Climate Held-Suarez Test GFDL MD Rosetta U Washington KAUST Stencil Code Gen KAUST Climate HOMME NCAR DFT GPAW Argonne National Lab BM:sppm,raptor,AMG,IRS,sphot Livermore Weather/Climate WRF, CM1 NCAR, NCSA Accelerating Discovery and Innovation in: Materials Science Energy Engineering Climate & Environment Life Sciences RZG 6 Silicon Design Next Gen Nuclear High Efficiency Engines Oil Exploration Whole Organ Simulation

Scalable, Data-centric Programming Models Join Classic HPC Structured Comms Random Comms (*pre-partitioned) Data in Motion: High Velocity Mixed Variety High Volume* (*over time) Embarrassingly Parallel Data Intensive: Data at Rest JAQL, Java Input Data (on disk) Output Data Reducers Mappers Data Intensive: Streaming Data SPL, C, Java Reactive Analytics Extreme Ingestion Data at Rest*: High Volume Mixed Variety Low Velocity Extreme Scale-out Compute Intensive (Data Generators) C/C++, Fortran, MPI, OpenMP = compute node Network Dependent Data and Compute Intensive (Large Address Space) C/C++, Fortran, UPC, SHMEM, MPI, OpenMP Data is Moving: Long Running All Data View Small Messages Discrete Math Low Spatial Locality Data is Generated: Long Running Small Input Massive Output Generative Modeling Extreme Physics 7 7

8 Graph 500: Demonstrating Blue Gene/Q Data-centric Capability June 12: www.graph500.org/results_june_2012 A particularly important analytics kernel Random memory access pattern Very fine access granularity High load imbalance in Communication and Computation Data dependent communications patterns Blue Gene features which helped: cost-scalable bisectional bandwidth low latency network with high messaging rates large system memory capacity low latency memory systems on individual nodes

9 BG/Q TCO Internet Scale Cost-structure By the Rack Annualized TCO of HPC Systems (Cabot Partners) BG/Q saves ~$300M/yr!

10 Active Storage Concept: Scalable, Solid State Storage with BG/Q How to guide: Remove 512 of 1024 BG/Q compute nodes in rack to make room for solid state storage Integrate 512 Solid State (Flash+) Storage Cards in BG/Q compute node form factor Standard BG/Q Compute Fabric BQC Compute Card Node card 16 BQC + 16 PCI Flash cards PCIe Flash Board 10GbE FPGA Flash Storage Capacity I/O Bandwidth 2012 Targets 2 TB 2 GB/s IOPS 200 K PCIe BGAS Rack Targets 512 Hs4 Cards Nodes 512 Storage Cap 1 PB System Software Environment Linux OS enabling storage + embedded compute OFED RDMA & TCP/IP over BG/Q Torus failure resilient Standard middleware GPFS, DB2, MapReduce, Streams Active Storage Target Applications Parallel File and Object Storage Systems Graph, Join, Sort, order-by, group-by, MR, aggregation Application specific storage interface Key architectural balance point: All-to-all throughput roughly equivalent to Flash throughput I/O Bandwidth Random IOPS Compute Power Network Bisect. 1 TB/s 100 Million 104 TF External 10GbE 512 512 GB/s scale it like BG/Q.

11 An HPC System with Active Storage supporting I/O and Analysis Compute Dense Compute Fabric Active Storage Fabric Archival Storage Disk/Tape External Switch, BG/Q Compute Racks BG/Q IO Links BGAS Racks 10GbE Point-to-point links Visualization Server, etc IO nodes have non-volatile memory as storage and external Ethernet Compute nodes not required for data-centric systems but offer higher density for HPC Compute nodes and IO nodes likely have Blue Gene type nodes and torus network IO Node cluster supports file systems in non-volatile memory and on disk On-line data does may not leave IO fabric until ready for long term archive Manual or automated hierarchical storage manages data object migration among tiers

12 Blue Gene/Q I/O Node Packaging Air cooled BG/Q nodes 3U, 19 inch rack mount Optics / Link chips Single PCIe Gen 2.0 x8 slot per node Integrated BG/Q Torus Network I/O links connect to compute racks Fans Optics adapters I/O Nodes use same SoC chip as compute nodes 48V power input Clock card master connector BPE communication Clock distribution Service network Display

Blue Gene Active Storage 64 Node Prototype Scalable Hybrid Memory PCIe Flash device BG/Q I/O Drawer (8 nodes) 10GbE FPGA 8 Flash cards 13 PCIe Blue Gene Active Storage 64 Node Prototype (4Q12) IBM 19 inch rack w/ power, BG/Q clock, etc PowerPC Service Node 8 IO Drawers 3D Torus (4x4x4) System specification targets: 64 BG/Q Nodes 12.8TF 128 TB Flash Capacity (SLC) 128 GB/s Flash I/O Bandwidth 128 GB/s Network Bisection Bandwidth 4 GB/s Per node All-to-all Capability 128x10GbE External Connectivity 256 GB/s I/O Link Bandwidth to BG/Q Compute Nodes Software Targets Linux 2.6.32 TCP/IP, OFED RDMA GPFS MVAPICH SLURM BG/Q IO Rack 12 I/O Boxes

Linux Based Software Stack For BG/Q Node Java SLURM LAMMPS DB2 Applications PIMD InfoSphere Streams Userspace Benchmarks iperf IOR FIO JavaBench d Data-centric Embedded HPC Network File Systems GPFS Hadoo p Frameworks ROOT/PROOF MPICH2 Message Passing Interface OpenMPI MVAPICH2 Nuero Tissue Simulation (Memory/IO bound) Storage Class Memory HS4 User Library RDMA Library Open Fabrics RoQ Library Joins / Sorts Graphbased Algorithms HS4 Flash Device Driver RDMA Core SRP Resource partitioning hypervisor? Static or dynamic?" IPoIB RoQ Device Driver Linux Kernel 2.6.32-220.4.2.bgq.el6.ppc64bgq BG/Q Patches BGAS Patches IP-ETH over RoQ New High Performance Solid State Store Interfaces RoQ Microcode (OFED Device) Function-shipped I/O Network Resiliency Support Embedded MPI Runtime options (?): ZeptoOS Linux port of PAMI Firmware + CNK Fused OS KVM + CNK hypervised physical resources HS4 (PCIe Flash) Blue Gene/Q Chip Resource partitioning hypervisor? Static or dynamic? C C C M HW C C M N C C M N 14 14 2 TB Flash Memory 11 Cores (44 Threads) 11 GB Memory I/O Torus 1 GB Memory Collectives 4 GB Memory

Active Storage Stack Optimizes Network/Storage/NVM Data Path 15 Scalable, active storage currently involves three server side state machines Network (TCP/IP, OFED RDMA), Storage Server (GPFS, PIMD, etc), and Solid State Store (Flash Cntl) These state machines will evolve and potentially merge as to better server scalable, data-intensive applications. Early applications will shape this memory/storage interface evolution Storage Server State Machine RDMA READ WRITE (e.g. KV Store) USER SPACE QUEUES (NVM Express-like) BLOCK READ / WRITE Network State Machine (e.g. RDMA) e.g.: Stream SCM direct to network Flash Control State Machine (e.g. Flash DMA) Area of research

16 BGAS Full System Emulator -- BGAS on BG/Q Compute Nodes Leverages BG/Q Active Storage (BGAS) Environment BG/Q + Flash memory, Linux REHL 6.2 standard network interfaces (OFED RDMA, TCP/IP) standard middleware (GPFS, etc) BGAS environment + Soft Flash Controller + Flash Emulator SFC breaks the work up between device driver and FPGA logic The Flash emulator manages RAM as storage with Flash access times Explore scalable, SoC, SCM (Flash, PCM) challenges and opportunities Work with the many device queues necessary for BG/Q performance Work on the software interface between network and SCM RDMA direct into SCM Realistic BGAS development platform Allows development of storage systems that deal with BGAS challenges GPFS-SNC should run GPFS-Perseus type declustered raid Multi-job platform challenges QoS requirements on the network Resiliency in network, SCM, and cluster system software Running today: InfoSphere Streams, DB2, GPFS, Hadoop, MPI, etc

17 BGAS Platform Performance Emulated Storage Class Memory Software Environment Tests Linux Operating System OFED RDMA on BG/Q Torus Network Fraction of 16 GB DRAM used to emulate Flash storage (RamDisk) on each node GPFS uses emulated Flash to create a global shared file system IOR Standard Benchmark all nodes do large contiguous writes tests A2A BW internal to GPFS All-to-all OFED RDMA verbs interface MPI All-to-all in BG/Q product environment a light weight compute node kernel (CNK) Results IOR used to benchmark GPFS 512 nodes 0.8 TB/s bandwidth to emulated storage Network software efficiency for all-to-all BG/Q MPI on CNK: 95% OFED RDMA verbs 80% GPFS IOR 40% - 50% (room to improve!) GPFS Performance on BGAS 0.8 TB/s GPFS Bandwidth to scratch file system (IOR on 512 Linux nodes)

Questions? Analytics Visualization Interpretation Data Modeling Simulation Data Acquisition Massive Scale Data and Compute 18 18

19 Parallel In-Memory Database (PIMD) Key/value object store Similar in function to Berkeley DB Support for partitioned datasets named containers for groups of K/V records Variable length key, variable length value Record values maybe appended to, or accessed/updated by byte range Data consistency enforced at the record level by default In-Memory DRAM and/or Non-volatile memory (with migration to disk supported). Other functions Several types of interators from generic next-record to streaming, parallel, sorted keys Sub-record projections Bulk insert Server controlled embedded function -- could include further push down into FPGA Parallel Client/Server Storage System Server is a state machine is driven by OFED RDMA and Storage events MPI client library wraps OFED RDMA connections to servers Hashed data distribution Generally private hash to avoid data imbalance in servers Considering allowing user data placement with a maximum allocation at each server Resiliency Currently used for scratch storage which is serialized into files for resiliency Plan to enable scratch, replication, and network raid resiliency on PDS granularity

The Classic Parallel I/O Stack v. a Compute-In-Storage Approach Classic parallel IO stack to access external storage Compute-in-storage Apps directly connect to scalable K/V storage Application HDF5 MPI-IO Key/Value Clients RDMA RDMA Connections Network 20 From: http://e-archivo.uc3m.es/bitstream/10016/9190/1/thesis_fjblas.pdf Key/Value Servers Non-Volatile Mem

Compute-in-Storage: HDF5 Mapped to a Scalable Key/Value Inteface Storage-embedded parallel programs can use HDF5 Many scientific packages already use HDF5 for I/O HDF5 mapped scalable key/value storage (SKV) client interface: native key/value tuple representation of records MPI/IO (adio) --> support for high-level APIs: HDF5 --> broad range of applications SKV provides lightweight direct access to NVM Compute-in-storage Apps directly connect to scalable K/V storage Application HDF5 MPI-IO Key/Value Clients client-server communications use OFED RDMA Scalable Key/Value storage (SKV) Design stores key/value records in (non-volatile) memory distributed parallel client-server thin server core: mediate between network and storage client access: RDMA only Features: non-blocking requests (deep queues) global/local iterators access to partial values (insert/retrieve/update/append) tuple-based access with server side predicates and projection RDMA RDMA Connections Network Key/Value Servers 21 Non-Volatile Mem

22 Structure Of PETSc From: http://www.it.uu.se/research/conf/scse07/material/gropp.pdf

Query Acceleration: Parallel Join vs Distributed Scan Total size in normal form: 4 x 10 8 Bytes (Most compact representation) Star Scheme Data Warehouse Total size in denormalized form: 1.2 x 10 10 Bytes (60X growth in data volume) Denormalized Table Customer Table Customer id Gender Income Zip code Phone number 250,000 rows 10,000,000 rows Fact Table Primary key Customer id date/time Product id Store id 500 rows Store Table Store id Zip code Phone number Product Table Product id Product name Product description 4000 rows Database scheme denormalized for scan based query acceleration Denormalized Table/View Primary key Customer id Gender Income Zip code Phone number Product id Product name Product description Store id Zip code Phone number 10,000,000 rows Scheme denormalization is often done for query acceleration in systems with weak or nonexistant network bisection. Strong network bisection enables efficient, ad-hoc table joins allowing data to remain in compact normla form. 23 Data Size Primary key 8 bytes Customer id 8 bytes Date/time 11 bytes Product id 8 bytes Store id 4 bytes Gender 1 byte Zip code 9 bytes Phone number 10 bytes Product name varchar(128) Product description varchar(1024)

Observations, comments, and questions How big are the future datasets? How random are the accesses? How much concurrency in algorithms? Heroic programming will be probably be required to make 100,000 node programs work well what about down scaling? A program using a library will usually call multiple interfaces during its execution life cycle what are the options for data distribution? Domain workflow design may not be the same skill as building a scalable parallel operator what will it will take care to enable these activities to be independent? A program using a library will only spend part of its execution time in the library can/must parallel operators in the library be pipelined or execute in parallel? Load balance who s responsibility? Are interactive workflows needed? Would it help to reschedule a workflow while a person thinks to avoid speculative runs? Operator pushdown how far? There is higher bandwidth and lower latency in the parallel storage array than outside, in the node than on the network, in the storage controller (FPGA) than in the node Push operators close to data but keep a good network around for when that isn t possible 24 Workflow End User Workflow definition Domain Language (scripting?) Collection of heroically coded parallel operators Domain Data Model (e.g. FASTA, FASTQ K/V Datasets) Global Storage Layer GPFS, K/V, etc Memory/Storage Controllers Support offload to Node/FPGA Hybrid Non-volatile Memory DRAM + Flash + PCM?

25 DOE Extreme Scale: Conventional Storage Planning Guidelines http://www.nersc.gov/assets/hpc-requirements-for-science/hpssextremescalefinalpublic.pdf

26 HPC I/O Requirements 60 TB/s Drive From Rick Stevens: http://www.exascale.org/mediawiki/images/d/db/planningforexascaleapps-steven.pdf

27 Exascale IO: 60TB/s Drives Use of Non-volatile Memory 60 TB/s bandwidth required Driven by higher frequency check points due to low MTTI Driven by tier 1 file system requirements HPC program IO Scientific analytics at exascale Disks: ~100 MB/s per disk ~600,000 disks! ~600 racks??? Flash ~100 MBps write bandwidth per flash package ~600,000 Flash packages ~60 Flash packages per device ~6 GBps bandwidth per device (e.g. PCIe 3.0 x8 Flash adaptor) ~10,000 Flash devices Flash is already more cost effective than disk for performance (if not capacity) at the unit level and this effect is amplified by deployment requirements (power, packaging, cooling) Conclusion: Exascale class systems will benefit from integration of a large solid state storage subsystem

Torus Networks Cost-scalable To Thousands Of Nodes Logic Diagram Physical Layout Packaged Node Cost scales linearly with number of nodes Torus all-to-all throughput does fall rapidly for very small system sizes But, bisectional bandwidth continues to rise as system grows A hypothetical 5D Torus with 1GB/s links yields theoretical peak all-to-all bandwidth of: 1GB/s (1 link) at 32k nodes (8x8x8x8x8) Above 0.5GB/s out to 1M nodes Mesh/Torus networks can be effective for data intensive applications where cost/bisection-bw is required A2A BW/Node (GB/s) 4 3.5 3 2.5 2 1.5 1 0.5 3D-Torus per Node 3D-Torus Aggregate 4D-Torus per Node 4D-Torus Aggregate 5D-Torus per Node 5D-Torus Aggregate Full Bisection per Node Full Bisection Aggregate 40000 35000 30000 25000 20000 15000 10000 5000 Aggregate A2A BW (GB/s) 28 0 0 0 5000 10000 15000 20000 25000 30000 Node Count