Portable, Scalable, and High-Performance I/O Forwarding on Massively Parallel Systems. Jason Cope copej@mcs.anl.gov



Similar documents
GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

New Storage System Solutions

Modeling Big Data/HPC Storage Using Massively Parallel Simula:on

Lustre Networking BY PETER J. BRAAM

Monitoring Tools for Large Scale Systems

Cray DVS: Data Virtualization Service

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Parallel I/O on JUQUEEN

Bottleneck Detection in Parallel File Systems with Trace-Based Performance Monitoring

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Cluster Implementation and Management; Scheduling

Current Status of FEFS for the K computer

OS/Run'me and Execu'on Time Produc'vity

Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Enabling Technologies for Distributed Computing

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Return on Experience on Cloud Compu2ng Issues a stairway to clouds. Experts Workshop Nov. 21st, 2013

Introduction to MPIO, MCS, Trunking, and LACP

Sun Constellation System: The Open Petascale Computing Architecture


Enabling Technologies for Distributed and Cloud Computing

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

HPC Update: Engagement Model

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

A Scalable Ethernet Clos-Switch

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Lustre & Cluster. - monitoring the whole thing Erich Focht

Investigation of storage options for scientific computing on Grid and Cloud facilities

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

PCI Express IO Virtualization Overview

SC09 Tutorial M06 Cluster Construc5on Tutorial

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

Can High-Performance Interconnects Benefit Memcached and Hadoop?

OSG Hadoop is packaged into rpms for SL4, SL5 by Caltech BeStMan, gridftp backend

Mixing Hadoop and HPC Workloads on Parallel Filesystems

High Performance Storage System. Overview

LONI Provides UNO High Speed Business Con9nuity A=er Katrina. Speakers: Lonnie Leger, LONI Chris Marshall, UNO

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

System Software for High Performance Computing. Joe Izraelevitz

New!! - Higher performance for Windows and UNIX environments

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Asynchronous Dynamic Load Balancing (ADLB)

Achieving Performance Isolation with Lightweight Co-Kernels

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

SMB Direct for SQL Server and Private Cloud

Self service for software development tools

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

Hybrid Software Architectures for Big

State of the Art Cloud Infrastructure

2009 Oracle Corporation 1

Understanding the Benefits of IBM SPSS Statistics Server

Next Generation HPC Storage Initiative. Torben Kling Petersen, PhD Lead Field Architect - HPC

owncloud Enterprise Edition on IBM Infrastructure

BENCHMARKING V ISUALIZATION TOOL

D1.2 Network Load Balancing

POWER ALL GLOBAL FILE SYSTEM (PGFS)

Pivot3 Reference Architecture for VMware View Version 1.03

Globus and the Centralized Research Data Infrastructure at CU Boulder

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Quantum StorNext. Product Brief: Distributed LAN Client

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Hadoop on the Gordon Data Intensive Cluster

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Network Contention and Congestion Control: Lustre FineGrained Routing

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Architecting a High Performance Storage System

Deployment Guide. How to prepare your environment for an OnApp Cloud deployment.

NetApp High-Performance Computing Solution for Lustre: Solution Guide

HUAWEI OceanStor Load Balancing Technical White Paper. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD.

Performance Analysis of Mixed Distributed Filesystem Workloads

AMD SEAMICRO OPENSTACK BLUEPRINTS CLOUD- IN- A- BOX OCTOBER 2013

Building Clusters for Gromacs and other HPC applications

HPC Software Requirements to Support an HPC Cluster Supercomputer

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Enabling High performance Big Data platform with RDMA

OBSERVEIT DEPLOYMENT SIZING GUIDE

Transcription:

Portable, Scalable, and High-Performance I/O Forwarding on Massively Parallel Systems Jason Cope copej@mcs.anl.gov

Computation and I/O Performance Imbalance Leadership class computa:onal scale: >100,000 processes Mul: core architectures Lightweight opera:ng systems on compute nodes Leadership class storage scale: >100 servers Cluster file systems Commercial storage hardware Compute and storage imbalance in current leadership class systems hinders applica:on I/O performance 1 GB/s of storage throughput for every 10TF of computa:on performance gap The gap has increased by a factor of 10 in recent years 2

DOE FastOS2 I/O Forwarding Scalability Layer (IOFSL) Project Goal: Design, build, and distribute a scalable, unified high end compu:ng I/O forwarding sopware layer that would be adopted by the DOE Office of Science and NNSA. Reduce the number of file system opera:ons that the parallel file system handles Provide func:on shipping at the file system interface level Offload file system func:ons from simple or full OS client processes to a variety of targets Support mul:ple parallel file system solu:ons and networks Integrate with MPI IO and any hardware features designed to support efficient parallel I/O 3

Outline I/O Forwarding Scalability Layer (IOFSL) Overview IOFSL Deployment on Argonne s IBM Blue Gene/P Systems IOFSL Deployment on Oak Ridge s Cray XT Systems Op:miza:ons and Results Pipelining in IOFSL Request Scheduling and Merging in IOFSL IOFSL Request Processing Future Work and Summary 4

HPC I/O Software Stack!"#$%&'()*+('+,(%!!"#$""%#&'#()*"#+,-#.'/&0)1"#).#"//232"$&#).#4'..256" 5!

IOFSL Architecture Client - MPI IO using ZoidFS ROMIO interface - POSIX using libsysio or FUSE Network - Transmit message using BMI over TCP / IP, MX, IB, Portals, and ZOID - Messages encoded using XDR Server - Delegates IO to backend file systems using na:ve drivers or libsysio Client Processing Node ROMIO libsysio FUSE ZOIDFS Client Network API System Network I/O Forwarding Server Network API ZOIDFS Server PVFS POSIX libsysio GPFS Lustre 6

Argonne s IBM Blue Gene/P Systems Figure Courtesy of Robert Ross, ANL 7

IOFSL Deployment on Argonne s IBM Blue Gene/P Systems Storage Server Storage Server Storage Server PVFS2 servers GPFS servers 10 Gbit Ethernet Network Tree Network ION Tree Network ION PVFS2 clients GPFS clients IOFSL servers ZOID servers IOFSL clients, ZOID clients Compute Nodes Compute Nodes 8

Initial IOFSL Results on Argonne s IBM Blue Gene/P Systems IOR Read IOR Write 800 700 CIOD IOFSL 900 800 CIOD IOFSL Avg Bandwidth (MiB/s) 600 500 400 300 200 Avg Bandwidth (MiB/s) 700 600 500 400 300 200 100 100 0 4 16 64 256 1024 4096 16384 65536 Message Size (KiB) 0 4 16 64 256 1024 4096 16384 65536 Message Size (KiB) 9

Initial IOFSL Results on Argonne s IBM Blue Gene/P Systems 800 700 CIOD, non-collective, t=8m IOFSL, TASK, t=8m Avg Bandwidth (MiB/s) 600 500 400 300 200 100 0 64 128 256 512 1024 Clients 10

Oak Ridge s Cray XT Systems Serial ATA 3 Gbit/sec InfiniBand 16 Gbit/sec XT5 SeaStar2+ 3D Torus 9.6 Gbytes/sec 366 Gbytes/s 384 Gbytes/s 384 Gbytes/s 384 Gbytes/s Jaguar XT5 96 Gbytes/s Jaguar XT4 Other Systems (Viz, Clusters) Enterprise Storage controllers and large racks of disks are connected via InfiniBand. 48 DataDirect S2A9900 controller pairs with 1 Tbyte drives and 4 InifiniBand connections per pair Storage Nodes run parallel file system software and manage incoming FS traffic. 192 dual quad core Xeon servers with 16 Gbytes of RAM each SION Network provides connectivity between OLCF resources and primarily carries storage traffic. 3000+ port 16 Gbit/sec InfiniBand switch complex Lustre Router Nodes run parallel file system client software and forward I/O operations from HPC clients. 192 (XT5) and 48 (XT4) one dual core Opteron nodes with 8 GB of RAM each Figure Courtesy of Galen Shipman, ORNL 11

IOFSL Deployment on Oak Ridge s Cray XT Systems Storage Server Storage Server Storage Server Lustre servers 16 Gbit Infiniband Network Lustre Router Nodes Lustre clients IOFSL servers TCP Network IOFSL clients Compute Nodes 12

Initial IOFSL Results on Oak Ridge s Cray XT Systems 800 700 IOFSL, TASK, t=8m XT4, non-collective, t=8m Avg Bandwidth (MiB/s) 600 500 400 300 200 100 0 128 256 512 1024 2048 4096 Clients 13

IOFSL Optimization #1: Pipeline Data Transfers Mo:va:on Limits on the amount of memory available on I/O nodes Limits on the amount of posted network opera:ons Need to overlap network opera:ons and file system opera:on for sustained throughput Solu:on: Pipeline data transfers between the IOFSL client and server Nego:ate the pipeline transfer buffer size Data buffers are aggregated or segmented at the nego:ated buffer size Issue network transfer requests for each pipeline buffer Reformat pipeline buffers into the original buffer sizes Currently serial and parallel pipeline modes 14

Pipeline Data Transfer Results for Different IOFSL Server Configurations 250 Server Config #1 (SM Events) Server Config #2 (TASK Events) Avg Bandwidth (MiB/s) 200 150 100 50 0 256 512 1024 2048 4096 8192 Pipeline Buffer Size (MiB) 15

IOFSL Optimization #2: Request Scheduling and Merging Request scheduling aggregates several requests into a bulk IO request Reduces the number of client accesses to the file systems With pipeline transfers, overlaps network and storage IO accesses Two scheduling modes supported FIFO mode aggregates requests as they arrive Handle Based Round Robin (HBRR) iterates over all ac:ve file handles to aggregate requests Request merging iden:fies aggregates noncon:guous requests into con:guous requests Brute Force mode iterates over all pending requests Interval Tree mode compares requests that are on similar ranges 16

IOFSL Request Scheduling and Merging Results with the IOFSL GridFTP Driver MPI Application Application MPI-IO FUSE 400 350 Requesting Scheduling No Request Scheduling IOFSL Server GridFTP Server WAN Avg Bandwidth (MiB/s) 300 250 200 150 100 50 GridFTP Server GridFTP Server 0 8 16 32 64 128 High-Performance Storage System Archival Storage System Number of Clients 17

IOFSL Optimization #3: Request Processing and Event Mode Mul: Threaded Task Mode New thread for execu:ng each IO request Simple implementa:on Thread conten:on and scalability issues State Machine Mode Use a fixed number of threads from a thread pool to execute IO requests Divide IO requests into smaller units of work Thread pools schedules IO requests to run non blocking units of work (data manipula:on, pipeline calcula:ons, request merging) Yield execu:on of IO requests on blocking resource accesses (network communica:on, :mer events, memory alloca:ons) 18

IOFSL Request Processing and Event Mode: Argonne s IBM Blue Gene/P Results 1600 1400 CIOD, non-collective, t=8m IOFSL, TASK, t=8m IOFSL, SM, t=8m Avg Bandwidth (MiB/s) 1200 1000 800 600 400 200 0 64 128 256 512 1024 Clients 19

IOFSL Request Processing and Event Mode: Oak Ridge s Cray XT4 Results Avg Bandwidth (MiB/s) 800 700 600 500 400 300 200 100 IOFSL, TASK, t=8m IOFSL, SM, t=8m XT4, non-collective, t=8m 0 128 256 512 1024 2048 4096 Clients 20

Current and Future Work Scaling and tuning of IOFSL on IBM BG/P and Cray XT systems Collabora:ve caching layer between IOFSL servers Security infrastructure Integra:ng IOFSL with end to end I/O tracing and visualiza:on tools for the NSF HECURA IOVIS / Jupiter project 21

Project Participants and Support Argonne Na,onal Laboratory: Rob Ross, Pete Beckman, Kamil Iskra, Dries Kimpe, Jason Cope Los Alamos Na,onal Laboratory: James Nunez, John Bent, Gary Grider, Sean Blanchard, Latchesar Ionkov, Hugh Greenberg Oak Ridge Na,onal Laboratory: Steve Poole, Terry Jones Sandia Na,onal Laboratories: Lee Ward University of Tokyo: Kazuki Ohta, Yutaka Ishikawa The IOFSL project is supported by the DOE Office of Science and NNSA. 22

IOFSL Software Access, Documentation, and Links IOFSL Project Website: hpp://www.iofsl.org IOFSL Wiki and Developers Website: hpp://trac.mcs.anl.gov/projects/iofsl/wiki Access to IOFSL Public git Repository: git clone hpp://www.mcs.anl.gov/research/projects/iofsl/git iofsl Recent publica:ons K. Ohta, D. Kimpe, J. Cope, K. Iskra, R. Ross, and Y. Ishikawa, "Op:miza:on Techniques at the I/O Forwarding Layer, IEEE Cluster 2010 (to appear). D. Kimpe, J. Cope, K. Iskra, and R. Ross. "Grids and HPC: Not as Different as you might think," Para2010 minisymposium on Real :me access and Processing of Large Data Sets, April 2010. 23

Questions? Jason Cope copej@mcs.anl.gov 24