Accelerating Lustre! with Cray DataWarp Steve Woods, Solutions Architect

Accelerating Lustre! with ray DataWarp Steve Woods, Solutions Architect

Accelerate Your Storage! The Problem a new storage hierarchy DataWarp overview End user Perspectives Use cases Features Examples onfiguration onsiderations Summary

The Problem Buying Disk for Bandwidth is Expensive HP Wire, May 1, 2014 Attributed to Gary Grider, LANL

New Storage Hierarchy PU On Node Off Node PU Memory (DRAM) Storage (HDD) Traditional On Node Off Node Near Memory (HBM/HM) Far Memory (DRAM/NVDIMM) Near Storage (SSD) Far Storage (HDD) Highest effective cost Lowest latency Lowest effective cost Highest latency Today ray Storage and Data Management - 2015 4

New Storage Hierarchy DataWarp Software defined storage High performance storage pool Sonexion Scalable file system Resilient storage Problem solved! Scale bandwidth separately from capacity Reduce overall solution cost Improve application run time PU Near Memory (HBM/HM) Far Memory (DRAM/NVDIMM) Near Storage (SSD) Far Storage (HDD) ray Today Bandwidth needed apacity needed ray Storage and Data Management - 2015 5

Blending Flash with Disk For high Performance Lustre Blended Solution DataWarp to satisfy the bandwidth needs Sonexion to satisfy the capacity needs Drives down the cost of bandwidth ($/GB/s) Sonexion-only Solution Lots of SSU s for bandwidth Drives up the cost of bandwidth ($/GB/s)

DataWarp Overview Hardware Intel Server Block-based SSD Aries I/O blade = Raw performance Software Virtualizes the underlying HW Single solution of flash & HDD Automation via policy Intuitive interface = Harnesses the performance

Software Phases of DataWarp Phase 0 (available 2014) Statically configured compute node swap Single server file systems, /flash/ Phase 1 (fall 2015) [LE 5.2UP04 + patches] Dynamic allocation and configuration of DataWarp storage to jobs (WLM support) Application controlled explicit movement of data between DataWarp and parallel file system (stage_in and stage_out) DVS striping across DataWarp nodes Phase 2 (late 2016) [LE 6.0UP02] DVS client caching Implicit movement of data between DataWarp and PFS storage (cache) No application changes required 9/12/2016 opyright 2015 ray Inc 8

DataWarp Hardware Package Standard X I/O blade SSDs instead of PIe cables = Plugs right into the Aries network apacity 2 nodes per blade 2 SSD s per node = 12.6 TB s per blade (shown) Performance = Node processors are already optimized for I/O and the ray Aries network A A A A A A LN LN DW DW HA HA HA HA SSD SSD SSD SSD 3.2TB 3.2TB 3.2TB 3.2TB Lustre storage =12.6TB

DataWarp Software Service layer (DWS) Defines the user experience Service layer (DVS) Virtualizes I/O Distributed File system layer Virtualizes the pool of Flash File presentation File presentation File presentation WLM User DataWarp Service Open Source FS Application Data Virtualization Service Logical Volume Manager PFS DWFS Devices

DataWarp User Perspectives Transparent New user No change to their experience e.g. PFS ache Active Experienced user WLM script cmds ommon for most use cases Optimized Power user ontrol Via Lib/LI e.g. async workflow

DataWarp User Perspectives Workload Manager Integration (WLM) Researcher/engineer inserts DataWarp commands into the job script I need this much space in the DataWarp pool I need the space in DataWarp to be shared I need the results saved out to the Parallel File System Job Script requests resources via WLM DataWarp capacity ompute nodes, files, file locations WLM automates clean up after the application completes WLM integration is the key Ease of use Dynamic provisioning

DataWarp User Perspectives Supported Workload Managers SLURM WLM User Application PFS Moab/Torque DataWarp Service XFS Data Virtualization Service DWFS PBS-Pro Logical Volume Manager Devices

Use ases for DataWarp Reference files File interchange High performance scratch Private scratch space Swap space We ll focus here Shared Storage PFS ache Local Storage Burst Buffer Local ache for the PFS Transparent user model heckpoint Restart

Use ases for DataWarp Shared Storage ray HP ompute Nodes DataWarp Nodes Reference files Read intensive commonly used by multi-compute nodes DataWarp Used directed behavior Automated provisioning of resources IS 2016 opyright 2016 ray Inc.

Use ases for DataWarp Shared Storage ray HP ompute Nodes File interchange Sharing intermediate work DataWarp Used directed behavior Automated provisioning of resources DataWarp Nodes IS 2016 opyright 2016 ray Inc.

Use ases for DataWarp Shared Storage ray HP ompute Nodes DataWarp Nodes High performance scratch Files are striped across the pool DataWarp User directed behavior Automated provisioning of resources IS 2016 opyright 2016 ray Inc.

Use ases for DataWarp Reference files File interchange High performance scratch Private scratch space Swap space Shared Storage Local Storage PFS ache Burst Buffer Local ache for the PFS Transparent user model heckpoint Restart

DataWarp Application Flexibility Burst Buffer ray HP ompute Nodes Shared Storage ray HP ompute Nodes Local Storage ray HP ompute Nodes PFS ache ray HP ompute Nodes Burst DataWarp Nodes DataWarp Nodes DataWarp Nodes DataWarp Nodes Trickle Sonexion Lustre Sonexion Lustre IS 2016 opyright 2016 ray Inc.

#DW jobdw... Requests a job DataWarp instance Lifetime the same as batch job Only usable by that batch job capacity=<size> Indirect control over server count based on granularity. Might help to request more space than you need. type=scratch Selects use of DWFS file system type=cache Selects use of DWFS file system 20

#DW jobdw... (continued) access_mode=striped All compute nodes see the same filesystem Files are striped across all allocated DW server nodes Files are visible to all compute nodes using the instance Aggregates both capacity and bandwidth per file access_mode=private All compute nodes see a different filesystem Files only go to a single DW server node A compute node uses the same DW node and files only seen by that compute node access_mode=striped,private Two mount points created on each compute node Share the same space 21

Simple DataWarp job with Moab #!/bin/bash #PBS -l walltime=2:00 -joe -l nodes=8 #DW jobdw type=scratch access_mode=striped capacity=790gib. /opt/modules/default/init/bash module load dws dwstat most # show DW space available and allocated cd $PBS_O_WORKDIR aprun -n 1 df -h $DW_JOB_STRIPED # only visible on compute nodes IOR=/home/users/dpetesch/bin/IOR.X aprun -n 32 -N 4 $IOR -F -t 1m -b 2g -o $DW_JOB_STRIPED/IOR_file 9/12/2016 opyright 2015 ray Inc 22

DataWarp scratch vs. cache Scratch (phase 1) #!/bin/bash #PBS -l walltime=4:00:00 -joe -l nodes=1 #DW jobdw type=scratch access_mode=striped capacity=200gib cd $PBS_O_WORKDIR export TMPDIR=$DW_JOB_PRIVATE NAST="/msc/nast20131/bin/nast20131 scr=yes bat=no sdir=$tmpdir" ccmrun ${NAST} input.dat mem=16gb mode=i8 out=dw_out ache (phase 2) #!/bin/bash #PBS -l walltime=4:00:00 -joe -l nodes=1 #DW jobdw type=cache access_mode=striped pfs=/lus/scratch/dw_cache capacity=200gib cd $PBS_O_WORKDIR export TMPDIR=$DW_JOB_STRIPED_AHE NAST="/msc/nast20131/bin/nast20131 scr=yes bat=no sdir=$tmpdir" ccmrun ${NAST} input.dat mem=16gb mode=i8 out=dw_cache_out 9/12/2016 opyright 2015 ray Inc 23

DataWarp Bandwidth The DataWarp bandwidth seen by an application depends on multiple factors: Transfer size of the I/O requests Number of Active Streams (files) per DataWarp server (for File-per-Process I/O, equals number of processes) Number of DataWarp server nodes (which is related to capacity requested) Other activity on the DW server nodes Administrative and other user jobs. It is a shared resource. 24

Minimize ompute Residence Time with Data Warp Timestep Writes Lustre Node ount Initial Data Load ompute Final Data Writes Wall Time Timestep Writes (DW) Key ompute Nodes DataWarp Node ount DW Preload DW Post Dump ompute Nodes - Idle I/O Time Lustre I/O Time DW DW Nodes Wall Time IS 2016 opyright 2016 ray Inc.

DataWarp with MS NASTRAN ray blog reference: http://www.cray.com/blog/io-accelerator-boosts-msc-nastran-simulations/ DataWarp Job wall clock reduced by 2x with DataWarp Lustre Only IS 2016 opyright 2016 ray Inc.

Elapsed seconds for Standard 3500 3000 2500 2000 1500 1000 500 Abaqus 2016 s4e, 24M elements, 2 ranks per node 16-core 2.3 GHz Haswell, 128 GB nodes X40 ABI lustre S400 lustre X40 ABI DW S400 /tmp 0 cpus=128 cpus=256 cpus=384 cpus=512 cpus=640 cpus=768 cpus=1024 cpus=1536 4 nodes 8 nodes 12 nodes 16 nodes 20 nodes 24 nodes 32 nodes 48nodes 9/12/2016 opyright 2015 ray Inc 27

DataWarp onsiderations Know your workload apacity requirement Bandwidth requirement Iteration interval alculate ratio of DataWarp to Spinning disk % of calculated bandwidth needed by DW vs HDD Is excess bandwidth needed to sync to HDD % of storage capacity needed by DW to maintain performance capacity for multiple iterations Budget

DataWarp Bottom Line It is about reducing Time to Solution Returning control back to compute Reducing the cost of Time to Solution

DataWarp Summary 1 3 2 Faster time to insight Easy 2to use Accelerates performance Dynamic Flexible

Questions?