Modeling Big Data/HPC Storage Using Massively Parallel Simula:on

Modeling Big Data/HPC Storage Using Massively Parallel Simula:on Chris Carothers (CCNI) Misbah Mubarak (CS) Rensselaer Polytechnic Ins:tute chrisc@cs.rpi.edu Rob Ross Phil Carns MCS/ANL rross@mcs.anl.gov Connec&ons of all the sub- networks in the world by Bill Cheswick, Lumeta Corp, 1998

Brief History of PDES and NM&S 1994: wireless PCS Network model. 32x32 square grid è 1024 LPs Mobile subscribers as events sent to grid LPs. No physical layer being modeled Performance: ~25K ev/sec on 8 DECsta:on worksta:ons 2003: Packet- level TCP over AT&T Backbone ~1 million packet- level TCP ﬂows over AT&T AS Leveraged Intel P4 HT, mul:- core processors Performance: 200K to 400K ev/sec on 4- way systems 2007: Slice- level BitTorrent Model over NBC Network Upto swarms of 256K Bi`orent clients with 2 to 16 seeders Consumed upwards of 48 GB Serial performance only: ~50K ev/sec 2007-Present saw a significant bump in PDES performance!

Blue Gene /P Layout In ~2009 ALCF/ANL Intrepid 163K cores/ 40 racks @ ~500 TFLOPS ~80TB RAM ~8 PB of disk over GPFS Custom OS kernel

NSF MRI Balanced Cyberinstrument @ CCNI Blue Gene/Q Phase 1: 400+ teraflops @ 2+ GF/wa` #1 on Green 500 list (architecture) 10PF and 20PF DOE systems Exec Model: 64K threads/16k cores 32 TB RAM 32 I/O nodes (4x over typical BG/Qs) RAM Storage Accelerator 8 TB @ 60+ GB/sec 32 servers @ 128 GB each Disk storage 32 servers @ 24 TB disk Bandwidth: 5 to 24 GB/sec Viz systems CCNI: 16 servers w/ dual GPUs EMACS: display wall + servers WHAT CAN WE DO WITH THIS COMPUTE POWER FOR STORAGE M&S?

12.27 billion ev/sec for 10% remote on 65,536 cores!! ROSS is an op&mis&c Time Warp discrete- event simula&on engine designed for massively parallel systems 4 billion ev/sec for 100% remote on 65,536 cores!! Observed similar scaling on Blue Gene/ Q using 256K MPI ranks PHOLD benchmark on ROSS w/ 1 M LPs @ 10 ev each!

CODES Project: CO- Design of Exascale Storage Top I/O job: Plasma Physics 67 TBs per job 10+ hrs to execute Over 2 hr idle period Overall, Bursty I/O! How to design an Exascale storage system? (e.g., Today that s ~1,000,000 3 TB hard disk system)

Modeling Complexity @ Every Level: e.g., File Open applica:on level request storage level Each box represents an event in the model! CIOD level file system level PVFS clients talk to PVFS servers itera:vely to find entry Randomly select a fileserver and create the object If we just model the file system complexity, what happens? 7

Model vs Data: Shared Unaligned Read Test Underscores the need for co- design with real experimental performance data!! Used POSIX interface Used 4MB (4 * 10^6) accesses for a total of 64MB per process Requests span mul:ple file stripe units, requiring that requests serviced by 2 storage node rather than one Simulated read performance curve is similar to simulated write performance curve because the striping algorithm is the same Max error is 30-40% Possible reasons: Lack of queuing at fileserver and Myricom network layer 8

The Dragonfly Network Topology A two level directly connected topology Uses high- radix routers Large number of ports per router Each port has moderate bandwidth p : Number of compute nodes connected to a router a : Number of routers in a group h : Number of global channels per router k=a + p + h 1 a=2p=2h (Recommended configura:on)

ROSS Dragonfly Performance Results on BG/P vs. BG/Q (for a 50 million node model) The event efficiency stays high on both BG/P and BG/Q as each MPI task has substan:al work load The computa:on performed at each MPI task dominates the number of rolled back events

Billion- Node Torus Network Model Using ROSS ROSS is a massively parallel discrete- event simulator Has scaled to 131,072 cores Yields very good strong scaling/extreme execu:on :me compression For accurate storage simula:ons, network is clearly important! So, can we model an exascale like network at the packet- level 32^6 (~1 billion) node Torus Topology consumes > 2 TB Small torus validated model against Blue Gene/L Torus network a b c d number of processors 4,096 8,192 16,384 32,768 65,536 131,072 efficiency 99.83% 99.90% 99.83% 99.55% 98.89% 97.51% event- rate (M/sec) 639 1,192 2,260 4,002 7,307 12,359 remote event percentage secondary rollback rate ( 10-5 ) 11.71% 12.39% 13.77% 16.53% 16.88% 17.22% 1.06 0.254 0.0429 0.51 3.87 21.7

Summary & Forward Challenges 1. TAKE AWAY: Big Data/HPC storage systems can be effec:vely modeled using massively parallel simula:on tools and techniques ROSS is open source at available at our wiki site: odin.cs.rpi.edu 2. Need models and co- design around hybrid HPC/cloud storage systems. 3. Need power, failure and recovery models. 4. Need to exploit simula:on s out of band capabili:es. 5. Extend parallel simula:on engines for parallel I/O, data collec:on. 6. Massively parallel dynamic load balancing of models, especially under irregular big data/hpc workloads. 7. Make simula:on engines and models easier to use and configure. 8. Valida:on/Verifica:on techniques at scale when you don t have real experimental data.