MWA Archive A multi-tiered dataflow and storage system based on NGAS Chen Wu Team: Dave Pallot, Andreas Wicenec, Chen Wu
Agenda Dataflow overview High-level requirements Next Generation Archive System (NGAS) Additional development to tailor NGAS for MWA Meet requirements Data capturing Data ingestion Data storage Data access Data mirroring Work-in-progress Data re-processing Conclusion 1
Online archive 32T Dataflow Online processing 320Gb/s 1/10Gbps ~ 400 MB/s 1 Gbps Staging archive MRO Mirrored archive AARNet ICRAR, Perth MIT, USA 10Gbps 10Gbps VO Long-term archive Staging & processing Data distribution, placement, scheduling Web 10Gbps Research PBStore, Perth Fornax, Perth Scripting UI Catalog VO-Table VO-Table 2
Online archive 128T Dataflow Online processing 320Gb/s MRO ~ 400 MB/s 1 Gbps 1/10 Gbps Proxy Archive 1Gbps Mirrored archive AARNet ICRAR, Perth MIT, USA VO Long-term archive Staging & processing Data distribution, placement, scheduling Web 10Gbps Research PBStore, Perth Fornax, Perth Scripting UI Catalog VO-Table VO-Table 3
Conceptual Dataflow Tier 0 metadata Online Archive RF data stream Online Processing control monitor Further Further processing Processing Further Further processing Processing Further Further processing Processing Tier 2 Proxy Archive Tier 1 Mirrored Mirrored Archive Archive Mirrored Mirrored Archive Archive Mirrored Mirrored Archive Archive Long-Term Archive Offline Processing 4
High-level Requirements High throughput data ingestion ~400MB/s visibility, 500 600 MB/s vis + image cube Efficient data distribution to multiple locations Australia / New Zealand / USA / India Secure and cost-effective storage of 8 10 TB of data collected daily Fast access to science archive by astronomers from 3 continents Intensive re-processing of archived data on GPU clusters calibration, imaging, etc. Continuous growth data volume (3PB/year), data variety (visibility, image cube, catalogue FITS, CASA, HDF5, JPEG2000) environment (Cortex à Pawsey) 5
Next Generation Archive System NGAS Openness Open source (L-GPL) originated from ESO for data archiving by Knudstrup and Wicenec (2000) Used by the astronomy community VLT, HST@ST-ECF, ESO@Paranal & La Silla, ALMA, evla, MWA ALMA: All 5 ALMA sites are based on NGAS, including all data transfers, mirroring from the observatory to all other sites across the globe. Full high availability setup with automatic fail-over Plugin-based - almost every feature is implemented as a plugin plugged into a light kernel Loss-free operation consistency checking, replication, fault-tolerance (e.g. power outage, disk failure, etc.) Object-based data storage Span multiple file systems distributed across multiple sites (e.g. VLT > 100 million objects) native HTTP interface (POST, GET, PUT, etc.) and location-transparent (Web admin UI) Scalable architecture Horizontal scalability add another NGAS instance In-storage processing Compute inside the archive Low cost, hardware-neutral deployment, versatility Linux machine + Disk arrays Support mobile storage media with energy efficiency Ingestion, Staging, Long-term Archive, Temporary storage, Proxy, Processing, etc. 6
Additional efforts needed for MWA High-throughput data ingestion 66 MB/s (ALMA) vs. 400 MB/s (MWA) Completely integrated systems vs. Theoretical performance Efficient dataflow control Multiple mirrored archives each with distinct subscription rules Saturate WAN bandwidth In-transit data processing Multi-tiered data storage Work well with HSM, e.g. retrieve a file from Tape à HTTP GET timeout Content-based access pattern classification for placement optimisation Workflow-optimised data staging to avoid I/O contention Staging data from long-term archive to Fornax (time and placement) Tasks in a complex workflow exchange data through a shared file system (But can we exploit local storage as well?) 7
Data Capturing @ MRO Wide Area Network NG/AMS Remote NGAS Servers Servers Web / Query / VO Interface Archive DB Metadata Network M&C Subsystem Local NG/AMS NGAS Servers Servers Data Network Data Producer DataCaptureMgr M&C Interface A Single Process Plugin-based Multi-threaded Throughput-oriented Fault-tolerant Admission control File handling In-memory Buffer DataHandler DataHandler Staging Area NGAS Client HDD Ramdisk SSD 8
Data Ingestion @ MRO Client push (synchronous) or Server pull (asynchronous) Find data type-specific storage medium with available capacity Receive data stream, and compute checksum (CRC) on the fly Data is stored temporarily at the Staging Area Data Archive Plug-In is invoked Quality check / compression Move file to the targeted volume / directory (e.g. MWA-DAPI) Register file in NGAS DB If replication defined, trigger file delivery threads file removal scheduling 128T Ingestion rate @ MRO 9
1400 Archiving simulation throughput 1200 Total archive throughput - MB/s 1000 800 600 400 200 0-200 A - 1Gbps bandwidth for commissioning B - 12 clients / 4 servers on 6 / 2 Fornax nodes C - aggregated data producing rate for 12 clients D - 12 clients / 4 servers on 6 idataplex / 1 Supermicro E - 12 clients / 2 servers on 6 idataplex / 1 Supermicro F - 24 clients / 4 servers on 6 / 1 Fornax nodes G - 24 clients / 4 servers on 24 / 2 Fornax nodes H - 24 clients / 4 servers on 12 / 2 Fornax nodes I - 24 clients / 4 servers on 28 Fornax nodes J - aggregated data producing rate for 24 clients 16 24 32 40 48 56 64 72 Simulated data rate per client - MB/s 10
Archive Servers CPU and File Handling 16MB/s 12 clients 48MB/s 12 clients 56MB/s 24 clients 16MB/s 12 clients 16MB/s 24clients 40MB/s 12 clients 11
Data Storage @ Cortex 32T Distribution of observation Size NGAS DB Science DB API Web Portal 1Gbps à 10Gbps 32T disk archiving rate Staging archive ICRAR, Perth 12
7KH WLPH FRQVXPHG WR UHVWRUH WKH GDWD LQ FDVH RI GLVDVWHU 7KH KLJK FRVW RI UHVRXUFHV UHTXLUHG IRU VWRUDJH QHWZRUN WDSH DQG DGPLQLVWUDWLRQ $Q DOWHUQDWLYH LV WR XVH WKH /LE6$0 OLEUDU\ IRU WKH 6XQ 6WRUDJH7HN 6WRUDJH $UFKLYH 0DQDJHU 6$0 D SDUDGLJP IRU GDWD PDQDJHPHQW WKDW JRHV EH\RQG EDFNXS :KDW,V WKH /LE6$0 /LEUDU\" 'HVLJQHG WR XVH ZLWK 6XQ 6WRUDJH7HN 6$0 DQG 6XQ 6WRUDJH7HN 4)6 VRIWZDUH WKH /LE6$0 OLEUDU\ RU $3, DOORZV \RX WR PDQDJH GDWD LQ D VDPIV ILOH V\VWHP IURP ZLWKLQ DQ DSSOLFDWLRQ Data Storage @ Cortexeb Portal Distribution of observation Size NGAS DB Science DB API $UFKLYLQJ WKH SURFHVV RI EDFNLQJ XS D ILOH E\ FRS\LQJ LW IURP D ILOH V\VWHP WR DUFKLYH PHGLD LV W\SLFDOO\ WKH ILUVW FRPSRQHQW 7KH DUFKLYH PHGLD FDQ EH D UHPRYDEOH PHGLD FDUWULGJH RU D GLVN SDUWLWLRQ RI DQRWKHU ILOH V\VWHP )LJXUH VKRZV WKH EDVLF FRPSRQHQWV RI WKH DUFKLYLQJ SURFHVV 1Gbps GHYHORSHUV VXQ FRP VRODULV DUWLFOHV OLEVDP KWPO Online archive 128T disk archiving rate 13
Data Access @ Cortex Client pull (synchronous) or Server push (asynchronous) Pull: Given a file identifier and version, get the file regardless of its location If multiple replica are found, follow: host à cluster à domain à Tapeà mirrored archive Push: Given a list of file identifiers and deliver_to_url, Sort files, then push online files while staging offline ones from the tape Support suspend/resume/cancel and persistency Server-directed I/O à better optimisation (queue mgmt., multiple requests aggregation, etc.) Data Processing Plug-Ins can be invoked prior to file retrieval E.g. decompressing, sub-setting, explicit staging, etc. Get FITS header info à Implicit staging à blocking 32T access stats release 14
Access pattern @ Cortex 32T C118 Fornax A # of accesses Overall First Access. C106 SUN Tracks # of accesses Overall inter-accesses C104 EOR Fields First Access W44 Galactic plane field inter-accesses 15
Access Patterns@ Cortex 32T 16
Data Mirroring Publish / Subscribe mode Criteria à Subscription Filter Plugin Events File ingestion, subscribe command, explicit triggering, etc. Subscribe into the past Each subscriber is given a queue Each subscriber / queue is assigned a priority Transfer HTTP / FTP / GridFTP (concurrent transfer, file pipeline, parallel connection) Target can be anything that supports HTTP / FTP / GridFTP E.g. python -m SimpleHTTPServer Failed deliveries will be recorded in the back log They get re-sent through explicit triggering when connections become normal Support proxy mode Bypass slow hops In-transit data processing 17
Mirrored Archive @ MIT 12 streams - Overall rate: 10.67 +/- 0.27 MB/s Cortex TCP window size: 16.0 MB, RTT ~0.281 secs Max send buffer: 128KB - 11.67MB/s (93.35Mb/s) Max send buffer: 64MB - 17.15MB/s (137.19Mb/s), 47% higher 18
Data Re-processing (WIP) Requirements Stage data from Cortex to Fornax Process calibration, imaging pipelines with data placed on Fornax Archive calibrated data / image cubes back to Cortex (Optional, not considered for 128T) Data staging strategies (when, how much, how frequent) Decoupled with Processing à no overlapping Staging is submitted as another job by users prior to submitting processing jobs Loosely-coupled with Processing à some overlapping Staging is a task inside the same job, can be asynchronous Tightly-coupled Processing à intertwined Processing and data staging in coordination Data placement strategies (where, when to cleanup / update) Loosely-coupled with Processing à Global file system (i.e. Lustre) User-friendly abstraction POXIS API, easiest for developers Tightly-coupled with Processing à Application-optimised file system Expose file block location in file metadata, E.g. Google File System, HDFS Dynamically-coupled with Processing à Temporary data storage and pipeline co-design Cache storage Multi-level Scheduling Stage Process Place 19
&' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* &' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* &' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* &' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* &' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* Data Re-processing on Fornax Uniqueness of Fornax for MWA GPU Cluster à MWA Real-Time System Local storage (7TB/node x 96 node) à RTS compute storage à in-storage processing Dual Infiniband connections à RTS data movement Therefore, MWA has applied 25% time on Fornax Fornax local storage is treated as a cache for MWA SRAM DRAM! " #$$#!%"&! " #$$#!%"& Frequency splits Fornax Local Storage! " #$$#!%"& Cortex Disk Cache Visibility files! " #$$#!%"&! " #$$#!%"& Cortex Tape Libraries 20
Fornax block diagram!"#$%&'()*+,-.(/#012,-0,3#-(45-#52-6!)11&!)11'!)11.!)11- A-%:(9":-!)6,478<,?@(9":-+!""#$%%&'()*+,-')."'(/.012/'(3%3415!)6,47879, (045 (+:9,19 /),+;)6 Fornax Architecture: Chris Harris
Processing-optimised NGAS Cache Storage!"#$%&'()*+,-.(/#012,-0,3#-(45-#52-6 Long-term archive A-%:(9":Mirrored archive!)6,478<, Cortex!""#$%%&'()*+,-')."'(/.012/'(3%3415?@(9":-+!)6,47879, (045 (+:9,19 Async staging /),+;)6 GEG 2/3 7"82$(9":-+ ;"<*(9":-+ Mirrored archive ()*+,$ EHF G,1-#$-, G&,-#$%> ()*+,# ;".<3,-(9":-+!""$ +)$ 3/4!""#!"%&!"%- EOR!"%' 1 =-5->"<.-$,(9":-+!"%. +)#!"%% 2/3!$"" 73+,#-(9":-+! B$C2$2D%$: EF(G,1-#$-,!/01$!)11$!)11#!)112!)113!/01#!)11&!)11'!)11.!)11-! 22
Conclusion Requirements MWA Data ingestion, transferring, storage, access, staging, re-processing NGAS Open (LGPL) software deployed in astronomy archive facilities around the globe NGAS tailored for MWA fulfills the above requirements, meets the MWA data challenge just access whole observations whenever I like, without going to file cabinets getting tapes, mounting and then trying to figure out what to do an MWA Super Science Fellow The NGAS-based MWA data archive solution is working fine during commissioning Fornax The major re-processing power horse and data staging hotspot for MWA One of few data-intensive clusters suited for MWA data reprocessing What s Next à Science archive Data modeling: ObsCore, Data access: VO Tools: ObsTAP, Saada, OpenCADC 23