MWA Archive A multi-tiered dataflow and storage system based on NGAS



Similar documents
The Murchison Widefield Array Data Archive System. Chen Wu Int l Centre for Radio Astronomy Research The University of Western Australia

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Diagram 1: Islands of storage across a digital broadcast workflow

In Memory Accelerator for MongoDB

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011


CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

THE HADOOP DISTRIBUTED FILE SYSTEM

Oracle TimesTen In-Memory Database on Oracle Exalogic Elastic Cloud

Reference Design: Scalable Object Storage with Seagate Kinetic, Supermicro, and SwiftStack

Data Movement and Storage. Drew Dolgert and previous contributors

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

PARALLELS CLOUD STORAGE

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

(WKHUQHW 3& &DUG 0RGHOV '(0993&7 '(09937 '(0993/7

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Integrating VoltDB with Hadoop

POSIX and Object Distributed Storage Systems

The Google File System

Distributed File Systems

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

ASKAP Science Data Archive: Users and Requirements CSIRO ASTRONOMY AND SPACE SCIENCE (CASS)

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Redefining Microsoft SQL Server Data Management. PAS Specification

Bigdata High Availability (HA) Architecture

HDFS Architecture Guide

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

The Hadoop Distributed File System

Hadoop IST 734 SS CHUNG

APPENDIX H. CONCEPT DEVELOPMENT

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Apache Hadoop. Alexandru Costan

GraySort on Apache Spark by Databricks

Big data management with IBM General Parallel File System

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

SMB Direct for SQL Server and Private Cloud

Taking Big Data to the Cloud. Enabling cloud computing & storage for big data applications with on-demand, high-speed transport WHITE PAPER

Quantum StorNext. Product Brief: Distributed LAN Client

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Performance and scalability of a large OLTP workload

2009 Oracle Corporation 1

Client-aware Cloud Storage

XenData Product Brief: SX-550 Series Servers for LTO Archives

StorReduce Technical White Paper Cloud-based Data Deduplication

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Running a Workflow on a PowerCenter Grid

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Distributed Database Access in the LHC Computing Grid with CORAL

Large File System Backup NERSC Global File System Experience

Maurice Askinazi Ofer Rind Tony Wong. Cornell Nov. 2, 2010 Storage at BNL

VMware vrealize Automation

Graylog2 Lennart Koopmann, OSDC /

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Managing your Red Hat Enterprise Linux guests with RHN Satellite

This talk is mostly about Data Center Replication, but along the way we'll have to talk about why you'd want transactionality arnd the Low-Level API.

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Redefining Microsoft Exchange Data Management

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Eloquence Training What s new in Eloquence B.08.00

(Scale Out NAS System)

DSS. High performance storage pools for LHC. Data & Storage Services. Łukasz Janyst. on behalf of the CERN IT-DSS group

Optimising NGAS for the MWA Archive

MiaRec. Architecture for SIPREC recording

Content Distribution Management

Designing a Cloud Storage System

CHAPTER 1 - JAVA EE OVERVIEW FOR ADMINISTRATORS

IBM WebSphere Distributed Caching Products

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Performance Analysis of Mixed Distributed Filesystem Workloads

ArcGIS for Server: Administrative Scripting and Automation

IBM Content Collector Deployment and Performance Tuning

Couchbase Server Under the Hood

Oracle WebLogic Server 11g Administration

Cray DVS: Data Virtualization Service

POWER ALL GLOBAL FILE SYSTEM (PGFS)

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Big Data Visualization with JReport

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

HADOOP PERFORMANCE TUNING

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Database Monitoring Requirements. Salvatore Di Guida (CERN) On behalf of the CMS DB group

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

VMware vrealize Automation

Open Text Archive Server and Microsoft Windows Azure Storage

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Cloud Computing for Control Systems CERN Openlab Summer Student Program 9/9/2011 ARSALAAN AHMED SHAIKH

BEST PRACTICES FOR INTEGRATING TELESTREAM VANTAGE WITH EMC ISILON ONEFS

Transcription:

MWA Archive A multi-tiered dataflow and storage system based on NGAS Chen Wu Team: Dave Pallot, Andreas Wicenec, Chen Wu

Agenda Dataflow overview High-level requirements Next Generation Archive System (NGAS) Additional development to tailor NGAS for MWA Meet requirements Data capturing Data ingestion Data storage Data access Data mirroring Work-in-progress Data re-processing Conclusion 1

Online archive 32T Dataflow Online processing 320Gb/s 1/10Gbps ~ 400 MB/s 1 Gbps Staging archive MRO Mirrored archive AARNet ICRAR, Perth MIT, USA 10Gbps 10Gbps VO Long-term archive Staging & processing Data distribution, placement, scheduling Web 10Gbps Research PBStore, Perth Fornax, Perth Scripting UI Catalog VO-Table VO-Table 2

Online archive 128T Dataflow Online processing 320Gb/s MRO ~ 400 MB/s 1 Gbps 1/10 Gbps Proxy Archive 1Gbps Mirrored archive AARNet ICRAR, Perth MIT, USA VO Long-term archive Staging & processing Data distribution, placement, scheduling Web 10Gbps Research PBStore, Perth Fornax, Perth Scripting UI Catalog VO-Table VO-Table 3

Conceptual Dataflow Tier 0 metadata Online Archive RF data stream Online Processing control monitor Further Further processing Processing Further Further processing Processing Further Further processing Processing Tier 2 Proxy Archive Tier 1 Mirrored Mirrored Archive Archive Mirrored Mirrored Archive Archive Mirrored Mirrored Archive Archive Long-Term Archive Offline Processing 4

High-level Requirements High throughput data ingestion ~400MB/s visibility, 500 600 MB/s vis + image cube Efficient data distribution to multiple locations Australia / New Zealand / USA / India Secure and cost-effective storage of 8 10 TB of data collected daily Fast access to science archive by astronomers from 3 continents Intensive re-processing of archived data on GPU clusters calibration, imaging, etc. Continuous growth data volume (3PB/year), data variety (visibility, image cube, catalogue FITS, CASA, HDF5, JPEG2000) environment (Cortex à Pawsey) 5

Next Generation Archive System NGAS Openness Open source (L-GPL) originated from ESO for data archiving by Knudstrup and Wicenec (2000) Used by the astronomy community VLT, HST@ST-ECF, ESO@Paranal & La Silla, ALMA, evla, MWA ALMA: All 5 ALMA sites are based on NGAS, including all data transfers, mirroring from the observatory to all other sites across the globe. Full high availability setup with automatic fail-over Plugin-based - almost every feature is implemented as a plugin plugged into a light kernel Loss-free operation consistency checking, replication, fault-tolerance (e.g. power outage, disk failure, etc.) Object-based data storage Span multiple file systems distributed across multiple sites (e.g. VLT > 100 million objects) native HTTP interface (POST, GET, PUT, etc.) and location-transparent (Web admin UI) Scalable architecture Horizontal scalability add another NGAS instance In-storage processing Compute inside the archive Low cost, hardware-neutral deployment, versatility Linux machine + Disk arrays Support mobile storage media with energy efficiency Ingestion, Staging, Long-term Archive, Temporary storage, Proxy, Processing, etc. 6

Additional efforts needed for MWA High-throughput data ingestion 66 MB/s (ALMA) vs. 400 MB/s (MWA) Completely integrated systems vs. Theoretical performance Efficient dataflow control Multiple mirrored archives each with distinct subscription rules Saturate WAN bandwidth In-transit data processing Multi-tiered data storage Work well with HSM, e.g. retrieve a file from Tape à HTTP GET timeout Content-based access pattern classification for placement optimisation Workflow-optimised data staging to avoid I/O contention Staging data from long-term archive to Fornax (time and placement) Tasks in a complex workflow exchange data through a shared file system (But can we exploit local storage as well?) 7

Data Capturing @ MRO Wide Area Network NG/AMS Remote NGAS Servers Servers Web / Query / VO Interface Archive DB Metadata Network M&C Subsystem Local NG/AMS NGAS Servers Servers Data Network Data Producer DataCaptureMgr M&C Interface A Single Process Plugin-based Multi-threaded Throughput-oriented Fault-tolerant Admission control File handling In-memory Buffer DataHandler DataHandler Staging Area NGAS Client HDD Ramdisk SSD 8

Data Ingestion @ MRO Client push (synchronous) or Server pull (asynchronous) Find data type-specific storage medium with available capacity Receive data stream, and compute checksum (CRC) on the fly Data is stored temporarily at the Staging Area Data Archive Plug-In is invoked Quality check / compression Move file to the targeted volume / directory (e.g. MWA-DAPI) Register file in NGAS DB If replication defined, trigger file delivery threads file removal scheduling 128T Ingestion rate @ MRO 9

1400 Archiving simulation throughput 1200 Total archive throughput - MB/s 1000 800 600 400 200 0-200 A - 1Gbps bandwidth for commissioning B - 12 clients / 4 servers on 6 / 2 Fornax nodes C - aggregated data producing rate for 12 clients D - 12 clients / 4 servers on 6 idataplex / 1 Supermicro E - 12 clients / 2 servers on 6 idataplex / 1 Supermicro F - 24 clients / 4 servers on 6 / 1 Fornax nodes G - 24 clients / 4 servers on 24 / 2 Fornax nodes H - 24 clients / 4 servers on 12 / 2 Fornax nodes I - 24 clients / 4 servers on 28 Fornax nodes J - aggregated data producing rate for 24 clients 16 24 32 40 48 56 64 72 Simulated data rate per client - MB/s 10

Archive Servers CPU and File Handling 16MB/s 12 clients 48MB/s 12 clients 56MB/s 24 clients 16MB/s 12 clients 16MB/s 24clients 40MB/s 12 clients 11

Data Storage @ Cortex 32T Distribution of observation Size NGAS DB Science DB API Web Portal 1Gbps à 10Gbps 32T disk archiving rate Staging archive ICRAR, Perth 12

7KH WLPH FRQVXPHG WR UHVWRUH WKH GDWD LQ FDVH RI GLVDVWHU 7KH KLJK FRVW RI UHVRXUFHV UHTXLUHG IRU VWRUDJH QHWZRUN WDSH DQG DGPLQLVWUDWLRQ $Q DOWHUQDWLYH LV WR XVH WKH /LE6$0 OLEUDU\ IRU WKH 6XQ 6WRUDJH7HN 6WRUDJH $UFKLYH 0DQDJHU 6$0 D SDUDGLJP IRU GDWD PDQDJHPHQW WKDW JRHV EH\RQG EDFNXS :KDW,V WKH /LE6$0 /LEUDU\" 'HVLJQHG WR XVH ZLWK 6XQ 6WRUDJH7HN 6$0 DQG 6XQ 6WRUDJH7HN 4)6 VRIWZDUH WKH /LE6$0 OLEUDU\ RU $3, DOORZV \RX WR PDQDJH GDWD LQ D VDPIV ILOH V\VWHP IURP ZLWKLQ DQ DSSOLFDWLRQ Data Storage @ Cortex 128T 7KH PRGHO HPSOR\HG LV FOLHQW VHUYHU $ FOLHQW SURFHVV PDNHV UHTXHVWV WR D VHUYHU SURFHVV 7KH VHUYHU SURFHVVHV WKH UHTXHVWV DQG UHWXUQV WKH SURFHVVLQJ VWDWXV WR WKH FOLHQW,Q WKH VLPSOHVW FDVH DV LV WKH FDVH ZLWK /LE6$0 WKH VHUYHU DQG FOLHQW UXQ RQ WKH VDPH PDFKLQH 7KHUHIRUH DOO UHTXHVWV DUH ORFDO DQG WUDQVODWH LQWR V\VWHP FDOOV WR WKH NHUQHO %DVLF &RQFHSWV DQG,PSOHPHQWDWLRQ %HIRUH WKLV DUWLFOH GHOYHV LQWR WKH GHWDLOV RI KRZ /LE6$0 LV XVHG WR RYHUFRPH WKH OLPLWDWLRQV RI WUDGLWLRQDO EDFNXS PHFKDQLVPV \RX VKRXOG XQGHUVWDQG VRPH EDVLF FRQFHSWV DVVRFLDWHG ZLWK HDFK IXQFWLRQ 7KLV DUWLFOH ZLOO ILUVW GLVFXVV WKH IRXU PDMRU FRPSRQHQWV RI WKH 6XQ 6WRUDJH7HN 6$0 DUFKLYH PDQDJHPHQW V\VWHP $UFKLYLQJ 5HOHDVLQJ 6WDJLQJ 5HF\FOLQJ $UFKLYLQJ Web Portal Distribution of observation Size NGAS DB Science DB API $UFKLYLQJ WKH SURFHVV RI EDFNLQJ XS D ILOH E\ FRS\LQJ LW IURP D ILOH V\VWHP WR DUFKLYH PHGLD LV W\SLFDOO\ WKH ILUVW FRPSRQHQW 7KH DUFKLYH PHGLD FDQ EH D UHPRYDEOH PHGLD FDUWULGJH RU D GLVN SDUWLWLRQ RI DQRWKHU ILOH V\VWHP )LJXUH VKRZV WKH EDVLF FRPSRQHQWV RI WKH DUFKLYLQJ SURFHVV 1Gbps GHYHORSHUV VXQ FRP VRODULV DUWLFOHV OLEVDP KWPO Online archive 128T disk archiving rate 13

Data Access @ Cortex Client pull (synchronous) or Server push (asynchronous) Pull: Given a file identifier and version, get the file regardless of its location If multiple replica are found, follow: host à cluster à domain à Tapeà mirrored archive Push: Given a list of file identifiers and deliver_to_url, Sort files, then push online files while staging offline ones from the tape Support suspend/resume/cancel and persistency Server-directed I/O à better optimisation (queue mgmt., multiple requests aggregation, etc.) Data Processing Plug-Ins can be invoked prior to file retrieval E.g. decompressing, sub-setting, explicit staging, etc. Get FITS header info à Implicit staging à blocking 32T access stats release 14

Access pattern @ Cortex 32T C118 Fornax A # of accesses Overall First Access. C106 SUN Tracks # of accesses Overall inter-accesses C104 EOR Fields First Access W44 Galactic plane field inter-accesses 15

Access Patterns@ Cortex 32T 16

Data Mirroring Publish / Subscribe mode Criteria à Subscription Filter Plugin Events File ingestion, subscribe command, explicit triggering, etc. Subscribe into the past Each subscriber is given a queue Each subscriber / queue is assigned a priority Transfer HTTP / FTP / GridFTP (concurrent transfer, file pipeline, parallel connection) Target can be anything that supports HTTP / FTP / GridFTP E.g. python -m SimpleHTTPServer Failed deliveries will be recorded in the back log They get re-sent through explicit triggering when connections become normal Support proxy mode Bypass slow hops In-transit data processing 17

Mirrored Archive @ MIT 12 streams - Overall rate: 10.67 +/- 0.27 MB/s Cortex TCP window size: 16.0 MB, RTT ~0.281 secs Max send buffer: 128KB - 11.67MB/s (93.35Mb/s) Max send buffer: 64MB - 17.15MB/s (137.19Mb/s), 47% higher 18

Data Re-processing (WIP) Requirements Stage data from Cortex to Fornax Process calibration, imaging pipelines with data placed on Fornax Archive calibrated data / image cubes back to Cortex (Optional, not considered for 128T) Data staging strategies (when, how much, how frequent) Decoupled with Processing à no overlapping Staging is submitted as another job by users prior to submitting processing jobs Loosely-coupled with Processing à some overlapping Staging is a task inside the same job, can be asynchronous Tightly-coupled Processing à intertwined Processing and data staging in coordination Data placement strategies (where, when to cleanup / update) Loosely-coupled with Processing à Global file system (i.e. Lustre) User-friendly abstraction POXIS API, easiest for developers Tightly-coupled with Processing à Application-optimised file system Expose file block location in file metadata, E.g. Google File System, HDFS Dynamically-coupled with Processing à Temporary data storage and pipeline co-design Cache storage Multi-level Scheduling Stage Process Place 19

&' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* &' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* &' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* &' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* &' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* Data Re-processing on Fornax Uniqueness of Fornax for MWA GPU Cluster à MWA Real-Time System Local storage (7TB/node x 96 node) à RTS compute storage à in-storage processing Dual Infiniband connections à RTS data movement Therefore, MWA has applied 25% time on Fornax Fornax local storage is treated as a cache for MWA SRAM DRAM! " #$$#!%"&! " #$$#!%"& Frequency splits Fornax Local Storage! " #$$#!%"& Cortex Disk Cache Visibility files! " #$$#!%"&! " #$$#!%"& Cortex Tape Libraries 20

Fornax block diagram!"#$%&'()*+,-.(/#012,-0,3#-(45-#52-6!)11&!)11'!)11.!)11- A-%:(9":-!)6,478<,?@(9":-+!""#$%%&'()*+,-')."'(/.012/'(3%3415!)6,47879, (045 (+:9,19 /),+;)6 Fornax Architecture: Chris Harris 2012 7"82$(9":-+ ()*+,$ ()*+,# ;"<*(9":-+ +)$ +)# EHF G,1-#$-, G&,-#$%> ;".<3,-(9":-+!""$!""#!"%&!"%'!"#$%&'()"*+,-.(/"0.(1#234-.2-,#.(567819:!/01#! B$C2$2D%$: EF(G,1-#$-,! =-5->"<.-$,(9":-+!"%.!"%-!"%%!$"" 73+,#-(9":-+!/01$!)11$!)11#!)112!)113!"J1"18>"K /Q(9/(A823 BC3#$D E$D'?C 9*AF ::G9 ;H'I1<(!"J1"18>"K(!FLM9//(;:G ;H'I1<(!"J1"18>"K(!FLM9//(;:G /)*(A823 -,!$/(QR 9/(A823 -,!$/(QR 9/(A823!"#$% &$'"(&)*)+,-.!"#$% ;71<=->#0!"#$?<'""$<# @+/(A823!"#$% ))/+(,0123$#!45(678(!569*:!"#$% &$'"(&)*)+,-. /)*(A823 BC3#$D E$D'?C 9*AF ::G9!"#$% ))/+(,0123$#!45(678(!569*: NO!:!L(P$3%>(,/+M)(A-. -,!$/(Q@* *S(A823 H'<>%(B#'?>I$(TMQ@PFU!! RVS(A823 -,!$/(QS 9/(A823 BLB4BLPL(,'"#?'%%$? HB!(!BL/++R 21

Processing-optimised NGAS Cache Storage!"#$%&'()*+,-.(/#012,-0,3#-(45-#52-6 Long-term archive A-%:(9":Mirrored archive!)6,478<, Cortex!""#$%%&'()*+,-')."'(/.012/'(3%3415?@(9":-+!)6,47879, (045 (+:9,19 Async staging /),+;)6 GEG 2/3 7"82$(9":-+ ;"<*(9":-+ Mirrored archive ()*+,$ EHF G,1-#$-, G&,-#$%> ()*+,# ;".<3,-(9":-+!""$ +)$ 3/4!""#!"%&!"%- EOR!"%' 1 =-5->"<.-$,(9":-+!"%. +)#!"%% 2/3!$"" 73+,#-(9":-+! B$C2$2D%$: EF(G,1-#$-,!/01$!)11$!)11#!)112!)113!/01#!)11&!)11'!)11.!)11-! 22

Conclusion Requirements MWA Data ingestion, transferring, storage, access, staging, re-processing NGAS Open (LGPL) software deployed in astronomy archive facilities around the globe NGAS tailored for MWA fulfills the above requirements, meets the MWA data challenge just access whole observations whenever I like, without going to file cabinets getting tapes, mounting and then trying to figure out what to do an MWA Super Science Fellow The NGAS-based MWA data archive solution is working fine during commissioning Fornax The major re-processing power horse and data staging hotspot for MWA One of few data-intensive clusters suited for MWA data reprocessing What s Next à Science archive Data modeling: ObsCore, Data access: VO Tools: ObsTAP, Saada, OpenCADC 23