MWA Archive A multi-tiered dataflow and storage system based on NGAS
|
|
- Samuel Booker
- 8 years ago
- Views:
Transcription
1 MWA Archive A multi-tiered dataflow and storage system based on NGAS Chen Wu Team: Dave Pallot, Andreas Wicenec, Chen Wu
2 Agenda Dataflow overview High-level requirements Next Generation Archive System (NGAS) Additional development to tailor NGAS for MWA Meet requirements Data capturing Data ingestion Data storage Data access Data mirroring Work-in-progress Data re-processing Conclusion 1
3 Online archive 32T Dataflow Online processing 320Gb/s 1/10Gbps ~ 400 MB/s 1 Gbps Staging archive MRO Mirrored archive AARNet ICRAR, Perth MIT, USA 10Gbps 10Gbps VO Long-term archive Staging & processing Data distribution, placement, scheduling Web 10Gbps Research PBStore, Perth Fornax, Perth Scripting UI Catalog VO-Table VO-Table 2
4 Online archive 128T Dataflow Online processing 320Gb/s MRO ~ 400 MB/s 1 Gbps 1/10 Gbps Proxy Archive 1Gbps Mirrored archive AARNet ICRAR, Perth MIT, USA VO Long-term archive Staging & processing Data distribution, placement, scheduling Web 10Gbps Research PBStore, Perth Fornax, Perth Scripting UI Catalog VO-Table VO-Table 3
5 Conceptual Dataflow Tier 0 metadata Online Archive RF data stream Online Processing control monitor Further Further processing Processing Further Further processing Processing Further Further processing Processing Tier 2 Proxy Archive Tier 1 Mirrored Mirrored Archive Archive Mirrored Mirrored Archive Archive Mirrored Mirrored Archive Archive Long-Term Archive Offline Processing 4
6 High-level Requirements High throughput data ingestion ~400MB/s visibility, MB/s vis + image cube Efficient data distribution to multiple locations Australia / New Zealand / USA / India Secure and cost-effective storage of 8 10 TB of data collected daily Fast access to science archive by astronomers from 3 continents Intensive re-processing of archived data on GPU clusters calibration, imaging, etc. Continuous growth data volume (3PB/year), data variety (visibility, image cube, catalogue FITS, CASA, HDF5, JPEG2000) environment (Cortex à Pawsey) 5
7 Next Generation Archive System NGAS Openness Open source (L-GPL) originated from ESO for data archiving by Knudstrup and Wicenec (2000) Used by the astronomy community VLT, & La Silla, ALMA, evla, MWA ALMA: All 5 ALMA sites are based on NGAS, including all data transfers, mirroring from the observatory to all other sites across the globe. Full high availability setup with automatic fail-over Plugin-based - almost every feature is implemented as a plugin plugged into a light kernel Loss-free operation consistency checking, replication, fault-tolerance (e.g. power outage, disk failure, etc.) Object-based data storage Span multiple file systems distributed across multiple sites (e.g. VLT > 100 million objects) native HTTP interface (POST, GET, PUT, etc.) and location-transparent (Web admin UI) Scalable architecture Horizontal scalability add another NGAS instance In-storage processing Compute inside the archive Low cost, hardware-neutral deployment, versatility Linux machine + Disk arrays Support mobile storage media with energy efficiency Ingestion, Staging, Long-term Archive, Temporary storage, Proxy, Processing, etc. 6
8 Additional efforts needed for MWA High-throughput data ingestion 66 MB/s (ALMA) vs. 400 MB/s (MWA) Completely integrated systems vs. Theoretical performance Efficient dataflow control Multiple mirrored archives each with distinct subscription rules Saturate WAN bandwidth In-transit data processing Multi-tiered data storage Work well with HSM, e.g. retrieve a file from Tape à HTTP GET timeout Content-based access pattern classification for placement optimisation Workflow-optimised data staging to avoid I/O contention Staging data from long-term archive to Fornax (time and placement) Tasks in a complex workflow exchange data through a shared file system (But can we exploit local storage as well?) 7
9 Data MRO Wide Area Network NG/AMS Remote NGAS Servers Servers Web / Query / VO Interface Archive DB Metadata Network M&C Subsystem Local NG/AMS NGAS Servers Servers Data Network Data Producer DataCaptureMgr M&C Interface A Single Process Plugin-based Multi-threaded Throughput-oriented Fault-tolerant Admission control File handling In-memory Buffer DataHandler DataHandler Staging Area NGAS Client HDD Ramdisk SSD 8
10 Data MRO Client push (synchronous) or Server pull (asynchronous) Find data type-specific storage medium with available capacity Receive data stream, and compute checksum (CRC) on the fly Data is stored temporarily at the Staging Area Data Archive Plug-In is invoked Quality check / compression Move file to the targeted volume / directory (e.g. MWA-DAPI) Register file in NGAS DB If replication defined, trigger file delivery threads file removal scheduling 128T Ingestion MRO 9
11 1400 Archiving simulation throughput 1200 Total archive throughput - MB/s A - 1Gbps bandwidth for commissioning B - 12 clients / 4 servers on 6 / 2 Fornax nodes C - aggregated data producing rate for 12 clients D - 12 clients / 4 servers on 6 idataplex / 1 Supermicro E - 12 clients / 2 servers on 6 idataplex / 1 Supermicro F - 24 clients / 4 servers on 6 / 1 Fornax nodes G - 24 clients / 4 servers on 24 / 2 Fornax nodes H - 24 clients / 4 servers on 12 / 2 Fornax nodes I - 24 clients / 4 servers on 28 Fornax nodes J - aggregated data producing rate for 24 clients Simulated data rate per client - MB/s 10
12 Archive Servers CPU and File Handling 16MB/s 12 clients 48MB/s 12 clients 56MB/s 24 clients 16MB/s 12 clients 16MB/s 24clients 40MB/s 12 clients 11
13 Data Cortex 32T Distribution of observation Size NGAS DB Science DB API Web Portal 1Gbps à 10Gbps 32T disk archiving rate Staging archive ICRAR, Perth 12
14 7KH WLPH FRQVXPHG WR UHVWRUH WKH GDWD LQ FDVH RI GLVDVWHU 7KH KLJK FRVW RI UHVRXUFHV UHTXLUHG IRU VWRUDJH QHWZRUN WDSH DQG DGPLQLVWUDWLRQ $Q DOWHUQDWLYH LV WR XVH WKH /LE6$0 OLEUDU\ IRU WKH 6XQ 6WRUDJH7HN 6WRUDJH $UFKLYH 0DQDJHU 6$0 D SDUDGLJP IRU GDWD PDQDJHPHQW WKDW JRHV EH\RQG EDFNXS :KDW,V WKH /LE6$0 /LEUDU\" 'HVLJQHG WR XVH ZLWK 6XQ 6WRUDJH7HN 6$0 DQG 6XQ 6WRUDJH7HN 4)6 VRIWZDUH WKH /LE6$0 OLEUDU\ RU $3, DOORZV \RX WR PDQDJH GDWD LQ D VDPIV ILOH V\VWHP IURP ZLWKLQ DQ DSSOLFDWLRQ Data Cortex 128T 7KH PRGHO HPSOR\HG LV FOLHQW VHUYHU $ FOLHQW SURFHVV PDNHV UHTXHVWV WR D VHUYHU SURFHVV 7KH VHUYHU SURFHVVHV WKH UHTXHVWV DQG UHWXUQV WKH SURFHVVLQJ VWDWXV WR WKH FOLHQW,Q WKH VLPSOHVW FDVH DV LV WKH FDVH ZLWK /LE6$0 WKH VHUYHU DQG FOLHQW UXQ RQ WKH VDPH PDFKLQH 7KHUHIRUH DOO UHTXHVWV DUH ORFDO DQG WUDQVODWH LQWR V\VWHP FDOOV WR WKH NHUQHO %DVLF &RQFHSWV DQG,PSOHPHQWDWLRQ %HIRUH WKLV DUWLFOH GHOYHV LQWR WKH GHWDLOV RI KRZ /LE6$0 LV XVHG WR RYHUFRPH WKH OLPLWDWLRQV RI WUDGLWLRQDO EDFNXS PHFKDQLVPV \RX VKRXOG XQGHUVWDQG VRPH EDVLF FRQFHSWV DVVRFLDWHG ZLWK HDFK IXQFWLRQ 7KLV DUWLFOH ZLOO ILUVW GLVFXVV WKH IRXU PDMRU FRPSRQHQWV RI WKH 6XQ 6WRUDJH7HN 6$0 DUFKLYH PDQDJHPHQW V\VWHP $UFKLYLQJ 5HOHDVLQJ 6WDJLQJ 5HF\FOLQJ $UFKLYLQJ Web Portal Distribution of observation Size NGAS DB Science DB API $UFKLYLQJ WKH SURFHVV RI EDFNLQJ XS D ILOH E\ FRS\LQJ LW IURP D ILOH V\VWHP WR DUFKLYH PHGLD LV W\SLFDOO\ WKH ILUVW FRPSRQHQW 7KH DUFKLYH PHGLD FDQ EH D UHPRYDEOH PHGLD FDUWULGJH RU D GLVN SDUWLWLRQ RI DQRWKHU ILOH V\VWHP )LJXUH VKRZV WKH EDVLF FRPSRQHQWV RI WKH DUFKLYLQJ SURFHVV 1Gbps GHYHORSHUV VXQ FRP VRODULV DUWLFOHV OLEVDP KWPO Online archive 128T disk archiving rate 13
15 Data Cortex Client pull (synchronous) or Server push (asynchronous) Pull: Given a file identifier and version, get the file regardless of its location If multiple replica are found, follow: host à cluster à domain à Tapeà mirrored archive Push: Given a list of file identifiers and deliver_to_url, Sort files, then push online files while staging offline ones from the tape Support suspend/resume/cancel and persistency Server-directed I/O à better optimisation (queue mgmt., multiple requests aggregation, etc.) Data Processing Plug-Ins can be invoked prior to file retrieval E.g. decompressing, sub-setting, explicit staging, etc. Get FITS header info à Implicit staging à blocking 32T access stats release 14
16 Access Cortex 32T C118 Fornax A # of accesses Overall First Access. C106 SUN Tracks # of accesses Overall inter-accesses C104 EOR Fields First Access W44 Galactic plane field inter-accesses 15
17 Access Cortex 32T 16
18 Data Mirroring Publish / Subscribe mode Criteria à Subscription Filter Plugin Events File ingestion, subscribe command, explicit triggering, etc. Subscribe into the past Each subscriber is given a queue Each subscriber / queue is assigned a priority Transfer HTTP / FTP / GridFTP (concurrent transfer, file pipeline, parallel connection) Target can be anything that supports HTTP / FTP / GridFTP E.g. python -m SimpleHTTPServer Failed deliveries will be recorded in the back log They get re-sent through explicit triggering when connections become normal Support proxy mode Bypass slow hops In-transit data processing 17
19 Mirrored MIT 12 streams - Overall rate: / MB/s Cortex TCP window size: 16.0 MB, RTT ~0.281 secs Max send buffer: 128KB MB/s (93.35Mb/s) Max send buffer: 64MB MB/s (137.19Mb/s), 47% higher 18
20 Data Re-processing (WIP) Requirements Stage data from Cortex to Fornax Process calibration, imaging pipelines with data placed on Fornax Archive calibrated data / image cubes back to Cortex (Optional, not considered for 128T) Data staging strategies (when, how much, how frequent) Decoupled with Processing à no overlapping Staging is submitted as another job by users prior to submitting processing jobs Loosely-coupled with Processing à some overlapping Staging is a task inside the same job, can be asynchronous Tightly-coupled Processing à intertwined Processing and data staging in coordination Data placement strategies (where, when to cleanup / update) Loosely-coupled with Processing à Global file system (i.e. Lustre) User-friendly abstraction POXIS API, easiest for developers Tightly-coupled with Processing à Application-optimised file system Expose file block location in file metadata, E.g. Google File System, HDFS Dynamically-coupled with Processing à Temporary data storage and pipeline co-design Cache storage Multi-level Scheduling Stage Process Place 19
21 &' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* &' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* &' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* &' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* &' &( )% )! )" )* )& )) )+ ), )' )( +% +! +" +* Data Re-processing on Fornax Uniqueness of Fornax for MWA GPU Cluster à MWA Real-Time System Local storage (7TB/node x 96 node) à RTS compute storage à in-storage processing Dual Infiniband connections à RTS data movement Therefore, MWA has applied 25% time on Fornax Fornax local storage is treated as a cache for MWA SRAM DRAM! " #$$#!%"&! " #$$#!%"& Frequency splits Fornax Local Storage! " #$$#!%"& Cortex Disk Cache Visibility files! " #$$#!%"&! " #$$#!%"& Cortex Tape Libraries 20
22 Fornax block diagram!"#$%&'()*+,-.(/#012,-0,3#-(45-#52-6!)11&!)11'!)11.!)11- (045 (+:9,19 /),+;)6 Fornax Architecture: Chris Harris "82$(9":-+ ()*+,$ ()*+,# ;"<*(9":-+ +)$ +)# EHF G,1-#$-, G&,-#$%> ;".<3,-(9":-+!""$!""#!"%&!"%'!"#$%&'()"*+,-.(/"0.(1# ,#.(567819:!/01#! B$C2$2D%$: EF(G,1-#$-,! =-5->"<.-$,(9":-+!"%.!"%-!"%%!$"" 73+,#-(9":-+!/01$!)11$!)11#!)112!)113!"J1"18>"K /Q(9/(A823 BC3#$D E$D'?C 9*AF ::G9 ;H'I1<(!"J1"18>"K(!FLM9//(;:G ;H'I1<(!"J1"18>"K(!FLM9//(;:G /)*(A823 -,!$/(QR 9/(A823 -,!$/(QR 9/(A823!"#$% &$'"(&)*)+,-.!"#$% ))/+(,0123$#!45(678(!569*:!"#$% &$'"(&)*)+,-. /)*(A823 BC3#$D E$D'?C 9*AF ::G9!"#$% ))/+(,0123$#!45(678(!569*: NO!:!L(P$3%>(,/+M)(A-. *S(A823 RVS(A823 -,!$/(QS 9/(A823 BLB4BLPL(,'"#?'%%$? HB!(!BL/++R 21
23 Processing-optimised NGAS Cache Storage!"#$%&'()*+,-.(/#012,-0,3#-(45-#52-6 Long-term archive A-%:(9":Mirrored archive!)6,478<, (045 (+:9,19 Async staging /),+;)6 GEG 2/3 7"82$(9":-+ ;"<*(9":-+ Mirrored archive ()*+,$ EHF G,1-#$-, G&,-#$%> ()*+,# ;".<3,-(9":-+!""$ +)$ 3/4!""#!"%&!"%- EOR!"%' 1 =-5->"<.-$,(9":-+!"%. +)#!"%% 2/3!$"" 73+,#-(9":-+! B$C2$2D%$: EF(G,1-#$-,!/01$!)11$!)11#!)112!)113!/01#!)11&!)11'!)11.!)11-! 22
24 Conclusion Requirements MWA Data ingestion, transferring, storage, access, staging, re-processing NGAS Open (LGPL) software deployed in astronomy archive facilities around the globe NGAS tailored for MWA fulfills the above requirements, meets the MWA data challenge just access whole observations whenever I like, without going to file cabinets getting tapes, mounting and then trying to figure out what to do an MWA Super Science Fellow The NGAS-based MWA data archive solution is working fine during commissioning Fornax The major re-processing power horse and data staging hotspot for MWA One of few data-intensive clusters suited for MWA data reprocessing What s Next à Science archive Data modeling: ObsCore, Data access: VO Tools: ObsTAP, Saada, OpenCADC 23
The Murchison Widefield Array Data Archive System. Chen Wu Int l Centre for Radio Astronomy Research The University of Western Australia
The Murchison Widefield Array Data Archive System Chen Wu Int l Centre for Radio Astronomy Research The University of Western Australia Agenda Dataflow Requirements Solutions & Lessons learnt Open solution
More informationIT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez
IT of SPIM Data Storage and Compression EMBO Course - August 27th Jeff Oegema, Peter Steinbach, Oscar Gonzalez 1 Talk Outline Introduction and the IT Team SPIM Data Flow Capture, Compression, and the Data
More informationDiagram 1: Islands of storage across a digital broadcast workflow
XOR MEDIA CLOUD AQUA Big Data and Traditional Storage The era of big data imposes new challenges on the storage technology industry. As companies accumulate massive amounts of data from video, sound, database,
More informationIn Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationDirect NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle
Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle Agenda Introduction Database Architecture Direct NFS Client NFS Server
More informationSAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
More informationwww.thinkparq.com www.beegfs.com
www.thinkparq.com www.beegfs.com KEY ASPECTS Maximum Flexibility Maximum Scalability BeeGFS supports a wide range of Linux distributions such as RHEL/Fedora, SLES/OpenSuse or Debian/Ubuntu as well as a
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationOracle TimesTen In-Memory Database on Oracle Exalogic Elastic Cloud
An Oracle White Paper July 2011 Oracle TimesTen In-Memory Database on Oracle Exalogic Elastic Cloud Executive Summary... 3 Introduction... 4 Hardware and Software Overview... 5 Compute Node... 5 Storage
More informationReference Design: Scalable Object Storage with Seagate Kinetic, Supermicro, and SwiftStack
Reference Design: Scalable Object Storage with Seagate Kinetic, Supermicro, and SwiftStack May 2015 Copyright 2015 SwiftStack, Inc. swiftstack.com Page 1 of 19 Table of Contents INTRODUCTION... 3 OpenStack
More informationData Movement and Storage. Drew Dolgert and previous contributors
Data Movement and Storage Drew Dolgert and previous contributors Data Intensive Computing Location Viewing Manipulation Storage Movement Sharing Interpretation $HOME $WORK $SCRATCH 72 is a Lot, Right?
More informationCloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
More informationPARALLELS CLOUD STORAGE
PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...
More informationCERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT
SS Data & Storage CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT HEPiX Fall 2012 Workshop October 15-19, 2012 Institute of High Energy Physics, Beijing, China SS Outline
More informationGlobus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago
Globus Striped GridFTP Framework and Server Raj Kettimuthu, ANL and U. Chicago Outline Introduction Features Motivation Architecture Globus XIO Experimental Results 3 August 2005 The Ohio State University
More information(WKHUQHW 3& &DUG 0RGHOV '(0993&7 '(09937 '(0993/7
(WKHUQHW 3& &DUG 0RGHOV '(0993&7 '(09937 '(0993/7 8VHU V *XLGH Rev. 08w (October 2004) 9'(099301110136 3ULQWHG LQ 7DLZDQ 5(&
More informationTHE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB
More informationIntegrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
More informationPOSIX and Object Distributed Storage Systems
1 POSIX and Object Distributed Storage Systems Performance Comparison Studies With Real-Life Scenarios in an Experimental Data Taking Context Leveraging OpenStack Swift & Ceph by Michael Poat, Dr. Jerome
More informationThe Google File System
The Google File System By Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (Presented at SOSP 2003) Introduction Google search engine. Applications process lots of data. Need good file system. Solution:
More informationDistributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
More informationUnstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
More informationASKAP Science Data Archive: Users and Requirements CSIRO ASTRONOMY AND SPACE SCIENCE (CASS)
ASKAP Science Data Archive: Users and Requirements CSIRO ASTRONOMY AND SPACE SCIENCE (CASS) Jessica Chapman, Data Workshop March 2013 ASKAP Science Data Archive Talk outline Data flow in brief Some radio
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationUsing MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com
Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A
More informationRedefining Microsoft SQL Server Data Management. PAS Specification
Redefining Microsoft SQL Server Data Management APRIL Actifio 11, 2013 PAS Specification Table of Contents Introduction.... 3 Background.... 3 Virtualizing Microsoft SQL Server Data Management.... 4 Virtualizing
More informationBigdata High Availability (HA) Architecture
Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources
More informationHDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
More informationBlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything
BlueArc unified network storage systems 7th TF-Storage Meeting Scale Bigger, Store Smarter, Accelerate Everything BlueArc s Heritage Private Company, founded in 1998 Headquarters in San Jose, CA Highest
More informationThe Hadoop Distributed File System
The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationAPPENDIX H. CONCEPT DEVELOPMENT
APPENDIX H. CONCEPT DEVELOPMENT +,17(502'$/75$163257$7,21 &(17(5237,216 7KH,7& LOOXVWUDWHG LQ)LJXUHV+WKURXJK+ LV LQWHQGHG WR VHUYH DV WKH SUHPLHU VKRUWWHUPEXVLQHVV WUDYHOHU SDUNLQJ RSWLRQ IRUWKHDLUSRUW7KH,7&DOVRVHUYHVDVWKHDLUSRUW
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationBig data management with IBM General Parallel File System
Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers
More informationDSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE
DSS Data & Diskpool and cloud storage benchmarks used in IT-DSS CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it Geoffray ADDE DSS Outline I- A rational approach to storage systems evaluation
More informationSMB Direct for SQL Server and Private Cloud
SMB Direct for SQL Server and Private Cloud Increased Performance, Higher Scalability and Extreme Resiliency June, 2014 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server
More informationTaking Big Data to the Cloud. Enabling cloud computing & storage for big data applications with on-demand, high-speed transport WHITE PAPER
Taking Big Data to the Cloud WHITE PAPER TABLE OF CONTENTS Introduction 2 The Cloud Promise 3 The Big Data Challenge 3 Aspera Solution 4 Delivering on the Promise 4 HIGHLIGHTS Challenges Transporting large
More informationQuantum StorNext. Product Brief: Distributed LAN Client
Quantum StorNext Product Brief: Distributed LAN Client NOTICE This product brief may contain proprietary information protected by copyright. Information in this product brief is subject to change without
More informationArchiving, Indexing and Accessing Web Materials: Solutions for large amounts of data
Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data David Minor 1, Reagan Moore 2, Bing Zhu, Charles Cowart 4 1. (88)4-104 minor@sdsc.edu San Diego Supercomputer Center
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationHadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
More informationPerformance and scalability of a large OLTP workload
Performance and scalability of a large OLTP workload ii Performance and scalability of a large OLTP workload Contents Performance and scalability of a large OLTP workload with DB2 9 for System z on Linux..............
More informationStorage Virtualization. Andreas Joachim Peters CERN IT-DSS
Storage Virtualization Andreas Joachim Peters CERN IT-DSS Outline What is storage virtualization? Commercial and non-commercial tools/solutions Local and global storage virtualization Scope of this presentation
More information2009 Oracle Corporation 1
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,
More informationArchive Storage Infrastructure At the Library of Congress September 2015
Infrastructure At the Library of Congress September 2015 http://www.loc.gov/avconservation/packard/ The Packard Campus Mission The National Audiovisual Conservation Center develops, preserves and provides
More informationClient-aware Cloud Storage
Client-aware Cloud Storage Feng Chen Computer Science & Engineering Louisiana State University Michael Mesnier Circuits & Systems Research Intel Labs Scott Hahn Circuits & Systems Research Intel Labs Cloud
More informationXenData Product Brief: SX-550 Series Servers for LTO Archives
XenData Product Brief: SX-550 Series Servers for LTO Archives The SX-550 Series of Archive Servers creates highly scalable LTO Digital Video Archives that are optimized for broadcasters, video production
More informationStorReduce Technical White Paper Cloud-based Data Deduplication
StorReduce Technical White Paper Cloud-based Data Deduplication See also at storreduce.com/docs StorReduce Quick Start Guide StorReduce FAQ StorReduce Solution Brief, and StorReduce Blog at storreduce.com/blog
More informationMEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?
MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM? Ashutosh Shinde Performance Architect ashutosh_shinde@hotmail.com Validating if the workload generated by the load generating tools is applied
More informationComparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014
Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet Anand Rangaswamy September 2014 Storage Developer Conference Mellanox Overview Ticker: MLNX Leading provider of high-throughput,
More informationRunning a Workflow on a PowerCenter Grid
Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
More informationComparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &
More informationDistributed Database Access in the LHC Computing Grid with CORAL
Distributed Database Access in the LHC Computing Grid with CORAL Dirk Duellmann, CERN IT on behalf of the CORAL team (R. Chytracek, D. Duellmann, G. Govi, I. Papadopoulos, Z. Xie) http://pool.cern.ch &
More informationLarge File System Backup NERSC Global File System Experience
Large File System Backup NERSC Global File System Experience M. Andrews, J. Hick, W. Kramer, A. Mokhtarani National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory
More informationMaurice Askinazi Ofer Rind Tony Wong. HEPIX @ Cornell Nov. 2, 2010 Storage at BNL
Maurice Askinazi Ofer Rind Tony Wong HEPIX @ Cornell Nov. 2, 2010 Storage at BNL Traditional Storage Dedicated compute nodes and NFS SAN storage Simple and effective, but SAN storage became very expensive
More informationVMware vrealize Automation
VMware vrealize Automation Reference Architecture Version 6.0 and Higher T E C H N I C A L W H I T E P A P E R Table of Contents Overview... 4 What s New... 4 Initial Deployment Recommendations... 4 General
More informationGraylog2 Lennart Koopmann, OSDC 2014. @_lennart / www.graylog2.org
Graylog2 Lennart Koopmann, OSDC 2014 @_lennart / www.graylog2.org About me 25 years old Living in Hamburg, Germany @_lennart on Twitter Co-Founder of TORCH - The Graylog2 company. Graylog2 history Started
More informationComparison of the Frontier Distributed Database Caching System with NoSQL Databases
Comparison of the Frontier Distributed Database Caching System with NoSQL Databases Dave Dykstra dwd@fnal.gov Fermilab is operated by the Fermi Research Alliance, LLC under contract No. DE-AC02-07CH11359
More informationManaging your Red Hat Enterprise Linux guests with RHN Satellite
Managing your Red Hat Enterprise Linux guests with RHN Satellite Matthew Davis, Level 1 Production Support Manager, Red Hat Brad Hinson, Sr. Support Engineer Lead System z, Red Hat Mark Spencer, Sr. Solutions
More informationThis talk is mostly about Data Center Replication, but along the way we'll have to talk about why you'd want transactionality arnd the Low-Level API.
This talk is mostly about Data Center Replication, but along the way we'll have to talk about why you'd want transactionality arnd the Low-Level API. Roughly speaking, the yellow boxes here represenet
More informationHow To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
More informationRedefining Microsoft Exchange Data Management
Redefining Microsoft Exchange Data Management FEBBRUARY, 2013 Actifio PAS Specification Table of Contents Introduction.... 3 Background.... 3 Virtualizing Microsoft Exchange Data Management.... 3 Virtualizing
More informationHigh Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand
High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand Hari Subramoni *, Ping Lai *, Raj Kettimuthu **, Dhabaleswar. K. (DK) Panda * * Computer Science and Engineering Department
More informationEloquence Training What s new in Eloquence B.08.00
Eloquence Training What s new in Eloquence B.08.00 2010 Marxmeier Software AG Rev:100727 Overview Released December 2008 Supported until November 2013 Supports 32-bit and 64-bit platforms HP-UX Itanium
More information(Scale Out NAS System)
For Unlimited Capacity & Performance Clustered NAS System (Scale Out NAS System) Copyright 2010 by Netclips, Ltd. All rights reserved -0- 1 2 3 4 5 NAS Storage Trend Scale-Out NAS Solution Scaleway Advantages
More informationDSS. High performance storage pools for LHC. Data & Storage Services. Łukasz Janyst. on behalf of the CERN IT-DSS group
DSS High performance storage pools for LHC Łukasz Janyst on behalf of the CERN IT-DSS group CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Introduction The goal of EOS is to provide a
More informationOptimising NGAS for the MWA Archive
Noname manuscript No. (will be inserted by the editor) Optimising NGAS for the MWA Archive C Wu A Wicenec D Pallot A Checcucci Received: date / Accepted: date Abstract The Murchison Widefield Array (MWA)
More informationMiaRec. Architecture for SIPREC recording
Architecture for SIPREC recording Table of Contents 1 Overview... 3 2 Architecture... 4 3 Third-party application integration... 6 3.1 REST API... 6 3.2 Direct access to MiaRec resources... 7 4 High availability
More informationContent Distribution Management
Digitizing the Olympics was truly one of the most ambitious media projects in history, and we could not have done it without Signiant. We used Signiant CDM to automate 54 different workflows between 11
More informationThe glite File Transfer Service
The glite File Transfer Service Peter Kunszt Paolo Badino Ricardo Brito da Rocha James Casey Ákos Frohner Gavin McCance CERN, IT Department 1211 Geneva 23, Switzerland Abstract Transferring data reliably
More informationDesigning a Cloud Storage System
Designing a Cloud Storage System End to End Cloud Storage When designing a cloud storage system, there is value in decoupling the system s archival capacity (its ability to persistently store large volumes
More informationCHAPTER 1 - JAVA EE OVERVIEW FOR ADMINISTRATORS
CHAPTER 1 - JAVA EE OVERVIEW FOR ADMINISTRATORS Java EE Components Java EE Vendor Specifications Containers Java EE Blueprint Services JDBC Data Sources Java Naming and Directory Interface Java Message
More informationIBM WebSphere Distributed Caching Products
extreme Scale, DataPower XC10 IBM Distributed Caching Products IBM extreme Scale v 7.1 and DataPower XC10 Appliance Highlights A powerful, scalable, elastic inmemory grid for your business-critical applications
More informationHadoop Distributed File System. Dhruba Borthakur June, 2007
Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle
More informationPerformance Analysis of Mixed Distributed Filesystem Workloads
Performance Analysis of Mixed Distributed Filesystem Workloads Esteban Molina-Estolano, Maya Gokhale, Carlos Maltzahn, John May, John Bent, Scott Brandt Motivation Hadoop-tailored filesystems (e.g. CloudStore)
More informationArcGIS for Server: Administrative Scripting and Automation
ArcGIS for Server: Administrative Scripting and Automation Shreyas Shinde Ranjit Iyer Esri UC 2014 Technical Workshop Agenda Introduction to server administration Command line tools ArcGIS Server Manager
More informationIBM Content Collector Deployment and Performance Tuning
Redpaper Wei-Dong Zhu Markus Lorch IBM Content Collector Deployment and Performance Tuning Overview This IBM Redpaper publication explains the key areas that need to be considered when planning for IBM
More informationCouchbase Server Under the Hood
Couchbase Server Under the Hood An Architectural Overview Couchbase Server is an open-source distributed NoSQL document-oriented database for interactive applications, uniquely suited for those needing
More informationOracle WebLogic Server 11g Administration
Oracle WebLogic Server 11g Administration This course is designed to provide instruction and hands-on practice in installing and configuring Oracle WebLogic Server 11g. These tasks include starting and
More informationCray DVS: Data Virtualization Service
Cray : Data Virtualization Service Stephen Sugiyama and David Wallace, Cray Inc. ABSTRACT: Cray, the Cray Data Virtualization Service, is a new capability being added to the XT software environment with
More informationPOWER ALL GLOBAL FILE SYSTEM (PGFS)
POWER ALL GLOBAL FILE SYSTEM (PGFS) Defining next generation of global storage grid Power All Networks Ltd. Technical Whitepaper April 2008, version 1.01 Table of Content 1. Introduction.. 3 2. Paradigm
More informationSAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013
SAP HANA SAP s In-Memory Database Dr. Martin Kittel, SAP HANA Development January 16, 2013 Disclaimer This presentation outlines our general product direction and should not be relied on in making a purchase
More informationBig Data Visualization with JReport
Big Data Visualization with JReport Dean Yao Director of Marketing Greg Harris Systems Engineer Next Generation BI Visualization JReport is an advanced BI visualization platform: Faster, scalable reports,
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationHADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
More informationRemoving Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
More informationDatabase Monitoring Requirements. Salvatore Di Guida (CERN) On behalf of the CMS DB group
Database Monitoring Requirements Salvatore Di Guida (CERN) On behalf of the CMS DB group Outline CMS Database infrastructure and data flow. Data access patterns. Requirements coming from the hardware and
More informationA Web Services Data Analysis Grid *
A Web Services Data Analysis Grid * William A. Watson III, Ian Bird, Jie Chen, Bryan Hess, Andy Kowalski, Ying Chen Thomas Jefferson National Accelerator Facility 12000 Jefferson Av, Newport News, VA 23606,
More informationCloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
More informationTCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance
TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance M. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel. Presented
More informationVMware vrealize Automation
VMware vrealize Automation Reference Architecture Version 6.0 or Later T E C H N I C A L W H I T E P A P E R J U N E 2 0 1 5 V E R S I O N 1. 5 Table of Contents Overview... 4 What s New... 4 Initial Deployment
More informationOpen Text Archive Server and Microsoft Windows Azure Storage
Open Text Archive Server and Microsoft Windows Azure Storage Whitepaper Open Text December 23nd, 2009 2 Microsoft W indows Azure Platform W hite Paper Contents Executive Summary / Introduction... 4 Overview...
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationCloud Computing for Control Systems CERN Openlab Summer Student Program 9/9/2011 ARSALAAN AHMED SHAIKH
Cloud Computing for Control Systems CERN Openlab Summer Student Program 9/9/2011 ARSALAAN AHMED SHAIKH CONTENTS Introduction... 4 System Components... 4 OpenNebula Cloud Management Toolkit... 4 VMware
More informationBEST PRACTICES FOR INTEGRATING TELESTREAM VANTAGE WITH EMC ISILON ONEFS
Best Practices Guide BEST PRACTICES FOR INTEGRATING TELESTREAM VANTAGE WITH EMC ISILON ONEFS Abstract This best practices guide contains details for integrating Telestream Vantage workflow design and automation
More information