HDF5-iRODS Project. August 20, 2008



Similar documents
DataGrids 2.0 irods - A Second Generation Data Cyberinfrastructure. Arcot (RAJA) Rajasekar DICE/SDSC/UCSD

irods and Metadata survey Version 0.1 Date March Abhijeet Kodgire 25th

GridFTP: A Data Transfer Protocol for the Grid

THE CCLRC DATA PORTAL

Digital Preservation Lifecycle Management

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

I G E O N - I N D I A, Status Report - Sep, INDIA STATUS REPORT - SEPTEMBER 2007

IDL. Get the answers you need from your data. IDL

PoS(ISGC 2013)021. SCALA: A Framework for Graphical Operations for irods. Wataru Takase KEK wataru.takase@kek.jp

Assessment of RLG Trusted Digital Repository Requirements

PASIG May 12, Jacob Farmer, CTO Cambridge Computer

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

The Mantid Project. The challenges of delivering flexible HPC for novice end users. Nicholas Draper SOS18

Deploying a distributed data storage system on the UK National Grid Service using federated SRB

Luc Declerck AUL, Technology Services Declan Fleming Director, Information Technology Department

CS 3530 Operating Systems. L02 OS Intro Part 1 Dr. Ken Hoganson

Early Cloud Experiences with the Kepler Scientific Workflow System

The THREDDS Data Repository: for Long Term Data Storage and Access

David Minor. Chronopolis Program Manager Director, Digital Preserva7on Ini7a7ves UCSD Library San Diego Supercomputer Center

MIGRATING DESKTOP AND ROAMING ACCESS. Migrating Desktop and Roaming Access Whitepaper

integrated Rule-Oriented Data System Reference

VisIt Visualization Tool

Data Management System for grid and portal services

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Information Sciences Institute University of Southern California Los Angeles, CA {annc,

Attaching Cloud Storage to a Campus Grid Using Parrot, Chirp, and Hadoop

How To Handle Big Data With A Data Scientist

Efficient Metadata Generation to Enable Interactive Data Discovery over Large-scale Scientific Data Collections

Data-Intensive Science and Scientific Data Infrastructure

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Sybase Unwired Platform 2.0

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Intro to Data Management. Chris Jordan Data Management and Collections Group Texas Advanced Computing Center

Data Grid Landscape And Searching

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

Parallel Analysis and Visualization on Cray Compute Node Linux

Visualizing and Analyzing Massive Astronomical Datasets with Partiview

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

Tools and Services for the Long Term Preservation and Access of Digital Archives

The Arctic Observing Network and its Data Management Challenges Florence Fetterer (NSIDC/CIRES/CU), James A. Moore (NCAR/EOL), and the CADIS team

A Survey Study on Monitoring Service for Grid

The NERC DataGrid (NDG)

irods Policy-Driven Data Preservation Integrating Cloud Storage and Institutional Repositories

Scientific versus Business Workflows

Active Measurement Data Analysis Techniques

Cross platform Migration of SAS BI Environment: Tips and Tricks

Fedora Distributed data management (SI1)

Implementing and Managing Microsoft Desktop Virtualization

Exploring the roles and responsibilities of data centres and institutions in curating research data a preliminary briefing.

An IDL for Web Services

DAME Astrophysical DAta Mining Mining & & Exploration Exploration GRID

Zhenping Liu *, Yao Liang * Virginia Polytechnic Institute and State University. Xu Liang ** University of California, Berkeley

XSEDE Service Provider Software and Services Baseline. September 24, 2015 Version 1.2

CiteSeer x in the Cloud

salsadpi: a dynamic provisioning interface for IaaS cloud

Understanding BeyondTrust Patch Management

Sybase Unwired Platform 2.1.x

Conceptualizing Policy-Driven Repository Interoperability (PoDRI) Using irods and Fedora

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

AN APPLICATION OF INFORMATION RETRIEVAL IN P2P NETWORKS USING SOCKETS AND METADATA

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Data Services for Campus Researchers

Data Management. Facility Access Challenges: Rudi Eigenmann NEES Operations Headquarters NEEScomm Center Purdue University

Advanced Peer to Peer Discovery and Interaction Framework

Science DMZs Understanding their role in high-performance data transfers

Data Management using irods

INTEGRATED RULE ORIENTED DATA SYSTEM (IRODS)

April 30,

Distributed File Systems

Utilizing the SDSC Cloud Storage Service

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Using Databases to Manage State Information for. Globally Distributed Data

Challenges in e-science: Research in a Digital World

A DISTRIBUTED CATALOG AND DATA SERVICES SYSTEM FOR REMOTE SENSING DATA

Infosys GRADIENT. Enabling Enterprise Data Virtualization. Keywords. Grid, Enterprise Data Integration, EII Introduction

Migrating a (Large) Science Database to the Cloud

In situ data analysis and I/O acceleration of FLASH astrophysics simulation on leadership-class system using GLEAN

Metadata driven framework for the Canada Research Data Centre Network

GridFTP GUI: An Easy and Efficient Way to Transfer Data in Grid

Concepts in Distributed Data Management or History of the DICE Group

Data Semantics Aware Cloud for High Performance Analytics

Automated and Scalable Data Management System for Genome Sequencing Data

MS_10324 Implementing and Managing Microsoft Desktop Virtualization

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Esri Maps for Business Intelligence (BI)

Efficiency of Web Based SAX XML Distributed Processing

Taking EPM to new levels with Oracle Hyperion Data Relationship Management WHITEPAPER

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

iplant + irods: Enabling data driven collaborations Nirav Merchant iplant Collaborative/Univ. of Arizona nirav@ .arizona.edu VAMP 2012 Utrecht

MDSplus Automated Build and Distribution System

MS Updating your Microsoft SQL Server 2008 BI Skills to SQL Server 2008 R2

Data processing goes big

PART 1. Representations of atmospheric phenomena

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform

Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory

An Integrated CyberSecurity Approach for HEP Grids. Workshop Report.

LSST Data Management. Tim Axelrod Project Scientist - LSST Data Management. Thursday, 28 Oct 2010

Chronopolis: A Partnership. The Chronopolis: Digital Preservation Archive Development and Demonstration Program

Transcription:

P A G E 1 HDF5-iRODS Project Final report Peter Cao The HDF Group 1901 S. First Street, Suite C-2 Champaign, IL 61820 xcao@hdfgroup.org Mike Wan San Diego Supercomputer Center University of California La Jolla, CA 92093 mwan@sdsc.edu August 20, 2008 Acknowledgements This project was sponsored by CIP/NLADR, an NSF PACI Project in Support of NCSA- SDSC Collaboration. The project was managed by the CyberInfrastructure Partnership (CIP), a joint effort led by the National Center for Supercomputing Applications (NCSA) and the San Diego Supercomputer Center (SDSC). The work was carried out by The HDF Group and the SDSC SRB team. The ASC/Alliance Center for Astrophysical Thermonuclear Flashes at the University of Chicago provided FLASH simulation data (HDF5 files) and other help for this project. The islice tool was built based on the extract_slice_from_chkpnt tool developed by Paul Ricker (UIUC/NCSA). Ruth Aydt (The HDF Group) also contributed to the project.

P A G E 2 Executive Summary Numerous scientific teams use the HDF5 format to store very large datasets. Efficient use of this data in a distributed environment depends on client applications being able to read any subset of the data without transferring the entire file to the local machine. The goal of the HDF5-iRODS Project was to develop an HDF5-iRODS module for the irods datagrid server that supported this capability, and to apply the technology to an NCSA/SDSC Strategic Applications Program (SAP) project, FLASH. A joint team from The HDF Group (representing NCSA) and the SDSC SRB group collaborated to accomplish the project goal. The team implemented five HDF5 microservices functions on the irods server, and developed an irods FLASH slice client application. The client implementation also includes a JNI interface that allows HDFView, a standard tool for browsing HDF5 files, to access HDF5 files stored remotely in irods. Finally, three new collection client/server calls were added to the irods APIs, making it easier for users to query the content of an irods collection. Introduction Simple data reduction operations and filters are commonly applied to large datasets. The resulting datasets are often quite small relative to the size of the original datasets. The time required to transfer a large file to a local machine can be very long when the file is stored remotely. A great deal of time can often be saved by performing the data reduction and filtering operations at the remote site and only transferring the results to the local machine. An application developed previously, HDF5-SRB, enabled a client to access objects in an HDF5 file stored remotely in a Storage Resource Broker (SRB) without transferring the entire file. The success of the HDF5-SRB project prompted NCSA and SDSC to undertake the current project which implemented similar capabilities for the new irods technology, and provided support for an NCSA/SDSC Strategic Applications Programs (SAP) project. FLASH was the SAP project selected for the initial implementation. Since the fall of 2004, the SDSC SRB team has been working on the next generation data management cyber-infrastructure, i Rule Oriented Data Systems (irods). Unlike SRB, where the policies used to manage the data at the server level are hard-coded, the irods architecture provides the means for encoding customized data management functionalities in an easy and declarative fashion using the Rule Oriented Programming (ROP) paradigm. Also, irods is open source. Because of the advantages of irods over SRB, NCSA, SDSC, and The HDF Group agreed to move to irods technology. A major task for this project was the transfer of previouslydeveloped HDF5-SRB technology to HDF5-iRODS technology. Once this work was complete, its usefulness was verified for the FLASH SAP project. FLASH Cosmology is a set of FLASH modules for cosmological applications, particularly the simulation of large-scale structures. FLASH simulations produce a variety of output data, require very fast I/O, and must run on high-performance parallel machines. Because of these characteristics, FLASH simulation output data is stored in HDF5 files.

P A G E 3 Accessing FLASH simulation data stored remotely presents major challenges. Due to the large scale structures, FLASH simulations generate terabytes of data. While HDF5 supports highperformance access to subsets of the data in a local file, it can take hours to transfer the entire file over the network. Furthermore, having multiple copies of the entire file greatly increases overall storage capacity requirements. A main goal of the HDF5-iRODS project was to address these issues. The HDF5-iRODS module offers the following major benefits to the FLASH community: 1. Reduces the need for large storage on local machine. Terabytes of data reside remotely, only small subsets are staged locally. 2. Supports fast browsing of data objects and metadata. Users can browse data objects by examining the structure of a file without loading the data content. This capability can make an enormous difference for files with large numbers of data objects, or with very large objects. 3. Provides rapid access to selected data and metadata. By transferring only the selected data/metadata to the client, instead of the entire multi-terabyte file, transfer time is reduced tremendously, especially when wide-area networks are involved. 4. Facilitates data sharing among scientists. An irods server provides a common lookup for distributed data, making it easily shared by many scientists without requiring them to have multiple copies of the same file on their local machines. Furthermore, everyone will have access to updated data after a new simulation run. Work Performed The irods micro-services are small, well-defined procedures/functions that perform certain tasks within the irods infrastructure. In order to support HDF5 with irods, five new micro-services were implemented on the server side. o msih5file_open o msih5file_close o msih5dataset_read o msih5dataset_read_attribute o msih5group_read_attribute To make these micro-services easier to use, five client functions were also created. o clh5file_open o clh5file_close o clh5dataset_read o clh5dataset_read_attribute o clh5group_read_attribute

P A G E 4 Three new collection client/server calls were added to make it easier for users to query the contents of a collection registered with an irods server. o rcopencollection o rcreadcollection o rcclosecollection islice, a command line tool for use with FLASH data was developed. The tool extracts a slice, perpendicular to one of the coordinate axes, from a FLASH output file stored on an irods server. The selection of a slice is based on the extract_slice_from_chkpnt tool, developed by Paul Ricker (UIUC/NCSA). The extracted slice is saved to raw floats and jpeg files on the local machine. The islice tool has been tested with FLASH simulation data provided by the ASC/Alliance Center for Astrophysical Thermonuclear Flashes at the University of Chicago. irods is written in C. An HDF5-iRODS JNI was implemented to interoperate between Java applications and the HDF5-iRODS module. This JNI allowed a Java application to be built that could talk to an irods server and display HDF5 data residing on the server. To support irods connectivity from HDFView, a set of irods HDF5 Java objects were added to the HDF-Java products. o H5SrbFile.java o H5SrbGroup.java o H5SrbScalarDS.java o H5SrbCompoundDS.java o H5SrbDatatype.java The HDFView GUI was improved to support irods. The improved GUI does not depend on the jargon package, but instead uses the new collection client/server calls. The HDF5-iRODS module, the FLASH slice tool, and the improved HDFView were tested using an irods server run by The HDF Group. irods 1.1, which includes the HDF5-iRODS module, has been released. The HDF5-iRODS User s Guide, final project report, and distribution README files have been written. The User s Guide is undergoing a final editorial pass prior to publication. Remaining Tasks The following tasks remain to be completed. The HDF Group and the SDSC SRB team will continue to work on these tasks over the next few months. Deploy HDF5-iRODS to evaluation server at NCSA or SDSC (TBD) Release islice, the FLASH slice tool (Nov. 10, 2008) Release HDFView with HDF5-iRODS support (Oct. 10, 2008)

P A G E 5 Publish the User s Guide (Dec. 10, 2008) Present the work at the HDF and HDF-EOS Workshop XII (Oct. 15-17, 2008, Aurora, CO) Present a poster at the escience 2008 conference, pending acceptance by the program committee (Dec. 7-12, 2008, Indianapolis, IN) Proposed Next Steps The current project supported the development of the HDF5-iRODS technology and the demonstration of its usefulness for a particular high-performance science application. In order for the work to have a broader and sustained impact on the science community, further action is needed. Some proposed work that could be undertaken with additional support is outlined here. If funding is possible, a more detailed proposal can be written. Perform updates and tests of the HDF5-iRODS module as HDF5, irods, and underlying system technologies continue to evolve. Support the staff responsible for running irods servers and installing the HDF5-iRODS software. Or, alternatively, perform those activities directly. Support science communities as they become familiar with HDF5-iRODS. Add new functions and capabilities requested by users. Support additional FLASH operations such as statistics, data projections, etc. Build a general framework that will allow the HDF-iRODS module to be applied to other SAP applications. Implement general command line tools to access HDF5 files stored on irods servers, such as ih5ls (list data objects in an HDF5 file) and ih5read (read a specific dataset or subset of a dataset).