irods Overview Intro to Data Grids and Policy-Driven Data Management!!Leesa Brieger, RENCI! Reagan Moore, DICE & RENCI!



Similar documents
irods Overview Introduction to Data Grids, Policy-Driven Data Management, and Enterprise irods

Data Management using irods

Policy Policy--driven Distributed driven Distributed Data Management (irods) Richard M arciano Marciano marciano@un

INTEGRATED RULE ORIENTED DATA SYSTEM (IRODS)

Technical. Overview. ~ a ~ irods version 4.x

irods and Metadata survey Version 0.1 Date March Abhijeet Kodgire 25th

Automated and Scalable Data Management System for Genome Sequencing Data

irods Policy-Driven Data Preservation Integrating Cloud Storage and Institutional Repositories

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

Conceptualizing Policy-Driven Repository Interoperability (PoDRI) Using irods and Fedora

Data grid storage for digital libraries and archives using irods

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

Nexus Professional Whitepaper. Repository Management: Stages of Adoption

How To Retire A Legacy System From Healthcare With A Flatirons Eas Application Retirement Solution

Deploying a distributed data storage system on the UK National Grid Service using federated SRB

Meister Going Beyond Maven

Integrated Rule-based Data Management System for Genome Sequencing Data

The National Consortium for Data Science (NCDS)

Abstract. 1. Introduction. irods White Paper 1

irods at CC-IN2P3: managing petabytes of data

Software configuration management

THE CCLRC DATA PORTAL

Using the NDMP File Service for DMA- Driven Replication for Disaster Recovery. Hugo Patterson

Servers. Servers. NAT Public Subnet: /20. Internet Gateway. VPC Gateway VPC: /16

Diagram 1: Islands of storage across a digital broadcast workflow

Oracle Recovery Manager

In ediscovery and Litigation Support Repositories MPeterson, June 2009

Managing Next Generation Sequencing Data with irods

PoS(ISGC 2013)021. SCALA: A Framework for Graphical Operations for irods. Wataru Takase KEK wataru.takase@kek.jp

BMC Cloud Management Functional Architecture Guide TECHNICAL WHITE PAPER

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

How To Manage Your Digital Assets On A Computer Or Tablet Device

Open Source Backup with Amanda

IKAN ALM Architecture. Closing the Gap Enterprise-wide Application Lifecycle Management

Grid Sun Carlo Nardone. Technical Systems Ambassador GSO Client Solutions

Symantec OpenStorage Date: February 2010 Author: Tony Palmer, Senior ESG Lab Engineer

Pluggable Rule Engine

11. Oracle Recovery Manager Overview and Configuration.

WHAT S NEW WITH EMC NETWORKER

Functional Requirements for Digital Asset Management Project version /30/2006

irods for Big Data Management in Research Driven Organizations Charles Schmitt CTO & Director of Informatics RENCI

MDSplus Automated Build and Distribution System

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

W H I T E P A P E R E X E C U T I V E S U M M AR Y S I T U AT I O N O V E R V I E W. Sponsored by: EMC Corporation. Laura DuBois May 2010

Media Exchange really puts the power in the hands of our creative users, enabling them to collaborate globally regardless of location and file size.

Distributed File Systems An Overview. Nürnberg, Dr. Christian Boehme, GWDG

Jenkins World Tour 2015 Santa Clara, CA, September 2-3

PASIG May 12, Jacob Farmer, CTO Cambridge Computer

WOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief

Jenkins Continuous Build System. Jesse Bowes CSCI-5828 Spring 2012

Local Loading. The OCUL, Scholars Portal, and Publisher Relationship

Clodoaldo Barrera Chief Technical Strategist IBM System Storage. Making a successful transition to Software Defined Storage

EXTENDING ORACLE WEBCENTER TO MOBILE DEVICES: BANNER ENGINEERING SUCCEEDS WITH MOBILE SALES ENABLEMENT

Data Management Resources at UNC: The Carolina Digital Repository and Dataverse Network

ILM et Archivage Les solutions IBM

Veritas Backup Exec : Protecting Microsoft SharePoint

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

EMC NETWORKER AND DATADOMAIN

<Insert Picture Here> Oracle Secure Backup 10.3 Secure Your Data, Protect Your Budget

White Paper. Software Development Best Practices: Enterprise Code Portal

IBM EXAM QUESTIONS & ANSWERS

Digital Preservation. OAIS Reference Model

Ikasan ESB Reference Architecture Review

NightOwlDiscovery. EnCase Enterprise/ ediscovery Strategic Consulting Services

Case Study. Data Governance Portal Brainvire Infotech Pvt Ltd Page 1 of 1

Ex Libris Rosetta: A Digital Preservation System Product Description

Software design (Cont.)

Managing your Red Hat Enterprise Linux guests with RHN Satellite

Oracle Data Integrator 12c: Integration and Administration

Columbia University Digital Library Architecture. Robert Cartolano, Director Library Information Technology Office October, 2009

Oracle Data Integrator 11g: Integration and Administration

Middleware- Driven Mobile Applications

Data Grids. Lidan Wang April 5, 2007

SOFTWARE TESTING TRAINING COURSES CONTENTS

Object storage in Cloud Computing and Embedded Processing

Content Distribution Management

Transcription:

irods Overview Intro to Data Grids and Policy-Driven Data Management!!Leesa Brieger, RENCI! Reagan Moore, DICE & RENCI!

Renaissance Computing Institute (RENCI) A research unit of UNC Chapel Hill Current Director: Stan Ahalt, formerly from the Ohio Supercomputer Center State-supported Governed by the Triangle universities: UNC Chapel Hill NC State University (Raleigh) Duke University (Durham) 2!

Data Intensive Cyber Environments (DICE Group) Directed by Reagan Moore Joint appointment in RENCI and SILS (UNC) Developed SRB, the Storage Resource Broker Began at SDSC, the San Diego Supercomputer Center Most of the group migrated to UNC Chapel Hill in 2008 The group is bi-coastal: DICE-UNC, DICE-UCSD irods, the integrated Rule-Oriented Data System, released in 2009 3!

irods Evolution Based on decade-long SRB development experience for managing distributed data Community-driven irods picked up where SRB left off Modular, extensible, customizable Open source (BSD license) Supported by DICE and RENCI at UNC: irods@renci 4!

irods I. Data grid middleware II. Data management infrastructure III. Framework for implementing policy-driven data management We shall cover all these aspects of irods. 5!

I. irods as Data Grid Middleware 6!

irods Unified Virtual Collection irods View of Distributed Data User Client User sees a single collection My Data: disk, filesystem, WOS storage unit... My Data: tape, database, filesystem... Partner s Data remote disk, tape, filesystem... irods installs over heterogeneous data resources Users share & manage distributed data as a single collection 7!

irods as a Data Grid Sharing data across: geographic and institutional boundaries heterogeneous resources (hardware/software) Virtual (logical) collections of distributed data Global name spaces data/files users: single sign on storage: virtual resources Metadata catalogue (icat) manages mappings between logical and physical name spaces Beyond a single-site repository model 8!

A RENCI Data Grid A complete data grid (zone) has one metadata catalogue (icat) irods Server irods Server irods Server irods Server irods Server irods Server Metadata Catalog (icat) Client asks for data request goes to an irods server Server contacts the icat- enabled server Informa@on (loca@on, access rights, etc) is retrieved from the icat Server containing data is signaled to send data to authorized client 9!

TUCASI Infrastructure Project (TIP) Federated Data Grids Independent data grids (zones), with separate metadata catalogues (icats), can federate 10!

II. irods for Data Management 11!

Issues in Data Management Organization and Usage Distributed data/virtual collections Distributed access to data (groups); data sharing (remote access and permissions/protection) Publishing (general distribution) Back-ups and replicas Metadata collection and tagging Evolution in usage model (life cycle) 12!

Data Life Cycle Creation Local Policy Active Use Service/Use Publication & Sharing Reference Collection Distribution Discovery and Re- purposing Archival Collection/ Deletion Retention/ Preservation Usage s ds evolution across the stages of the data life cycle.

More Issues in Data Management Requirements on Data Infrastructure Data integrity, authenticity/verification, provenance Discoverability (metadata management and query support) Audit tracking/accounting Reanalysis, reproducibility Interfaces to the data Data services (derived products, new formats, ) 14!

Further Considerations Ingestion How will data be submitted? Virus checks Data types Necessary metadata Sharing Access rights Copyright and intellectual property terms Curation Setting policy requirements into action implementation Managing remote data Preservation Long-term: life span, archival policy Metadata to support policy 15!

irods for Data Management irods infrastructure supports these and other capabilities for data management. Need an overall data management plan in order to move to implementation. Clear statement of data management plan = data policy Use irods can be used to implement that data policy. 16!

III. irods for Policy-Driven Data Management 17!

What is Policy-Driven Data Management? Policy determines when management procedures are run Define a data policy Identify the management steps to carry out the policy Define computer procedures for implementation Trigger the procedures when policy requires Events in the data grid such as putting, getting, replicating, collections or files creating users, resources, groups can trigger procedures State of the data grid can trigger procedures sdfsdf when given time period has elapsed whenever a file of prescribed format is ingested when data reaches a prescribed age 18!

Policy-Oriented Data Infrastructure Implement management policies Each community defines their own policies Manage administrative tasks Data administrator (not system administrator) manages data collections Assessment criteria for checking policy compliance Point-in-time: queries on metadata catalog Compliance over time: queries on audit tables 19!

Policy-Based Data Environments Assembled/Distributed Collections Properties - attributes that ensure the purpose of the collections Policies - methodologies for enforcing desired properties Procedures - functions that implement the policies implemented as computer actionable rules/workflows Persistent state information - results of applying the procedures contained in system metadata Assessment criteria - validation that state information conforms to the desired purpose mapped to periodically executed policies 20!

Additional irods Design Goals Abstract out the data management Policy-based data management Separate policy enforcement from storage administration Policy follows data around the grid: collection management independent of remote storage locations Scalability and extensibility Enable versioning of policies and procedures and data [coming] Support differentiated services (storage, metadata, messaging, workflow, scheduling) Microservice modules 21!

irods Policy Implementation Microservices and Rules Microservice the functional unit of work (C programs) Rules workflows of microservices (and rules) Provide server-side (data-side) services Event-triggered rule execution 22!

Microservices Modular New libraries of microservices can be developed without touching the core code Allow for customization and extension of data grid for community-specific policy 23!

Asdfadsfasd afsdfasdfsdf 24!

Server Functions Peer-to-peer architecture icat-enabled server (IES) interacts with remote metadata catalog requests forwarded to server where data reside Translate from client action to storage protocol Authenticate and authorize requested operations Implement/enforce policies from local rule engine Manage execution of processes at the storage location

Highly Controlled Environment All accesses are authenticated GSI / Kerberos / Challenge-response / Shibboleth All operations are authorized ACLs on files, storage Constraints on each rule (only authorized users can run them) Local rule base controls interactions with local storage

Some Management Capabilities Replication Registration of files into the data grid Synchronization of remote directory Managed file transport (idrop) Automated metadata extraction Queries on metadata, tags Server-side workflows (loop over result sets) Parallel I/O streams & RBUDP (Reliable Blast UDP) transport

irods Extensible Infrastructure Some extensions handled by user groups Clients Policies Procedures irods (extensible) core infrastructure: Network transport Authentication/Authorization Distributed storage access Metadata management Messaging Rule engine 28!

An irods Overview 29!

irods Version 3.0 Released autumn 2011 New features New rule engine Improved, cleaner syntax Strong parameter typing Optimized performance (can run thousands of rules) Expanded math operators Extended rule language Rule versioning Rule debugging Distributed rule-based management Symbolic links to external systems (based on microservices) Restart for transport of large files Improved Java interface: Jargon Pending features Windows native client port, Windows server support Dropbox interface (idrop)

irods Books irods microservices book - available from Amazon Describes new rule language Lists examples for using the current microservices Input / output parameters defined Executable rule provided for each micro-service Testing of micro-service execution or Illustration of use with other micro-services or Demonstration of use in a management policy irods Primer also available from Amazon

Applications Astronomy NOAO, CyberSKA, LSST High Energy Physics BaBar, KEK Earth Systems NASA (MODIS data set) Australian Research Collaboration Service (ARCS) BeSTGRID coordinating New Zealand research organizations (public and private) Genomics Broad Institute (MIT, Harvard), Sanger Institute (UK) Carolina Digital Repository Texas Digital Libraries Seismology Southern California Earthquake Center Bibliothèque Nationale de France

irods User Meeting March 1, 2, 2012 Tucson, Arizona Organized by the DICE group, hosted by the iplant consortium National and international participation by many in the irods user community 33!

Enterprise irods RENCI s support for long-term irods sustainability Target new funding models move beyond traditional public research funds Beta Release due out in March 2012 RedHat Fedora model: Community code (DICE) releases every 4 months Enterprise code (RENCI) releases every 18 months Product lines, service agreements based on irods and E-iRODS 34!

Enterprise irods 3.0 Beta Initial release based on irods 3.0 - Tracks community code, with a delay Hardened binary release of irods - Passes continuous integration with back-ported bug fixes from community trunk - Packaging: initially RPM and DEB Certification Documentation Subscription Support Contracts 35!

Certification 100% test coverage of server-side APIs across selected platforms and topologies: n-way testing across all combinations Certified Platforms (initially) CentOS 5.7 Ubuntu 11.04 Mac OSX 10.5.8 Topologies Single zone: icat server + 2 non-icat servers Federation: two single zones 36!

Proposed Support Packages Options for Negotiation Tutorial packages User and admin tutorials On-site hands-on or web conferencing Technology preview packages - Helpdesk response to usage problems: irods, E-iRODS Production support packages - Bug fixes and problem closure for E-iRODS supported components on supported platforms Development support packages - Community or proprietary feature development 37!

RENCI irods Automated Testing Track testing by a functional Testing Matrix dimensions represent variables which affect the behavior of data grid: Platform, Configuration, Database, System Features Automate an exhaustive walk of the Testing Matrix Libraries of test scripts for System Features Regular testing driven by Hudson via a Celery Framework which runs on a Virtual Distributed testing grid Also useful for non-functional testing: Load Testing, Scalability 38!

RENCI Collaborative Development and Test Environment Git distributed revision control system GForge project management system hosting & version control bug-tracking messaging Hudson/Jenkins Continuous Integration environment: incremental quality control Nexus Maven repository that tracks dependencies and bundles for check-out (Java) 39!

Continuous Integration Automated via Hudson (moving to Jenkins) A risk reduction technique Push code frequently to the repository Build & test for each new commit in order to catch defects as early as possible Automated CI removes a level of burden from developers and provides constant insight to the state of the project 40!

Open Source Software for the Test Environment git Developed at RENCI: python gridbundle - celery - schema.json - nose - validator.js erlang deploy_gridbundle.py - rabbitmq javascript asserticmd/ asserticmdfail - node.js bash 41!

Code Hardening Defensive Programming anticipate errors and design to avoid them or identify them immediately when they occur Leverage Static Analysis Tools audit code regularly for common programming errors, potential vulnerabilities, and security concerns; enhance code coverage Peer review of code to catch errors missed by static analysis as well as potential design issues and opportunities for refactoring Refactor code, leveraging industry best practices for security, extensibility, & maintainability 42!

Documentation & Support Create irods Server installation packages for supported platforms: RPM, DEB, MSI, etc Polish the installation procedure & scripts for irods Add support for remote icat configuration, automated MySQL installation, etc. Work towards a comprehensive offline Administration Guide Begin a Configuration Cookbook: rule configurations for well known use cases 43!