INTEGRATED RULE ORIENTED DATA SYSTEM (IRODS) Todd BenDor Associate Professor Dept. of City and Regional Planning UNC-Chapel Hill bendor@unc.edu http://irods.org/ SESYNC Model Integration Workshop
Important note: Most of this is blatantly ripped off from irods official documentation irods executive summary/introduction Reagan Moore, UNC SILS Adil Hasan, Liverpool
Punchline Model integration and flexibility needs sophisticated tools for collecting, curating, and updating databases. What happens when our model integration dreams come true? Maintaining data provenance and facilitating loose/tight coupling is key.
Overview What is irods? Who Uses irods? What Can irods Do?
What is irods? integrated Rule Oriented Data Management System. irods is open source data grid middleware that implements Data Virtualization Automation of Data Operations A Robust Metadata Catalog Data Management Policy Enforcement and Compliance Verification
irods? Based on considerable experience from Storage Resource Broker (SRB) developed by Data Intensive Cyber Environment (DICE) group at UNC, UCSD, SD Super Computer Center. Found many groups used SRB to store large quantities of data. A lot of server-side post-processing of the data is required (e.g. replicate files, convert to different format, checksum etc). Almost all management is policy driven.
Emergence of irods SRB experience motivated requirements for a new data management system Contained all SRB functionality Add work-flow to manage server-side post-processing Model coupling potential grows Configurable only include the 'services' you need Open-source SRB license imposed sever restrictions on the academic community
Claims to fame An infinitely configurable data janitor irods is the kind of technology you need to host everyone s unstructured data. A powerful data migration tool A data preservation technology A tool for providing fine-grained privacy and security controls Can be seen as a basis for a Digital Repository/Archive Digital Repository/Archive is a Policy Driven System
but wait, there s more! Extensible: irods has command-line clients, APIs for numerous programming languages, and web clients Data is accessed using familiar APIs Supports new plug-ins for storage resources, authentication mechanisms, microservices, and network protection irods lets system administrators roll out an extensible data grid without changing their infrastructure. OPEN SOURCE
How will models interface with large amounts of heterogeneous data? Microservices Authentication Mechanisms Integrated Custom Client Command Line Client Familiar APIs Storage Resources Network Services Web Client Coordinating Resources Databases
irods is Middleware mid dle ware `midl,we(ə)r noun software that acts as a bridge between an operating system or database and applications, especially on a network Applications/Users irods Data Grid Operating System Filesystem irods Middleware Layer - abstracts out the low-level I/O - provides a uniform interface to heterogeneous storage systems (Heterogeneous) Storage Systems
Data Virtualization across Grids Administrators control how the grid is presented to users Implement replication, load-distribution, and archiving policies that are completely transparent to the user. Independent grids can be federated with one another to allow controlled access to remote grids or grids operated by separate workgroups. irods Data Grid Federation irods Data Grid Boston Chicago RTP System 1 RTP System 2 London
Automation of Data Operations With irods, any agent can initiate any action upon any trigger. This powerful capability allows administrators to automate policies such as: Validating checksums every time a new file is placed in a folder. Backing up a set of files every second Thursday. Archiving data that hasn t been accessed in over 1 month. Logging each time a file is replicated or destroyed. Permitting a file to be accessed by multiple independently defined user groups. These operations can be distributed to the storage resource or client.
Policy Enforcement and Compliance Verifications Metadata + automation = infrastructure to enforce mandated data management policies Records retention and privacy protection requirements Audit trails generated by irods can be used to verify compliance with policy.
Who Uses irods? 15 iplant: 15,000 users on an irods data grid with 100 million files http://www.iplantcollaborative.org/ IN2P3 (French Nat. Inst. Nuclear and Particle Physics): over 6 PB of data managed by irods Sanger Institute (Genome Research): 20+ PB of irods data NASA Center for Climate Simulations: 300 million metadata attributes CineGRID (exchanging digital media): sites distributed across Japan- US-Europe
What Can irods Do? irods simplifies data grid management BnF (2008) Data on different storage devices at different locations can be centrally managed. In situ migration to new hardware can be managed by replicating the legacy resource before repurposing or decommissioning it. Backup and archiving are transparent and highly configurable using the automation and metadata capabilities in irods.
What Can irods Do? irods simplifies data discovery, data validation, and data processing. attribute: library attribute: total_reads attribute: attribute: library type attribute: attribute: total_reads lane attribute: attribute: type is_paired_read attribute: attribute: lane study_accession_number attribute: ute: library_id libra attribute: is_paired_read attribute: attribute: study_accession_number sample_accession_number attribute: ribute: sample_public_name total_ attribute: library_id attribute: attribute: sample_accession_number manual_qc attribute: ribute: tag type attribute: sample_public_name attribute: attribute: manual_qc sample_common_name attribute: ribute: md5lane attribute: tag attribute: attribute: sample_common_name tag_index attribute: ribute: study_title is_paire attribute: md5 attribute: attribute: tag_index study_id attribute: ribute: reference study_ attribute: study_title attribute: attribute: study_id sample attribute: attribute: ute: reference target li attribute: attribute: sample sample_id attribute: attribute: target id_run attribute: attribute: sample_id study attribute: attribute: id_run alignment attribute: study attribute: alignment attribute: library attribute: total_reads attribute: type attribute: lane attribute: is_paired_read attribute: study_accession_number attribute: library_id attribute: sample_accession_number attribute: sample_public_name attribute: manual_qc attribute: tag attribute: sample_common_name attribute: md5 attribute: tag_index attribute: study_title attribute: study_id attribute: reference attribute: sample attribute: target attribute: sample_id attribute: id_run attribute: study attribute: alignment User-defined and intrinsic metadata make stored data searchable. Validation and analytical tools can be automated to process incoming data. The results and process steps can be stored in the icat metadata catalog.
What Can irods Do? Distributed Data Management Data Preservation Digital Archives Data at scale Data Maintenance Automated Data Processing Data Curation Digital Libraries Large number of users Complex management tasks Critical policy enforcement Data Virtualization Data Protection and Security Data Sharing and Access Policy Enforcement
Where irods fits? Client interacts with digital repository to access data Access Digital Repository Storage IRODS Spans Digital Repository and Storage Domains
Rules Policies are implemented in irods as rules. Rule is a series of logically connected steps. Each step realised as a micro-service. IRODS rules fully featured: Contain loops and branches. Can have rules contained within rules. IRODS rules read from a rule-file (called core.irb by default).
Questions/ideas? Todd BenDor Department of City and Regional Planning bendor@unc.edu http://todd.bendor.org