Data grid storage for digital libraries and archives using irods

Data grid storage for digital libraries and archives using irods Mark Hedges, Centre for e-research, King s College London eresearch Australasia, Melbourne, 30 th Sept. 2008 Background: Project History Data grid project at AHDS and STFC ended at demise of AHDS Used SRB (Storage Resource Broker) AHDS Executive -> Centre for e- Research at King s College London Centre incorporates staff and expertise of AHDS and other groups Continuity, but some change of focus New data grid project (using irods) 1

Background: Data Challenge History Ongoing growth of corpora due to major digitisation projects Highly diverse in type and Visual size: Arts images, text, music, video, database, Performing multi-media Arts Archaeology Require specialised knowledge Literature/Linguistics Highly complex, contextual, fuzzy, uncertain, inconsistent, incomplete Rapid expansion: AHDS data size increased 20-fold between 2005 & 2008 Increasing number of large objects (e.g. video, archaeology scans) Data Grids Storage Resource Broker (SRB), a widely-used data grid technology developed by the San Diego Super Computer Center Addresses storage issues for digital repository and preservation environments Provides uniform, searchable access to virtualised, distributed resources, so DL is insulated from: physical location of data types of storage migrating to new hardware Scalable as library grows, new resources can be added dynamically Auditing facilities 2

SRB Storage client application (e.g. digital repository) SRB storage datastream1 object1 client request client response datastream2 datastream3 disseminator disseminator impl (web service) get Entire object retrieved object2 object3 object1 object2 distributed / virtualised Issues Not open source Very effective for storage management, but not integrated with wider infrastructure. Not easy to integrate application-specific requirements (either change the core code, or implement in client, or use proxy commands) some examples in later slides. No built-in implementation of workflow (have to script this outside SRB, whether server or client side), or of asynchronous processing. Requires choreography between SRB admin and person running workflow. Relatively restricted support for metadata extension 3

irods The open source successor to SRB Provides similar data virtualisation Rule-Oriented Data management System Rule Engine allows data management policies to defined and realised as rules Rules are sets of operations that you want to impose on an object (e.g. file, user, resource, ). Rules allow virtualisation of policies the digital library is insulated from how these policies are implemented. What are rules? Rules built up cumulatively from atomic operations called micro-services Micro-services and rules can be added and modified to meet local needs Triggered by certain events: Eventcondition-action model Great potential to hide processing from application layer Create server-side workflows 4

Definition of rules The components of a rule definition are as follows: actiondef condition workflowchain recoverychain Where: actiondef identifies the action to be carried out condition is necessary condition for execution workflowchain is sequence of actions to be executed recoverychain is corresponding sequence of recovery actions (to ensure consistent state). Rule can be built up cumulatively from other rules. Data passed into/within rules (via parameters/context). Examples of rule use Some examples of using rules: Digital preservation Processing digital material on ingest Fedora disseminators -> rules/microservices Shibboleth integration Integration with provenance systems 5

Example: digital preservation Execution triggered when an object has been ingested acpostprocforput accheckobjectintegrity## acanalyseobject## acnormaliseobject## msisysrepldataobj(presrescgrp,all) nop##nop##nop##msicleanupreplicas Example Rule processing text objects on ingest Processing depends on type of object. acpostprocforput $format == "application/msword" && $objectcategory= textcategoryx" acvalidateobjfortextprocessingx## acexecutetextprocessingx## acvalidatetextprocessingx nop##msicleanuptextprocx## nop 6

Example: data-side processing Fedora retrieves entire objects for processing Inefficient, and not always necessary Implement processing close to the data Fedora disseminators -> irods rules Client-side workflows -> irods rules irods Storage & Rules Client application (e.g. digital repository) irods storage layer + rule execution datastream1 object1 client request client response datastream2 datastream3 disseminator iget / irule rule triggers executes object2 object3 Rule Engine distributed / virtualised rule definition processing impl 7

irods Access Management: Shibboleth Apache access request PIP irods+re Capture & store attributes mod_ shib admin attributes Rule response PDP -service -service -service PEP client Client stores data in irods Provenance & irods Rule causes microservice to access external system External Provenance System IRODS + icat + RE Update icat file metadata IRODS + RE Update icat file metadata IRODS System IRODS +RE Rule engine runs, manipulations recorded Internal Provenance System 8

More prototyping Next steps Developing more comprehensive set of rules for curation and preservation Finish Shibboleth & provenance integration Dynamic deployment of rules Prototypes -> production Acknowledgements Thanks for contributions from: Tobias Blanke, King s College London Adil Hasan, University of Liverpool Jens Jensen, Science & Technology Facilities Council Andrea Weise, Science & Technology Facilities Council Also, thanks to the JISC which funded part of the work. 9

Contacts mark.hedges at kcl.ac.uk 10