Scalable Services for Digital Preservation

Scalable Services for Digital Preservation A Perspective on Cloud Computing Rainer Schmidt, Christian Sadilek, and Ross King

Digital Preservation (DP) Providing long-term access to growing collections of digital assets. Not just a question of storage Not just a question of files Software preservation rather than hardware preservation Prevent objects from becoming uniterpretable bit-streams. Requires establishment of research infrastructures and networks.» Not an out-of-the-box solution A number of large EU projects and initiatives are dealing with the implications of digital preservation. FP6/FP7: Planets, CASPAR, DPE, SHAMAN 2

Planets Permanent Long-term Access through NETworked Services Addresses the problem of digital preservation driven by National Libraries and Archives Project instrument: FP6 Integrated Project 5. IST Call Consortium: 16 organisations from 7 countries Duration: 48 months, June 2006 May 2010 Budget: 14 Million Euro http://www.planets-project.eu/ 3

Why does DP need HPC resources? Digital object management systems, repositories, or archives are designed for storing and providing access to large amounts of data. Many different data streams and metadata models Not designed to support continuous data modification. Focus on storage, no support for HPC. Digital preservation is an e-science problem Processing vast amounts of complex data (e.g. analyse, migrate), Experiments in distributed and heterogeneous environments, Crossing administrative and institutional boundaries. 4

Towards Grid and Cloud Computing Planets preservation architecture is based on services. Supports interoperability and a distributed environment Sufficient for a controlled experiments (Testbed) Not sufficient for handling a production environment Massively, uncontrolled user requests Mass migration of hundreds of TBytes of data Content Holders are faced with loosing vast amounts of data Holding not sufficient computational resources in-house There is a clear requirement for incorporating HPC facilities -> Grid and Cloud Computing 5

Execution Tiered Architecture Preservation Planning and Workflow Generation Workbench for Testbed Experiments Web Portal Browser Clients Service Registry Workflow Def. reference lookup Data + Metadata Registry Execution Engine maintain Planets Instance Tools + Format Registry Data Model Planets App. Services 3rd Party Tool Services Grid/Cloud Execution Services Web/Grid Services Experimental Environment (qualitative results) Production Environment (quantitative results) Resources 6

Integrating Clouds and Virtual Clusters Basic Idea: Extending Planets SOA with Grid Services The Planets IF Job Submission Services Allow Job Submission to a PC cluster (e.g. Hadoop, Condor) standard Grid protocols/interfaces (SOAP, HPC-BP, JSDL) Cluster nodes are instantiated from pre-configured system images Most Preservation Tools are 3rd party applications Software need to be preinstalled on cluster nodes Cluster and JSS be instantiated in-house (e.g. a PC lab) or on top of (leased) cloud resources (AWS EC2). Computation be moved to data or vice-versa 7

Experimental Architecture Virtual Cluster (Apache Hadoop) Virtual Node (Xen) HPCBP Cloud Infrastructure (EC2) JSDL Job Job Description File JSS Storage Infrastructure (S3) Raw Data Data Transfer Service 8

Mass Migration of Digital Objects Map-Reduce implements a framework and prog. model for processing large documents (Sorting, Searching, Indexing) on multiple nodes. Automated decomposition (split) Mapping to intermediary pairs (map), optionally (combine) Merge output (reduce) Provides implementation for data parallel problems, i/o intensive, Example: Conversion digital object (e.g website, folder, archive) Decompose into atomic pieces (e.g. file, image, movie) On each node, convert piece to target format Merge pieces and create new data object 9

Experimental Results - Setup Amazon Elastic Compute Cloud (EC2) 1 150 cluster nodes Custom image based on Fedora 8 i386 Amazon Simple Storage Service (S3) max. 1TB I/O, ~32,5MB/s download / ~13,8MB/s upload (cloud internally) Apache Hadoop (v.0.18) MapReduce Implementation Preinstalled command line tools (e.g, ps2pdf ) 10

Experimental Results 1 Scaling Job Size time [min] 10,00 9,00 8,00 7,00 6,00 5,00 4,00 3,00 2,00 1,00 0,00 x(1k) = 3,5 x(1k) = 4,4 x(1k) = 3,6 1 10 100 1000 number of nodes = 5 EC2 0,07 MB EC2 7,5 MB EC2 250 MB SLE 0,07 MB SLE 7,5 MB SLE 250 MB tasks x(1k) = t_seq / t_parallel and tasks = 1000 11

Experimental Results 2 Scaling #nodes 40,00 35,00 X n=1, t=36, s1 = 1, e=100% time [min] 30,00 25,00 20,00 15,00 10,00 5,00 0,00 X X n=1 (local), t=26 n=5, t=4.5, s=4.5, e=90% n=10, t=4.5, s=7.6, e=75% X X n=50, t=1.68, s=21, e=43% 0 50 100 150 X n=100, t=1.03, s=35, e=35% EC2 1000 x 0,07 MB SLE 1000 x 0,07 MB nodes 12

Conclusion Preservation systems need to employ HPC resources. Content holders and data repository systems are not ready to utilize computational Grids. There is a need to bridge research communities in the areas of digital preservation and e-science. Cloud Computing provides a powerful solution for getting on-demand access to appropriate HPC resources. Many integration issues: Security, Legal Aspects, Reliability, Standardization. Planets IF Job Submission Service, a first step. Submission to virtual cluster of DP nodes based on Grid protocols/interfaces. 13

Fin 14

The Planets Service Framework Defines an Service-Oriented Architecture for Digital Preservation Set of Preservation Services, Interfaces, a common Data Model Implements Common Services Authentication and Authorization, Monitoring, Logging, Notification, Service Registration and Lookup Provides Workflow Enactment Service and Engine Components-based, XML serialization APIs for Applications that use Planets Testbed Experiments, Executing Preservation Plans 15

<?xml version="1.0" encoding="utf-8"?> <jsdl:jobdefinition xmlns="http://www.example.org/" xmlns:jsdl="http://schemas.ggf.org/jsdl/2005/11/jsdl" xmlns:jsdl-posix="http://schemas.ggf.org/jsdl/2005/11/jsdl-posix" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://schemas.ggf.org/jsdl/2005/11/jsdl jsdl.xsd "> <jsdl:jobdescription> <jsdl:jobidentification> <jsdl:jobname>start vi</jsdl:jobname> </jsdl:jobidentification> <jsdl:application> <jsdl:applicationname>ls</jsdl:applicationname> <jsdl-posix:posixapplication> <jsdl-posix:executable>/bin/ls</jsdl-posix:executable> <jsdl-posix:argument>-la file.txt</jsdl-posix:argument> <jsdl-posix:environment name="ld_library_path">/usr/local/lib</jsdl-posix:environment> <jsdl-posix:input>/dev/null</jsdl-posix:input> <jsdl-posix:output>stdout.${job_id}</jsdl-posix:output> <jsdl-posix:error>stderr.${job_id}</jsdl-posix:error> </jsdl-posix:posixapplication> </jsdl:application> </jsdl:jobdescription> </jsdl:jobdefinition> 16