Scalable Services for Digital Preservation

Similar documents
A Service for Data-Intensive Computations on Virtual Clusters

Cloud computing - Architecting in the cloud

The Quest for Conformance Testing in the Cloud

A programming model in Cloud: MapReduce

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Corso di Reti di Calcolatori L-A. Cloud Computing

Benchmarking Amazon s EC2 Cloud Platform

The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project

Interoperability between Sun Grid Engine and the Windows Compute Cluster

Scalable Architecture on Amazon AWS Cloud

Deploying Business Virtual Appliances on Open Source Cloud Computing

High Performance Computing MapReduce & Hadoop. 17th Apr 2014

Cluster, Grid, Cloud Concepts

Elastic Cloud Computing in the Open Cirrus Testbed implemented via Eucalyptus

Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud

Orbiter Series Service Oriented Architecture Applications

Cloud Computing: Computing as a Service. Prof. Daivashala Deshmukh Maharashtra Institute of Technology, Aurangabad

Building Platform as a Service for Scientific Applications

Hadoop & its Usage at Facebook

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

CiteSeer x in the Cloud

Hadoop IST 734 SS CHUNG

Open source Google-style large scale data analysis with Hadoop

An Overview of the Open Cloud Consortium

Plug-and-play Virtual Appliance Clusters Running Hadoop. Dr. Renato Figueiredo ACIS Lab - University of Florida

5 SCS Deployment Infrastructure in Use

Amazon Web Services. Elastic Compute Cloud (EC2) and more...

Long-term archiving and preservation planning

for my computation? Stefano Cozzini Which infrastructure Which infrastructure Democrito and SISSA/eLAB - Trieste

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

OpenNebula Leading Innovation in Cloud Computing Management

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Hadoop. Sunday, November 25, 12

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Distributed File Systems An Overview. Nürnberg, Dr. Christian Boehme, GWDG

Performance Analysis of a Numerical Weather Prediction Application in Microsoft Azure

Hadoop & its Usage at Facebook

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Open source large scale distributed data management with Google s MapReduce and Bigtable

Cloud Computing. Adam Barker

HDFS Cluster Installation Automation for TupleWare

How To Build A Cloud Storage System

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Improving MapReduce Performance in Heterogeneous Environments

Data processing goes big

Cloud computing is a marketing term that means different things to different people. In this presentation, we look at the pros and cons of using

Availability Digest. HPE Helion Private Cloud and Cloud Broker Services February 2016

Efficient Cloud Management for Parallel Data Processing In Private Cloud

TECHNOLOGY WHITE PAPER Jun 2012

2) Xen Hypervisor 3) UEC

Scientific and Technical Applications as a Service in the Cloud

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

What is Analytic Infrastructure and Why Should You Care?

OpenNebula Cloud Innovation and Case Studies for Telecom

Cloud Computing Summary and Preparation for Examination

INTRODUCTION TO CLOUD MANAGEMENT

Web Services and Service Oriented Architectures. Thomas Soddemann, RZG

Economic Cloud Computing What to keep in mind when using the Cloud...

GeoGrid Project and Experiences with Hadoop

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

THE HADOOP DISTRIBUTED FILE SYSTEM

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Data and beyond

Professional Hadoop Solutions

Generic Log Analyzer Using Hadoop Mapreduce Framework

Technical Overview Simple, Scalable, Object Storage Software

Managing the Data Center Using the JBoss Enterprise SOA Platform

Digital Preservation The Planets Way: Annotated Reading List

Log Mining Based on Hadoop s Map and Reduce Technique

COURSE CONTENT Big Data and Hadoop Training

Software Development In the Cloud Cloud management and ALM

Renderbot Tutorial. Intro to AWS

<Insert Picture Here> Infrastructure as a Service (IaaS) Cloud Computing for Enterprises

Cloud Computing. Summary

Digital libraries of the future and the role of libraries

AquaLogic Service Bus

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

A Big Data-driven Model for the Optimization of Healthcare Processes

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

Duke University

A. Document repository services for EU policy support

IBM EXAM QUESTIONS & ANSWERS

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

How To Cloud Compute At The Cloud At The Cyclone Center For Cnc

Chapter 7. Using Hadoop Cluster and MapReduce

and Deployment Roadmap for Satellite Ground Systems

UPS battery remote monitoring system in cloud computing

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library

Cloud Computing Training

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Cloud-pilot.doc SA1 Marcus Hardt, Marcin Plociennik, Ahmad Hammad, Bartek Palak E U F O R I A

Transcription:

Scalable Services for Digital Preservation A Perspective on Cloud Computing Rainer Schmidt, Christian Sadilek, and Ross King

Digital Preservation (DP) Providing long-term access to growing collections of digital assets. Not just a question of storage Not just a question of files Software preservation rather than hardware preservation Prevent objects from becoming uniterpretable bit-streams. Requires establishment of research infrastructures and networks.» Not an out-of-the-box solution A number of large EU projects and initiatives are dealing with the implications of digital preservation. FP6/FP7: Planets, CASPAR, DPE, SHAMAN 2

Planets Permanent Long-term Access through NETworked Services Addresses the problem of digital preservation driven by National Libraries and Archives Project instrument: FP6 Integrated Project 5. IST Call Consortium: 16 organisations from 7 countries Duration: 48 months, June 2006 May 2010 Budget: 14 Million Euro http://www.planets-project.eu/ 3

Why does DP need HPC resources? Digital object management systems, repositories, or archives are designed for storing and providing access to large amounts of data. Many different data streams and metadata models Not designed to support continuous data modification. Focus on storage, no support for HPC. Digital preservation is an e-science problem Processing vast amounts of complex data (e.g. analyse, migrate), Experiments in distributed and heterogeneous environments, Crossing administrative and institutional boundaries. 4

Towards Grid and Cloud Computing Planets preservation architecture is based on services. Supports interoperability and a distributed environment Sufficient for a controlled experiments (Testbed) Not sufficient for handling a production environment Massively, uncontrolled user requests Mass migration of hundreds of TBytes of data Content Holders are faced with loosing vast amounts of data Holding not sufficient computational resources in-house There is a clear requirement for incorporating HPC facilities -> Grid and Cloud Computing 5

Execution Tiered Architecture Preservation Planning and Workflow Generation Workbench for Testbed Experiments Web Portal Browser Clients Service Registry Workflow Def. reference lookup Data + Metadata Registry Execution Engine maintain Planets Instance Tools + Format Registry Data Model Planets App. Services 3rd Party Tool Services Grid/Cloud Execution Services Web/Grid Services Experimental Environment (qualitative results) Production Environment (quantitative results) Resources 6

Integrating Clouds and Virtual Clusters Basic Idea: Extending Planets SOA with Grid Services The Planets IF Job Submission Services Allow Job Submission to a PC cluster (e.g. Hadoop, Condor) standard Grid protocols/interfaces (SOAP, HPC-BP, JSDL) Cluster nodes are instantiated from pre-configured system images Most Preservation Tools are 3rd party applications Software need to be preinstalled on cluster nodes Cluster and JSS be instantiated in-house (e.g. a PC lab) or on top of (leased) cloud resources (AWS EC2). Computation be moved to data or vice-versa 7

Experimental Architecture Virtual Cluster (Apache Hadoop) Virtual Node (Xen) HPCBP Cloud Infrastructure (EC2) JSDL Job Job Description File JSS Storage Infrastructure (S3) Raw Data Data Transfer Service 8

Mass Migration of Digital Objects Map-Reduce implements a framework and prog. model for processing large documents (Sorting, Searching, Indexing) on multiple nodes. Automated decomposition (split) Mapping to intermediary pairs (map), optionally (combine) Merge output (reduce) Provides implementation for data parallel problems, i/o intensive, Example: Conversion digital object (e.g website, folder, archive) Decompose into atomic pieces (e.g. file, image, movie) On each node, convert piece to target format Merge pieces and create new data object 9

Experimental Results - Setup Amazon Elastic Compute Cloud (EC2) 1 150 cluster nodes Custom image based on Fedora 8 i386 Amazon Simple Storage Service (S3) max. 1TB I/O, ~32,5MB/s download / ~13,8MB/s upload (cloud internally) Apache Hadoop (v.0.18) MapReduce Implementation Preinstalled command line tools (e.g, ps2pdf ) 10

Experimental Results 1 Scaling Job Size time [min] 10,00 9,00 8,00 7,00 6,00 5,00 4,00 3,00 2,00 1,00 0,00 x(1k) = 3,5 x(1k) = 4,4 x(1k) = 3,6 1 10 100 1000 number of nodes = 5 EC2 0,07 MB EC2 7,5 MB EC2 250 MB SLE 0,07 MB SLE 7,5 MB SLE 250 MB tasks x(1k) = t_seq / t_parallel and tasks = 1000 11

Experimental Results 2 Scaling #nodes 40,00 35,00 X n=1, t=36, s1 = 1, e=100% time [min] 30,00 25,00 20,00 15,00 10,00 5,00 0,00 X X n=1 (local), t=26 n=5, t=4.5, s=4.5, e=90% n=10, t=4.5, s=7.6, e=75% X X n=50, t=1.68, s=21, e=43% 0 50 100 150 X n=100, t=1.03, s=35, e=35% EC2 1000 x 0,07 MB SLE 1000 x 0,07 MB nodes 12

Conclusion Preservation systems need to employ HPC resources. Content holders and data repository systems are not ready to utilize computational Grids. There is a need to bridge research communities in the areas of digital preservation and e-science. Cloud Computing provides a powerful solution for getting on-demand access to appropriate HPC resources. Many integration issues: Security, Legal Aspects, Reliability, Standardization. Planets IF Job Submission Service, a first step. Submission to virtual cluster of DP nodes based on Grid protocols/interfaces. 13

Fin 14

The Planets Service Framework Defines an Service-Oriented Architecture for Digital Preservation Set of Preservation Services, Interfaces, a common Data Model Implements Common Services Authentication and Authorization, Monitoring, Logging, Notification, Service Registration and Lookup Provides Workflow Enactment Service and Engine Components-based, XML serialization APIs for Applications that use Planets Testbed Experiments, Executing Preservation Plans 15

<?xml version="1.0" encoding="utf-8"?> <jsdl:jobdefinition xmlns="http://www.example.org/" xmlns:jsdl="http://schemas.ggf.org/jsdl/2005/11/jsdl" xmlns:jsdl-posix="http://schemas.ggf.org/jsdl/2005/11/jsdl-posix" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://schemas.ggf.org/jsdl/2005/11/jsdl jsdl.xsd "> <jsdl:jobdescription> <jsdl:jobidentification> <jsdl:jobname>start vi</jsdl:jobname> </jsdl:jobidentification> <jsdl:application> <jsdl:applicationname>ls</jsdl:applicationname> <jsdl-posix:posixapplication> <jsdl-posix:executable>/bin/ls</jsdl-posix:executable> <jsdl-posix:argument>-la file.txt</jsdl-posix:argument> <jsdl-posix:environment name="ld_library_path">/usr/local/lib</jsdl-posix:environment> <jsdl-posix:input>/dev/null</jsdl-posix:input> <jsdl-posix:output>stdout.${job_id}</jsdl-posix:output> <jsdl-posix:error>stderr.${job_id}</jsdl-posix:error> </jsdl-posix:posixapplication> </jsdl:application> </jsdl:jobdescription> </jsdl:jobdefinition> 16