A Service for Data-Intensive Computations on Virtual Clusters

Size: px

Start display at page:

Download "A Service for Data-Intensive Computations on Virtual Clusters"

Brian Burke
10 years ago
Views:

1 A Service for Data-Intensive Computations on Virtual Clusters Executing Preservation Strategies at Scale Rainer Schmidt, Christian Sadilek, and Ross King

2 Planets Project Permanent Long-term Access through NETworked Services Addresses the problem of digital preservation driven by National Libraries and Archives Project instrument: FP6 Integrated Project 5. IST Call Consortium: 16 organisations from 7 countries Duration: 48 months, June 2006 May 2010 Budget: 14 Million Euro

3 Outline Digital Preservation Need for escience Repositories The Planets Preservation Environment Distributed Data and Metadata Integration Data-Intensive Computations Grid Service for Job Execution on AWS Experimental Results Conclusions

Metadata Integration Data-Intensive Computations Grid

4 Drivers for Digital Preservation Exponential growth of digital data across all sectors of society Large quantities of often irreplaceable information Research and scientific data Data becomes significantly more complex and diverse Digital information is fragile Ensure long term access and interpretability Requires data curation and preservation Development and assessment of preservation strategies Need for integrated and largely automated environments

complex and diverse Digital information is fragile Ensure long term access and interpretability Requires data

5 The Planets Environment A collaborative research infrastructure for the systematic development and evaluation and of preservation strategies. A decision-support system for the development of preservation plans An integrated environment for dynamically providing, discovering and accessing a wide range of preservation tools, services, and technical registries. A workflow environment for the data integration, process execution, and the maintenance of preservation information.

A decision-support system for the development of preservation plans An integrated environment for dynamically

6 Service Gateway Architecture Administration UI Preservation Planning (Plato) Experimentation Testbed Application Workflow Execution UI User Applications Authentication and Authorization Notification and Logging System Workflow Execution and Monitoring Experiment Data and Metadata Repository Service and Tool Registry Portal Services Application Services Execution Services Data Storage Services Application Execution and Data Services Physical Resources, Computers, Networks

Execution and Monitoring Experiment Data and Metadata Repository Service and Tool Registry Portal Services Application

7 Data and Metadata Integration Support distributed and remote storage facilities data registry and metadata repository Integrate different data sources and encoding standards Support range of protocols and description schemas Connect distributed data repositories and archives. Development of common digital object model Dynamic mapping of objects from different institutions (memory inst., research collections, content networks) Recording of provenance information, preservation, integrity, and descriptive metadata

8 Data Intensive Applications Development Planets Job Submission Services Allow Job Submission to a PC cluster (e.g. Hadoop, Condor) Standard grid protocols/interfaces (SOAP, HPC-BP, JSDL) Demand for massive compute and storage resources Instantiate cluster on top of (leased) cloud resources based on AWS EC2 + S3, (or alternatively in-house in a PC lab.) Allows computation to be moved to data or vice-versa On-demand cluster based on virtual machine images Many Preservation Tools are/rely on 3rd party applications Software can be preinstalled on virtual cluster nodes

on top of (leased) cloud resources based on AWS EC2 + S3, (or alternatively in-house in a PC lab.

9 Virtual Cluster and File System (Apache Hadoop) Experimental Architecture HPCBP Virtual Node (Xen) Cloud Infrastructure (EC2) Workflow Environment JSDL Job Description File JSS Job Storage Infrastructure (S3) Digital Objects Data Service

Infrastructure (EC2) Workflow Environment JSDL Job

10 A Light-Weight Job Execution Service WS - Container Account Manager JSDL Parser Session Handler TLS/WS Security BES/HPCBP API Exec. Manager Data Resolver Input Generator Job Manager

11 A Light-Weight Job Execution Service WS - Container Account Manager JSDL Parser Session Handler TLS/WS Security BES/HPCBP API Exec. Manager Data Resolver Input Generator Job Manager S3 HDFS Hadoop

12 The MapReduce Application Map-Reduce implements a framework and prog. model for processing large documents (Sorting, Searching, Indexing) on multiple nodes. Automated decomposition (split) Mapping to intermediary pairs (map), optionally (combine) Merge output (reduce) Provides implementation for data parallel + i/o intensive applications Migrating a digital object (e.g web page, folder, archive) Decompose into atomic pieces (e.g. file, image, movie) On each node, process input splits Merge pieces and create new data collections

Automated decomposition (split) Mapping to intermediary pairs (map), optionally (combine) Merge output (reduce) Provides

13 Experimental Setup Amazon Elastic Compute Cloud (EC2) cluster nodes Custom image based on Fedora 8 i386 Amazon Simple Storage Service (S3) max. 1TB I/O per experiment All measurements based on a single core per virtual machine Apache Hadoop (v.0.18) MapReduce Implementation Preinstalled command line tools Quantitative evaluation of VMs (AWS) based on execution time, number of tasks, number of nodes, physical size

1TB I/O per experiment All measurements based on a single core per virtual machine Apache Hadoop (v.0.

14 Experimental Results 1 Scaling Job Size time [min] 10,00 9,00 8,00 7,00 6,00 5,00 4,00 3,00 2,00 1,00 0,00 x(1k) = 3,5 x(1k) = 4,4 x(1k) = 3, number of nodes = 5 EC2 0,07 MB EC2 7,5 MB EC2 250 MB SLE 0,07 MB SLE 7,5 MB SLE 250 MB tasks x(1k) = t_seq / t_parallel and tasks = 1000

10 100 1000 number of nodes = 5 EC2 0,07 MB EC2 7,5 MB EC2 250 MB SLE

15 Experimental Results 2 Scaling #nodes 40,00 35,00 X n=1, t=36, s1 = 0.72, e=72% time [min] 30,00 25,00 20,00 15,00 10,00 5,00 0,00 X X n=1 (local), t=26 n=5, t=4.5, s=3.3, e=66% n=10, t=4.5, s=5.5, e=55% X X n=50, t=1.68, s=16, e=31% X n=100, t=1.03, s=26, e=26% EC x 0,07 MB SLE 1000 x 0,07 MB nodes

t=26 n=5, t=4.5, s=3.3, e=66% n=10, t=4.5, s=5.5, e=55% X X n=50, t=1.

16 Results and Observations AWS + Hadoop + JSS Robust and fault tolerant, scalable up to large numbers of nodes ~32,5MB/s download / ~13,8MB/s upload (cloud internally) Also small clusters of virtual machines reasonable S = 4,4 for p = 5, with n=1000 and s=7,5mb Ideally, size of cluster grows/shrinks on demand Overheads: SLE vs. Cloud (p=1, n=1000) 30% due to file system master In average, less then 10% due to S3 Small overheads due to coordination, latencies, pre-processing

with n=1000 and s=7,5mb Ideally, size of cluster grows/shrinks on demand Overheads: SLE vs.

17 Conclusion Challenges in digital preservation are diversity and scale Focus: scientific data, arts and humanities Repositories Preservation systems need to be embedded with largescale distributed environments grids, clouds, escience environments Cloud Computing provides a powerful solution for getting ondemand access to an IT infrastructure. Currently integration issues: Policies, Legal Aspects, SLAs We have presented current developments in the context of the EU project Planets on building an integrated research environment for the development and evaluation of preservation strategies

getting ondemand access to an IT infrastructure.

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline