A Service for Data-Intensive Computations on Virtual Clusters



Similar documents
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Chapter 7. Using Hadoop Cluster and MapReduce

A programming model in Cloud: MapReduce

A Cost-Evaluation of MapReduce Applications in the Cloud

Benchmarking Amazon s EC2 Cloud Platform

Hadoop & its Usage at Facebook

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Hadoop IST 734 SS CHUNG

Deploying Business Virtual Appliances on Open Source Cloud Computing

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Data Semantics Aware Cloud for High Performance Analytics

The Quest for Conformance Testing in the Cloud

Cloud computing - Architecting in the cloud

Cloud Computing. Adam Barker

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

Hadoop & its Usage at Facebook

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Apache Hadoop. Alexandru Costan

OpenNebula Leading Innovation in Cloud Computing Management

HDFS Cluster Installation Automation for TupleWare

UPS battery remote monitoring system in cloud computing

Data processing goes big

Amazon EC2 Product Details Page 1 of 5

Cluster, Grid, Cloud Concepts

Improving MapReduce Performance in Heterogeneous Environments

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Plug-and-play Virtual Appliance Clusters Running Hadoop. Dr. Renato Figueiredo ACIS Lab - University of Florida

Hadoop & Spark Using Amazon EMR

THE HADOOP DISTRIBUTED FILE SYSTEM

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Hadoop and Map-Reduce. Swati Gore

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Cloud Computing. Summary

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Amazon Web Services. Elastic Compute Cloud (EC2) and more...

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

COURSE CONTENT Big Data and Hadoop Training

Introduction to Cloud Computing

Log Mining Based on Hadoop s Map and Reduce Technique

Cloud Computing: Computing as a Service. Prof. Daivashala Deshmukh Maharashtra Institute of Technology, Aurangabad

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Hadoop Setup. 1 Cluster

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Data Refinery with Big Data Aspects

Cloud Federations in Contrail

Evaluation Methodology of Converged Cloud Environments

5 SCS Deployment Infrastructure in Use

EMC IRODS RESOURCE DRIVERS

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Efficient Cloud Management for Parallel Data Processing In Private Cloud

Duke University

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Interoperability between Sun Grid Engine and the Windows Compute Cluster

Elastic Cloud Computing in the Open Cirrus Testbed implemented via Eucalyptus

MapReduce, Hadoop and Amazon AWS

Scientific and Technical Applications as a Service in the Cloud

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Big Data and Analytics: Challenges and Opportunities

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

2) Xen Hypervisor 3) UEC

Cloud Computing Summary and Preparation for Examination

BIG DATA TRENDS AND TECHNOLOGIES

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

EEDC. Scalability Study of web apps in AWS. Execution Environments for Distributed Computing

Efficient Data Management Support for Virtualized Service Providers

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Hadoop Distributed File System Propagation Adapter for Nimbus

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Open source Google-style large scale data analysis with Hadoop

Big Data Analytics Using R

Map Reduce / Hadoop / HDFS

Creating A Galactic Plane Atlas With Amazon Web Services

Professional Hadoop Solutions

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Hadoop Parallel Data Processing

SURFsara Data Services

What is Analytic Infrastructure and Why Should You Care?

Transcription:

A Service for Data-Intensive Computations on Virtual Clusters Executing Preservation Strategies at Scale Rainer Schmidt, Christian Sadilek, and Ross King rainer.schmidt@arcs.ac.at

Planets Project Permanent Long-term Access through NETworked Services Addresses the problem of digital preservation driven by National Libraries and Archives Project instrument: FP6 Integrated Project 5. IST Call Consortium: 16 organisations from 7 countries Duration: 48 months, June 2006 May 2010 Budget: 14 Million Euro http://www.planets-project.eu/

Outline Digital Preservation Need for escience Repositories The Planets Preservation Environment Distributed Data and Metadata Integration Data-Intensive Computations Grid Service for Job Execution on AWS Experimental Results Conclusions

Drivers for Digital Preservation Exponential growth of digital data across all sectors of society Large quantities of often irreplaceable information Research and scientific data Data becomes significantly more complex and diverse Digital information is fragile Ensure long term access and interpretability Requires data curation and preservation Development and assessment of preservation strategies Need for integrated and largely automated environments

The Planets Environment A collaborative research infrastructure for the systematic development and evaluation and of preservation strategies. A decision-support system for the development of preservation plans An integrated environment for dynamically providing, discovering and accessing a wide range of preservation tools, services, and technical registries. A workflow environment for the data integration, process execution, and the maintenance of preservation information.

Service Gateway Architecture Administration UI Preservation Planning (Plato) Experimentation Testbed Application Workflow Execution UI User Applications Authentication and Authorization Notification and Logging System Workflow Execution and Monitoring Experiment Data and Metadata Repository Service and Tool Registry Portal Services Application Services Execution Services Data Storage Services Application Execution and Data Services Physical Resources, Computers, Networks

Data and Metadata Integration Support distributed and remote storage facilities data registry and metadata repository Integrate different data sources and encoding standards Support range of protocols and description schemas Connect distributed data repositories and archives. Development of common digital object model Dynamic mapping of objects from different institutions (memory inst., research collections, content networks) Recording of provenance information, preservation, integrity, and descriptive metadata

Data Intensive Applications Development Planets Job Submission Services Allow Job Submission to a PC cluster (e.g. Hadoop, Condor) Standard grid protocols/interfaces (SOAP, HPC-BP, JSDL) Demand for massive compute and storage resources Instantiate cluster on top of (leased) cloud resources based on AWS EC2 + S3, (or alternatively in-house in a PC lab.) Allows computation to be moved to data or vice-versa On-demand cluster based on virtual machine images Many Preservation Tools are/rely on 3rd party applications Software can be preinstalled on virtual cluster nodes

Virtual Cluster and File System (Apache Hadoop) Experimental Architecture HPCBP Virtual Node (Xen) Cloud Infrastructure (EC2) Workflow Environment JSDL Job Description File JSS Job Storage Infrastructure (S3) Digital Objects Data Service

A Light-Weight Job Execution Service WS - Container Account Manager JSDL Parser Session Handler TLS/WS Security BES/HPCBP API Exec. Manager Data Resolver Input Generator Job Manager

A Light-Weight Job Execution Service WS - Container Account Manager JSDL Parser Session Handler TLS/WS Security BES/HPCBP API Exec. Manager Data Resolver Input Generator Job Manager S3 HDFS Hadoop

The MapReduce Application Map-Reduce implements a framework and prog. model for processing large documents (Sorting, Searching, Indexing) on multiple nodes. Automated decomposition (split) Mapping to intermediary pairs (map), optionally (combine) Merge output (reduce) Provides implementation for data parallel + i/o intensive applications Migrating a digital object (e.g web page, folder, archive) Decompose into atomic pieces (e.g. file, image, movie) On each node, process input splits Merge pieces and create new data collections

Experimental Setup Amazon Elastic Compute Cloud (EC2) 1 150 cluster nodes Custom image based on Fedora 8 i386 Amazon Simple Storage Service (S3) max. 1TB I/O per experiment All measurements based on a single core per virtual machine Apache Hadoop (v.0.18) MapReduce Implementation Preinstalled command line tools Quantitative evaluation of VMs (AWS) based on execution time, number of tasks, number of nodes, physical size

Experimental Results 1 Scaling Job Size time [min] 10,00 9,00 8,00 7,00 6,00 5,00 4,00 3,00 2,00 1,00 0,00 x(1k) = 3,5 x(1k) = 4,4 x(1k) = 3,6 1 10 100 1000 number of nodes = 5 EC2 0,07 MB EC2 7,5 MB EC2 250 MB SLE 0,07 MB SLE 7,5 MB SLE 250 MB tasks x(1k) = t_seq / t_parallel and tasks = 1000

Experimental Results 2 Scaling #nodes 40,00 35,00 X n=1, t=36, s1 = 0.72, e=72% time [min] 30,00 25,00 20,00 15,00 10,00 5,00 0,00 X X n=1 (local), t=26 n=5, t=4.5, s=3.3, e=66% n=10, t=4.5, s=5.5, e=55% X X n=50, t=1.68, s=16, e=31% 0 50 100 150 X n=100, t=1.03, s=26, e=26% EC2 1000 x 0,07 MB SLE 1000 x 0,07 MB nodes

Results and Observations AWS + Hadoop + JSS Robust and fault tolerant, scalable up to large numbers of nodes ~32,5MB/s download / ~13,8MB/s upload (cloud internally) Also small clusters of virtual machines reasonable S = 4,4 for p = 5, with n=1000 and s=7,5mb Ideally, size of cluster grows/shrinks on demand Overheads: SLE vs. Cloud (p=1, n=1000) 30% due to file system master In average, less then 10% due to S3 Small overheads due to coordination, latencies, pre-processing

Conclusion Challenges in digital preservation are diversity and scale Focus: scientific data, arts and humanities Repositories Preservation systems need to be embedded with largescale distributed environments grids, clouds, escience environments Cloud Computing provides a powerful solution for getting ondemand access to an IT infrastructure. Currently integration issues: Policies, Legal Aspects, SLAs We have presented current developments in the context of the EU project Planets on building an integrated research environment for the development and evaluation of preservation strategies