The EU DataGrid Data Management



Similar documents
EDG Project: Database Management Services

Grid Data Management. Raj Kettimuthu

Replica Management Services in the European DataGrid Project

Workload Management Services. Data Management Services. Networking. Information Service. Fabric Management

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Advanced School in High Performance and GRID Computing November 2008

Data Management System for grid and portal services

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

File Transfer Software and Service SC3

Chapter 12 Distributed Storage

Data Management Services Design and Development

Web Service Based Data Management for Grid Applications

The glite File Transfer Service

A simple object storage system for web applications Dan Pollack AOL

MIGRATING DESKTOP AND ROAMING ACCESS. Migrating Desktop and Roaming Access Whitepaper

Data Grids. Lidan Wang April 5, 2007

Analisi di un servizio SRM: StoRM

Deploying complex applications to Google Cloud. Olia Kerzhner

Plateforme de Calcul pour les Sciences du Vivant. SRB & glite. V. Breton.

Data Management. Issues

Deploying a distributed data storage system on the UK National Grid Service using federated SRB

Performance and Scalability of a Replica Location Service

ZooKeeper. Table of contents

Roberto Barbera. Centralized bookkeeping and monitoring in ALICE

Simplified Management With Hitachi Command Suite. By Hitachi Data Systems

Configuration Worksheets for Oracle WebCenter Ensemble 10.3

Tivoli Storage Manager Explained

DFSgc. Distributed File System for Multipurpose Grid Applications and Cloud Computing

Introduction to OpenStack

The Globus Replica Location Service: Design and Experience

The dcache Storage Element

w w w. u l t i m u m t e c h n o l o g i e s. c o m Infrastructure-as-a-Service on the OpenStack platform

Symantec NetBackup OpenStorage Solutions Guide for Disk

Distributed Data Management

Mass Storage at GridKa

OSG PUBLIC STORAGE. Tanya Levshina

High Availability Databases based on Oracle 10g RAC on Linux

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Building and Deploying Enterprise M2M Applications with Axeda Platform

Private Cloud Storage for Media Applications. Bang Chang Vice President, Broadcast Servers and Storage

The CMS analysis chain in a distributed environment

Oracle Recovery Manager

Configuration Management of Massively Scalable Systems

Distributed Database Access in the LHC Computing Grid with CORAL

Using the NDMP File Service for DMA- Driven Replication for Disaster Recovery. Hugo Patterson

A Web Services Data Analysis Grid *

Analyses on functional capabilities of BizTalk Server, Oracle BPEL Process Manger and WebSphere Process Server for applications in Grid middleware

Client/Server Grid applications to manage complex workflows

For large geographically dispersed companies, data grids offer an ingenious new model to economically share computing power and storage resources

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Mule Enterprise Service Bus (ESB) Hosting

Exchange Storage Meeting Requirements with Dot Hill

IBM Solutions Grid for Business Partners Helping IBM Business Partners to Grid-enable applications for the next phase of e-business on demand

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File Systems

Snapshots in Hadoop Distributed File System

Delegated Administration Quick Start

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Agenda. Overview Configuring the database for basic Backup and Recovery Backing up your database Restore and Recovery Operations Managing your backups

Oracle Secure Backup 10.2 Policy-Based Backup Management. An Oracle White Paper December 2007

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Data Management System - Developer Guide

Customer Bank Account Management System Technical Specification Document

Clodoaldo Barrera Chief Technical Strategist IBM System Storage. Making a successful transition to Software Defined Storage

glibrary: Digital Asset Management System for the Grid

Cloud Storage Backup for Storage as a Service with AT&T

HDFS Architecture Guide

EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH CERN ACCELERATORS AND TECHNOLOGY SECTOR A REMOTE TRACING FACILITY FOR DISTRIBUTED SYSTEMS

MuleSoft Blueprint: Load Balancing Mule for Scalability and Availability

Technical. Overview. ~ a ~ irods version 4.x

Alfresco Enterprise on AWS: Reference Architecture

The EU DataGrid Information and Monitoring Services

CMB 207 1I Citrix XenApp and XenDesktop Fast Track

Web Application Hosting Cloud Architecture

WebLogic Server Foundation Topology, Configuration and Administration

Continuous security audit automation with Spacewalk, Puppet, Mcollective and SCAP

Oracle BI Publisher Enterprise Cluster Deployment. An Oracle White Paper August 2007

Cloud-Era File Sharing and Collaboration

Storage Virtualization. Andreas Joachim Peters CERN IT-DSS

PROGRESS Portal Access Whitepaper

Oracle Identity Analytics Architecture. An Oracle White Paper July 2010

Automated file management with IBM Active Cloud Engine

A Survey Study on Monitoring Service for Grid

This training is targeted at System Administrators and developers wanting to understand more about administering a WebLogic instance.

Heterogeneous Database Replication Gianni Pucciani

Hypertable Architecture Overview

THE CCLRC DATA PORTAL

Removing Failure Points and Increasing Scalability for the Engine that Drives webmd.com

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Network monitoring in DataGRID project

QuickStart Guide for Mobile Device Management

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Avid. Avid Interplay Web Services. Version 2.0

11. Oracle Recovery Manager Overview and Configuration.

AGENDA: INTRODUCTION: 1. How is our cloud monitoring setup? 2. Which are the tools used? 3. How do we access monitoring dashboard?

Oracle WebLogic Server 11g: Administration Essentials

OpenAM. 1 open source 1 community experience distilled. Single Sign-On (SSO) tool for securing your web. applications in a fast and easy way

About the Author About the Technical Contributors About the Technical Reviewers Acknowledgments. How to Use This Book

Application Discovery Manager User s Guide vcenter Application Discovery Manager 6.2.1

Transcription:

The EU DataGrid Data Management The European DataGrid Project Team http://www.eu-datagrid.org DataGrid is a project funded by the European Union Grid Tutorial 4/3/2004 n 1

EDG Tutorial Overview Workload Management Services Data Management Services Networking Information Service Fabric Management Grid Tutorial - 4/3/2004 Data Management Services - n 2

Overview Data Management Issues Main Components Replica Manager Replica Location Service Replica Metadata Catalog Replica Optimization Service Grid Tutorial - 4/3/2004 Data Management Services - n 3

File Management Motivation Site A Site B Storage Element A Storage Element B File A File X File B File Y File Transfer File A File C File B File D Grid Tutorial - 4/3/2004 Data Management Services - n 4

File Management Motivation Replica Catalog: Map Logical to Site files Site A Site B Storage Element A Storage Element B File A File X File B File Y File Transfer File A File C File B File D Grid Tutorial - 4/3/2004 Data Management Services - n 5

File Management Motivation Replica Catalog: Map Logical to Site files Replica Selection: Get best file Site A Site B Storage Element A Storage Element B File A File X File B File Y File Transfer File A File C File B File D Grid Tutorial - 4/3/2004 Data Management Services - n 6

File Management Motivation Replica Catalog: Map Logical to Site files Replica Selection: Get best file Pre- Post-processing: Prepare files for transfer Validate files after transfer Site A Site B Storage Element A Storage Element B File A File X File B File Y File Transfer File A File C File B File D Grid Tutorial - 4/3/2004 Data Management Services - n 7

File Management Motivation Replica Catalog: Map Logical to Site files Replica Selection: Get best file Pre- Post-processing: Prepare files for transfer Validate files after transfer Site A Replication Automation: Data Source subscription Site B Storage Element A Storage Element B File A File X File B File Y File Transfer File A File C File B File D Grid Tutorial - 4/3/2004 Data Management Services - n 8

File Management Motivation Replica Catalog: Map Logical to Site files Replica Selection: Get best file Pre- Post-processing: Prepare files for transfer Validate files after transfer Site A Storage Element A Replication Automation: Data Source subscription Site B Load balancing: Replicate based on usage Storage Element B File A File X File B File Y File Transfer File A File C File B File D Grid Tutorial - 4/3/2004 Data Management Services - n 9

File Management Replica Catalog: Map Logical to Site files Replica Manager: atomic replication operation single client interface orchestrator Replica Selection: Get best file Pre- Post-processing: Prepare files for transfer Validate files after transfer Site A Storage Element A Replication Automation: Data Source subscription Site B Load balancing: Replicate based on usage Storage Element B File A File X File B File Y File Transfer File A File C File B File D Grid Tutorial - 4/3/2004 Data Management Services - n 10

File Management Replica Catalog: Map Logical to Site files Replica Manager: atomic replication operation single client interface orchestrator Replica Selection: Get best file Pre- Post-processing: Prepare files for transfer Validate files after transfer Site A Metadata: LFN Storage metadata Element A Transaction information Access patterns File A File X File B File Y File Transfer Replication Automation: Data Source subscription Site B Load balancing: Replicate based on usage Storage Element B File A File C File B File D Grid Tutorial - 4/3/2004 Data Management Services - n 11

File Management Replica Catalog: Map Logical to Site files Replica Manager: atomic replication operation single client interface orchestrator Replica Selection: Get best file Pre- Post-processing: Prepare files for transfer Validate files after transfer Site A Metadata: LFN Storage metadata Element A Transaction information Access patterns File A File X File B File Y File Transfer Replication Automation: Data Source subscription Site B Load balancing: Replicate based on usage Storage Element B File A File C File B File D Grid Tutorial - 4/3/2004 Data Management Services - n 12

Data Management Tools Tools for Locating data Copying data Managing and replicating data Meta Data management On EDG Testbed you have Replica Location Service (RLS) Replica Metadata Service (RMC) Repica Optimisation Service (ROS) Replica Manager (RM) RM RLS RMC ROS Grid Tutorial - 4/3/2004 Data Management Services - n 13

What is a Replica? Logical File Name (LFN) Physical Name on Storage Stockie LollyPop Grid Tutorial - 4/3/2004 Data Management Services - n 14

What is a Replica? Logical File Name (LFN) Physical Name on Storage Stockie LollyPop Grid Tutorial - 4/3/2004 Data Management Services - n 15

What is a Replica? Logical File Name (LFN) Physical Name on Storage Stockie LollyPop Grid Tutorial - 4/3/2004 Data Management Services - n 16

What is a Replica? Logical File Name (LFN) Physical Name on Storage Stockie LollyPop Grid Tutorial - 4/3/2004 Data Management Services - n 17

What is a Replica? Logical File Name (LFN) Physical Name on Storage Stockie LollyPop Grid Tutorial - 4/3/2004 Data Management Services - n 18

Replication Services: Basic Functionality Each file has a unique Grid ID. Locations corresponding to the GUID are kept in the Replica Location Service. Users may assign aliases to the GUIDs. These are kept in the Replica Metadata Catalog. Files have replicas stored at many Grid sites on Storage Elements. Replica Manager Replica Metadata Catalog Replica Location Service Storage Element Storage Element The Replica Manager provides atomicity for file operations, assuring consistency of SE and catalog contents. Grid Tutorial - 4/3/2004 Data Management Services - n 19

Higher Level Replication Services Hooks for user-defined pre- and postprocessing for replication operations are The available. Replica Manager may call on the Replica Optimization service to find the best replica among many based on network and SE monitoring. Replica Manager Replica Metadata Catalog Replica Location Service Replica Optimization Service Storage Element Storage Element SE Monitor Network Monitor Grid Tutorial - 4/3/2004 Data Management Services - n 20

Interactions with other Grid components Resource Broker User Interface or Worker Node Virtual Organization Membership Service Information Service Replica Manager Replica Metadata Catalog Replica Location Service Replica Optimization Service Storage Element Storage Element SE Monitor Applications and users interface to data through the Replica Manager Network Monitor either directly or through the Resource Broker. Management calls should never go directly to the SRM. Grid Tutorial - 4/3/2004 Data Management Services - n 21

Storage Resource Management (1) Data are stored on disk pool servers or Mass Storage Systems storage resource management needs to take into account Transparent access to files (migration to/from disk pool) File pinning Space reservation File status notification Life time management SRM (Storage Resource Manager) takes care of all these details SRM is a Grid Service that takes care of local storage interaction and provides a Grid interaface to outside world In EDG we originally used the term Storage Elemement now we use the term SRM to refer to the new service Grid Tutorial - 4/3/2004 Data Management Services - n 22

Storage Resource Management (2) Original SRM design specification: LBL, JNL, FNAL, CERN Support for local policy Each storage resource can be managed independently Internal priorities are not sacrificed by data movement between Grid agents Disk and tape resources are presented as a single element Temporary locking/pinning Files can be read from disk caches rather than from tape Reservation on demand and advance reservation Space can be reserved for registering a new file Plan the storage system usage File status and estimates for planning Provides info on file status Provide estimates on space availability/usage EDG provides one implementation as part of the current software release Grid Tutorial - 4/3/2004 Data Management Services - n 23

Simplified Interaction Replica Manager - SRM Replica Manager client 1 6 6 3 2 Replica Catalog SRM 4 5 Storage 1. The Client asks a catalog to provide the location of a file 2. The catalog responds with the name of an SRM 3. The client asks the SRM for the file 4. The SRM asks the storage system to provide the file 5. The storage system sends the file to the client through the SRM or 6. directly Grid Tutorial - 4/3/2004 Data Management Services - n 24

Naming Conventions Logical File Name (LFN) An alias created by a user to refer to some item of data e.g. lfn:cms/20030203/run2/track1 Site URL (SURL) (or Physical File Name (PFN)) The location of an actual piece of data on a storage system e.g. srm://pcrd24.cern.ch/flatfiles/cms/output10_1 Globally Unique Identifier (GUID) A non-human readable unique identifier for an item of data e.g. guid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6 Logical File Name 1 Physical File SURL 1 Logical File Name 2 GUID Logical File Name n Physical File SURL n Grid Tutorial - 4/3/2004 Data Management Services - n 25

Replica Metadata Catalog (RMC) vs. Replica Location Service (RLS) RMC: Stores LFN-GUID mappings RM RLS RLS: Stores GUID-SURL mappings RMC ROS Logical File Name 1 Physical File SURL 1 Logical File Name 2 GUID Logical File Name n Physical File SURL n RMC RLS Grid Tutorial - 4/3/2004 Data Management Services - n 26

Replica Location Service (RLS) The Replica Location Service is a system that maintains and provides access to information about the physical location of copies of data files. It is a distributed service that stores mappings between globally unique identifiers of datafiles and the physical identifiers of all existing replicas of these datafiles. Design is a joint collaboration between Globus and EDG-WP2 RM RLS RMC ROS Grid Tutorial - 4/3/2004 Data Management Services - n 27

Replica Location Service RLS Local Catalogs hold the actual name mappings Remote Indices redirect inquiries to LRCs actually having the file LRCs are configured to send index updates to any number of RLIs Indexes are Bloom Filters Grid Tutorial - 4/3/2004 Data Management Services - n 28

RLS Components (1) Local Replica Catalog (LRC) Stores GUID to SURL (PFN) mappings for a single SRM Stores attributes on SURL (PFN), e.g file size, creator Maintains local state independently of other RLS components complete local record of all replicas on a single SRM Many Local Replica Catalogs in a Grid can be configured, for instance one LRC per site or one LRC per SRM Fairly permanent fixture LRC coupled to the SRM, SRM removal is infrequent new LRCs added with addition of new SRMs to a site RM RLS RMC ROS RLI LRC Grid Tutorial - 4/3/2004 Data Management Services - n 29

RLS Components (2) Replica Location Index (RLI) Stores GUID to LRC mappings Distributed index over Local Replica Catalogs in a Grid Receives periodic soft state updates from LRCs information has an associated expiration time LRCs configured to send updates to RLIs (push) Maintains collective state inconsistencies due to the soft state update mechanism Can be installed anywhere not inherently associated with anything Uses Bloom Filter Indexes RM RLS RMC ROS RLI LRC Grid Tutorial - 4/3/2004 Data Management Services - n 30

LRC Implementation LRC data stored in a Relational Database Runs with either Oracle 9i or MySQL Catalogs implemented in Java and hosted in a J2EE application server Tomcat4 or Oracle 9iAS for application server Java and C++ APIs exposed to clients through Apache Axis (Java) and gsoap (C++) Catalog APIs exposed using WSDL Vendor neutral approach taken to allow different deployment options Grid Tutorial - 4/3/2004 Data Management Services - n 31

RLI Implementation Updates implemented as a push from the LRCs RLIs less permanent than LRCs Bloom filter updates only implemented O(10 6 ) (or more ) entries in an LRC May contain false positives, but no false negatives rate depends on the configutation of the bloom filter Bloom filters stored to disk Impractical to send full LRC lists to multiple RLIs fine for tests not scalable in a production environment Implemented in Java as a web service as for the LRCs Grid Tutorial - 4/3/2004 Data Management Services - n 32

Usage Here at NCP Only one Local Replica Catalog (LRC) One Replica Metadata Catalog pcncp27.ncp.edu.pk Grid Tutorial - 4/3/2004 Data Management Services - n 33

User Interfaces for Data Management Users are mainly referred to use the interface of the Replica Manager client: Management commands Catalog commands Optimization commands File Transfer commands RM RLS RMC ROS The services RLS, RMC and ROS provide additional user interfaces Mainly for additional catalog operations (in case of RLS, RMC) Additional server administration commands Should mainly be used by administrators Can also be used the check the availability of a service Grid Tutorial - 4/3/2004 Data Management Services - n 34

The Replica Manager Interface Management Commands copyandregisterfile args: source, dest, lfn, protocol, streams Copy a file into grid-aware storage and register the copy in the Replica Catalog as an atomic operation. replicatefile args: source/lfn, dest, protocol, streams Replicate a file between grid-aware stores and register the replica in the Replica Catalog as an atomic operation. deletefile args: source/sehost, all Delete a file from storage and unregister it. Example edg-rm --vo=tutor copyandregisterfile file:/home/bob/analysis/data5.dat -d lxshare0384.cern.ch Grid Tutorial - 4/3/2004 Data Management Services - n 35

The Replica Manager Interface Catalog Commands (1) registerfile args: source, lfn Register a file in the Replica Catalog that is already stored on a Storage Element. unregisterfile args: source, guid Unregister a file from the Replica Catalog. listreplicas args: lfn/surl/guid List all replicas of a file. registerguid args: surl, guid Register an SURL with a known GUID in the Replica Catalog. listguid args: lfn/surl Print the GUID associated with an LFN or SURL. Grid Tutorial - 4/3/2004 Data Management Services - n 36

The Replica Manager Interface Catalog Commands (2) addalias args: guid, lfn Add a new alias to GUID mapping removealias args: guid, lfn Remove an alias LFN from a known GUID. printinfo() Print the information needed by the Replica Manager to screen or to a file. getversion() Get the versions of the replica manager client. Grid Tutorial - 4/3/2004 Data Management Services - n 37

The Replica Manager Interface Optimization Commands listbestfile args: lfn/guid, sehost Return the 'best' replica for a given logical file identifier. getbestfile args: lfn/guid, sehost, protocol, streams Return the storage file name (SFN) of the best file in terms of network latencies. getaccesscost args: lfn/guid[], ce[], protocol[] Calculates the expected cost of accessing all the files specified by logicalname from each Computing Element host specified by cehosts. Grid Tutorial - 4/3/2004 Data Management Services - n 38

The Replica Manager Interface File Transfer Commands copyfile args: soure, dest Copy a file to a non-grid destination. listdirectory args: dir List the directory contents on an SRM or a GridFTP server. Grid Tutorial - 4/3/2004 Data Management Services - n 39

Replica Management Use Case edg-rm copyandregisterfile -l lfn:higgs CERN LYON edg-rm listreplicas -l lfn:higgs edg-rm replicatefile -l lfn:higgs NIKHEF edg-rm listbestfile -l lfn:higgs CERN edg-rm getaccesscost -l lfn:higgs CERN NIKHEF LYON edg-rm getbestfile -l lfn:higgs CERN edg-rm deletefile -l lfn:higgs LYON edg-rm listbestfile -l lfn:higgs CERN Grid Tutorial - 4/3/2004 Data Management Services - n 40

Metadata Management and Security Project Spitfire 'Simple' Grid Persistency Grid Metadata Application Metadata Unified Grid enabled front end to relational databases. Metadata Replication and Consistency Publish information on the metadata service Secure Grid Services Grid authentication, authorization and access control mechanisms enabled in Spitfire Modular design, reusable by other Grid Services Grid Tutorial - 4/3/2004 Data Management Services - n 41

Conclusions and Further Information The second generation Data Management services have been designed and implemented based on the Web Service paradigm Flexible, extensible service framework Deployment choices : robust, highly available commercial products supported (eg. Oracle) as well as open-source (MySQL, Tomcat) First experiences with these services show that their performance meets the expectations Further information / documentation: www.cern.ch/edg-wp2 Grid Tutorial - 4/3/2004 Data Management Services - n 42