The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Similar documents
Data Grids. Lidan Wang April 5, 2007

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

Web Service Based Data Management for Grid Applications

Concepts and Architecture of the Grid. Summary of Grid 2, Chapter 4

Deploying a distributed data storage system on the UK National Grid Service using federated SRB

GridFTP: A Data Transfer Protocol for the Grid

Cluster, Grid, Cloud Concepts

Technical. Overview. ~ a ~ irods version 4.x

Grid Scheduling Dictionary of Terms and Keywords

DESIGN OF A PLATFORM OF VIRTUAL SERVICE CONTAINERS FOR SERVICE ORIENTED CLOUD COMPUTING. Carlos de Alfonso Andrés García Vicente Hernández

X.500 and LDAP Page 1 of 8

Grid Sun Carlo Nardone. Technical Systems Ambassador GSO Client Solutions

DFSgc. Distributed File System for Multipurpose Grid Applications and Cloud Computing

EDG Project: Database Management Services

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Application Centric Infrastructure Object-Oriented Data Model: Gain Advanced Network Control and Programmability

GEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington

Gradient An EII Solution From Infosys

Diagram 1: Islands of storage across a digital broadcast workflow

Resource Management on Computational Grids

Fedora Directory Server FUDCon III London, 2005

DataNet Flexible Metadata Overlay over File Resources

The ORIENTGATE data platform

ZooKeeper. Table of contents

CMDB Federation. DMTF Standards for Federating CMDBs and other Management Data Repositories

Websense Support Webinar: Questions and Answers

Service-Orientation and Next Generation SOA

Distributed Database Access in the LHC Computing Grid with CORAL

Big Data With Hadoop

PROTOTYPE IMPLEMENTATION OF A DEMAND DRIVEN NETWORK MONITORING ARCHITECTURE

UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Analisi di un servizio SRM: StoRM

The Sierra Clustered Database Engine, the technology at the heart of

An Intelligent Approach for Integrity of Heterogeneous and Distributed Databases Systems based on Mobile Agents

THE CCLRC DATA PORTAL

A Taxonomy and Survey of Grid Resource Planning and Reservation Systems for Grid Enabled Analysis Environment

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Chapter 7. Using Hadoop Cluster and MapReduce

Data Management System - Developer Guide

Service Oriented Architecture

Web Application Hosting Cloud Architecture

Cross-domain Identity Management System for Cloud Environment

2012 LABVANTAGE Solutions, Inc. All Rights Reserved.

API Architecture. for the Data Interoperability at OSU initiative

irods and Metadata survey Version 0.1 Date March Abhijeet Kodgire 25th

A Survey Study on Monitoring Service for Grid

The THREDDS Data Repository: for Long Term Data Storage and Access

Infosys GRADIENT. Enabling Enterprise Data Virtualization. Keywords. Grid, Enterprise Data Integration, EII Introduction

MIGRATING DESKTOP AND ROAMING ACCESS. Migrating Desktop and Roaming Access Whitepaper

Enterprise Application Designs In Relation to ERP and SOA

Simple Solution for a Location Service. Naming vs. Locating Entities. Forwarding Pointers (2) Forwarding Pointers (1)

LinuxWorld Conference & Expo Server Farms and XML Web Services

Remote Sensitive Image Stations and Grid Services

Cloud Service Model. Selecting a cloud service model. Different cloud service models within the enterprise

Collaborative & Integrated Network & Systems Management: Management Using Grid Technologies

Enterprise Digital Identity Architecture Roadmap

An approach to grid scheduling by using Condor-G Matchmaking mechanism

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

BSC vision on Big Data and extreme scale computing

Workflow Requirements (Dec. 12, 2006)

Classic Grid Architecture

KNOWLEDGE GRID An Architecture for Distributed Knowledge Discovery

FIPA agent based network distributed control system

Cisco Active Network Abstraction Gateway High Availability Solution

INTEGRATED RULE ORIENTED DATA SYSTEM (IRODS)

WINDOWS 2000 Training Division, NIC

Introducing FedFS On Linux Chuck Lever Oracle Corporation

Introduction to Service Oriented Architectures (SOA)

Communiqué 4. Standardized Global Content Management. Designed for World s Leading Enterprises. Industry Leading Products & Platform

A Service for Data-Intensive Computations on Virtual Clusters

CHAPTER 7 SUMMARY AND CONCLUSION

16th International Conference on Control Systems and Computer Science (CSCS16 07)

Active Directory. By: Kishor Datar 10/25/2007

Software Life-Cycle Management

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Service Virtualization: Managing Change in a Service-Oriented Architecture

SOACertifiedProfessional.Braindumps.S90-03A.v by.JANET.100q. Exam Code: S90-03A. Exam Name: SOA Design & Architecture

SOA REFERENCE ARCHITECTURE: WEB TIER

Transcription:

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets!! Large data collections appear in many scientific domains like climate studies.!! Users and resources are distributed geographically.!! Numerous solutions to individual issues exist but no overarching architecture.!! Solution: Data Grid!! Specialization and extension of Grid paradigm Ethar Elsaka!! Mechanism neutrality!! DG architecture to be independent of lower-level storage/ access systems. Encapsulation of local peculiarities.!! Policy neutrality!! Expose design decisions with performance implications to the user; ie let user set priorities.!! Compatibility with Grid infrastructure!! Exploit Grid services like authentication and resource management.!! Uniformity of information infrastructure!! Use the same data model and interface as required to handle the underlying Grid infrastructure.

!! Data Access!! It provides mechanisms for accessing, managing, and transferring data stored in storage systems!! Metadata Access!! It provides mechanisms for accessing and managing information about stored data.!! Separating data and metadata maximizes flexibility.!! Storage System!! It provides function for creating, destroying, reading, writing, and manipulating file instances!! Unit of info: file instance consists of a named, uninterrupted sequence of bytes.!! File instance can reside in a file system, database, or other storage systems like (SRB).!! Storage system associates a set of properties with each file instance.!! Data Access!! It is an API that describes the possible operations that can be done on storage systems and file instances.!! API main operations!! Remote requests to read/write named file instances.!! Determining file instance attributes.!! Supporting optimized implementation of replica.!! Data grid complications!! Heterogeneous security environments.!! Reservation capabilities on storage systems and environments when increased performance is required.!! Performance monitoring and self-optimization.!! Error detection and reporting.!! It manages the information of the grid, file instances, the contents of file instances, and various storage systems contained in data grid.!! Metadata Types!! Application metadata!! Replica metadata!! System configuration metadata!! Application queries sent to metadata service. Repository consists of references to logical files, which can be mapped to actual replicas.

!! Metadata complications!! Difficult to select uniform representation and interface: numerous existing metadata representations reflecting different needs and philosophies (XML, indexing data structures, etc).!! Large scale metadata issues: scalability, heterogeneity, distributed environment, ownership and local control over data access, robustness in the face of partial failures.!! Hierarchical and distributed system.!! Similar systems: LDAP, existing Grid metadata systems.!! Authorization and authentication!! Resource reservation and co-allocation!! Performance measurements!! Instrumentation services!! Replica Management!! It is responsible of creating, deleting copies of file instances, or replica, within specified storage systems.!! A replica is created because of achieving better performance.!! It maintains a repository or catalog.!! It does not determine when or where replicas are created.!! Replica Selection!! It is responsible of choosing the replica that will optimize the performance.!! Grid information service and metadata repository provide it with information about the network and file instances!! Used LDAP to construct catalogs.!! Catalog: tree structure known as Directory Information Tree (DIT).!! Two applications: Climate modeling and data visualizations.!! Climate Modeling:!! DIT: root node, node for each collection, node for each logical file per collection.!! Metadata: XML. Data accessed via URLS.!! Prototype: user manually choose replica.!! Ability to map a query to the URL using LDAP and move the data using general purpose transfer API

!! Data visualization application!! Locate files for distance visualization applications.!! Each file: a timestep in a series!! Can be of different resolutions.!! Can be of different data layouts.!! Several thousand files.!! Each file has multiple replicas listed in catalog.!! Their implementation scales poorly: too many objects in replica catalog.!! Solution: organize logical files into collections.!! keep metadata and replica management separate.!! Replica and Metadata Catalogs should be distinct.!! Data contents info should be stored in Metadata Catalog.!! Data location info/mapping should be stored in Replica Catalog.!! Scientific Applications Requires transfer, publish, replicate, discover, share, and analyze data.!! Business Applications Maintain database consistency, manage data replication, facilitate data discovery, and respond dynamically to changes in the load applied to database by users.!! The data whose structure is explicitly defined in a way that can be exploited by operations.!! Data structured using!! Data relational model!! XML documents!! Structured data categories!! Primary structured data!! Ancillary data!! Collaboration data!! Personal data!! Service data

!! Applications requirements!! Diverse usage scenarios!! Heterogeneity at all system levels!! Performance demand associated with access, manipulation, and analysis of large quantities of data.!! Data Sources!! Any facility that may accept or provide data (e.g. database and file systems).!! Data source may be temporal or permanent.!! Grid contains diverse data sources with different storage systems, data types, data models, and access mechanisms.!! Data Discovery, Movement and Replication!! The ability to discover desired data based on meta data attributes rather than knowledge of name or location.!! The ability to move the data between storage systems or between programs and data storage.!! The ability to create replicas to reduce the access latency, maintain local control over necessary data, or improve reliability and load balancing.!! Data Analysis and Processing!! Data analysis requires a series of data processing and analysis.!! These subsequent execution of operations requires planning, scheduling, and monitoring challenges.!! Once the execution plan is designed, one must map the execution of the operations onto available resources.

!! The design of an infrastructure that supports the dataintensive applications various from monolithic integrated architecture to layered protocol architecture.!! The data-oriented services:!! Resource-level services for data sources.!! High performance data access and update.!! Collective services for managing data.!! Managing data across different data sources like data transfer and data discovery.!! Collective services that federate data sources!! Combining data from different data sources!! Domain-specific services.!! Specialized data processing and analysis operations!! Data Access Responsible of the interfaces provided to the user to access the data on the devices and performance characteristics.!! GridFTP as a File Access Service!! Designed as a fundamental data access and data transport service!! Third party transfers!! Parallel and stripped transfer!! Partial file transfer!! Fault tolerance and restart!! Data Access and Integration (enterprise applications)!! Standard interface!! Establish a session!! Submit a series of queries!! Each submission receives a response!! OGSA DAI components are suitable for Grid environment.!! Managing Data Sources Responsible of managing the storage systems holding the data collections being accessed.!! NeSt!! Software-only grid storage appliance.!! Lots provides a specific capacity to a particular user or group!! Lots logically groups sets of files to facilitate sharing, discovery, and removal.!! Lots can be either Best-effort or guaranteed.

!! NeST main components:!! Storage manager!! Transfer manager!! Dispatcher!! Protocol layer!! Storage Resource Manager!! It manages associated storage via a range of techniques includes cache management, preallocation and advance reservation of space!! It manages two types of shared resources: files and space!! Data Transport Services!! Responsible of managing the transport of data between data storage services.!! Multiple data object transfer services or reliable data transfer services.!! General Data Discovery Services!! Responsible of identifying data items that match specified characteristics or metadata.!! MCAT & MCS provide a general metadata schema.!! Workflow Management, Planning, and Scheduling!! Workflow manager is responsible of initiating the execution of individual tasks in the specified order.!! Many databases contribute data and resources to a multi-database federation, but each participant has full local autonomy.!! Loosely coupled federated database and tightly coupled federated database!! The goal is to provide access to sophisticated analysis and simulation computations using integrated data from multiple resources.!! Data Mediation Services!! They provide transparency with respect to data models!! Replication Services for Location Transparency!! They provide for replication transparency.!! Types!! Replica management services!! Replica location services!! Consistency services