EDG Project: Database Management Services

Similar documents

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Distributed Database Access in the LHC Computing Grid with CORAL

Web Service Based Data Management for Grid Applications

HISP: a data-driven portal for hadron therapy

Roberto Barbera. Centralized bookkeeping and monitoring in ALICE

The CMS analysis chain in a distributed environment

A simple object storage system for web applications Dan Pollack AOL

Practical Cassandra. Vitalii

Real-time Data Replication

Portable Scale-Out Benchmarks for MySQL. MySQL User Conference 2008 Robert Hodges CTO Continuent, Inc.

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Data Management in the Cloud

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Status and Evolution of ATLAS Workload Management System PanDA

Data Grids. Lidan Wang April 5, 2007

Client/Server Grid applications to manage complex workflows

Bryan Tuft Sr. Sales Consultant Global Embedded Business Unit

Configuration Management of Massively Scalable Systems

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

ATLAS job monitoring in the Dashboard Framework

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

CitusDB Architecture for Real-Time Big Data

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

High-Volume Data Warehousing in Centerprise. Product Datasheet

A Survey Study on Monitoring Service for Grid

An Approach to Implement Map Reduce with NoSQL Databases

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

ADDING A NEW SITE IN AN EXISTING ORACLE MULTIMASTER REPLICATION WITHOUT QUIESCING THE REPLICATION

PoS(EGICF12-EMITC2)110

Learn Oracle WebLogic Server 12c Administration For Middleware Administrators

Optimization of QoS for Cloud-Based Services through Elasticity and Network Awareness

Big Data With Hadoop

THE CCLRC DATA PORTAL

High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

MIGRATING DESKTOP AND ROAMING ACCESS. Migrating Desktop and Roaming Access Whitepaper

Technical. Overview. ~ a ~ irods version 4.x

Fast Innovation requires Fast IT

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Chapter 3. Database Environment - Objectives. Multi-user DBMS Architectures. Teleprocessing. File-Server

Grid Sun Carlo Nardone. Technical Systems Ambassador GSO Client Solutions

White Paper: 1) Architecture Objectives: The primary objective of this architecture is to meet the. 2) Architecture Explanation

Monitoring Clusters and Grids

HP OO 10.X - SiteScope Monitoring Templates

Introduction. AppDynamics for Databases Version Page 1

Trafodion Operational SQL-on-Hadoop

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Virtual Credit Card Processing System

Tier Architectures. Kathleen Durant CS 3200

Load Testing Tools. Animesh Das

A Peer-to-Peer Approach to Content Dissemination and Search in Collaborative Networks

CMS Dashboard of Grid Activity

Sharding with postgres_fdw

Object Relational Mapping for Database Integration

Using DDS to Enable The Real-Time Enterprise Service Bus (RT-ESB)

Geodatabase Programming with SQL

White Paper November Technical Comparison of Perspectium Replicator vs Traditional Enterprise Service Buses

Client/server is a network architecture that divides functions into client and server

Performance And Scalability In Oracle9i And SQL Server 2000

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

Hypertable Architecture Overview

Contents RELATIONAL DATABASES

Hadoop & its Usage at Facebook

Integrating VoltDB with Hadoop

Cloud Computing. Lecture 5 Grid Case Studies

Database Monitoring Requirements. Salvatore Di Guida (CERN) On behalf of the CMS DB group

C/S Basic Concepts. The Gartner Model. Gartner Group Model. GM: distributed presentation. GM: distributed logic. GM: remote presentation

Grid Computing in Aachen

Hadoop & its Usage at Facebook

The GENIUS Grid Portal

Cluster, Grid, Cloud Concepts

Evaluating Metadata access

Introduction. Introduction: Database management system. Introduction: DBS concepts & architecture. Introduction: DBS versus File system

Using Apache Derby in the real world

Distributed Database Design

Techniques for Scaling Components of Web Application

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Transcription:

EDG Project: Database Management Services Leanne Guy for the EDG Data Management Work Package EDG::WP2 Leanne.Guy@cern.ch http://cern.ch/leanne 17 April 2002 DAI Workshop Presentation 1

Information in EDG Databases Databases are used within the EDG Project to store information pertaining to: File replication Replica catalogue, Replica location index Metadata Application specific: physics collections, medical image headers Grid middleware: Storage element metadata, Replication metadata Logging and bookkeeping Monitoring Network statistics Status of grid services Service lists and much more 17 April 2002 DAI Workshop Presentation 2

Grid services using databases Logging and Bookkeeping service (WP1: Workload management) Grid service that provides information about jobs pending, running, current status, where submitted, etc.. Events sorted in a SQL database relational model allows fast complex querying Network performance information (WP7: Networking) Measurement of network performance metrics round trip time, packet loss, throughput needed by other grid services such as query optimisation Information stored in an SQL database summary statistics, information, reports generated by querying Replica Catalogue (WP2: Data Management) Contains LFN PFN mappings associated metadata stored in an SQL database 17 April 2002 DAI Workshop Presentation 3

Services using databases (2) Metadata Grid middleware services Query optimisation service Replica catalogue Header information for medical images nominative and anonymous, confidential and public Applications metadata collections, logical file sets 17 April 2002 DAI Workshop Presentation 4

Spitfire: A Relational DB Service for the Grid Spitfire: Grid Metadata Service provided by EDG::WP2 provides access to various implementations of database backends via a grid enabled front end Decouples the client from the RDBMS backend via a mediator Interoperability, ease-of-use, performance and plugability Atomic operations: consistency Core SQL functionality: update/insert/delete/query Web services JDBC interface to RDBMS SOAP interfaces Authorisation is role based jwget tool is provided for client side command line usage. 17 April 2002 DAI Workshop Presentation 5

Spitfire architecture SOAP SOAP Connecting Layer (Global Spitfire Layer) SOAP SOAP SOAP SOAP Local Spitfire Layer OracleLayer DB2Layer PGLayer MyLayer JDBC enabled RDBMS backend Oracle DB2 PostGres MySQL 17 April 2002 DAI Workshop Presentation 6

Spitfire local layer architecture Tomcat Server Client API XML query WSDL/SOAP Security Servlet MD Servlet Translator Servlet JDBC DB Backend 17 April 2002 DAI Workshop Presentation 7

Spitfire local layer API Client APIs to facilitate request issue and administration DB Administration API Create / Delete Database Create / Drop Table User Management API Create / Delete / Update Role Register / Unregister Users User Functionality API Insert / Update / Delete / Select Set Table Timestamp Set Row Timestamp Open Dedicated Connection Close Connection DB Information service API Quotas, Memory, Disk space User Info & roles, Schemata 17 April 2002 DAI Workshop Presentation 8

Spitfire global layer Higher level functionality Not currently defined, will possibly include : Distributed Querying Interactions with other metadata servers on the grid Biomedical applications will need to query several metadata catalogues to obtain lists of specific medical images Requires a definition of common schemata and indices. Replication / Caching mechanisms Replication of a part or all of a database to another metadata service Expiration & Cleanup Automatic removal of stale information based on stored timestamps Transactions 17 April 2002 DAI Workshop Presentation 9

Spitfire for biomedical applications Medical metadata : WP10 Use spitfire to store metadata for medical images Patient specific & image specific data Database will contain LFNs of collections of images Separation of confidential and non confidential data One SF metadata database instance per DICOM server Queries on numerous distributed databases Examples usage: Patient will query the SF database for all her/his medical images Physician will query SF database for all records on one patient Researcher queries SF database for all ultrasound images 17 April 2002 DAI Workshop Presentation 10

Spitfire for data management (1) Replica Catalogue Stores PFN <-> LFN mappings Metadata for files (PFNs) on the local storage element file size, file status : durable, volatile, permanent Replica Catalogue Metadata Catalogue: RepMeC Instance of Spitfire Stores attributes of LFNs independent of any associated PFNs location of master and secondary copies for an LFN VO specific data, e.g replication policy for LFNs high level access control RepMec API provides methods to define, store and extract metadata on replicas 17 April 2002 DAI Workshop Presentation 11

Spitfire status and releases Spitfire 1.2 : April release Integrated XSQL-Spitfire Installation decoupled from Tomcat & MySQL Tested to date with MySQL & PostgreSQL Java API, SOAP interfaces designed, not currently provided Spitfire 1.3: end May 2002 Alpha release of SOAP interface Schema and User Management Test-suite designed with Junit based tests implemented Spitfire 1.4: end September (testbed 2.0) Version 1.0: Full integration of all components More running applications 17 April 2002 DAI Workshop Presentation 12

Service Index: problem statement In a distributed system such as a DataGrid: Want to maintain and query dynamic and timely information about Grid services, resources and virtual organisations? How should a database node maintain information populated from a large variety of unreliable, frequently changing, autonomous and heterogeneous remote data sources? In particular, how should it do so without sacrificing reliability, predictability and simplicity? How can powerful queries be expressed over time-sensitive dynamic information? Use a Hyperlink Registry: A registry node has a database that holds a set of tuples A tuple may contain a piece of context i.e a service description, file, picture, network load information, storage element information 17 April 2002 DAI Workshop Presentation 13

Hyperlink registry properties XML data model allows for structured and semi-structured data, which is important for integration of heterogeneous content. Uses the XQuery language allows for powerful searching critical for non-trivial applications. Database state maintenance based on soft state enables reliable, predictable and simple content integration from a large number of autonomous distributed content providers. Content link, content cache, hybrid pull/push communication model allow for a wide range of dynamic content freshness policies, driven by all three system components: content provider, hyper registry and client. 17 April 2002 DAI Workshop Presentation 14

Service discovery example Find all services that implement a replica catalog service interface and that CMS members are allowed to use, and that have an HTTP bindings for the replica catalog operation XML getpfns(string LFN). XQuery: LET $repcat := "http://gridforum.org/interface/replicacatalog-1.0" FOR $tuple IN /tupleset/tuple[@type="service"] WHERE SOME $op IN $tuple/content/service/interface[@type = $repcat]/operation SATISFIES ($op/name="xml getpfns(string LFN)" AND $op/bindhttp/@verb="get AND contains($op/allow, "http://cms.cern.ch/everybody")) RETURN $tuple 17 April 2002 DAI Workshop Presentation 15

HEP:Monte Carlo event production Detector and physics event simulation Evaluate reconstruction software, assist detector design Estimate eventual requirements for storage and CPU, Characteristics Scheduled coordinated activity, sites assigned a specific task Lots of data output: O(10 8 ) events output for a given dataset Goal: Objective maximise throughput Simulation use case steps: Generate job script event type, kinematical parameters, software versions estimate of max/min output file size, job CPU time Identify CE and SE with sufficient space for output datasets Move job to CE, Execute job Move output files to final Storage; register output data with grid (optional) Store job metadata in an database CE & SE node, provenance data, date run, code version/release, platform 17 April 2002 DAI Workshop Presentation 16

HEP: Event reconstruction Characteristics: Distributed access input data: output of earlier Monte Carlo data production Output data size typically smaller than input data set Event reconstruction use case steps: Generate job script Input LFNs where known, software version Submit job on local host to run on a remote host Grid services identify best CE for job, based on distribution of physical instances of input LFNs and other criteria, i.e software version Replica Manager identifies PFNs for input LFNs Move PFNs to designated execution site Run reconstruction job User monitors job progress on remote host from local host Move output data to final storage site Storage site may be different from execution and submission site 17 April 2002 DAI Workshop Presentation 17

HEP: Physics data analysis Characteristics: Many physicists accessing same data sets Entire job is as fast as the slowest event analysis Chaotic and random access patterns Input datasets not always know a priori Goal: minimise latency Physics analysis use case steps: User queries metadata to extract LFN list for an analysis input data could be reconstructed MC or detector ( real ) data Resource broker identifies site for job execution Replica manager moves data to site for job execution Run analysis job Move output data to final SE Register output data with grid 17 April 2002 DAI Workshop Presentation 18

BioMedical usage models (1) Analysis of a set of melanoma medical images Images will contain nominative and anonymous or proprietary information Confidential patient and physician details Cannot replicate in a grid environment! Medical analysis use case steps: Physician queries a metadata service (Spitfire) for interesting images match analysis parameters: year, melanoma type, etc Submit job to analyse images Image files located only on hospital DICOM servers need to be replicated to grid aware storage All nominative information must be stripped from data prior to replication Header image mappings stored in a local metadata database Spitfire Identify CE and run job Output data files may be registered with Grid 17 April 2002 DAI Workshop Presentation 19

BioMedical usage models (2) Example : subsequent to an analysis, a misdiagnosis is discovered A melanoma is in fact malignant and not benign Researcher who performed the analysis must inform the examining physician All nominative information was stripped from image files Researcher updates a metadata database diagnosis field Update reflects new diagnosis and triggers an action Header information is found an examining physician is informed Researcher has no access to nominative information but has communicated the analysis results 17 April 2002 DAI Workshop Presentation 20

Documents and references WP2 Documents 1. http://cern.ch/grid-data-management/publications.html Spitfire 1. http://cern.ch/hep-proj-spitfire Service Index 1. A Unified Peer-to-Peer Database Framework for XQueries over Dynamic Distributed Content and its Application for Scalable Service Discovery Wolfgang Hoschek, PhD Thesis, Technical University of Vienna (submitted), 2002. 2. A Data Model and Query Language for Service Discovery Wolfgang Hoschek, Data Grid TechReport DataGrid-02-TED-0409, April 2002. 17 April 2002 DAI Workshop Presentation 21