A Grid Data Integration Service (OGSA-DQP)



Similar documents
Service-Based Distributed Querying on the Grid

OGSA-DQP: A Service-Based Distributed Query Processor for the Grid

Data and beyond

OGSA - A Guide to Data Access and Integration in UK

Data Grids. Lidan Wang April 5, 2007

DATA INTEGRATION AND QUERY REFORMULATION IN SERVICE-BASED GRIDS

SERVICE CHOREOGRAPHY FOR DATA INTEGRATION ON THE GRID

Grid Data Integration Based on Schema Mapping

Grid Data Integration based on Schema-mapping

GENERIC DATA ACCESS AND INTEGRATION SERVICE FOR DISTRIBUTED COMPUTING ENVIRONMENT

Grid Data Management Systems & Services

This is an author-deposited version published in : Eprints ID : 12518

Experiences of Designing and Implementing Grid Database Services in the OGSA-DAI project

Gradient An EII Solution From Infosys

Data Access and Integration in the ISPIDER Proteomics Grid

International Summer School on Grid Computing. Naples, Italy. Introduction to OGSA-DAI

XML Data Integration in OGSA Grids

Real-time Data Replication

DATABASES AND THE GRID

Distributed Databases

DISTRIBUTED AND PARALLELL DATABASE

Parallel Processing of JOIN Queries in OGSA-DAI

Introduction to Scheduling Theory

Bringing Big Data Modelling into the Hands of Domain Experts

Dynamic allocation of servers to jobs in a grid hosting environment

Copyright Soleran, Inc. esalestrack On-Demand CRM. Trademarks and all rights reserved. esalestrack is a Soleran product Privacy Statement

Optimizing Utility in Cloud Computing through Autonomic Workload Execution

MapReduce and Hadoop Distributed File System V I J A Y R A O

Java EE Web Development Course Program

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

HALOGEN. Technical Design Specification. Version 2.0

An Evaluation of the Application Hosting Environment Uk e-science Engineering Task Force

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY" Tokyo. Koln Sebastopol. Cambridge Farnham.

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Optimizing Utility in Cloud Computing through Autonomic Workload Execution

Topics in basic DBMS course

chapater 7 : Distributed Database Management Systems

Data Lab System Architecture

Problems and Measures Regarding Waste 1 Management and 3R Era of public health improvement Situation subsequent to the Meiji Restoration

Profiling as a Service

DObjects: Enabling Distributed Data Services for Metacomputing Platforms

Integration strategy

Data Wrangling: The Elephant in the Room of Big Data. Norman Paton University of Manchester

Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services

A Uniform Approach to Workflow and Data Integration

Bioinformatics Grid - Enabled Tools For Biologists.

Why Standardize on Oracle Database 11g Next Generation Database Management. Thomas Kyte

Challenges for Data Driven Systems

>

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Client/Server Grid applications to manage complex workflows

MANAGING SCIENTIFIC DATA WITH NDN

Managing R12 EBS using OEM with the Application Management and Application Change Management Packs

Integrating Big Data into the Computing Curricula

Data integration for metagenomics: current status and future plans

Bruce Momjian June, Postgres Plus Technical Overview

Tier Architectures. Kathleen Durant CS 3200

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Integrated Communication Systems

Decomposition into Parts. Software Engineering, Lecture 4. Data and Function Cohesion. Allocation of Functions and Data. Component Interfaces

2015 The MathWorks, Inc. 1

The Virtual Grid Application Development Software (VGrADS) Project

MapReduce and Hadoop Distributed File System

Self-optimizing Block Transfer in Web Service Grids

Efficient Data Storage and Analysis for Generic Biomolecular Simulation Data

Proteome Data Integration: Characteristics and Challenges

Writing & Running Pipelines on the Open Grid Engine using QMake. Wibowo Arindrarto DTLS Focus Meeting

Big Data Database Revenue and Market Forecast,

Big Data and Cloud Computing for GHRSST

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

ADMIRE D3.3 ADMIRE Platform Release 2

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Tutorial on Client-Server Architecture

The HP Neoview data warehousing platform for business intelligence

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Deployment Topologies

Application of Distributed Database Concepts to RAN Configuration Management

Oracle PharmaGRID Response. Dave Pearson Oracle Corporation UK

COMPUTING SCIENCE. Scalable and Responsive Event Processing in the Cloud. Visalakshmi Suresh, Paul Ezhilchelvan and Paul Watson

Rob Zoeteweij CUSTOMER CASE CONFIGURATION MANAGEMENT PROVISIONING & AUTOMATED PATCHING

Data Lab Operations Concepts

IV Distributed Databases - Motivation & Introduction -

Oracle Data Integration Solutions GoldenGate New Features Summary

Play with Big Data on the Shoulders of Open Source

Consolidation Technology in the Cloud Data Processing

Application of Predictive Analytics for Better Alignment of Business and IT

Interoperability between Sun Grid Engine and the Windows Compute Cluster

Enterprise Level Change Control: A Life Science Business Imperative. Presented by: Carl Ning Solutions Delivery Manager Sparta Systems

Provisioning and Resource Management at Large Scale (Kadeploy and OAR)

LOG MANAGEMENT AND SIEM FOR SECURITY AND COMPLIANCE

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

Technology Strategies for Big Data Analytics Paul Bachteal Director, Americas Technology Practice

Service Oriented Architectures

Distributed Databases in a Nutshell

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

ORACLE DATABASE 10G ENTERPRISE EDITION

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Transcription:

A Grid Data Integration Service (OGSA-DQP) Paul Watson, University of Newcastle-upon-Tyne based on the work of Norman Paton, Tasos Gounaris, Alvaro Fernandes, Rizos Sakellariou University of Manchester Jim Smith, Arijit Mukherjee, Paul Watson University of Newcastle-upon-Tyne www.neresc.ac.uk

The Problem Many grid applications would benefit from access to distributed data Data sources are scattered and autonomous Integration is often done by tedious manual process or (recently) hand-coded workflows We are interested in how to simplify the process of querying distributed data Focussing initially on information held in (relational) databases www.neresc.ac.uk 2

Distributed Query Processing Queries are expressed in OQL allows computations to be included in the query A single query may reference data at multiple sites the data locations may be transparent to the query author select p.proteinid, Blast(p.sequence) from protein p, proteinterm t where t.termid = S92 and p.proteinid = t.proteinid www.neresc.ac.uk 3

Query Compiler OGSA-DQP automatically compiles and executes the query on a set of Grid nodes - in parallel where possible OQL Parser Logical Optimiser Physical Optimiser Single-node optimiser Multi-node optimiser Partitioner Scheduler Evaluator www.neresc.ac.uk 4

Execution Plan select p.proteinid, Blast(p.sequence) from protein p, proteinterm t where t.termid = S92 and p.proteinid = t.proteinid The plan is split in to a set of partitions Grid resources are acquired to execute the partitions in parallel where possible, required and affordable 9,10 3-8 exchange reduce reduce op_call (Blast) exchange hash_join (proteinid) exchange reduce 1 2 table_scan (protein) table_scan termid=s92 (proteinterm) www.neresc.ac.uk 5

Evaluation on the Grid The OGSA-DQP builds on OGSA-DAI accesses relational databases wrapped by OGSA-DAI Oracle, DB2, MySQL Data streams between nodes flow control All services are OGSI-compliant built on GT3 www.neresc.ac.uk 6

perform(querysubplan) Execution on the Grid GDQ 3 3 Client G 1 GDS GDQS GDT perform(query) N0 perform(querysubplan) 2 perform(querysubplan) createservice 2 4 Factory GQESF G createservice Factory GQESF G results GDS 3 GDS GDT GQES1 G GDT GQES1 G N3 N4 4 results GDS GDT GQES2 G operation_call blast(p.sequence) reduce (p.proteinid, blast) Factory GDS GDS G GQESF G reduce (p.proteinid, blast) 14 GDT GQES3 G hash_join (p.proteinid=t.proteinid) results sequential_scan N2 reduce (proteinid,sequence) Web Services (BLAST) reduce (proteinid) N1 2 createservice operation_call blast(p.sequence) Factory GQESF G GDS G sequential_scan (term=8372) www.neresc.ac.uk 7

Mutual Benefit The Grid needs DQP: Declarative, high-level resource integration with implicit parallelism DQP needs the Grid: Systematic access to remote data and computational resources Cost based optimisation Dynamic resource discovery and allocation www.neresc.ac.uk 8

Summary DQP is a potentially important technology for the Grid OGSA-DQP supports: declarative expression of queries location transparency access to both data and computational resources dynamic deployment on Grid resources implicit parallelism First release made in September 2003 available for download Dynamic adaptation now being investigated fault-tolerance, performance, cost www.neresc.ac.uk 9

Experiences and Issues Remote service deployment not yet available for Grids, but some work PhD Project at Newcastle (Chris Fowler) dynamically deploy individual services remotely initial prototype by end of November 2003 working on security issues WS only GridShed project (Newcastle + BT) design of hosting environments for Grids install execution images on nodes as required www.neresc.ac.uk 10

Experiences & Issues DQP vs Workflow? for what space of problems is each better DQP advantages? declarative expression of intent cost-based choice of execution plans implicit parallelisation Investigating with Bioinformatics applications in the my Grid project DQP with workflows & workflows with DQP www.neresc.ac.uk 11

Projects/Sponsors Projects OGSA-DAI Polar Polar* my Grid Sponsors www.neresc.ac.uk 12