Grid and Cloud Database Management



Similar documents
How To Understand The Gsoap-Dami Framework

International Series on Consumer Science

Automated Firewall Analytics

Big-Data Analytics and Cloud Computing

Lasers in Restorative Dentistry

Applying Comparative Effectiveness Data to Medical Decision Making

Oral and Cranial Implants

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

Spatial Data on the Web

Praseeda Manoj Department of Computer Science Muscat College, Sultanate of Oman

The Product Manager s Toolkit

Lecture Notes in Computer Science 5161

The Banks and the Italian Economy

Design of Flexible Production Systems

Lecture Notes in Mathematics 2033

3rd International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2016) March 10-11, 2016 VIT University, Chennai, India

Human Rights in European Criminal Law

Manifest for Big Data Pig, Hive & Jaql

Understanding Competitive Advantage

The Ophidia framework: toward cloud- based big data analy;cs for escience Sandro Fiore, Giovanni Aloisio, Ian Foster, Dean Williams

Springer-Verlag Berlin Heidelberg GmbH

Miklós Szendrői Franklin H. Sim (Eds.) Color Atlas of Clinical Orthopedics

Software Process Automation

Grid Computing vs Cloud

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Big Data at Cloud Scale

Data analy(cs workflows for climate

Spatial Inequalities

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Contents. Preface Acknowledgements. Chapter 1 Introduction 1.1

Open Cloud Computing Interface - Monitoring Extension

Digital libraries of the future and the role of libraries

Java and the Java Virtual Machine

Grid Technology and Information Management for Command and Control

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

SpringerBriefs in Criminology

Data Modeling for Big Data

VisionWaves : Delivering next generation BI by combining BI and PM in an Intelligent Performance Management Framework

Collaborative & Integrated Network & Systems Management: Management Using Grid Technologies

Library and Information Sciences

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

SAS 9.4 Intelligence Platform

Challenges and Opportunities in Health Care Management

Ammonia. Catalysis and Manufacture. Springer-Verlag. Berlin Heidelberg New York London Paris Tokyo Hong Kong Barcelona Budapest

Remote Sensitive Image Stations and Grid Services

Data Mining and Database Systems: Where is the Intersection?

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

The Ophidia framework: toward big data analy7cs for climate change

Data Warehouse: Introduction

Lecture Notes in Mathematics 2026

Common Capabilities for Service Oriented Infrastructures In A Grid & Cloud Computing

Corporate Performance Management

Twister4Azure: Data Analytics in the Cloud

Scalable End-User Access to Big Data HELLENIC REPUBLIC National and Kapodistrian University of Athens

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Tracking System for GPS Devices and Mining of Spatial Data

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

How to Enhance Traditional BI Architecture to Leverage Big Data

Data Semantics Aware Cloud for High Performance Analytics

SQL Server 2012 Business Intelligence Boot Camp

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

KNOWLEDGE GRID An Architecture for Distributed Knowledge Discovery

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Oracle Business Intelligence 11g Business Dashboard Management

CHAPTER-24 Mining Spatial Databases

How To Model Data For Business Intelligence (Bi)

SPATIAL DATA CLASSIFICATION AND DATA MINING

Network for Sustainable Ultrascale Computing (NESUS)

SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM

Business Intelligence Systems

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

Adaptive Business Intelligence

Data Warehousing and OLAP Technology for Knowledge Discovery

Geospatial intelligence and data fusion techniques for sustainable development problems

AN INTEGRATION APPROACH FOR THE STATISTICAL INFORMATION SYSTEM OF ISTAT USING SDMX STANDARDS

GEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington

Concept and Project Objectives

Oracle Fusion Middleware

DATA MINING - SELECTED TOPICS

Workprogramme

Near Sheltered and Loyal storage Space Navigating in Cloud

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY" Tokyo. Koln Sebastopol. Cambridge Farnham.

Data Warehousing in the Age of Big Data

Updating Your SQL Server Skills to Microsoft SQL Server 2014

Transcription:

Grid and Cloud Database Management

Sandro Fiore Giovanni Aloisio Editors Grid and Cloud Database Management 123

Editors Sandro Fiore, Ph.D. Faculty of Engineering Department of Innovation Engineering University of Salento Via per Monteroni 73100 Lecce, Italy and Euro Mediterranean Center for Climate Change (CMCC) Via Augusto Imperatore 16 73100 Lecce, Italy sandro.fiore@unisalento.it Prof. Giovanni Aloisio Faculty of Engineering Department of Innovation Engineering University of Salento Via per Monteroni 73100 Lecce, Italy and Euro Mediterranean Center for Climate Change (CMCC) Via Augusto Imperatore 16 73100 Lecce, Italy giovanni.aloisio@unisalento.it ISBN 978-3-642-20044-1 e-isbn 978-3-642-20045-8 DOI 10.1007/978-3-642-20045-8 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011929352 ACM Computing Classification (1998): C.2, H.2, H.3, J.2, J.3 c Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: deblik, Berlin Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface Since the 1960s, database systems have been playing a relevant role in the information technology field. By the mid-1960s, several systems were also available for commercial purposes. Hierarchical and network database systems provided two different perspectives and data models to organize data collections. In 1970, E. Codd wrote a paper called A Relational Model of Data for Large Shared Data Banks, proposing a model relying on relational table structures. Relational databases became appealing for industries in the 1980s, and their wide adoption fostered new research and development activities toward advanced data models like object oriented or the extended relational. The online transaction processing (OLTP) support provided by the relational database systems was fundamental to make this data model successful. Even though the traditional operational systems were the best solution to manage transactions, new needs related to data analysis and decision support tasks led in the late 1980s to a new architectural model called data warehouse. It includes extraction transformation and loading (ETL) primitives and online analytical processing (OLAP) support to analyze data. From OLTP to OLAP, from transaction to analysis, from data to information, from the entity-relationship data model to a star/snowflake one, and from a customer-oriented perspective to a market-oriented one, data warehouses emerged as data repository architecture to perform data analysis and mining tasks. Relational, object-oriented, transactional, spatiotemporal, and multimedia data warehouses are some examples of database sources. Yet, the World Wide Web can be considered another fundamental and distributed data source (in the Web2.0 era it stores crucial information from a market perspective about user preferences, navigation, and access patterns). Accessing and processing large amount of data distributed across several countries require a huge amount of computational power, storage, middleware services, specifications, and standards. Since the 1990s, thanks to Ian Foster and Carl Kesselman, grid computing has emerged as a revolutionary paradigm to access and manage distributed, heterogeneous, and geographically spread resources, promising computer power as easy to access as an electric power grid. The term resources also includes the database, v

vi Preface yet successful attempts of grid database management research efforts started only after 2000. Later on, around 2007, a new paradigm named Cloud Computing brought the promise of providing easy and inexpensive access to remote hardware and storage resources. Exploiting pay per use models, virtualization for resource provisioning, cloud computing has been rapidly accepted and used by researchers, scientists, and industries. Grid and cloud computing are exciting paradigms and how they deal with database management is the key topic of this book. By exploring current and future developments in this area, the book tries to provide a thorough understanding of the principles and techniques involved in these fields. The idea of writing this book dates back to a tutorial on Grid Database Management that was organized at the 4th International Conference on Grid and Pervasive Computing (GPC 2009) held in Geneva (4 8 May 2009). Following up an initial idea from Ralf Gerstner (Springer Senior Editor Computer Science), we decided to act as editors of the book. We invited internationally recognized experts asking them to contribute on challenging topics related to grid and cloud database management. After two review steps, 16 chapters have been accepted for publication. Ultimately, the book provides the reader with a collection of chapters dealing with Open standards and specifications (Sect. 1), Research efforts on grid database management (Sect. 2), Cloud data management (Sect. 3), and some Scientific case studies (Sect. 4). The presented topics are well balanced, complementary, and range from well-known research projects and real case studies to standards and specifications as well as to nonfunctional aspects such as security, performance, and scalability, showing up how they can be effectively addressed in grid- and cloudbased environments. Section 1 discusses the open standards and specifications related to grid and cloud data management. In particular, Chap. 1 presents an overview of the WS-DAI family of specifications, the motivation for defining them, and their relationships with other OGF and non-ogf standards. Conversely, Chap. 2 outlines the OCCI specificationsand demonstrates (by presenting three interesting use cases) how they can be used in data management-related setups. Section 2 presents three relevant research efforts on grid-database management systems. Chapter 3 provides a complete overview on the Grid Relational Catalog (GRelC) Project, a grid database research effort started in 2001. The project s main features, its interoperability with glite-based production grids, and a relevant showcase in the environmental domain are also presented. Chapter 4 provides a complete overview about the OGSA-DAI framework, the main components for the distributed data management via workflows, the distributed query processing, and the most relevant security and performance aspects. Chapter 5 gives a detailed overview of the architecture and implementation of DASCOSA-DB. A complete description of novel features, developed to support typical data-intensive applications running on a grid system, is also presented.

Preface vii Section 3 provides a wide overview on several cloud data management topics. Some of them (from Chaps. 6 to 8) specifically focus only on database aspects, whereas the remaining ones (from Chaps. 9 to 12) are wider in scope and address more general cloud data management issues. In this second case, the way these concepts apply to the database world is clarified through some practical examples or comments provided by the authors. In particular, Chap. 6 proposes a new security technique to measure the trustiness of the cloud resources. Through the use of the metadata of resources and access policies, the technique builds the privilege chains and binds authorization policies to compute the trustiness of cloud database management. Chapter 7 presents a method to manage the data with dirty data and obtain the query results with quality assurance in the dirty data. A dirty database storage structure for cloud databases is presented along with a multilevel index structure for query processing on dirty data. Chapter 8 examines column-oriented databases in virtual environments and provides evidence that they can benefit from virtualization in cloud and grid computing scenarios. Chapter 9 introduces a Windows Azure case study demonstrating the advantages of cloud computing and how the generic resources offered by cloud providers can be integrated to produce a large dynamic data store. Chapter 10 presents CloudMiner, which offers a cloud of data services running on a cloud service provider infrastructure. An example related to database management exploiting OGSA-DAI is also discussed. Chapter 11 defines the requirements of e-science provenance systems and presents a novel solution (addressing these requirements) named the Vienna e-science Provenance System (VePS). Chapter 12 examines the state of the art of workload management for data-intensive computing in clouds. A taxonomy is presented for workload management of data-intensive computing in the cloud and the use of the taxonomy to classify and evaluate current workload management mechanisms. Section 4 presents a set of scientific use cases connected with Genomic, Health, Disaster monitoring, and Earth Science. In particular, Chap. 13 explores the implementation of an algorithm, often used to analyze microarray data, on top of an intelligent runtime that abstracts away the hard parts of file tracking and scheduling in a distributed system. This novel formulation is compared with a traditional method of expressing data parallel computations in a distributed environment using explicit message passing. Chapter 14 describes the use of Grid technologies for satellite data processing and management within the international disaster monitoring projects carried out by the Space Research Institute NASU- NSAU, Ukraine (SRI NASU-NSAU). Chapter 15 presents the CDM ActiveStorage infrastructure, a scalable and inexpensive transparent data cube for interactive analysis and high-resolution mapping of environmental and remote sensing data. Finally, Chap. 16 presents a mechanism for distributed storage of multidimensional EEG time series obtained from epilepsy patients on a cloud computing infrastructure (Hadoop cluster) using a column-oriented database (HBase). The bibliography of the book covers the essential reference material. The aim is to convey any useful information to the interested readers, including researchers actively involved in the research field, students (both undergraduate and graduate), system designers, and programmers.

viii Preface The book may serve as both an introduction and a technical reference for grid and cloud database management topics. Our desire and hope is that it will prove useful while exploring the main subject, as well as the research and industries efforts involved, and that it will contribute to new advances in this scientific field. Lecce February 2010 Sandro Fiore Giovanni Aloisio

Contents Part I Open Standards and Specifications 1 Open Standards for Service-Based Database Access and Integration... 3 Steven Lynden, Oscar Corcho, Isao Kojima, Mario Antonioletti, and Carlos Buil-Aranda 2 Open Cloud Computing Interface in Data Management-Related Setups... 23 Andrew Edmonds, Thijs Metsch, and Alexander Papaspyrou Part II Research Efforts on Grid Database Management 3 The GRelC Project: From 2001 to 2011, 10 Years Working on Grid-DBMSs... 51 Sandro Fiore, Alessandro Negro, and Giovanni Aloisio 4 Distributed Data Management with OGSA DAI... 63 Michael J. Jackson, Mario Antonioletti, Bartosz Dobrzelecki, and Neil Chue Hong 5 The DASCOSA-DB Grid Database System... 87 Jon Olav Hauglid, Norvald H. Ryeng, and Kjetil Nørvåg Part III Cloud Data Management 6 Access Control and Trustiness for Resource Management in Cloud Databases... 109 Jong P. Yoon 7 Dirty Data Management in Cloud Database... 133 Hongzhi Wang, Jianzhong Li, Jinbao Wang, and Hong Gao ix

x Contents 8 Virtualization and Column-Oriented Database Systems... 151 Ilia Petrov, Vyacheslav Polonskyy, and Alejandro Buchmann 9 Scientific Computation and Data Management Using Microsoft Windows Azure... 169 Steven Johnston, Simon Cox, and Kenji Takeda 10 The CloudMiner... 193 Andrzej Goscinski, Ivan Janciak, Yuzhang Han, and Peter Brezany 11 Provenance Support for Data-Intensive Scientific Workflows... 215 Fakhri Alam Khan and Peter Brezany 12 Managing Data-Intensive Workloads in a Cloud... 235 R. Mian, P. Martin, A. Brown, and M. Zhang Part IV Scientific Case Studies 13 Managing and Analysing Genomic Data Using HPC and Clouds... 261 Bartosz Dobrzelecki, Amrey Krause, Michal Piotrowski, and Neil Chue Hong 14 Grid Technologies for Satellite Data Processing and Management Within International Disaster Monitoring Projects... 279 Nataliia Kussul, Andrii Shelestov, and Sergii Skakun 15 Transparent Data Cube for Spatiotemporal Data Mining and Visualization... 307 Mikhail Zhizhin, Dmitry Medvedev, Dmitry Mishin, Alexei Poyda, and Alexander Novikov 16 Distributed Storage of Large-Scale Multidimensional Electroencephalogram Data Using Hadoop and HBase... 331 Haimonti Dutta, Alex Kamil, Manoj Pooleery, Simha Sethumadhavan, and John Demme Index... 349