Query Processing and Optimization in Distributed Databases. Erdem USLU Nihal SARMAŞIK

Similar documents
Topics. Distributed Databases. Desirable Properties. Introduction. Distributed DBMS Architectures. Types of Distributed Databases

chapater 7 : Distributed Database Management Systems

DISTRIBUTED AND PARALLELL DATABASE

Distributed Databases. Concepts. Why distributed databases? Distributed Databases Basic Concepts

Fragmentation and Data Allocation in the Distributed Environments

Physical Database Design and Tuning

1. Physical Database Design in Relational Databases (1)

Distributed Data Management

Distributed Architectures. Distributed Databases. Distributed Databases. Distributed Databases

Distributed Database Management Systems

Distributed Databases in a Nutshell

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

Chapter 22: Distributed Databases

VII. Database System Architecture

Distributed Databases

Principles of Distributed Database Systems

Distributed Databases

An Overview of Distributed Databases

A Review of Database Schemas

Chapter 3: Distributed Database Design

Data Management in the Cloud. Zhen Shi

AN OVERVIEW OF DISTRIBUTED DATABASE MANAGEMENT

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Middleware for Heterogeneous and Distributed Information Systems

Introduction to Parallel and Distributed Databases

Distributed Databases. Fábio Porto LBD winter 2004/2005

Chapter 5: Overview of Query Processing

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

Distributed Database Design (Chapter 5)

Contents RELATIONAL DATABASES

AHAIWE Josiah Information Management Technology Department, Federal University of Technology, Owerri - Nigeria jahaiwe@yahoo.

In Memory Accelerator for MongoDB

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

How To Build A Fault Tolerant Mythrodmausic Installation

Object Oriented Databases. OOAD Fall 2012 Arjun Gopalakrishna Bhavya Udayashankar

Chapter 2: DDBMS Architecture

BBM467 Data Intensive ApplicaAons

When an organization is geographically dispersed, it. Distributed Databases. Chapter 13-1 LEARNING OBJECTIVES INTRODUCTION

Distributed Databases

Concepts of Database Management Seventh Edition. Chapter 9 Database Management Approaches

Objectives of Lecture. Network Architecture. Protocols. Contents


PARALLELS CLOUD STORAGE

A Comparison of Approaches to Large-Scale Data Analysis

Query Optimization for Distributed Database Systems Robert Taylor Candidate Number : Hertford College Supervisor: Dr.

MongoDB and Couchbase

Query Processing in Data Integration Systems

TECHNIQUES FOR DATA REPLICATION ON DISTRIBUTED DATABASES

A Novel Data Placement Model for Highly-Available Storage Systems

SODDA A SERVICE-ORIENTED DISTRIBUTED DATABASE ARCHITECTURE

Distributed Database Design

Principles and characteristics of distributed systems and environments

LiTH, Tekniska högskolan vid Linköpings universitet 1(7) IDA, Institutionen för datavetenskap Juha Takkinen

B.Com(Computers) II Year DATABASE MANAGEMENT SYSTEM UNIT- V

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

Data Management in the Cloud

Data Modeling. Database Systems: The Complete Book Ch ,

Availability Digest. Data Deduplication February 2011

Big Data Analytics. Lucas Rego Drumond

Advanced Oracle SQL Tuning

Technologies & Applications

Advanced Database Group Project - Distributed Database with SQL Server

8 9 +

Advanced Database Management Systems

CitusDB Architecture for Real-Time Big Data

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION

Who am I? Copyright 2014, Oracle and/or its affiliates. All rights reserved. 3

DESIGN OF FILE SYSTEM ARCHITECTURE WITH CLUSTER FORMATION ALONG WITH MOUNT TABLE

Part 5: More Data Structures for Relations

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

14 Databases. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:


The Relational Model. Why Study the Relational Model? Relational Database: Definitions. Chapter 3

Central vs. Distributed Systems

H4 DATABASE DEVELOPMENT SOLUTIONS & MARKING SCHEME JUNE 2013

Database Scalability {Patterns} / Robert Treat

Outline. Mariposa: A wide-area distributed database. Outline. Motivation. Outline. (wrong) Assumptions in Distributed DBMS

not necessarily strictly sequential feedback loops exist, i.e. may need to revisit earlier stages during a later stage

A Generic Model for Querying Multiple Databases in a Distributed Environment Using JDBC and an Uniform Interface

Secure Data Transfer and Replication Mechanisms in Grid Environments p. 1

Distributed Database Management Systems

Distributed and Parallel Database Systems

Inside the PostgreSQL Query Optimizer

Topics in basic DBMS course

Active/Active DB2 Clusters for HA and Scalability

VALLIAMMAI ENGNIEERING COLLEGE SRM Nagar, Kattankulathur

Using Temporary Tables to Improve Performance for SQL Data Services

Chapter 5: Logical Database Design and the Relational Model Part 2: Normalization. Introduction to Normalization. Normal Forms.

Principles for Distributed Databases in Telecom Environment Imran Ashraf Amir Shahzed Khokhar

Diagram 1: Islands of storage across a digital broadcast workflow

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Data warehousing with PostgreSQL

Chapter 3. Database Environment - Objectives. Multi-user DBMS Architectures. Teleprocessing. File-Server

Physical Design in OODBMS

news from Tom Bacon about Monday's lecture

Relational model. Relational model - practice. Relational Database Definitions 9/27/11. Relational model. Relational Database: Terminology

The Relational Data Model: Structure

Joins for Hybrid Warehouses: Exploiting Massive Parallelism in Hadoop and Enterprise Data Warehouses

Redis Cluster. a pragmatic approach to distribution

IV Distributed Databases - Motivation & Introduction -

Transcription:

Query Processing and Optimization in Distributed Databases Erdem USLU Nihal SARMAŞIK

CONTENT Introduction to Distributed DBMS Storing Data in Distributed DBMS Distributed Query Processing Cost-Based Query Optimization

Introduction to Distributed DBMS Main Considerations: Distributed Transaction Atomicity users should be able to write transactions that access and update data at several sites just as they would write transactions over purely local data Location Transparency user does not have to know the location of the data data requests automatically forwarded to appropriate sites

Introduction to Distributed DBMS Types of Distributed Database System Homogeneous Distributed Database System All sites of the database system have identical setup, i.e., same database system software. The underlying operating system may be different Window Site 4 Oracle Window Site 5 Site 3 Linux Oracle Oracle Communications network Site 2 Linux Unix Site 1 Oracle Oracle

Introduction to Distributed DBMS Types of Distributed Database System Heterogeneous Distributed Database System (Multidatabase System) Each site may run different database system/software Window Site 4 Object Oriented Unix Relational Site 5 Unix Site 1 Hierarchical Communications network Object Oriented Site 3 Linux Network DBMS Site 2 Linux Relational

Storing Data in Distributed DBMS Fragmentation Division of relation r into fragments r1, r2,, rn which contain sufficient information to reconstruct relation r allows a relation to be split so that tuples are located where they are most frequently accessed Fragmentation types Horizontal Fragmentation Vertical Fragmentation Mixed Fragmentation

Storing Data in Distributed DBMS Horizontal Fragmentation ( Row-wise ) R A1 A2 An R1 A1 A2 An T1 T2 T3 T100 T1 T2 T3 T100 T101 T102 T103 T600 R2 T101 T102 T103 A1 A2.. An R = R1 U R2 T600

Storing Data in Distributed DBMS Vertical Fragmentation (Coloumn-wise ) The schema for relation r is split into several smaller schemas R All schemas must contain a common candidate key (or superkey) to ensure lossless join property R1 R2 A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate key R = R1 R2 R1.TID=R2.TID

Storing Data in Distributed DBMS Mixed fragmentation R = (Rs1 U Rs2) (Rs3 U Rs4)

Storing Data in Distributed DBMS Replication Multiple copies of data, stored in different sites Adventages of Replication Availability: failure of site containing relation r does not result in unavailability of r is replicas exist Parallelism: queries on r may be processed by several nodes in parallel. Reduced data transfer: relation r is available locally at each site containing a replica of r

Critera for measuring the cost of a query evalution strategy For centralized DBMSs number of disk accesses (# blocks read / written) For distributed databases, additionally The cost of data transmission over the network Potential gain in performance from having several sites processing parts of the query in parallel

Nonjoin Queries in Distributed DBMS Example: EMP = EMP1 U EMP2 (Horizontal Fragmentation) EMP1 EMP2 SITE 1 SITE 2 EMP1= σ emp_id<5000 (EMP) EMP2= σ emp_id>=5000 (EMP)

Nonjoin Queries in Distributed DBMS QUERY: Select all employees whose ages between 20 and 30. SELECT emp_id, name FROM EMP1 WHERE age > 20 AND age < 30 UNION SELECT emp_id, name FROM EMP2 WHERE age > 20 AND age < 30

Nonjoin Queries in Distributed DBMS QUERY: What is average age of all employee? SELECT (T1.age+T2.age)/ (T1.emp_number+T2.emp_number) FROM (SELECT SUM(age) as age, count(*) as emp_number FROM EMP1) as T1, (SELECT SUM(age) as age, count(*) as emp_number FROM EMP2) as T2

Join Queries in Distributed DBMS JOIN STRATEGY Ship whole Which strategy is better for me? Fetch as needed Semijoins Bloomjoins

Join Queries in Distributed DBMS (Ship Whole) Ship Whole: Transferring the complete relation Example: QUERY: The query asks for R S

Join Queries in Distributed DBMS (Ship whole) COST ANALYSIS When execution at noder noder: send data request message (relation S) to nodes nodes: send requested data (relation S) to noder Total costs: 2 messages, 18 attribute value When execution at nodes nodes: send data request message (relation R) to noder noder: send requested data (relation R) to nodes Total costs: 2 messages, 14 attribute value

Join Queries in Distributed DBMS (Fetch As Needed) Fetch As Needed: Transferring the relation piecewise Example: QUERY: The query asks for R S

Join Queries in Distributed DBMS (Fetch As Needed) COST ANALYSIS When execution at noder noder: send data request message (tuples of relation S with B = 7 ) to nodes nodes: send requested data (0 tuples of relation S with B = 7 ) to noder noder: send data request message (tuples of relation S with B = 1 ) to nodes nodes: send requested data (1 tuple of relation S with B = 1 ) to noder... Total costs: 7 * 2 = 14 messages, 7 + 2 * 3 = 13 attribute value When execution at nodes nodes: send data request message (tuples of relation S with B = 9 ) to noder noder: send requested data (0 tuples of relation S with B = 9 ) to nodes nodes: send data request message (tuples of relation S with B = 1 ) to noder noder: send requested data (1 tuple of relation S with B = 1 ) to nodes... Total costs: 6 * 2 = 12 messages, 6 + 2 * 2 = 10 attribute value

Join Queries in Distributed DBMS Ship Whole vs Fetch as needed: Fetch as needed results in a high number of messages Ship whole results in high amounts of transferred data

Join Queries in Distributed DBMS (Semijoin) Semijoin: Requesting all join partners in just one step

Join Queries in Distributed DBMS (Bloom join) Bloom join: Also known as bit-vector join Avoiding to transfer all join attribute values to the other node Instead transferring a bitvector B[1 : : : n] Transformation Choose an appropriate hash function h Apply h to transform attribute values to the range [1 : : : n] Set the corresponding bits in the bitvector B[1 : : : n] to 1

Join Queries in Distributed DBMS (Bloom join)

Join Queries in Distributed DBMS (Bloom join) Conclusion: Transferring the bit-vector reduces network load Bit-vector only indicates potential join partners because multiple attribute values might map to the same hash value Might result in transferring unnecessary tuples Requirements: an appropriate hash function h and n needs to be large enough to avoid a high number of collision

Join Queries in Distributed DBMS Semijoin vs bloom-join * The cost of shipping is less in bloom-join since bit-vector is used rather than projection * In Bloom join the size of the reductioned part which transferred back is likely to be larger than in Semijoin, so the costs of shipping are likely to be higher

Cost-Based Query Optimization Main Consideration for Query Optimizition Communication cost If there is several copies of a relation, decide which copy to use Amount of data being shipped Relative processing speed at each site Site selection

Cost-Based Query Optimization Global plan: Includes several local plans (subqueries) If response time is critical, subqueries can be carried out in parallel Local plans constructed by optimizer of each site

THANK YOU!