CS377: Database Systems Distributed Databases. Li Xiong Department of Mathematics and Computer Science Emory University

Similar documents
Relational Schema. CS 4700/6700 A Sample of Small Database Design Using Microsoft Access

Distributed Database Systems. Prof. Dr. Carl-Christian Kanne

CSC 443 Data Base Management Systems. Basic SQL

Chapter 3: Distributed Database Design

Distributed Databases. Concepts. Why distributed databases? Distributed Databases Basic Concepts

Lab Assignment Creating a Relational Database Schema from ER Diagram, Populating the Database and Querying over the database with SQL

ER & EER to Relational Mapping. Chapter 9 1

Distributed Database Design (Chapter 5)

Fragmentation and Data Allocation in the Distributed Environments

chapater 7 : Distributed Database Management Systems

Chapter 8. SQL-99: SchemaDefinition, Constraints, and Queries and Views

Distributed Databases

Part A: Data Definition Language (DDL) Schema and Catalog CREAT TABLE. Referential Triggered Actions. CSC 742 Database Management Systems

Part 4: Database Language - SQL

New York University Computer Science Department Courant Institute of Mathematical Sciences

COURSE CODE: CIT 844 COURSE TITLE: ADVANCED DATABASE MANAGEMENT SYSTEM

VIEWS virtual relation data duplication consistency problems

Distributed Database Management Systems

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

DISTRIBUTED AND PARALLELL DATABASE

How To Create A Table In Sql (Ahem)

SQL Nested & Complex Queries. CS 377: Database Systems

Distributed Databases. Fábio Porto LBD winter 2004/2005

The Relational Algebra

SQL-99: Schema Definition, Basic Constraints, and Queries

More SQL: Assertions, Views, and Programming Techniques

Distributed Databases in a Nutshell

Principles of Distributed Database Systems

Introduction to SQL: Data Retrieving

Chapter 9, More SQL: Assertions, Views, and Programming Techniques

CHAPTER 8: SQL-99: SCHEMA DEFINITION, BASIC CONSTRAINTS, AND QUERIES

BBM467 Data Intensive ApplicaAons

How To Use The Database In Jdbc.Com On A Microsoft Gdbdns.Com (Amd64) On A Pcode (Amd32) On An Ubuntu (Amd66) On Microsoft

b. Examine the following histories. Draw their serialization graph and identify which of them is serializable given reasons.

CS 338 Join, Aggregate and Group SQL Queries

Advanced Database Management Systems

Summary on Chapter 4 Basic SQL

Distributed Data Management

Topics. Distributed Databases. Desirable Properties. Introduction. Distributed DBMS Architectures. Types of Distributed Databases

Basic Concepts of Database Systems

AN OVERVIEW OF DISTRIBUTED DATABASE MANAGEMENT

{ PreviousEducation ( CollegeName, StartDate, EndDate, { Degree (DegreeName, Month, Year) }, { Transcript (CourseName, Semester, Year, Grade) } ) }

Lab Manual. Database Systems COT-313 & Database Management Systems Lab IT-216

Relational Normalization: Contents. Relational Database Design: Rationale. Relational Database Design. Motivation

10CS54: DATABASE MANAGEMENT SYSTEM

TOP-DOWN APPROACH PROCESS BUILT ON CONCEPTUAL DESIGN TO PHYSICAL DESIGN USING LIS, GCS SCHEMA

Technologies & Applications

B.Com(Computers) II Year DATABASE MANAGEMENT SYSTEM UNIT- V

Introduction to tuple calculus Tore Risch

Overview of Database Management

An Overview of Distributed Databases

Database Management Systems. Chapter 1

VII. Database System Architecture

Course Notes on Databases and Database Management Systems

CHAPTER 3. Relational Database Management System: Oracle. 3.1 COMPANY Database

CS 377 Database Systems. Database Design Theory and Normalization. Li Xiong Department of Mathematics and Computer Science Emory University

Chapter 2: DDBMS Architecture

In Memory Accelerator for MongoDB

BCA. Database Management System

featuring data privacy Andres Avelino Campos Sainz A Project submitted in partial fulfillment of the requirements for the degree of

Horizontal Fragmentation Technique in Distributed Database

Figure 14.1 Simplified version of the

Physical Database Design and Tuning

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

1. Physical Database Design in Relational Databases (1)

AHAIWE Josiah Information Management Technology Department, Federal University of Technology, Owerri - Nigeria jahaiwe@yahoo.

Domain driven design, NoSQL and multi-model databases

Distributed Database Design

Distributed Database Management Systems

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

Data Management in the Cloud

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

bigdata Managing Scale in Ontological Systems

Information Systems SQL. Nikolaj Popov

A Shared-nothing cluster system: Postgres-XC

Distributed Architectures. Distributed Databases. Distributed Databases. Distributed Databases

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer.

Cassandra vs MySQL. SQL vs NoSQL database comparison

Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework

A Review of Database Schemas

The Sierra Clustered Database Engine, the technology at the heart of

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Introduction to Parallel and Distributed Databases

Introduction to Databases

Relational Algebra. Query Languages Review. Operators. Select (σ), Project (π), Union ( ), Difference (-), Join: Natural (*) and Theta ( )

Optimizing Performance. Training Division New Delhi

SUBQUERIES AND VIEWS. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 6

Data warehousing with PostgreSQL

Functional Dependency and Normalization for Relational Databases

Database Scalability {Patterns} / Robert Treat

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. Introduction to Databases. Why databases? Why not use XML?

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

1 File Processing Systems

TECHNIQUES FOR DATA REPLICATION ON DISTRIBUTED DATABASES

SQL Server 2012 Optimization, Performance Tuning and Troubleshooting

Relational Database Design Theory

14 Databases. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:

How To Scale Big Data

Transcription:

CS377: Database Systems Distributed Databases Li Xiong Department of Mathematics and Computer Science Emory University 1

Centralized DBMS on a Network Site 1 Site 2 Site 5 Communication Network Site 4 Site 3 2

Distributed DBMS Environment Site 1 Site 2 Site 5 Communication Network Site 4 Site 3 3

Distributed Database System A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network. A distributed database management system (D DBMS) is the software that manages the DDB and provides an access mechanism that makes this distribution transparent to the users. Distributed database system (DDBS) = DDB + D DBMS 4

Distributed Database System The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented horizontally and stored with possible replication as shown below. 6

Distributed DBMS Promises ❶Transparent management of distributed, fragmented, and replicated data ❷Improved reliability/availability through distributed transactions ❸Improved performance ❹Easier and more economical system expansion 7

Distributed DBMS Issues Distributed Database Design How to distribute the database Query Processing Optimize cost = data transmission + local processing 8

Distributed DBMS Issues Concurrency Control Synchronization of concurrent accesses Consistency and isolation of transactions' effects Deadlock management Reliability How to make the system resilient to failures Atomicity and durability 9

Distributed database design Data distribution Top-down - mostly in designing systems from scratch Bottom-up - when the databases already exist at a number of sites Unit of distribution relation fragments of relations (sub-relations) Data are inherently fragmented, e.g. in locality Allow concurrent execution of a number of transactions that access different portions of a relation 10

Example Employee relation E (#,name,loc,sal, ) 40% of queries: 40% of queries: Qa: select * Qb: select * from E from E where loc=sa where loc=sb and and... Motivation: Two sites: Sa, Sb Qa Sa Sb Qb 11

Fragmentation Alternatives Horizontal PROJ 1 : projects with budgets less than $200,000 PROJ 2 : projects with budgets greater than or equal to $200,000 PROJ 1 PNO PNAME BUDGET LOC P1 Instrumentation 150000 Montreal P2 Database Develop. 135000 New York PROJ PNO PNAME BUDGET LOC P1 Instrumentation 150000 Montreal P2 Database Develop. 135000 New York P3 CAD/CAM 250000 New York P4 Maintenance 310000 Paris P5 CAD/CAM 500000 Boston PROJ 2 PNO PNAME BUDGET LOC P3 CAD/CAM 250000 New York P4 Maintenance 310000 Paris P5 CAD/CAM 500000 Boston 12

Fragmentation Alternatives Vertical PROJ 1 :information about project budgets PROJ 2 :information about project names and locations PROJ PNO PNAME BUDGET LOC P1 Instrumentation 150000 Montreal P2 Database Develop. 135000 New York P3 CAD/CAM 250000 New York P4 Maintenance 310000 Paris P5 CAD/CAM 500000 Boston PROJ 1 PROJ 2 PNO BUDGET PNO PNAME LOC P1 150000 P2 135000 P3 250000 P4 310000 P5 500000 P1 Instrumentation Montreal P2 Database Develop. New York P3 CAD/CAM New York P4 Maintenance Paris P5 CAD/CAM Boston 13

Data Fragmentation, Replication and Horizontal fragmentation Allocation A horizontal subset of a relation which contain those of tuples which satisfy selection conditions. E.g. Employee relation with selection condition (DNO = 5) Can be specified by a σ Ci (R) operation in the relational algebra. Complete horizontal fragmentation A set of horizontal fragments whose conditions C1, C2,, Cn include all the tuples in R- every tuple in R satisfies (C1 OR C2 OR OR Cn). Disjoint complete horizontal fragmentation: No tuple in R satisfies (Ci AND Cj) where i j. How to reconstruct R from complete horizontal fragments? 14

Three common horizontal partitioning techniques Round robin Hash partitioning Range partitioning 15 15

Round robin R D0 D1 D2 t1 t1 t2 t2 t3 t4 t4... t5 t3 16

Hash partitioning R D0 D1 D2 t1 h(k1)=2 t1 t2 h(k2)=0 t2 t3 h(k3)=0 t3 t4 h(k4)=1 t4... 17

Range partitioning R D0 D1 D2 t1: A=5 partitioning t1 t2: A=8 vector t2 t3: A=2 4 7 t3 t4: A=3 V0 V1 t4... 18

Data Fragmentation, Replication and Vertical fragmentation Allocation A vertical subset of a relation that contains a subset of columns. E.g. Employee relation: a vertical fragment of Name, Bdate, Sex Can be specified by a Π Li (R) operation in the relational algebra. Li Each fragment must include the primary key attribute of the parent relation Employee Complete vertical fragmentation A set of vertical fragments whose projection lists L1, L2,, Ln include all the attributes in R but share only the primary key of R. L1 L2... Ln = ATTRS (R) Li Lj = PK(R) for any i j How to reconstruct R from complete vertical fragments? 19

Data Fragmentation, Replication and Allocation Mixed (Hybrid) fragmentation A combination of Vertical fragmentation and Horizontal fragmentation. This is achieved by SELECT-PROJECT operations which is represented by Π Li (σ Ci (R)) 20

Data Fragmentation, Replication and Fragmentation schema Allocation A definition of a set of fragments (horizontal or vertical or mixed) that can reconstruct the original database Allocation schema Distribution of fragments to sites of distributed databases. It can be fully or partially replicated or can be partitioned Data Replication Full replication: database is replicated to all sites. Partial replication: some selected part is replicated 21

Distributed Database System The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented horizontally and stored with possible replication as shown below. 22

Distributed DBMS Issues Distributed Database Design How to distribute the database Query Processing Optimize cost = data transmission + local processing 23

Query Processing in Distributed Databases Cost of transferring data (files and results) over the network is usually high Example: Employee at site 1 and Department at Site 2 Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size = 10 6 bytes. Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3,500 bytes. Dname Dnumber Mgrssn Mgrstartdate Q submitted at Site 3: retrieve employee name and department name where the employee works. Π Fname,Lname,Dname (Employee Dno = Dnumber Department) Result has 10,000 tuples and each result tuple is 40 bytes 24

Query Processing in Distributed Strategies: Databases 1. Transfer Employee and Department to site 3. Total transfer size 2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3. Total transfer size 3. Transfer Department relation to site 1, execute the join at site 1, and send the result to site 3. Total bytes transferred Optimization criteria: minimizing data transfer. Which strategy? 25

Query Processing in Distributed Strategies: Databases 1. Transfer Employee and Department to site 3. Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes. 2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3. Query result size = 40 * 10,000 = 400,000 bytes. Total transfer size = 400,000 + 1,000,000 = 1,400,000 bytes. 3. Transfer Department relation to site 1, execute the join at site 1, and send the result to site 3. Total bytes transferred = 400,000 + 3500 = 403,500 bytes. Optimization criteria: minimizing data transfer. Preferred approach: strategy 3. 26

Query Processing in Distributed Databases What if Q is submitted at site 2? Example: Employee at site 1 and Department at Site 2 Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size = 10 6 bytes. Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3,500 bytes. Dname Dnumber Mgrssn Mgrstartdate Q submitted at Site 2: retrieve employee name and department name where the employee works. Π Fname,Lname,Dname (Employee Dno = Dnumber Department) Result has 10,000 tuples and each result tuple is 40 bytes 27

Query Processing in Distributed Databases Semijoin: Objective is to reduce the number of tuples in a relation before transferring it to another site. Example execution of Q: 1. Project the join attributes of Department at site 2, and transfer them to site 1. For Q, 4 * 100 = 400 bytes are transferred 2. Join the transferred file with the Employee relation at site 1, and transfer the required attributes from the resulting file to site 2. For Q, 32 * 10,000 = 320,000 bytes are transferred 3. Execute the query by joining the transferred file with Department and present the result to the user at site 2. Semi-join Left semi-join R S=Π R (R join S). 28

Parallel Databases Parallel database Using parallel processers Architectures Shared memory Shared disk Shared nothing Data partitioning (shard) 29