BBM467 Data Intensive ApplicaAons

Similar documents
Distributed Databases. Concepts. Why distributed databases? Distributed Databases Basic Concepts

Distributed Databases

Distributed Systems LEEC (2005/06 2º Sem.)

BBM467 Data Intensive ApplicaAons

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

DISTRIBUTED AND PARALLELL DATABASE

chapater 7 : Distributed Database Management Systems

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Distributed Data Management

Chapter 18: Database System Architectures. Centralized Systems

A Shared-nothing cluster system: Postgres-XC

Client/Server and Distributed Computing

BBM467 Data Intensive ApplicaAons

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Distributed Systems. REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1

Principles and characteristics of distributed systems and environments

Module 14: Scalability and High Availability

Database Replication with Oracle 11g and MS SQL Server 2008

TECHNIQUES FOR DATA REPLICATION ON DISTRIBUTED DATABASES

Cloud Computing at Google. Architecture

Survey on Comparative Analysis of Database Replication Techniques

Mobile and Heterogeneous databases Database System Architecture. A.R. Hurson Computer Science Missouri Science & Technology

Distributed Architectures. Distributed Databases. Distributed Databases. Distributed Databases

An Overview of Distributed Databases

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Data Management in the Cloud

Principles of Distributed Database Systems

How To Understand The Concept Of A Distributed System

Distributed Operating Systems

Apuama: Combining Intra-query and Inter-query Parallelism in a Database Cluster

System Aware Cyber Security Architecture

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

Tier Architectures. Kathleen Durant CS 3200

The Oracle Universal Server Buffer Manager

GeoGrid Project and Experiences with Hadoop

Developing Scalable Java Applications with Cacheonix

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

In Memory Accelerator for MongoDB

How To Virtualize A Storage Area Network (San) With Virtualization

Database Scalability {Patterns} / Robert Treat

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Scality RING High performance Storage So7ware for pla:orms, StaaS and Cloud ApplicaAons

Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework

Optimizing Performance. Training Division New Delhi

How To Create A Multi Disk Raid

SCALABILITY AND AVAILABILITY

<Insert Picture Here> Oracle In-Memory Database Cache Overview

2.1 What are distributed systems? What are systems? Different kind of systems How to distribute systems? 2.2 Communication concepts

Symmetric Multiprocessing

Fragmentation and Data Allocation in the Distributed Environments

Physical Database Design and Tuning

CHAPTER 1: OPERATING SYSTEM FUNDAMENTALS

Design Patterns for Distributed Non-Relational Databases

Highly Available Service Environments Introduction

Big Data & Scripting storage networks and distributed file systems

Hadoop Cluster Applications

1. Physical Database Design in Relational Databases (1)

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware

Web Server Architectures

I N T E R S Y S T E M S W H I T E P A P E R INTERSYSTEMS CACHÉ AS AN ALTERNATIVE TO IN-MEMORY DATABASES. David Kaaret InterSystems Corporation

Capacity Planning Process Estimating the load Initial configuration

Introduction to Parallel and Distributed Databases

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

In-Memory Columnar Databases HyPer. Arto Kärki University of Helsinki

Database Replication Techniques: a Three Parameter Classification

AN OVERVIEW OF DISTRIBUTED DATABASE MANAGEMENT

Inge Os Sales Consulting Manager Oracle Norway

Chapter 3. Database Environment - Objectives. Multi-user DBMS Architectures. Teleprocessing. File-Server

Tushar Joshi Turtle Networks Ltd

Distributed Database Management Systems

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Technical Comparison of Oracle Database vs. SQL Server 2000: Focus on Performance. An Oracle White Paper December 2003

CS 5523 Operating Systems: Intro to Distributed Systems

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Adaptive Virtual Partitioning for OLAP Query Processing in a Database Cluster

Recruitment Process Outsourcing

OpenMosix Presented by Dr. Moshe Bar and MAASK [01]

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

BLM 413E - Parallel Programming Lecture 3

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 5 - DBMS Architecture

Distributed and Parallel Database Systems

IncidentMonitor Server Specification Datasheet

Transcription:

Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal akal@hace7epe.edu.tr

FoundaAons of Data[base] Clusters Database Clusters Hardware Architectures Data Design Schemes ReplicaAon Schemes Query Parallelism Logical Cluster OrganizaAon ReplicaAon Management

Database Clusters A cluster of computers can be thought as a single compuang resource. It ualizes mulaple machines to provide a more powerful compuang environment through a single system image. There are two types clusters high availability clusters (HA) high performance compu5ng clusters (HPC)

Hardware Architectures: Shared Memory All processors have access to the main memory and the disk, respecavely. The processors are Aghtly coupled inside the same box and interconnected with a special switch. The interprocess communicaaon is done by using a shared memory. The shared- memory approach presents simplicity and allows for load balancing as well as inter- query parallelism which comes for free. However, it is too expensive since it requires a special interconnect among the processors. P P P D D M Its performance and scalability are limited with the available memory and communicaaon bandwidths.

Hardware Architectures: Shared Disk In the shared- disk approach, all processors have their own memory, but they share disks. The interprocess communicaaon occurs over a common high- speed bus. Provides high availability. All data is sall accessible even when a node fails. Since each node has its own data cache, cache coherency must be maintained, e.g. by means of a lock manager, which results in reduced performance. Shared- disk systems have limited scalability due to bandwidth of the high- speed bus and potenaal bo7lenecks of shared hardware. M M P P D D D

Hardware Architectures: Shared Nothing In a shared- nothing architecture, each node is a complete stand- alone computer with its own memory and disk. M M The nodes are connected via switch or LAN. But, they do not share anything. D P P D The main advantages of such systems are very good scalability and high availability. P D However, the management of data is complicated and the programming with this model is harder due to importance of data paraaoning and allocaaon. M

ParAAoning Schemes Ver$cal Par$$oning: VerAcal paraaoning divides the columns of a table into separate tables. VerAcal paraaoning makes projecaons and joins easier and helps opamizing access to the cache by reducing size of the tuples. However, access to the whole table may be required anyway, when execuang queries. Horizontal Par$$oning: Horizontal paraaoning divides a table along its tuples. Its basic advantage is to allow parallel scans or projects. The hash par55oning is based on a hash funcaon that distributes the tuples according to a hashing key. useful for parallel exact match queries and hash- join operaaons. not appropriate for range queries and operaaons on other than paraaoning keys. The range par55oning is made based on value intervals of paraaoning keys. ualizes evaluaaons of range queries. the performance of the range paraaoning depends on the interval size. The round robin paraaoning technique distributes the tuples on each of the paraaons. This approach is also called striping. The number of logically con- secuave tuples forms a striping unit. The relaave size of the striping unit directly affects the performance. Small striping units result in more I/O parallelism for scans and long range queries. Larger striping units, on the other hand, may cause latency to complete scans.

ParAAoning Schemes A B A C 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 a) Vertical Partitioning Original Table A B C 1 2 3 4 5 6 7 8 9 10 A B C 1 4 A B C 5 2 3 A B C 6 7 8 10 9 b) Hash Partitioning A B C A B C A B C A B C A B C 1 2 3 4 5 6 7 8 9 10 d) Round-Robin Partitioning 1 2 3 4 5 6 7 8 9 10 c) Range Partitioning

Virtual ParAAoning Virtual paraaoning, also called query paraaoning, assumes that all tables are fully replicated on each cluster node. In this approach, a query is decomposed into subqueries which access small pieces of data by appending range predicates to the where clause of that query. Each subquery then deals with only a small part of the data.

Virtual ParAAoning (Example) original query SELECT Sum(L_ExtendedPrice*L_Discount) AS Revenue FROM LineItem WHERE L_Discount BETWEEN 0.03 AND 0.05 subquery1 SELECT Sum(L_ExtendedPrice*L_Discount) AS Revenue FROM LineItem WHERE L_Discount BETWEEN 0.03 AND 0.05 AND L_OrderKey BETWEEN 0 AND 3000000 subquery2 SELECT Sum(L_ExtendedPrice*L_Discount) AS Revenue FROM LineItem WHERE L_Discount BETWEEN 0.03 AND 0.05 AND L_OrderKey BETWEEN 300001 AND 6000000 LineItem node A LineItem node B

ReplicaAon Schemes Full Replica$on: Tables are duplicated on each cluster node. That is, each node holds an exact copy of the original database. Par$al Replica$on: ParAal replicaaon means that only parts of original database are replicated on the different cluster nodes. Mixed Replica$on: Both full and paraal replicaaon at the same Ame.

ReplicaAon Schemes Original Database c) Mixed Replica$on a) Full Replica$on b) Par$al Replica$on

Mixed Data Design - Organize as node groups (NG) - Freely design every NG Global Database Scheme Co-existing Design Schemes 1 2 3 Node 1 Node 2 Node 3 Node 4 Node 5 Node Group 1 NG 2 NG 3 Database Cluster

Query Parallelism in a Cluster inter- query parallelism: The capability of the database management system to accept queries from mulaple users simultaneously. Each query is executed independently of the others. intra- query parallelism: Achieved by decomposing queries into subqueries and evaluaang them simultaneously. inter- par55on, intra- par55on and hybrid parallelism

Q 1 Q 2 Q 4 Data Data Data Database (Partition) Database Partition Database Partition a) inter-query c) intra-query & inter-partition Q 3 Q 5 Data Data Data Database Partition Database Partition Database Partition b) intra-query & intra-partition c) intra-query & intra-partition & inter-partition

Logical Cluster OrganizaAon Flat Cluster Architecture: Allows any cluster node to be accessible by clients. Forms a federated database of disanct databases running on independent servers. Connected by a LAN, no resource sharing, such as disks. Provides high availability and simple design. ReplicaAon is difficult to implement with this model. Middleware Based Cluster Architecture: A client can only interact with the cluster through a coordinaaon middleware. The middleware is responsible for scheduling and rouang of the clients requests. The middleware has the knowledge about underlying cluster. It can be used to ensure correct execuaons of concurrent updates and reads. It also allows to improve overall throughput by choosing be7er components, e.g. with less load to perform client requests. It is subject to single point of failure. If the middleware fails, the cluster will become useless. The middleware must be decentralized to improve scalability.

Logical Cluster OrganizaAon Clients Coordination Middleware Database Cluster a) flat architecture b) middleware-based architecture

ReplicaAon Management ReplicaAon is an essenaal technique to improve availability and scalability by fully or paraally duplicaang data objects among the nodes of a distributed system. ReplicaAon management is responsible for the maintenance of replicas and ensures consistency of mulaple copies of the same data object residing on different nodes. That is, replicaaon management is not simply copying data objects onto different nodes of a distributed system.

SynchronizaAon of Updates There are two possibiliaes for the locaaon of updates: Updates can either be centralized on one primary copy Or, be distributed on (a subset of) all replicas (update everywhere). : update : propagation : updatable object : read-only object a) Primary Copy b) Update Everywhere SynchronizaAon of updates can be done in two ways: eager and lazy

SynchronizaAon of Updates Eager (or synchronous) replicaaon. All copies of an object are synchronized within the same database transacaon. Allows early detecaon of conflicts and presents a simple soluaon to provide consistency. Has drawbacks regarding performance and due to the high communicaaon overhead among the replicas and the high probability of deadlocks. Lazy (or asynchronous) replicaaon. Replica maintenance is decoupled from the original database transacaon. The transacaons keeping the replicas up- to- date and consistent run as separate and independent database transacaons aler the original transacaon has commi7ed. Compared to eager replicaaon approaches, lazy approaches require addiaonal efforts to guarantee serializable execuaons.

Eager Primary Copy ReplicaAon

Eager Update Everywhere ReplicaAon

Lazy Primary Copy ReplicaAon with Immediate Updates

Lazy Primary Copy ReplicaAon with Deferred Updates

Lazy Update Everywhere ReplicaAon