Big Data & Scripting storage networks and distributed file systems



Similar documents
Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Distributed Data Stores

The Advantages and Disadvantages of Network Computing Nodes

Introduction to NOSQL

Brewer s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Transactions and ACID in MongoDB

Eventually Consistent

Database Replication with Oracle 11g and MS SQL Server 2008

A survey of big data architectures for handling massive data

The Sierra Clustered Database Engine, the technology at the heart of

NoSQL in der Cloud Why? Andreas Hartmann

Survey on Comparative Analysis of Database Replication Techniques

Special Relativity and the Problem of Database Scalability

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

Practical Cassandra. Vitalii

The CAP theorem and the design of large scale distributed systems: Part I

Big Data Management and NoSQL Databases

Can the Elephants Handle the NoSQL Onslaught?

Database Replication Techniques: a Three Parameter Classification

NoSQL. Thomas Neumann 1 / 22

these three NoSQL databases because I wanted to see a the two different sides of the CAP

The Cloud Trade Off IBM Haifa Research Storage Systems

How to Choose Between Hadoop, NoSQL and RDBMS

Do Relational Databases Belong in the Cloud? Michael Stiefel

Database Replication with MySQL and PostgreSQL

Client/Server and Distributed Computing

BBM467 Data Intensive ApplicaAons

Lecture Data Warehouse Systems

Daniel J. Adabi. Workshop presentation by Lukas Probst


NoSQL Database Options

Database Scalability {Patterns} / Robert Treat

IT Service Management

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Cloud Computing at Google. Architecture

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Data Management in the Cloud

V:Drive - Costs and Benefits of an Out-of-Band Storage Virtualization System

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Berkeley Ninja Architecture

Contents. SnapComms Data Protection Recommendations

NAS 251 Introduction to RAID

The CAP-Theorem & Yahoo s PNUTS

Introduction to NoSQL

Database Replication

Big Data Course Highlights

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Cloud DBMS: An Overview. Shan-Hung Wu, NetDB CS, NTHU Spring, 2015

The ConTract Model. Helmut Wächter, Andreas Reuter. November 9, 1999

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Load Balancing. Load Balancing 1 / 24

NoSQL Database - mongodb

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Adapting Distributed Hash Tables for Mobile Ad Hoc Networks

Data Consistency on Private Cloud Storage System

Big Systems, Big Data

How To Write A Disk Array

Availability Digest. MySQL Clusters Go Active/Active. December 2006

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Design Patterns for Distributed Non-Relational Databases

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Exploring RAID Configurations

Best Practices for Hadoop Data Analysis with Tableau

NoSQL Databases. Nikos Parlavantzas

Consistency Models for Cloud-based Online Games: the Storage System s Perspective

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Report Data Management in the Cloud: Limitations and Opportunities

Integrating Big Data into the Computing Curricula

Introduction to NoSQL and MongoDB. Kathleen Durant Lesson 20 CS 3200 Northeastern University

Operating Systems. RAID Redundant Array of Independent Disks. Submitted by Ankur Niyogi 2003EE20367

A Survey of Distributed Database Management Systems

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

Big data management with IBM General Parallel File System

Best Practices for Migrating from RDBMS to Amazon DynamoDB

Data Modeling for Big Data

CAP Theorem and Distributed Database Consistency. Syed Akbar Mehdi Lara Schmidt

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Lecture 36: Chapter 6

Dynamo: Amazon s Highly Available Key-value Store

Consistency Trade-offs for SDN Controllers. Colin Dixon, IBM February 5, 2014

An Overview of Cloud & Big Data Technology and Security Implications. Tim Grance Senior Computer Scientist NIST Information Technology Laboratory

Mixed-Criticality Systems Based on Time- Triggered Ethernet with Multiple Ring Topologies. University of Siegen Mohammed Abuteir, Roman Obermaisser

A Novel Data Placement Model for Highly-Available Storage Systems

MASTER PROJECT. Resource Provisioning for NoSQL Datastores

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Scaling To 1 Billion Hits A Day. Chander Dhall Me@ChanderDhall.com

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Transcription:

Big Data & Scripting storage networks and distributed file systems 1,

2, adaptivity: Cut-and-Paste 1 distribute blocks to [0, 1] using hash function start with n nodes: n equal parts of [0, 1] [0, 1] N n 1, ( i 1, i ] N n n i adding a node: reassign stripe of width 1 n+1 from each part to new node N n+1 move blocks in these parts to N n+1 fuse new part together sorted by previous block assignment consider as one continuous interval 1 Brinkmann, Salzwedel, Scheideler Efficient, distributed data placement strategies for storage area networks, 2000

3, adaptivity: Cut-and-Paste assumption: hash function yields uniform random distribution expected equal distribution of blocks to nodes node addresses can be determined without keeping table retrace splitting scheme ordering before fusing allows exact location not applicable in heterogeneous situations extensions to heterogeneous case exist 2 2 Brinkmann, Salzwedel, Scheideler Compact adaptive placement schemes for non-uniform requirements, 2002

4, data access until now: store and retrieve blocks distribution and redundancy by duplicates open: accessing this data possible guarantees/properties: parallel data access distributed among nodes synchronization with read operations (i.e. read always current state) synchronization of write/update operations (e.g. which update wins) performance: latency = time until system reacts to request bandwidth = transferable data per time unit (e.g. Mb/s) update strategy determines guarantees and strategies for read access

data access: consistency and synchronicity redundancy block update affects multiple nodes synchronous updates sequence of read write accesses yields correct results example: write operation A 1 affects one copy of a block B i next read access A 2 reads different copy of B i all copies of B i should be identical (synchronous) consistency parallel updates should not lead to alternative histories example: simultaneous updates A 1 and A 2 to different copies of block B i [] two copies of B i in different states determine winner copy of B i or prevent parallel update 5,

6, data access: master copy the master copy can be used ensure consistency: block B i has r duplicates: B 1 i,..., B r i a single, distinguished copy is B i s master copy, say B 1 i consistency is ensured by limiting write access to the master copy reading: use any copy reading in parallel is allowed writing: write only to master copy update duplicates from master

7, data access: master copy updates can be executed using different strategies, resulting in different guarantees immediately hold reading access when master copy is written can guarantee synchronicity lazily e.g. when system load allows without blocking access copies can be in obsolete state

8, data access: master copy access-blocking needs centralized write access centralization leads to bottle-necks in efficiency and security guarantees like synchronicity can often only be given in exchange for efficiency: synchronous systems with immediate update have block reading access on updates ( centralization) can guarantee synchronicity and consistency asynchronous / lazy updates can still guarantee consistency no synchronicity without centralized reading access

9, data access: without master copy writing is allowed to any duplicate synchronicity could be achieved by blocking duplicates blocking in a distributed situation as complicated easily leads to locking situations updates without blocking or master copy parallel writing is allowed two duplicates can be in different updated states neither synchronicity nor consistency guaranteed synchronization strategy needed to resolve inconsistencies duplicates in different states have to be reunified example: latest update always wins

10, data access: transactions abstraction of grouped updates implement invariants in storage systems example: online store (1) outstanding debts, (2) bills to be send (3) goods available, (4) goods to be send when order is placed, ensure (1)-(4) are all updated (purchase successful, transaction completed) not updated at all (purchase failed, transaction failed) transactions group individual updates together usually either all updates are executed or none (atomic) system can guarantee properties for transactions e.g. ACID (atomic, consistent, isolated, durable) in RDBMS implementation very complex in distributed systems

11, data access: the CAP-theorem formalize guarantees and network model: consistency transactions, guarantee ACID-properties allows to keep system in a consistent state availability data still accessible if nodes fail (e.g. backups) partition tolerance tolerate disconnection of the network individual parts function independently allows arbitrary scaling network model no global clock (asynchronous) nodes decide on local information

12, data access: the CAP theorem CAP theorem Maximal two properties out of consistency, availability and partition tolerance can be guaranteed. conjectured by Brewer, 2000 proven for an asynchronous network model 3 improvements are possible when nodes have a global timer (synchronous network model) keeping all nodes on single, global time is very complicated (if possible at all) 3 Lynch, Gilbert, Brewer s conjecture and the feasibility of consistent, available, partition-tolerant web services, 2002

13, data access: trade-offs redundancy enables parallelism and failure tolerance enforces updates in multiple places collides with efficiency synchronicity and consistency systems differ in the subset of properties they provide optimal system for particular task result of required guarantees distributed RDBMS systems: consistency, availability NOSQL 4 systems: availability, partition tolerance 4 not only SQL