Parallel & Distributed Data Management



Similar documents
Distributed storage for structured data

Cloud Computing at Google. Architecture

Whitepaper. Big Data implementation: Role of Memory and SSD in Microsoft SQL Server Environment

A programming model in Cloud: MapReduce

Query Optimization in Cloud Environment

Distributed Systems. Tutorial 12 Cassandra

Hypertable Architecture Overview

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Table A Distributed Storage System For Data

Database Scalability and Oracle 12c

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Cloud Computing mit mathematischen Anwendungen

Database Scalability {Patterns} / Robert Treat

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

White Paper. Optimizing the Performance Of MySQL Cluster

NoSQL and Hadoop Technologies On Oracle Cloud

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

Advanced Data Management Technologies

2.1.5 Storing your application s structured data in a cloud database

Big Data Database Revenue and Market Forecast,

Hadoop Ecosystem B Y R A H I M A.

NoSQL Data Base Basics

Apache Hadoop Ecosystem

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Execution Strategies for SQL Subqueries

High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo

Introduction to Apache Cassandra

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Introducing DocumentDB

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Using RDBMS, NoSQL or Hadoop?

NoSQL. Thomas Neumann 1 / 22

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Comparing SQL and NOSQL databases

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Hadoop IST 734 SS CHUNG

How To Scale Out Of A Nosql Database

Cloud Service Model. Selecting a cloud service model. Different cloud service models within the enterprise

Cassandra A Decentralized, Structured Storage System

Big Data Analytics. Lucas Rego Drumond

Scalable Internet Services and Load Balancing

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Cloud Computing Is In Your Future

Distributed File Systems

How To Write A Database Program

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Large-Scale Web Applications

How To Improve Performance In A Database

Design Patterns for Distributed Non-Relational Databases

The Google File System

Amr El Abbadi. Computer Science, UC Santa Barbara

In Memory Accelerator for MongoDB

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Microsoft Azure Data Technologies: An Overview

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Evaluation of NoSQL databases for large-scale decentralized microblogging

Big Data Technology CS , Technion, Spring 2013

Hosting Transaction Based Applications on Cloud

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Structured Data Storage

Bigtable: A Distributed Storage System for Structured Data

How To Build Cloud Storage On Google.Com

Big Systems, Big Data

Apache HBase. Crazy dances on the elephant back

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Distributed File Systems

Contents. 1. Introduction

CLOUD scale storage Anwitaman DATTA SCE, NTU Singapore CE 7490 ADVANCED TOPICS IN DISTRIBUTED SYSTEMS

bigdata Managing Scale in Ontological Systems

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Scalable Internet Services and Load Balancing

EDG Project: Database Management Services

NoSQL Database Options

Performance And Scalability In Oracle9i And SQL Server 2000

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Sierra Clustered Database Engine, the technology at the heart of

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Introduction to NOSQL

Bigtable is a proven design Underpins 100+ Google services:

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Exploring Power-Performance Tradeoffs in Database Systems

Big Data With Hadoop

Referential Integrity in Cloud NoSQL Databases

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

these three NoSQL databases because I wanted to see a the two different sides of the CAP

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

From 1000/day to 1000/sec

Program Agenda. Safe Harbor Statement. What are SecureFiles? Performance. Features DBFS. Internals. Wrap Up 11/22/2014

A survey of big data architectures for handling massive data

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Cloud DBMS: An Overview. Shan-Hung Wu, NetDB CS, NTHU Spring, 2015

Transcription:

Parallel & Distributed Data Management Kai Shen Data Management Data management Efficiency: fast reads/writes Durability and consistency: data is safe and sound despite failures Usability: convenient interfaces for application developers Scalable data management Large volume and high volume of accesses, for example, in data centers Parallelism enables high efficiency Distributed systems are necessary for reliability 1 2 Google File System Reliability Google File System (or Colossus) works on such assumptions A modest number of large files (100s of MBs or GBs or more); large grained accesses are common throughput is more important than latency Reads are more often than writes replication is cost effective Design/implementation Chunk based (64MB per chunk), a single large file contains multiple chunks which are distributed (allowing parallel accesses) Many chunk servers, but one master node Google file system employs replication (default 3) for reliability, but consistent replication for mutable data is challenging: Writes have to reach every replica Writes are done in the same order on all replicas A client may read stale data from a replica who is behind on applying writes What consistency model is supported? Availability Single master design is simpler (no need for distributed consensus), but requires quick, consistent recovery from failure Replication of the master, but only one allows writes 3 4 1

Unstructured vs. Structured Data Example SQL Statement (TPC H Q2) Is file system API convenient for application developers? Data model Unstructured data: General byte stream in a file Structured data: Specific, regular data organization that contains semantics to enable powerful/flexible search and update Most prominent example of structured data management: relational database Data is organized into tables, views, keys (references), indexes etc. that allow SQL queries and updates with potential multiple table joins select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 22 and p_type like '%COPPER' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AFRICA' and ps_supplycost = ( select min(ps_supplycost) from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AFRICA' ) 5 6 Relational Databases Key value/nosql Store Your favorite relational database? Database typically runs on one machine. For scalability and reliability parallel/distributed databases: Data partitioning makes distributed query complex and expensive Data replication makes it challenging to maintain data consistency in updates Transactions are already complex (much more complex if transactions commit in a distributed fashion) There is space between no structure (plain file system) and strongly structured data (relational DBs supporting SQL) Key value or hash table store Data is organized into sets of key value pairs Support lookup(key), insert(key,value), delete(key), and replace(key,new_value) What data fits the key value model? Webtable: key is the URL, value is the web page content What else? Nosql data stores 7 8 2

Specific Key value Stores Key value Stores vs. Relational DBs Before the times of relational databases dbm, developed by Ken Thompson (AT&T), 1970s After the wide uses of relational databases TokyoCabinet/KyotoCabinet, 2000s LevelDB, 2010s Data semantics isn t as flexible What if your data semantics does fit into one single table; semantic linking/joining of multiple tables are needed to answer queries or support updates? Easier to scale up Operations are uniformly simple, each operating on a single data item (no whole table scans or cross table joins) Data partitioning won t cause distributed operations Replication consistency is easier to maintain And leaner and faster Really? Some argue SQLite is lean, fast, AND supporting SQL. 9 10 Scalable Structured Store: Bigtable Bigtable: Data Locality Developed at Google [Chang et al. 2006] Data model Enhanced key value model (row:string, column:string, time:int64) string Two dimensional key allows multiple attributes for each key: in Webtable, a web page has content, incoming references, other labels (spam, ), see Figure 1 of the paper Timestamp allows version management (earlier versions may be useful; allows garbage collection) Data layout Data in lexicographic order by row key Applications should devise the row key in a way such that rows often referenced together have lexicographically nearby keys Reversing URL hostname components for row keys in Webtable www.google.com/index.html com.google.www/index.html Processing near data Data processing scripts can be supplied to run at data server 11 12 3

Bigtable: Distributed Organization Bigtable: Distributed Consistency Centralized or decentralized management? A master server and many tablet servers A special METADATA table that records the location of user tables Scalability and robustness in centralized management Relieve the master from common tasks won t become the scaling bottleneck Fast recovery in case of master failure Maintain consistency in distributed system Ensure that everyone agrees with one master at a time Ensure that everyone agrees with the root METADATA table Distributed consensus Paxos distributed consensus Google s implementation: Chubby lock service 13 14 Bigtable: Performance Figure 6 of the paper memcache In memory key value store, not durable Limited memory space, older things are evicted (following LRU order) when space runs out, effectively a cache Typically work together with a durable store mem cache Application 1. lookup(k) 2. SELECT 3. insert(k,v) Durable database Application 2. delete or replace mem cache 1. UPDATE Durable database 15 16 4

Distributed memcache [Nishtala et al. 2013] Distributed memcache: Scalabilty Key idea: Decouple the performance/scalability from I/O and durability, if the workload is read mostly A two layer data management system: distributed memcache and durable data store (relational databases or ) Both layers handle writes Distributed memcache handles reads alone (mostly) If the workload is read mostly, only the distributed memcache layer needs to be scalable and fast Can work with legacy databases that don t scale well Used in many places including Facebook Since memcache isn t concerned with durability or I/O, its scalability is easier to accomplish: A balanced data partition to many memcache servers A scalable, robust, distributed algorithm to tell which memcache server has your data C x D B A hash code of a data object hash code of a memcache server 17 18 Bigtable vs. Distributed memcache Megastore Performance / scalability? A memcache system is fundamentally constrained by the underlying durable data store if application performs a nontrivial amount of writes. A recent Google system [Baker et al. 2011] that is a step closer to relational databases Support relational DB like data model but do not support operations that hinder scalability (e.g., SQL table joins) SQL table joins can be implemented by applications after initial lookups on the based tables with indexes A new point in the flexibility vs. scalability design space: Bridge to the SQL world, but require applications to explicitly support expensive operations 19 20 5

Summary A variety of data management systems Unstructured file system Relational databases with SQL support (MySQL, SQLite, Oracle, Microsoft SQL Server, ) Single machine key value stores (KyotoCabinet, LevelDB, ) Bigtable Distributed memcache Megastore What is right for my scalable data management application? Data model and data access semantics Scalability / performance requirements Ease of development and ease of use 21 6